skills/do-reasoning-ask-questions/SKILL.md
Information-theoretic question-asking framework for disambiguating user intent through structured yes/no questions. Uses a three-agent architecture (Seeker, Oracle, Pruner) grounded in Shannon entropy to maximize information gain per question turn. Trigger phrases: 'clarify ambiguous request', 'ask better questions', 'narrow down requirements', 'disambiguate user intent', 'information-gathering dialogue', 'structured requirements elicitation'.
npx skillsauth add ndpvt-web/arxiv-claude-skills do-reasoning-ask-questionsInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
This skill enables Claude to systematically resolve ambiguity in user requests by asking maximally informative yes/no questions, modeled on the Seeker-Oracle-Pruner triad from Pedrozo et al. (2026). Instead of asking vague open-ended questions or making assumptions, Claude constructs a hypothesis space of plausible user intents, computes which binary question would eliminate the most candidates (maximizing information gain in bits), asks that question, prunes the space based on the answer, and repeats until the intent is unambiguous. This transforms requirement gathering from guesswork into a principled search process.
The core insight from the paper is that asking questions should be treated as an information-theoretic optimization problem. Given a set of N plausible interpretations of a user's request (the hypothesis space), the current uncertainty is H = log2(N) bits. The optimal yes/no question is the one that splits the hypothesis space as close to 50/50 as possible, yielding ~1 bit of information gain per turn. This is the same principle behind binary search, but applied to intent disambiguation.
The framework uses three cooperating roles. The Seeker generates candidate yes/no questions and selects the one with highest expected information gain. The Oracle (the user, in our case) answers truthfully. The Pruner eliminates hypotheses inconsistent with the answer, shrinking the candidate space. The key behavioral finding: reasoning through candidates explicitly (chain-of-thought) dramatically improves question quality -- models with CoT achieved 0.93 win rates versus 0.10 without it in partially observable settings, where the questioner cannot see the full hypothesis space and must infer it from conversation history alone.
A practical takeaway for coding agents: generate multiple candidate questions before selecting one, and explicitly reason about how many hypotheses each question would eliminate. Smaller models in the study compensated for limited capacity by exploring ~9.7 candidate questions per turn versus ~7.6 for larger models, but larger models selected higher-IG candidates. The lesson: always brainstorm several possible questions, then pick the one that maximally partitions the remaining possibilities.
Enumerate the hypothesis space. Upon receiving an ambiguous request, list all plausible interpretations. Structure them hierarchically if possible (like the paper's Region > Country > State > City taxonomy). For a coding task, this might be: framework choice > architecture pattern > specific library > configuration variant.
Compute current entropy. Count the number of live hypotheses N. Your current uncertainty is log2(N) bits. This tells you roughly how many perfect binary questions you need (e.g., 16 hypotheses = 4 questions minimum).
Generate 5-10 candidate yes/no questions. Each question should be answerable with yes or no. Think about which dimension of the hypothesis space has the most variance -- target that dimension first. Prefer questions about categories (high-level splits) before specific instances.
Score each candidate by expected information gain. For each question, estimate how many hypotheses would remain under "yes" vs "no." The ideal question splits the space 50/50. Compute expected IG = H_before - (p_yes * H_after_yes + p_no * H_after_no), where p_yes and p_no are the fraction of hypotheses consistent with each answer.
Ask the highest-IG question. Present it clearly to the user as a yes/no question. Avoid compound questions -- one binary decision per turn.
Prune the hypothesis space. Based on the user's answer, eliminate all inconsistent hypotheses. Update N and recompute entropy.
Check termination condition. If only one hypothesis remains (or the remaining hypotheses are equivalent for implementation purposes), proceed to implementation. If entropy is still high, return to step 3.
Summarize what you've learned. Before implementing, state back to the user the specific interpretation you've converged on, giving them a chance to correct any misunderstanding.
Implement with confidence. With ambiguity resolved, proceed with the specific implementation. Reference the disambiguation decisions in code comments only where they affect non-obvious design choices.
Track cumulative information gain. If the user provides new requirements mid-implementation that reintroduce ambiguity, repeat the process on the new hypothesis space rather than guessing.
Example 1: Disambiguating an Authentication Feature Request
User: "Add authentication to my Express app"
Step 1 - Enumerate hypothesis space (8 candidates):
Auth method: [session-based, JWT, OAuth2, magic-link]
Storage: [database, in-memory]
= 8 combinations, H = 3.0 bits
Step 2 - Generate candidate questions, scored by IG:
Q1: "Do you need token-based auth (JWT/OAuth) rather than
session-based?" → splits 4/4 → IG ≈ 1.0 bit
Q2: "Do you need third-party login (Google, GitHub, etc.)?"
→ splits 2/6 → IG ≈ 0.81 bits
Q3: "Is this for an API (no browser sessions) or a web app
with browser sessions?" → splits 4/4 → IG ≈ 1.0 bit
Q4: "Do you need persistent sessions across server restarts?"
→ splits 4/4 → IG ≈ 1.0 bit
Step 3 - Ask Q3 (tied for highest, and most intuitive):
Claude: "Is this for a stateless API (consumed by mobile/SPA
clients) or a traditional web app with server-side sessions?"
User: "It's an API for a React frontend."
Step 4 - Prune: eliminate session-based options → 4 remaining
[JWT+db, JWT+memory, OAuth2+db, OAuth2+memory], H = 2.0 bits
Step 5 - Next best question:
Claude: "Do users need to log in via third-party providers
like Google or GitHub?"
User: "No, just email and password."
Step 6 - Prune: eliminate OAuth2 → 2 remaining, H = 1.0 bit
Claude: "Should auth tokens survive server restarts (stored in
a database) or is in-memory sufficient for development?"
User: "Database, this is for production."
Step 7 - Converged: JWT + database storage for email/password auth.
Total questions: 3, Total IG: 3.0 bits (optimal for 8 hypotheses).
Step 8 - Confirm and implement:
Claude: "I'll implement JWT authentication with email/password,
storing refresh tokens in your database. I'll add login, register,
and token refresh endpoints to your Express app."
Example 2: Debugging an Ambiguous Bug Report
User: "The app crashes when users submit the form"
Step 1 - Enumerate hypothesis space:
Crash type: [uncaught exception, infinite loop, OOM, network timeout]
Trigger: [all inputs, specific input patterns, race condition, auth-expired]
Location: [client-side, server-side, database]
= ~48 combinations, H ≈ 5.6 bits
Step 2 - Start with highest-level split:
Claude: "Does the crash happen on the client side (browser error/
white screen) or server side (5xx response)?"
User: "Server side, we get a 500 error."
→ Prune client-side hypotheses, ~16 remaining, H ≈ 4.0 bits
Claude: "Does it happen for every form submission, or only for
certain inputs?"
User: "Only when the description field is very long."
→ Prune to input-dependent server errors, ~4 remaining, H ≈ 2.0 bits
Claude: "Does your server log show a database error or an
application-level exception?"
User: "Database error -- something about packet size."
→ Converged: database packet size limit on large text input.
Result: 3 questions to pinpoint root cause from 48 possibilities.
Claude proceeds to check max_allowed_packet or equivalent config
and the column type (VARCHAR vs TEXT).
Example 3: Narrowing Down a Refactoring Target
User: "Refactor the data layer to be cleaner"
Step 1 - Hypothesis space:
Target: [separate concerns, reduce duplication, improve types,
add caching, switch ORM, add repository pattern]
Scope: [single module, full data layer]
= 12 combinations, H ≈ 3.6 bits
Step 2 - Questions:
Claude: "Is the main pain point about code organization (files/
modules are tangled) or about runtime behavior (performance,
error handling)?"
User: "Organization -- everything is in one huge file."
Claude: "Do you want to keep the current data access approach
(raw queries / current ORM) and just restructure the files, or
also change how data is accessed?"
User: "Just restructure, keep the queries as-is."
→ Converged in 2 questions: split monolithic data file into
separate modules by domain concern, preserving existing queries.
Pedrozo, D. M., Soares, T. W. de L., & de Oliveira, B. L. M. (2026). Do Reasoning Models Ask Better Questions? A Formal Information-Theoretic Analysis on Multi-Turn LLM Games. arXiv:2601.17716. Key takeaway: Explicit chain-of-thought reasoning over candidate questions dramatically improves disambiguation efficiency (0.93 vs 0.10 win rate), and the optimal strategy is to generate multiple candidate questions, score them by expected information gain (Shannon entropy reduction), and select the one that most evenly splits the hypothesis space.
development
Audit LLM-based automatic short answer grading (ASAG) systems for adversarial vulnerabilities using token-level and prompt-level attack strategies from the GradingAttack framework. Triggers: 'test grading robustness', 'adversarial attack on grading', 'audit LLM grader', 'red-team answer grading', 'ASAG vulnerability assessment', 'grading fairness attack'
development
Build structured information-seeking agents that decompose complex queries into multi-turn search-and-browse workflows, aggregate results from multiple web sources, and return answers in typed structured formats (items, sets, lists, tables). Applies the GISA benchmark's ReAct-based agent architecture and evaluation methodology. Trigger phrases: "build an information-seeking agent", "search agent pipeline", "multi-turn web research agent", "structured web search workflow", "aggregate information from multiple sources", "web research with structured output"
data-ai
Optimize LLM prompts using GFlowPO's iterative generate-evaluate-refine loop with diversity-preserving exploration and dynamic memory. Use when: 'optimize this prompt', 'find a better prompt for this task', 'prompt engineering with examples', 'auto-tune my system prompt', 'improve prompt accuracy', 'generate prompt variations'.
development
Constrain LLM generation with executable Pydantic schemas and multi-agent pipelines to produce structurally valid, domain-rich artifacts. Uses ontology-as-grammar to eliminate hallucinated structures while preserving creative output. Trigger phrases: "generate a valid game design", "schema-constrained generation", "build a multi-agent pipeline with Pydantic validation", "ontology-driven content generation", "structured creative generation with DSPy", "generate artifacts that pass domain validation".