skills/craft-calibrated-reasoning-answer-faithful/SKILL.md
Apply CRAFT (Calibrated Reasoning with Answer-Faithful Traces) for multi-hop question answering with verified reasoning chains. Use when: 'answer a complex question from multiple documents', 'build a faithful multi-hop QA pipeline', 'reason over retrieved passages with citations', 'verify reasoning chain faithfulness', 'structured RAG with auditable traces', 'multi-step question decomposition with evidence grounding'
npx skillsauth add ndpvt-web/arxiv-claude-skills craft-calibrated-reasoning-answer-faithfulInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
This skill enables Claude to perform multi-hop question answering over retrieved documents using structured, auditable reasoning traces inspired by the CRAFT framework. Instead of generating a flat chain-of-thought that may hallucinate or contradict its own evidence, Claude decomposes complex questions into sub-questions, declares which documents it will cite, reasons step-by-step with explicit source attribution, and then extracts a final answer — all within a structured XML trace that can be mechanically verified for faithfulness.
CRAFT addresses three failure modes in RAG-based multi-hop QA: reasoning collapse (where complex multi-hop chains degrade under noisy retrieval), reasoning-answer inconsistency (where the model produces a correct answer but its stated reasoning doesn't actually support it), and loss of format control (where free-form chain-of-thought drifts from the required structure).
The core insight is a structured XML trace with four components: <plan> (sub-question decomposition), <gold_docs> (pre-declared evidence boundary), <reason> (step-by-step reasoning with document citations), and <answer> (final extracted answer). This structure isn't decorative — each component enables a specific faithfulness audit. The plan-to-reason check verifies reasoning follows the declared sub-questions. The gold_docs-to-reason check ensures all citations fall within the declared evidence set. The reason-to-answer check confirms the answer is entailed by the reasoning. The grounding check verifies each claim is supported by the cited document text.
In the original paper, these audits serve as reward signals for GRPO-based reinforcement learning. For inference-time application, we use them as self-verification steps: generate the trace, then audit each dimension, and regenerate or flag components that fail consistency checks.
Receive the question and document set. Collect the user's question q and retrieved documents D = {d_1, d_2, ..., d_K}. Number each document with an index for citation tracking.
Decompose into sub-questions. Break the multi-hop question into an ordered sequence of sub-questions, each addressing one reasoning hop. Output these in a <plan> block. For a 2-hop question, this typically yields 2 sub-questions; for bridge-type questions, the answer to sub-question 1 feeds into sub-question 2.
Declare the evidence boundary. Before reasoning, scan the documents and pre-declare which document indices contain supporting evidence in a <gold_docs> block. This forces explicit commitment to an evidence set before reasoning begins, preventing post-hoc rationalization.
Reason step-by-step with citations. In a <reason> block, address each sub-question in order. For each reasoning step, cite the specific document index (e.g., [doc 3]) and quote or paraphrase the relevant passage. Chain the intermediate conclusions: the output of step N becomes input context for step N+1.
Extract the final answer. In an <answer> block, state the final answer derived from the reasoning chain. This must be a direct consequence of the last reasoning step.
Audit: Plan-to-Reason alignment. Verify that every sub-question from the plan is addressed in the reasoning, in order, and no reasoning steps exist that don't correspond to a planned sub-question.
Audit: Citation-to-Evidence grounding. Verify that every document index cited in <reason> appears in the <gold_docs> set, and that at least one citation exists. Flag any citation to a document not in the declared evidence set.
Audit: Reason-to-Answer entailment. Verify the final answer in <answer> logically follows from the conclusion of the reasoning chain. If the answer contradicts or is unsupported by the reasoning, flag it.
Audit: Claim-to-Source verification. For each factual claim in <reason>, verify it is actually stated or supported in the cited document text. Flag any claim that the cited document doesn't support.
Report results with confidence. Present the answer along with the structured trace. If any audit fails, either regenerate the failing component or explicitly disclose the inconsistency to the user.
Example 1: Bridge-type multi-hop question
User: Given these documents, answer: "What is the capital of the country where the creator of Python was born?"
Approach:
<plan>
Subquestion 1: Who created Python?
Subquestion 2: In which country was the creator of Python born?
Subquestion 3: What is the capital of that country?
</plan>
<gold_docs>[1, 2, 3]</gold_docs>
<reason>
From document [1]: Python was created by Guido van Rossum.
From document [2]: Guido van Rossum was born in Haarlem, Netherlands.
Therefore the country is the Netherlands.
From document [3]: The capital of the Netherlands is Amsterdam.
</reason>
<answer>Amsterdam</answer>
Audit results:
Example 2: Comparison-type question with distractors
User: From these passages, determine: "Are the directors of Inception and Interstellar the same person?"
Approach:
<plan>
Subquestion 1: Who directed Inception?
Subquestion 2: Who directed Interstellar?
Subquestion 3: Are they the same person?
</plan>
<gold_docs>[1, 3]</gold_docs>
<reason>
From document [1]: Inception was directed by Christopher Nolan.
From document [3]: Interstellar was directed by Christopher Nolan.
Both films were directed by the same person, Christopher Nolan.
</reason>
<answer>Yes, both Inception and Interstellar were directed by Christopher Nolan.</answer>
Audit results:
Example 3: Audit failure detection and recovery
User: "Which award did the university attended by the author of Harry Potter win in 2020?"
Initial (flawed) trace:
<plan>
Subquestion 1: Who authored Harry Potter?
Subquestion 2: Which university did they attend?
Subquestion 3: What award did that university win in 2020?
</plan>
<gold_docs>[1, 2, 4]</gold_docs>
<reason>
From document [1]: J.K. Rowling authored Harry Potter.
From document [2]: Rowling attended the University of Exeter.
From document [4]: The University of Exeter won a Queen's Anniversary Prize.
</reason>
<answer>Queen's Anniversary Prize in 2020</answer>
Audit results:
Recovery: Disclose the inconsistency. The documents do not contain evidence of an award in 2020. Report: "Based on the available documents, I cannot find an award won by the University of Exeter in 2020. Document [4] mentions a Queen's Anniversary Prize in 2015. The question may reference information not present in the provided documents."
Do:
<gold_docs> before writing <reason> — this prevents post-hoc cherry-picking of evidence.Avoid:
<gold_docs>. If you discover relevant evidence mid-reasoning, update the declaration first.<gold_docs> is the defense.| Failure Mode | Detection | Recovery |
|---|---|---|
| Reasoning collapse (incoherent mid-chain) | Plan→Reason audit fails: reasoning steps don't follow sub-question order | Regenerate <reason> block, following <plan> step by step |
| Hallucinated citation | Citation index not in <gold_docs> or references non-existent document | Remove the claim; check if answer still holds without it |
| Answer not entailed | Reason→Answer audit fails | Re-derive answer strictly from final reasoning step; if no valid answer follows, report "insufficient evidence" |
| Unsupported claim | Grounding audit fails for a specific claim | Remove or correct the claim; re-check downstream reasoning that depended on it |
| Format violation | XML tags missing or malformed | Regenerate the full trace; structured format is required for auditability |
| No evidence found | <gold_docs> is empty after scanning all documents | Report that the question cannot be answered from the provided documents |
Paper: CRAFT: Calibrated Reasoning with Answer-Faithful Traces via Reinforcement Learning for Multi-Hop Question Answering (Liu et al., 2026)
Key takeaway: The four-component XML trace (plan, gold_docs, reason, answer) combined with four-dimension faithfulness auditing (plan→reason, citation→evidence, reason→answer, claim→source) provides a mechanically verifiable structure that catches reasoning-answer inconsistencies — the most insidious failure mode where the answer is correct but the reasoning is fabricated.
development
Audit LLM-based automatic short answer grading (ASAG) systems for adversarial vulnerabilities using token-level and prompt-level attack strategies from the GradingAttack framework. Triggers: 'test grading robustness', 'adversarial attack on grading', 'audit LLM grader', 'red-team answer grading', 'ASAG vulnerability assessment', 'grading fairness attack'
development
Build structured information-seeking agents that decompose complex queries into multi-turn search-and-browse workflows, aggregate results from multiple web sources, and return answers in typed structured formats (items, sets, lists, tables). Applies the GISA benchmark's ReAct-based agent architecture and evaluation methodology. Trigger phrases: "build an information-seeking agent", "search agent pipeline", "multi-turn web research agent", "structured web search workflow", "aggregate information from multiple sources", "web research with structured output"
data-ai
Optimize LLM prompts using GFlowPO's iterative generate-evaluate-refine loop with diversity-preserving exploration and dynamic memory. Use when: 'optimize this prompt', 'find a better prompt for this task', 'prompt engineering with examples', 'auto-tune my system prompt', 'improve prompt accuracy', 'generate prompt variations'.
development
Constrain LLM generation with executable Pydantic schemas and multi-agent pipelines to produce structurally valid, domain-rich artifacts. Uses ontology-as-grammar to eliminate hallucinated structures while preserving creative output. Trigger phrases: "generate a valid game design", "schema-constrained generation", "build a multi-agent pipeline with Pydantic validation", "ontology-driven content generation", "structured creative generation with DSPy", "generate artifacts that pass domain validation".