skills/closing-reasoning-gaps-clinical/SKILL.md
Build systems that detect and fix reasoning gaps in LLM agents by comparing their chain-of-thought against reference reasoning, extracting structured discrepancies, and generating corrective instructions stored in a retrievable knowledge base. Use when: 'build a reasoning improvement pipeline', 'detect logic gaps in agent output', 'compare agent reasoning to expert reasoning', 'create a corrective knowledge base from reasoning errors', 'improve clinical decision support accuracy', 'patch reasoning with RAG-retrieved instructions'.
npx skillsauth add ndpvt-web/arxiv-claude-skills closing-reasoning-gaps-clinicalInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
This skill enables Claude to implement the Differential Reasoning Learning (DRL) framework from Liu et al. (2026). DRL improves LLM-based agents — particularly in clinical and high-stakes domains — by systematically comparing an agent's chain-of-thought reasoning against reference rationales (expert-authored or from stronger models), extracting structured discrepancies as graph differences, converting those discrepancies into corrective natural-language instructions, and retrieving them at inference time via RAG to patch logic gaps. The technique is domain-adaptable: while the paper targets clinical reasoning, the graph-based discrepancy analysis applies to any scenario where reasoning fidelity matters.
DRL works in two stages. Stage 1 (Knowledge Mining) processes training examples: for each case, it extracts the agent's free-form chain-of-thought and a reference rationale, parses both into reasoning graphs — directed acyclic graphs (DAGs) where nodes are typed as Facts (observations with polarity: present/absent/uncertain), Hypotheses (conclusions with confidence levels), or Actions (interventions typed as TEST/TREAT/ASSESS/OBSERVE/PRESCRIBE), connected by typed edges (supports/contradicts/suggests_test). It then runs a clinically weighted graph edit distance (GED) analysis that decomposes discrepancies into three components: missing nodes (evidence the agent overlooked), hallucinated nodes (unsupported claims), and path discrepancies (broken or incorrect inference chains). Penalty weights escalate by clinical consequence: Facts=1.0x, Hypotheses=1.5x, Actions=2.0x. An LLM-as-a-judge performs semantic node alignment (handling paraphrases and equivalent medical concepts) before computing GED.
Each diagnosed discrepancy is fed to an Insight Generator that produces a structured corrective instruction containing: the error type, clinical context, what went wrong, why it matters, prevention steps, trigger keywords, and the underlying principle. These instructions populate the Differential Reasoning Knowledge Base (DR-KB).
Stage 2 (Inference) retrieves the top-k most relevant DR-KB instructions for a new query via BM25 over trigger keywords and context, injects them into the agent prompt, and generates a prediction. This is a training-free improvement — no fine-tuning required, just prompt augmentation. Optimal k=5 in the paper's experiments, with diminishing returns beyond that.
Define the reasoning domain and node taxonomy. Adapt the four node types (Facts, Hypotheses, Actions, Final) and three edge types (supports, contradicts, suggests_test) to your domain. For clinical: Facts are symptoms/vitals/labs; Hypotheses are diagnoses; Actions are interventions. For software debugging: Facts are error messages/logs; Hypotheses are root causes; Actions are fixes.
Collect reference rationales. Gather expert-authored reasoning traces, clinical guidelines, or outputs from a stronger model (e.g., GPT-4o reasoning on the same cases). These serve as the gold-standard reasoning graphs.
Generate agent chain-of-thought traces. Run your target agent on the same cases with explicit CoT prompting. Capture the full reasoning text.
Extract reasoning graphs from both traces. Use an LLM to parse each reasoning trace into a DAG with typed nodes and edges. Prompt the LLM to output structured JSON:
{
"nodes": [
{"id": "F1", "type": "fact", "content": "Patient has fever 39.2C", "polarity": "present"},
{"id": "H1", "type": "hypothesis", "content": "Bacterial pneumonia", "confidence": "high"},
{"id": "A1", "type": "action", "content": "Order chest X-ray", "action_type": "TEST"}
],
"edges": [
{"src": "F1", "dst": "H1", "type": "supports", "justification": "Fever suggests infection"}
]
}
Align nodes semantically using LLM-as-a-judge. For each node in the reference graph, ask the LLM whether a semantically equivalent node exists in the agent graph. This handles paraphrases (e.g., "elevated WBC" vs "leukocytosis"). Output a mapping of aligned pairs and unmatched nodes on each side.
Compute weighted graph edit distance. From the alignment, classify discrepancies into three buckets:
d_miss: reference nodes absent from agent graph (weighted by type: Fact=1.0, Hypothesis=1.5, Action=2.0)d_halluc: agent nodes with no reference counterpart (same weights)d_path: edge-level differences — missing connections, reversed directions, wrong edge typesGenerate corrective instructions from discrepancies. Feed each discrepancy profile to an Insight Generator LLM with a structured output schema:
{
"title": "Missing differential for drug-induced hepatitis",
"error_type": "missing_hypothesis",
"domain": "hepatology",
"situation_context": "Patient with elevated ALT/AST and recent medication change",
"what_went_wrong": "Agent failed to consider drug-induced liver injury despite new statin",
"why_it_matters": "Drug-induced hepatitis is reversible if caught early; missing it leads to continued exposure",
"prevention_steps": ["Always check medication history when liver enzymes are elevated", "Consider drug-induced causes before infectious workup"],
"trigger_keywords": ["elevated ALT", "elevated AST", "new medication", "statin", "liver enzymes"],
"principle": "Medication review is mandatory in any hepatic enzyme elevation workup"
}
Store instructions in the DR-KB. Index each instruction by its trigger keywords and domain context. Use a BM25-compatible store (e.g., Elasticsearch, SQLite with FTS5, or an in-memory index).
At inference, retrieve top-k instructions and augment the prompt. For a new query, extract key clinical terms, retrieve the k=5 most relevant DR-KB instructions via BM25, and prepend them to the agent prompt as structured guidance:
## Reasoning Guidelines (retrieved from prior case analysis)
1. When liver enzymes are elevated with recent medication changes, always consider drug-induced hepatitis before infectious causes.
2. ...
Evaluate both answer accuracy and reasoning fidelity. Run the augmented agent, extract its reasoning graph, and compute GED against reference. Track both final-answer accuracy and GED reduction over iterations.
Example 1: Building a Clinical QA Reasoning Improvement Pipeline
User: "I have a medical QA dataset with expert explanations. My LLM agent gets 60% accuracy. Build a pipeline to improve its reasoning."
Approach:
Output structure:
pipeline/
extract_graphs.py # LLM-based DAG extraction from CoT text
align_and_ged.py # Semantic node alignment + weighted GED computation
generate_instructions.py # Discrepancy-to-instruction conversion
dr_kb/
index.json # BM25-indexed instruction store
instructions/ # Individual instruction JSON files
inference.py # RAG retrieval + prompt augmentation + prediction
evaluate.py # Accuracy + GED metrics
Example 2: Debugging an Agent That Hallucinates Reasoning Steps
User: "My coding assistant sometimes invents plausible but wrong explanations for bugs. How do I systematically fix this?"
Approach:
d_halluc, missed evidence in d_missOutput (corrective instruction):
{
"title": "Grounding NPE diagnosis in deserialization contexts",
"error_type": "hallucinated_hypothesis",
"domain": "java_debugging",
"what_went_wrong": "Agent attributed NPE to null user input, but actual cause was missing no-arg constructor required by Jackson",
"prevention_steps": [
"Check class constructors when NPE occurs during deserialization",
"Verify serialization framework requirements before blaming input data"
],
"trigger_keywords": ["NullPointerException", "deserialization", "Jackson", "ObjectMapper"]
}
Example 3: Implementing the Graph Extraction Prompt
User: "How do I prompt an LLM to extract a reasoning graph from free-text chain-of-thought?"
Approach — use this prompt template:
Extract a reasoning graph from the following chain-of-thought text.
Output a JSON object with:
- "nodes": array of objects with {id, type, content, metadata}
- type must be one of: "fact", "hypothesis", "action", "final"
- For facts: metadata includes "polarity" (present/absent/uncertain)
- For hypotheses: metadata includes "confidence" (high/medium/low)
- For actions: metadata includes "action_type" (TEST/TREAT/ASSESS/OBSERVE/PRESCRIBE)
- "edges": array of objects with {src, dst, type, justification}
- type must be one of: "supports", "contradicts", "suggests_test"
- src and dst are node IDs
Rules:
- Each distinct claim, observation, or conclusion is a separate node
- Edges must reflect the logical dependency stated or implied in the text
- Do not infer edges not supported by the text
- Use IDs like F1, F2 for facts; H1, H2 for hypotheses; A1, A2 for actions
Chain-of-thought text:
{cot_text}
Paper: Liu et al., "Closing Reasoning Gaps in Clinical Agents with Differential Reasoning Learning," arXiv:2602.09945v1 (2026). What to look for: Section 3 for the full DRL algorithm (graph extraction, GED computation, instruction generation), Section 4 for the DR-KB schema, and Section 5 for ablation results showing the impact of retrieval depth k and reference rationale quality.
development
Audit LLM-based automatic short answer grading (ASAG) systems for adversarial vulnerabilities using token-level and prompt-level attack strategies from the GradingAttack framework. Triggers: 'test grading robustness', 'adversarial attack on grading', 'audit LLM grader', 'red-team answer grading', 'ASAG vulnerability assessment', 'grading fairness attack'
development
Build structured information-seeking agents that decompose complex queries into multi-turn search-and-browse workflows, aggregate results from multiple web sources, and return answers in typed structured formats (items, sets, lists, tables). Applies the GISA benchmark's ReAct-based agent architecture and evaluation methodology. Trigger phrases: "build an information-seeking agent", "search agent pipeline", "multi-turn web research agent", "structured web search workflow", "aggregate information from multiple sources", "web research with structured output"
data-ai
Optimize LLM prompts using GFlowPO's iterative generate-evaluate-refine loop with diversity-preserving exploration and dynamic memory. Use when: 'optimize this prompt', 'find a better prompt for this task', 'prompt engineering with examples', 'auto-tune my system prompt', 'improve prompt accuracy', 'generate prompt variations'.
development
Constrain LLM generation with executable Pydantic schemas and multi-agent pipelines to produce structurally valid, domain-rich artifacts. Uses ontology-as-grammar to eliminate hallucinated structures while preserving creative output. Trigger phrases: "generate a valid game design", "schema-constrained generation", "build a multi-agent pipeline with Pydantic validation", "ontology-driven content generation", "structured creative generation with DSPy", "generate artifacts that pass domain validation".