skills/automated-rubrics-reliable-evaluation/SKILL.md
Generate fine-grained evaluation rubrics for medical dialogue systems using a retrieval-augmented multi-agent pipeline. Decomposes medical evidence into atomic facts, synthesizes them with interaction constraints, and produces weighted, auditable rubrics. Use when: 'evaluate medical chatbot responses', 'generate rubrics for clinical QA', 'build a medical LLM evaluation pipeline', 'score health dialogue quality', 'create automated clinical evaluation criteria', 'refine medical AI responses with rubric feedback'.
npx skillsauth add ndpvt-web/arxiv-claude-skills automated-rubrics-reliable-evaluationInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
This skill enables Claude to build retrieval-augmented multi-agent pipelines that automatically generate instance-specific evaluation rubrics for medical dialogue systems. Rather than relying on generic metrics or expensive expert annotation, the technique retrieves authoritative medical evidence, decomposes it into atomic facts (positive assertions, contraindications, safety red flags), extracts interaction intent constraints from the user query, and synthesizes both tracks into weighted, verifiable evaluation criteria. The resulting rubrics can score LLM responses, discriminate between subtly different answer qualities, and guide targeted response refinement.
The core insight is dual-track constraint construction: medical evidence and user interaction intent are processed in parallel by specialized agents, then merged into a single rubric. The objective track retrieves content from authoritative sources (CDC, WHO, PubMed, Mayo Clinic, drug databases), synthesizes overlapping snippets, and decomposes them into three categories of atomic facts: positive facts (declarative assertions, dosage ranges, conditional logic), negative constraints (explicit prohibitions, contraindications), and safety red flags (emergency warnings). The subjective track extracts explicit instructions and implicit communication cues from the user query, identifying medically necessary but missing contextual variables (e.g., patient age, comorbidities not stated).
These two tracks feed into a Rubric Synthesis Agent that maps each atomic fact and interaction constraint to a structured criterion tuple (criterion_text, evaluation_axis, clinical_weight). Each criterion is assigned to one of five axes: accuracy (factual correctness, safety violations), completeness (topic coverage), context_awareness (clarifying questions for missing info), communication_quality (tone, empathy, clarity), and instruction_following (formatting constraints). Clinical weights range from -10 to +10 following severity tiers: safety-critical items get extreme weights (|8-10|), completeness items get moderate weights (|4-7|), and minor details get low weights (|1-3|). Negative weights penalize harmful content.
An Auditing Agent then runs three-phase gap analysis: (1) scan source evidence for uncovered facts and generate missing criteria, (2) filter hallucinated or irrelevant criteria and validate that negative constraints are present, (3) merge fragmented criteria while enforcing a cap of 20 criteria per rubric and preserving all safety red flags. This audit loop is what prevents both hallucinated evaluation criteria and dangerous omissions.
Route the medical query to search: Implement a Routing Agent that transforms the user's medical query into 3-5 optimized search queries. Use a high-capacity model for intent identification and query generation, and a lightweight model for reranking retrieved results by source authority (prioritize CDC, WHO, PubMed over generic web results).
Retrieve and synthesize evidence: Fetch the top-5 results per search query from medical knowledge sources. Use an Evidence Synthesis Agent to cross-check overlapping claims, de-duplicate, and consolidate into a single evidence block per query. Extract text with a tool like Trafilatura to handle diverse HTML layouts.
Decompose evidence into atomic facts (Reference Board): Run a Medical Fact Agent that breaks the synthesized evidence into three categories:
positive_facts: Declarative assertions, quantitative data (dosages, thresholds), conditional logic ("if X then Y").negative_constraints: Explicit prohibitions ("do not prescribe X with Y"), contraindications.safety_red_flags: Emergency indicators ("seek immediate care if..."), critical alerts.
Apply query-aware filtering to retain only relevant positive facts, but always preserve all safety red flags and negative constraints regardless of query specificity.Extract interaction intent constraints: Run an Interaction Intent Agent in parallel with step 3. Parse the user query to extract explicit instructions (e.g., "explain in simple terms"), implicit communication cues (e.g., expressed anxiety suggesting empathetic tone), and identify medically necessary but missing contextual variables (e.g., unstated age, allergies, current medications) that the response should ask about.
Synthesize the initial rubric: Feed both the Reference Board (atomic facts) and interaction constraints into a Rubric Synthesis Agent. For each fact or constraint, generate a criterion tuple:
{
"criterion": "Response must warn against combining ibuprofen with blood thinners",
"axis": "accuracy",
"weight": -9
}
Assign weights using clinical severity tiers: |8-10| for safety/accuracy-critical, |4-7| for completeness, |1-3| for minor communication or formatting items. Use negative weights for criteria that penalize harmful content.
Audit and refine the rubric: Run a three-phase Auditing Agent:
Score responses against the rubric: For each LLM response under evaluation, check every criterion. Use a structured judging protocol: run N=3 trials with order swapping (if comparing two responses) for 6 total runs, then decide by majority vote. Calculate a weighted sum of satisfied/violated criteria as the final score.
Compute discriminative metrics: For paired evaluation (reference vs. candidate), calculate:
Generate refinement feedback (optional): Convert rubric violations into a structured Edit Plan:
{
"actions": [
{"type": "ADD", "priority": 1, "detail": "Include warning about renal impairment risk with NSAIDs in elderly patients"},
{"type": "REMOVE", "priority": 2, "detail": "Remove unsupported claim about herbal remedy efficacy"}
]
}
Feed this plan to a Refinement Agent that edits the response while strictly prohibited from introducing claims not in the evidence.
Validate and report: Calculate Clinical Intent Alignment (CIA) against any available gold-standard rubrics: CIA = (matched keypoints / total keypoints). Report per-axis coverage breakdown and flag any axes with zero criteria as potential blind spots.
Example 1: Evaluating a medication interaction response
User: "Build an evaluation rubric for a chatbot answering: 'Can I take ibuprofen with my blood pressure medication lisinopril?'"
Approach:
Output rubric (abbreviated):
Criterion | Axis | Weight
-------------------------------------------------------------|-------------------|-------
States ibuprofen can reduce lisinopril's BP-lowering effect | accuracy | +8
Warns about increased renal impairment risk | accuracy | -9
Mentions occasional low-dose may be acceptable with MD consult| completeness | +6
Asks about patient's kidney function status | context_awareness | +5
Includes emergency signs requiring immediate care | accuracy | -8
Does NOT recommend chronic concurrent use as safe | accuracy | -10
Uses accessible language appropriate for patient query | communication | +3
Suggests consulting prescribing physician before combining | completeness | +7
Example 2: Discriminating between two chatbot responses about chest pain
User: "I have two chatbot responses to 'I'm having chest pain after exercise.' Score them with an automated rubric and tell me which is better."
Approach:
Output:
Response A score: 72/100 (missed: did not ask about pain radiation pattern, did not mention calling 911 for severe symptoms)
Response B score: 41/100 (missed: incorrectly stated exercise-induced chest pain is "usually nothing to worry about", omitted cardiac red flags)
Mean Score Delta: 31 points
Winner: Response A (statistically significant across all 6 judging runs)
Key discriminating criteria:
- "Must not dismiss chest pain as benign without ruling out cardiac causes" (weight: -10): A passed, B failed
- "Must list red flag symptoms requiring emergency care" (weight: -9): A passed, B failed
Example 3: Rubric-guided response refinement
User: "Here's a chatbot response about managing Type 2 diabetes. Use rubric-based feedback to improve it."
Approach:
Output:
Original score: 59/100
Edit Plan:
1. [ADD, priority 1] Include warning about hypoglycemia signs when on sulfonylureas
2. [ADD, priority 2] Ask about current HbA1c level and medication list
3. [MODIFY, priority 3] Replace "you should exercise daily" with evidence-based recommendation (150 min/week moderate activity per ADA guidelines)
4. [REMOVE, priority 4] Remove unsubstantiated claim about cinnamon supplements lowering blood sugar
Refined score: 68/100 (+9.2% improvement)
All safety red flags now covered. Context awareness improved from 2/5 to 4/5 criteria met.
Paper: "Automated Rubrics for Reliable Evaluation of Medical Dialogue Systems" by Chen et al. (2026). arXiv:2601.15161. Focus on Section 3 (three-stage pipeline architecture), Section 4 (CIA metric and discriminative sensitivity), and Appendix B (agent prompt templates) for implementation details.
development
Audit LLM-based automatic short answer grading (ASAG) systems for adversarial vulnerabilities using token-level and prompt-level attack strategies from the GradingAttack framework. Triggers: 'test grading robustness', 'adversarial attack on grading', 'audit LLM grader', 'red-team answer grading', 'ASAG vulnerability assessment', 'grading fairness attack'
development
Build structured information-seeking agents that decompose complex queries into multi-turn search-and-browse workflows, aggregate results from multiple web sources, and return answers in typed structured formats (items, sets, lists, tables). Applies the GISA benchmark's ReAct-based agent architecture and evaluation methodology. Trigger phrases: "build an information-seeking agent", "search agent pipeline", "multi-turn web research agent", "structured web search workflow", "aggregate information from multiple sources", "web research with structured output"
data-ai
Optimize LLM prompts using GFlowPO's iterative generate-evaluate-refine loop with diversity-preserving exploration and dynamic memory. Use when: 'optimize this prompt', 'find a better prompt for this task', 'prompt engineering with examples', 'auto-tune my system prompt', 'improve prompt accuracy', 'generate prompt variations'.
development
Constrain LLM generation with executable Pydantic schemas and multi-agent pipelines to produce structurally valid, domain-rich artifacts. Uses ontology-as-grammar to eliminate hallucinated structures while preserving creative output. Trigger phrases: "generate a valid game design", "schema-constrained generation", "build a multi-agent pipeline with Pydantic validation", "ontology-driven content generation", "structured creative generation with DSPy", "generate artifacts that pass domain validation".