skills/agentic-ai-healthcare-medicine/SKILL.md
Design, evaluate, and improve LLM-based agentic systems for healthcare using a seven-dimensional taxonomy with 29 sub-dimensions. Triggers: 'build a healthcare AI agent', 'evaluate my medical agent', 'healthcare agent architecture review', 'audit agent capabilities for clinical use', 'design a multi-agent medical system', 'gap analysis for healthcare LLM agent'.
npx skillsauth add ndpvt-web/arxiv-claude-skills agentic-ai-healthcare-medicineInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
This skill enables Claude to architect, evaluate, and systematically improve LLM-based agentic systems for healthcare and medicine using a rigorous seven-dimensional taxonomy derived from an empirical review of 49 studies (Vatsal, Dubey & Singh, 2026). Rather than ad-hoc agent design, this approach maps every agent capability to one of 29 operational sub-dimensions across Cognitive Capabilities, Knowledge Management, Interaction Patterns, Adaptation & Learning, Safety & Ethics, Framework Typology, and Core Tasks & Subtasks — then uses quantitative benchmarks of capability prevalence to identify gaps, prioritize development, and avoid known architectural pitfalls.
The taxonomy organizes healthcare agent capabilities into seven dimensions with 29 sub-dimensions. Each sub-dimension has a three-level rubric: Fully Implemented (✓), Partially Implemented (Δ), and Not Implemented (✗), with precise criteria distinguishing each level. The empirical finding across 49 studies reveals stark asymmetries: retrieval-grounded capabilities dominate (External Knowledge Integration at 76% ✓, Multi-Agent Design at 82% ✓) while adaptation, safety, and action-oriented capabilities lag severely (Drift Detection at 96% ✗, Event-Triggered Activation at 92% ✗, Regulatory Compliance at 82% ✗).
The actionable insight is that most healthcare agents cluster in a "retrieval-advising" archetype — strong at ingesting knowledge and answering questions, weak at acting on decisions, adapting to distributional shifts, and satisfying regulatory requirements. A well-designed agent must consciously address the neglected dimensions. The taxonomy provides a checklist: if your agent scores ✗ on Treatment Planning, Safety Guardrails, or Human-in-the-Loop, those are not optional features — they are empirically-identified gaps that separate prototype from production-grade systems.
The co-occurrence analysis adds further guidance: Multi-Agent Design pairs naturally with Conversational Mode; External Knowledge Integration pairs with Medical QA but rarely with Dynamic Updates (meaning RAG pipelines are typically static). These patterns reveal where architectural choices create downstream constraints.
| # | Dimension | Sub-Dimension | Benchmark (✓ / Δ / ✗) | |---|-----------|---------------|------------------------| | 1 | Cognitive Capabilities | Planning | 43% / 39% / 18% | | 2 | | Perception (Input Processing) | 49% / 47% / 4% | | 3 | | Action (Output & Execution) | 43% / 20% / 37% | | 4 | | Meta-Capabilities | 33% / 37% / 30% | | 5 | | Consistency & Conflict Resolution | 35% / 27% / 38% | | 6 | Knowledge Management | External Knowledge Integration | 76% / 8% / 16% | | 7 | | Memory Module | 45% / 49% / 6% | | 8 | | Dynamic Updates & Forgetting | 2% / 51% / 47% | | 9 | Interaction Patterns | Conversational Mode | 45% / 12% / 43% | | 10 | | Event-Triggered Activation | 4% / 4% / 92% | | 11 | | Human-in-the-Loop | 20% / 8% / 72% | | 12 | | Error Recovery | 14% / 47% / 39% | | 13 | Adaptation & Learning | Drift Detection & Mitigation | 0% / 4% / 96% | | 14 | | Reinforcement-Based Adaptation | 24% / 6% / 70% | | 15 | | Meta-Learning & Few-Shot | 35% / 2% / 63% | | 16 | Safety & Ethics | Safety Guardrails & Adversarial Robustness | 10% / 37% / 53% | | 17 | | Bias & Fairness | 16% / 39% / 45% | | 18 | | Privacy-Preserving Mechanism | 18% / 29% / 53% | | 19 | | Regulatory & Compliance Constraints | 12% / 6% / 82% | | 20 | Framework Typology | Multi-Agent Design | 82% / 6% / 12% | | 21 | | Centralized Orchestration | 45% / 39% / 16% | | 22 | Core Tasks | Clinical Documentation & EHR Analysis | 47% / 29% / 24% | | 23 | | Medical QA & Decision Support | 65% / 20% / 15% | | 24 | | Triage & Differential Diagnosis | 39% / 31% / 30% | | 25 | | Diagnostic Reasoning | 41% / 27% / 32% | | 26 | | Treatment Planning & Prescription | 12% / 29% / 59% | | 27 | | Drug Discovery & Clinical Trial Design | 18% / 10% / 72% | | 28 | | Patient Interaction & Monitoring | 10% / 8% / 82% | | 29 | | Benchmarking & Simulation | 12% / 6% / 82% |
Define the target Core Tasks — Select which of the 8 Core Task sub-dimensions the agent must address (e.g., Diagnostic Reasoning + Treatment Planning). Use the benchmark table above to understand baseline difficulty: Treatment Planning at 59% ✗ means expect significant engineering effort.
Select Framework Typology — Decide between multi-agent (specialized roles for retrieval, reasoning, safety checking) vs. monolithic agent. Multi-agent is dominant (82% ✓) for good reason: it enables modularity and redundancy. Design an explicit orchestration layer if choosing multi-agent — 39% of systems only partially implement orchestration.
Design the Cognitive Pipeline — For each agent, implement: (a) Planning — decompose clinical tasks into sub-goals with strategy comparison, not fixed workflows; (b) Perception — build encoders/parsers for each input modality (text, imaging, structured EHR); (c) Action — implement verified tool execution with precondition/postcondition checks, not just text generation; (d) Meta-Capabilities — add self-critique loops where the agent evaluates its own reasoning and flags uncertainty.
Build Knowledge Management — Implement a RAG pipeline with domain-specific medical knowledge bases (clinical guidelines, drug databases, ICD ontologies). Add a persistent memory module (episodic for patient history, semantic for domain knowledge). Critically, add dynamic update mechanisms — 47% of systems lack this, creating stale knowledge risk.
Wire Interaction Patterns — Implement conversational mode with session-scoped context. Add human-in-the-loop confirmation gates for high-stakes decisions (prescriptions, diagnoses). Build error recovery with transactional rollbacks and bounded retries. Consider event-triggered activation for monitoring use cases (92% absence means competitive advantage).
Layer Safety & Ethics — Implement multi-stage guardrails: input validation → reasoning audit → output filtering. Run stratified bias audits across demographic groups. Add privacy controls (role-based access, data retention schedules). Map to specific regulatory frameworks (HIPAA, GDPR) with documented consent flows.
Add Adaptation Mechanisms — Implement few-shot or in-context learning for new clinical scenarios. Add drift detection on incoming data distributions with automatic alerts. Consider RLHF or reward-based refinement from clinician feedback.
Score the system against all 29 sub-dimensions using the ✓/Δ/✗ rubric. Target ✓ on all sub-dimensions relevant to the deployment context. Flag any ✗ on Safety & Ethics sub-dimensions as blockers.
Collect implementation evidence — For each of the 29 sub-dimensions, gather concrete evidence from code, documentation, and test results.
Apply the rubric — Rate each sub-dimension as ✓ (end-to-end with demonstrated evidence), Δ (mechanism present but incomplete), or ✗ (absent or asserted without evidence). Be conservative: default to Δ when claims are implicit or simulation-only.
Generate the gap report — Compare ratings against the benchmark prevalence table. Highlight sub-dimensions rated ✗ where the benchmark shows >30% ✓ (the agent is behind the field). Flag sub-dimensions rated ✗ in Safety & Ethics regardless of benchmark.
Prioritize remediation — Rank gaps by clinical risk (Safety first), then by benchmark prevalence (catch up to field), then by co-occurrence dependencies (e.g., fixing Error Recovery unblocks Treatment Planning).
Example 1: Scaffold a Multi-Agent Diagnostic System
User: "Build me a multi-agent system that takes patient symptoms and lab results, generates differential diagnoses, and recommends next steps."
Approach:
Output structure:
# agent_config.py
AGENTS = {
"intake": {
"role": "Parse patient data and retrieve relevant guidelines",
"tools": ["ehr_parser", "rag_retriever", "lab_normalizer"],
"perception": "multimodal", # structured EHR + free text
},
"reasoner": {
"role": "Generate ranked differential diagnoses with reasoning",
"tools": ["knowledge_graph", "reasoning_chain"],
"meta_capabilities": {
"self_critique": True,
"confidence_calibration": True,
"uncertainty_threshold": 0.7, # flag for human review
},
},
"safety_checker": {
"role": "Validate outputs against safety constraints",
"tools": ["drug_interaction_db", "guideline_validator"],
"guardrails": ["contraindication_check", "demographic_bias_audit"],
"human_in_the_loop": {
"trigger": "high_acuity_or_low_confidence",
"gate_type": "blocking",
},
},
}
ORCHESTRATOR = {
"routing": "sequential_with_gates",
"error_recovery": {
"rag_fallback": "parametric_with_confidence_penalty",
"max_retries": 2,
},
}
Example 2: Audit an Existing Medical QA Chatbot
User: "I have a medical chatbot that answers patient questions using RAG over clinical guidelines. Can you evaluate it against the taxonomy?"
Approach:
Output — Gap Report:
TAXONOMY AUDIT REPORT
=====================
COGNITIVE CAPABILITIES
Planning: Δ (task breakdown exists, no strategy comparison)
Perception: Δ (text only, no multimodal)
Action: ✗ (text-only output, no tool execution)
Meta-Capabilities: ✗ (no self-critique or confidence calibration)
Consistency & Conflict: ✗ (no conflict detection across sources)
KNOWLEDGE MANAGEMENT
External Knowledge Integration: ✓ (RAG pipeline with clinical guidelines)
Memory Module: Δ (conversation history only, no persistent memory)
Dynamic Updates & Forgetting: ✗ (static index, no temporal decay)
INTERACTION PATTERNS
Conversational Mode: ✓ (multi-turn with context)
Event-Triggered Activation: ✗ (user-initiated only)
Human-in-the-Loop: ✗ (no clinician review gates)
Error Recovery: Δ (basic retry, no fallback strategy)
ADAPTATION & LEARNING
Drift Detection: ✗ CRITICAL — no monitoring for guideline changes
Reinforcement-Based: ✗
Meta-Learning & Few-Shot: ✗
SAFETY & ETHICS
Safety Guardrails: ✗ BLOCKER — no input/output filtering
Bias & Fairness: ✗ BLOCKER — no demographic audit
Privacy-Preserving: Δ (basic deidentification only)
Regulatory Compliance: ✗ BLOCKER — no HIPAA/GDPR mapping
FRAMEWORK TYPOLOGY
Multi-Agent Design: ✗ (monolithic)
Centralized Orchestration: ✗ (N/A for single agent)
PRIORITY REMEDIATION:
1. [BLOCKER] Add safety guardrails — input sanitization + output filtering
2. [BLOCKER] Implement bias audit across age/gender/race subgroups
3. [BLOCKER] Map data handling to HIPAA requirements
4. [HIGH] Add human-in-the-loop gates for clinical recommendations
5. [HIGH] Implement confidence calibration and self-critique
6. [MEDIUM] Add persistent memory for longitudinal patient context
7. [MEDIUM] Implement dynamic knowledge base updates
Example 3: Add Safety Layer to Treatment Planning Agent
User: "My treatment planning agent generates prescription recommendations but has no safety checks. How do I add them?"
Approach:
Output:
class TreatmentSafetyPipeline:
"""Multi-stage safety verification for prescription recommendations.
Maps to taxonomy: Safety Guardrails (✓), Human-in-the-Loop (✓),
Error Recovery (✓), Bias & Fairness (Δ)."""
def verify(self, recommendation, patient_context):
# Stage 1: Drug interaction check (Safety Guardrails)
interactions = self.drug_db.check_interactions(
proposed=recommendation.medications,
current=patient_context.current_medications,
)
if interactions.severity == "critical":
return self.block_with_explanation(interactions)
# Stage 2: Dosage validation against weight/age/renal function
dosage_ok = self.dosage_validator.check(
recommendation, patient_context.demographics
)
# Stage 3: Bias audit — check if recommendation differs by
# demographic group for same clinical presentation
bias_flag = self.bias_auditor.check_demographic_parity(
recommendation, patient_context
)
# Stage 4: Confidence gate — route to human review if uncertain
if recommendation.confidence < 0.8 or bias_flag or not dosage_ok:
return self.route_to_clinician(
recommendation,
flags={"bias": bias_flag, "dosage": not dosage_ok},
)
# Stage 5: Regulatory logging (HIPAA audit trail)
self.audit_logger.log_decision(
recommendation, patient_context, rationale=recommendation.chain_of_thought
)
return recommendation
Paper: Vatsal, Dubey & Singh (2026). "Agentic AI in Healthcare & Medicine: A Seven-Dimensional Taxonomy for Empirical Evaluation of LLM-based Agents." arXiv:2602.04813v1. https://arxiv.org/abs/2602.04813v1
What to look for: The full 29 sub-dimension definitions with ✓/Δ/✗ criteria (Section III), the per-study evaluation matrices (Tables I–VIII), and the co-occurrence analysis showing which capabilities cluster together and which are systematically absent.
development
Audit LLM-based automatic short answer grading (ASAG) systems for adversarial vulnerabilities using token-level and prompt-level attack strategies from the GradingAttack framework. Triggers: 'test grading robustness', 'adversarial attack on grading', 'audit LLM grader', 'red-team answer grading', 'ASAG vulnerability assessment', 'grading fairness attack'
development
Build structured information-seeking agents that decompose complex queries into multi-turn search-and-browse workflows, aggregate results from multiple web sources, and return answers in typed structured formats (items, sets, lists, tables). Applies the GISA benchmark's ReAct-based agent architecture and evaluation methodology. Trigger phrases: "build an information-seeking agent", "search agent pipeline", "multi-turn web research agent", "structured web search workflow", "aggregate information from multiple sources", "web research with structured output"
data-ai
Optimize LLM prompts using GFlowPO's iterative generate-evaluate-refine loop with diversity-preserving exploration and dynamic memory. Use when: 'optimize this prompt', 'find a better prompt for this task', 'prompt engineering with examples', 'auto-tune my system prompt', 'improve prompt accuracy', 'generate prompt variations'.
development
Constrain LLM generation with executable Pydantic schemas and multi-agent pipelines to produce structurally valid, domain-rich artifacts. Uses ontology-as-grammar to eliminate hallucinated structures while preserving creative output. Trigger phrases: "generate a valid game design", "schema-constrained generation", "build a multi-agent pipeline with Pydantic validation", "ontology-driven content generation", "structured creative generation with DSPy", "generate artifacts that pass domain validation".