skills/bioace-automated-framework-biomedical/SKILL.md
Evaluate biomedical QA outputs using the BioACE nugget-based framework — assess answer completeness, correctness, precision, recall, and citation quality against ground-truth nuggets. Trigger phrases: - "evaluate biomedical answers" - "check citation quality for medical QA" - "nugget-based evaluation of RAG output" - "assess completeness and correctness of biomedical text" - "verify biomedical citations with NLI" - "score biomedical question answering output"
npx skillsauth add ndpvt-web/arxiv-claude-skills bioace-automated-framework-biomedicalInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
This skill enables Claude to apply the BioACE evaluation framework to assess the quality of biomedical question-answering (QA) outputs. BioACE decomposes evaluation into two axes — answer quality (measured via nugget-based completeness, correctness, precision, and recall) and citation quality (measured via NLI-based and LLM-based entailment checking). When a user needs to evaluate whether a biomedical answer is factually complete, correctly grounded, and properly cited against source literature, this skill provides the structured methodology to do so.
Nugget-based answer evaluation. BioACE decomposes reference (gold-standard) answers into discrete atomic facts called "nuggets" — each nugget is a single verifiable claim or piece of information (e.g., "Metformin is a first-line treatment for type 2 diabetes"). The system then checks each nugget against the generated answer to compute four metrics: completeness (proportion of reference nuggets covered), correctness (factual accuracy of matched content), precision (ratio of correct claims to total claims in the generated answer), and recall (proportion of required nuggets successfully captured). This nugget decomposition is what makes the evaluation granular rather than relying on coarse text-similarity scores like ROUGE or BERTScore.
Citation entailment verification. For each citation attached to a claim in the generated answer, BioACE checks whether the cited passage actually entails the claim. This is done using Natural Language Inference (NLI) models — specifically biomedical-domain NLI models like BioNLI and SciNLI, which outperform general-domain alternatives (e.g., RoBERTa-MNLI) on this task. The framework also supports LLM-based entailment as an alternative. A citation is scored as "supporting" only if the NLI model classifies the relationship between the cited passage and the claim as entailment (not contradiction or neutral).
Why this matters. Standard text-similarity metrics fail in the biomedical domain because a semantically similar but factually incorrect answer can score highly. BioACE's nugget decomposition catches partial answers and hallucinated content that surface-level metrics miss. The citation verification layer adds a second check: even if an answer is correct, unsupported or fabricated citations undermine trust.
Decompose the reference answer into nuggets. Break the gold-standard answer into atomic, independently verifiable facts. Each nugget should be a single claim — e.g., "Drug X inhibits enzyme Y" rather than a compound sentence with multiple facts. Store nuggets as a JSON array.
Decompose the generated answer into claim segments. Parse the system-generated answer into individual claim segments, preserving any citation markers (e.g., [1], [PMID:12345]) attached to each segment.
Match generated claims to reference nuggets. For each reference nugget, determine whether the generated answer contains a semantically equivalent claim. Use sentence embeddings (e.g., all-MiniLM-L6-v2, all-mpnet-base-v2, or sup-simcse-roberta-large) to compute similarity, with a threshold for matching (typically cosine similarity >= 0.75).
Score answer completeness. Calculate completeness = number_of_matched_nuggets / total_reference_nuggets. A score of 1.0 means every required fact is present in the generated answer.
Score answer correctness. For each matched claim, verify factual accuracy — check that the generated claim does not distort or contradict the nugget it matches. Flag claims that are semantically close but factually inverted (e.g., "increases risk" vs. "decreases risk"). Score correctness = number_of_correct_matches / number_of_matched_nuggets.
Compute precision and recall. Calculate precision = number_of_correct_matches / total_generated_claims and recall = number_of_correct_matches / total_reference_nuggets. These give a balanced view of over-generation (low precision) vs. under-coverage (low recall).
Extract citation-claim pairs. For each citation in the generated answer, extract the (claim_text, cited_passage_text) pair. If the cited passage must be retrieved (e.g., from PubMed via PMID), fetch it first.
Run NLI-based citation verification. For each (claim, cited_passage) pair, classify the relationship using a biomedical NLI model. Prefer BioNLI or SciNLI over general-domain models. Label each citation as entailment (supports), contradiction (refutes), or neutral (irrelevant).
Compute citation scores. Calculate citation_precision = entailment_citations / total_citations and optionally citation_recall = claims_with_at_least_one_supporting_citation / total_claims_requiring_citation.
Produce a structured evaluation report. Aggregate all scores into a single report with per-question and overall metrics, including specific nuggets missed, incorrect claims, and unsupported citations.
Example 1: Evaluating a RAG system's answer about metformin
User: "I have a biomedical QA system that answered the question 'What are the side effects of metformin?' — evaluate its output against the reference answer."
Approach:
{
"question": "What are the side effects of metformin?",
"reference_nuggets": [
"Gastrointestinal symptoms are the most common side effects of metformin",
"Metformin can cause lactic acidosis in rare cases",
"Metformin may lead to vitamin B12 deficiency with long-term use",
"Diarrhea, nausea, and abdominal pain are frequent GI side effects",
"Lactic acidosis risk increases with renal impairment"
]
}
{
"generated_claims": [
{"text": "Metformin commonly causes gastrointestinal issues such as diarrhea and nausea", "citations": ["PMID:19246357"]},
{"text": "Lactic acidosis is a serious but rare complication", "citations": ["PMID:20536313"]},
{"text": "Metformin can cause weight gain in some patients", "citations": []}
]
}
Answer Evaluation:
Matched nuggets: 3/5 (nuggets 1, 2, 4 matched; 3, 5 missed)
Correct matches: 2/3 (claim about weight gain is incorrect — metformin typically causes weight loss)
Completeness: 0.60
Correctness: 0.67
Precision: 0.67 (2 correct out of 3 generated claims)
Recall: 0.40 (2 correct matches out of 5 reference nuggets)
Citation Evaluation:
PMID:19246357 → claim "GI issues": entailment ✓
PMID:20536313 → claim "lactic acidosis": entailment ✓
Claim "weight gain": no citation provided ✗
Citation precision: 2/2 = 1.00
Citation coverage: 2/3 claims cited = 0.67
Example 2: Building a citation verification module
User: "Help me write a Python function that checks whether PubMed citations actually support the claims they're attached to in a biomedical QA output."
Approach:
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
class CitationVerifier:
def __init__(self, model_name="mosaicnlp/BioNLI"):
self.tokenizer = AutoTokenizer.from_pretrained(model_name)
self.model = AutoModelForSequenceClassification.from_pretrained(model_name)
self.labels = ["entailment", "neutral", "contradiction"]
def verify(self, claim: str, cited_passage: str) -> dict:
"""Check if a cited passage entails the claim."""
inputs = self.tokenizer(
cited_passage, claim,
return_tensors="pt", truncation=True, max_length=512
)
with torch.no_grad():
logits = self.model(**inputs).logits
probs = torch.softmax(logits, dim=-1)[0]
pred_label = self.labels[probs.argmax().item()]
return {
"label": pred_label,
"confidence": probs.max().item(),
"supports_claim": pred_label == "entailment"
}
def evaluate_citations(self, claim_citation_pairs: list[dict]) -> dict:
"""Evaluate all citation-claim pairs and return aggregate scores."""
results = []
for pair in claim_citation_pairs:
result = self.verify(pair["claim"], pair["cited_passage"])
result["claim"] = pair["claim"]
results.append(result)
supporting = sum(1 for r in results if r["supports_claim"])
return {
"per_citation": results,
"citation_precision": supporting / len(results) if results else 0,
"total_citations": len(results),
"supporting_citations": supporting
}
Example 3: Implementing nugget-based completeness scoring
User: "Write code to score how complete a generated biomedical answer is against reference nuggets."
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
def score_completeness(
reference_nuggets: list[str],
generated_answer: str,
similarity_threshold: float = 0.75,
model_name: str = "all-mpnet-base-v2"
) -> dict:
"""Score answer completeness using nugget matching."""
model = SentenceTransformer(model_name)
# Split generated answer into sentences as candidate claims
import re
generated_claims = [s.strip() for s in re.split(r'[.!?]+', generated_answer) if s.strip()]
nugget_embeddings = model.encode(reference_nuggets)
claim_embeddings = model.encode(generated_claims)
sim_matrix = cosine_similarity(nugget_embeddings, claim_embeddings)
matched_nuggets = []
unmatched_nuggets = []
for i, nugget in enumerate(reference_nuggets):
max_sim = sim_matrix[i].max()
best_match_idx = sim_matrix[i].argmax()
if max_sim >= similarity_threshold:
matched_nuggets.append({
"nugget": nugget,
"matched_claim": generated_claims[best_match_idx],
"similarity": float(max_sim)
})
else:
unmatched_nuggets.append({"nugget": nugget, "best_similarity": float(max_sim)})
completeness = len(matched_nuggets) / len(reference_nuggets) if reference_nuggets else 0
return {
"completeness": completeness,
"matched": matched_nuggets,
"missed": unmatched_nuggets,
"total_nuggets": len(reference_nuggets),
"matched_count": len(matched_nuggets)
}
all-mpnet-base-v2 or sup-simcse-roberta-large) for nugget matching — they outperform token-overlap metrics like ROUGE for semantic matching.roberta-large-mnli with a warning that biomedical accuracy may be degraded.answer_eval.py and citation_eval.py.development
Audit LLM-based automatic short answer grading (ASAG) systems for adversarial vulnerabilities using token-level and prompt-level attack strategies from the GradingAttack framework. Triggers: 'test grading robustness', 'adversarial attack on grading', 'audit LLM grader', 'red-team answer grading', 'ASAG vulnerability assessment', 'grading fairness attack'
development
Build structured information-seeking agents that decompose complex queries into multi-turn search-and-browse workflows, aggregate results from multiple web sources, and return answers in typed structured formats (items, sets, lists, tables). Applies the GISA benchmark's ReAct-based agent architecture and evaluation methodology. Trigger phrases: "build an information-seeking agent", "search agent pipeline", "multi-turn web research agent", "structured web search workflow", "aggregate information from multiple sources", "web research with structured output"
data-ai
Optimize LLM prompts using GFlowPO's iterative generate-evaluate-refine loop with diversity-preserving exploration and dynamic memory. Use when: 'optimize this prompt', 'find a better prompt for this task', 'prompt engineering with examples', 'auto-tune my system prompt', 'improve prompt accuracy', 'generate prompt variations'.
development
Constrain LLM generation with executable Pydantic schemas and multi-agent pipelines to produce structurally valid, domain-rich artifacts. Uses ontology-as-grammar to eliminate hallucinated structures while preserving creative output. Trigger phrases: "generate a valid game design", "schema-constrained generation", "build a multi-agent pipeline with Pydantic validation", "ontology-driven content generation", "structured creative generation with DSPy", "generate artifacts that pass domain validation".