skills/devtu-benchmark-harness/SKILL.md
Continuous improvement system for ToolUniverse tools, skills, and plugin. Run benchmarks, diagnose failures, route fixes to devtu skills, retest. Use after skill optimization, tool additions, or as regression check.
npx skillsauth add mims-harvard/tooluniverse devtu-benchmark-harnessInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
A 5-step feedback loop for improving ToolUniverse tools, skills, and plugin quality.
Note: This skill is dataset-agnostic. Per-benchmark score history, known-failing question IDs, and dataset-specific investigations belong in temp_docs_and_tests/benchmark_tracking/ (gitignored workfolder), NOT in this skill directory.
1. RUN benchmark → 2. ANALYZE results → 3. DIAGNOSE failures → 4. FIX via devtu skill → 5. RETEST → repeat
One command does steps 0 (memorization audit), 1 (build), 2 (run), 3 (analyze), 4 (diagnose + extract failures):
bash skills/devtu-benchmark-harness/scripts/run_harness_loop.sh --benchmark bixbench --n 20 --seed 42
# After reviewing diagnose.log and applying devtu skill fixes:
bash skills/devtu-benchmark-harness/scripts/run_harness_loop.sh --retest /path/to/failures.json
The script creates temp_docs_and_tests/benchmark_tracking/run_<TS>/ with results.json, analysis.log, diagnose.log, failures.json. Diagnose output lists each failure with the exact devtu skill to invoke — do NOT fix manually.
Before accepting any skill edit (from devtu-optimize-skills or manual), run:
python3 skills/devtu-benchmark-harness/scripts/check_memorization.py --all
Fails if any skill contains benchmark names, capsule UUIDs, bix-N question IDs, or known-to-be-GT specific numeric answers. This prevents overfitting the plugin to a single benchmark's answer key. Run in --strict mode to also flag specific gene names and dataset filenames (softer signal).
bash scripts/build-plugin.sh # rebuild plugin with latest skills
python skills/devtu-benchmark-harness/scripts/run_eval.py \
--benchmark bixbench \ # bixbench | lab-bench | custom
--mode plugin-only \ # plugin-only | baseline-only | comparison
--n 205 \ # number of questions
--timeout 1800 \ # seconds per question
--max-turns 30 # agent turns per question
Options: --category DESeq2 (filter), --resume results.json (skip done), --guidance path.md (inject custom).
Skill auto-matching in interactive mode is variable — sometimes Claude reads the skill description but starts writing code before loading the skill body. To force the router's critical conventions into every request's system prompt (more reliable, measures the plugin's conventions as-designed rather than skill-routing-as-implemented):
APPEND_CONVENTIONS=1 python skills/evals/run_benchmark.py --benchmark bixbench --plugin-only
Use this mode when measuring the CORRECTNESS of the conventions (are they the right rules?). Use default mode when measuring the RELIABILITY of skill routing (does Claude actually invoke the skill?). The gap between these two numbers is the routing-reliability problem.
Rscript skills/evals/install_r_packages.R # R packages
python3 skills/evals/bixbench/download_capsules.py # BixBench data (~5 GB)
| Benchmark | Questions | Tests | Data |
|-----------|----------|-------|------|
| lab-bench | 20 MCQ | Database lookup accuracy | skills/evals/lab-bench/questions.json |
| bixbench | 205 computational | Data analysis + statistics | skills/evals/bixbench/questions.json + capsule data |
| custom | User-defined | Any | Custom JSON file |
python skills/devtu-benchmark-harness/scripts/analyze_results.py \
--results results.json \
--questions skills/evals/bixbench/questions.json \
--benchmark bixbench
Output:
| Category | Skill | |----------|-------| | DESeq2, fold_change | tooluniverse-rnaseq-deseq2 | | ANOVA, regression, chi_square, spline_fitting | tooluniverse-statistical-modeling | | pathway_enrichment, DESeq2+enrichGO | tooluniverse-gene-enrichment | | phylogenetics | tooluniverse-phylogenetics | | variant_analysis, epigenomics | tooluniverse-variant-analysis | | crispr_screen, functional_genomics | tooluniverse-crispr-screen-analysis | | single_cell | tooluniverse-single-cell |
python skills/devtu-benchmark-harness/scripts/analyze_results.py \
--results results.json \
--questions skills/evals/bixbench/questions.json \
--diagnose
Each recommendation includes the failing category, responsible skill, failure type, and which devtu skill to invoke for the fix.
For each failure, verify whether it's an agent error or a GT (ground truth) issue:
temp_docs_and_tests/bixbench/bixbench/data/CapsuleFolder-{uuid}/*.py, *.R, analysis.R, run_*.pyDo not fix manually — use devtu skills so fixes follow established patterns and include tests.
| Diagnosis | What to do | Invoke |
|-----------|-----------|--------|
| Tool returns wrong data | Fix tool code + JSON config | Skill('devtu-fix-tool') |
| No tool exists for this computation | Create new ToolUniverse tool | Skill('devtu-create-tool') |
| Skill gives wrong guidance | Update SKILL.md conventions | Skill('devtu-optimize-skills') |
| Agent needs bundled script | Add script to skill's scripts/ dir | Skill('devtu-optimize-skills') Pattern 15 |
| Grader false negative | Fix grade_answers.py | Direct code fix |
| Multiple coordinated changes | Full cycle | Skill('devtu-self-evolve') |
1. analyze_results.py --diagnose → get recommendations
2. For each recommendation → invoke the appropriate devtu skill
3. bash scripts/build-plugin.sh → rebuild dist
4. run_eval.py --retest failures.json → verify fix
Diagnosis: "ANOVA wrong_answer → tooluniverse-statistical-modeling"
→ Invoke: Skill('devtu-optimize-skills')
→ Tell it: "statistical-modeling skill produces wrong F-statistics for
per-gene expression ANOVA. Agent aggregates at sample level instead
of gene level."
→ The skill handles: read SKILL.md, add convention, verify no
memorization, rebuild, suggest retest.
# Extract failed question IDs
python skills/devtu-benchmark-harness/scripts/analyze_results.py \
--results results.json --extract-failures /tmp/failures.json
# Retest only failures
python skills/devtu-benchmark-harness/scripts/run_eval.py \
--benchmark bixbench --mode plugin-only --retest /tmp/failures.json
Compare: how many flipped from wrong to correct? Update baseline if improved.
grade_answers.py applies 7 strategies in order:
Unicode normalization: minus signs (U+2212), superscript exponents (10⁻²⁶ → e-26).
# Re-grade with LLM
python skills/devtu-benchmark-harness/scripts/grade_answers.py \
--results results.json --output graded.json --llm
The ToolUniverse plugin uses router-only skill matching:
1 auto-matchable skill: "tooluniverse" (router, ~300 chars)
└── Routing table → 113 sub-skills (all disable-model-invocation: true)
Why: Claude Code has a character budget for skill descriptions (~1% of context). 114 skills × 500 chars = 57K exceeds budget → descriptions get dropped. With 1 router, the agent always sees it and routes correctly.
In -p mode, skills don't auto-match. The benchmark runner simulates interactive behavior via full_skill_injection mode: programmatically detects matching skill, injects its full SKILL.md content.
Insert as Phase 3.5 between Testing and Fix:
Phase 3 (Test) → Phase 3.5 (Benchmark) → Phase 4 (Fix via devtu) → Phase 5 (Retest)
The --diagnose flag references these patterns:
| Pattern | Root cause | Fix action |
|---------|-----------|------------|
| DESeq2 wrong_answer | pydeseq2 vs R disagreement, wrong set operations | devtu-optimize-skills on rnaseq-deseq2 |
| ANOVA wrong_answer | F-stat vs p-value confusion, wrong aggregation | devtu-optimize-skills on statistical-modeling |
| spline wrong_answer | R ns() ≠ Python patsy; endpoint inclusion varies | devtu-optimize-skills on statistical-modeling |
| phylogenetics wrong_answer | PhyKIT output column selection, file pairing | devtu-fix-tool on phykit_batch_analysis |
| variant wrong_answer | Multi-row Excel headers, coding-variant denominator | devtu-optimize-skills on variant-analysis |
| enrichGO wrong_answer | R clusterProfiler version sensitivity | devtu-fix-tool on run_deseq2_analysis |
| timeout | Pipeline >30 min (Trimmomatic, GATK) | devtu-create-tool to wrap pipeline |
| GT issue | Ground truth unreproducible with current tools | Document in results, exclude from score |
When adding conventions to skills from benchmark findings:
tools
Post-market safety surveillance and recall/adverse-event RETRIEVAL across the full spectrum of FDA-regulated products that are NOT covered by the drug-AE signal skills: medical devices, food / dietary supplements / cosmetics, veterinary drugs, and drug supply (shortages). Orchestrates openFDA endpoints (MAUDE device adverse events + device recalls + 510(k), CAERS food/supplement/ cosmetic adverse events, veterinary adverse events, drug shortages, and cross-product enforcement/recall reports). USE WHEN the user asks: "are there adverse events for [device / pacemaker / infusion pump / insulin pump]", "device recalls for [firm/product]", "supplement / vitamin / cosmetic adverse reactions", "is [drug] in shortage", "what injectables are on shortage", "veterinary / animal adverse events for [drug] in [dog/cat/horse]", "food recall for listeria", "MAUDE report for [device]", "CAERS reactions for [brand]". DO NOT USE for drug adverse-event SIGNAL detection or disproportionality (PRR / ROR / IC) or drug-AE association scoring — that is `tooluniverse-pharmacovigilance` / `tooluniverse-adverse-event-detection`. This skill is multi-product surveillance and retrieval, not drug-AE statistical signal mining.
tools
--- name: tooluniverse-phewas description: Cross-ancestry / cross-biobank phenome-wide association (PheWAS) and replication. Given ONE variant (rsID) or ONE gene, look up every phenotype it associates with across European/UK (UKB-TOPMed), Finnish (FinnGen), Japanese (BioBank Japan), and Taiwanese (TPMI) biobanks, plus exome-wide gene-burden PheWAS (Genebass), then judge whether an association replicates across ancestries or is population-specific. Use whenever the user asks "what else is this va
tools
Dereplicate a putative natural product and assign its chemical taxonomy. Use to answer "is [compound] a known natural product", "what microbe/organism produces [compound]", "what chemical class is [compound]", "dereplicate this metabolite (by formula/exact mass/InChIKey/SMILES)", or "classify this molecule into ChemOnt". Searches NPAtlas for known microbial natural products (producing organism + literature reference), assigns the ChemOnt kingdom→superclass→class→subclass hierarchy via ClassyFire, resolves systematic IUPAC names to structure via OPSIN, and cross-references identity in PubChem. NOT for general drug/compound identity or ADMET (use tooluniverse-chemical-compound-retrieval / tooluniverse-small-molecule-discovery) and NOT for metabolomics pathway/enrichment analysis (use tooluniverse-metabolomics skills).
tools
Genome-ASSEMBLY discovery, QC, and replicon mapping for any organism (bacteria, archaea, fungi, and beyond) using NCBI Datasets. Resolves an organism name or taxid to assemblies, picks the reference/representative or best-quality assembly, pulls assembly QC metrics (total length, contig/scaffold N50, contig count, GC%, assembly level, RefSeq category), enumerates chromosomes and plasmids via per-replicon sequence reports, and compares candidate assemblies on quality. Use for "what genomes are available for [organism]", "assembly stats / N50 / GC content for [GCF_/GCA_ accession]", "how many plasmids does [strain] have", "compare assemblies for [species]", "find the reference genome for [taxon]", "is this assembly Complete Genome or just contigs". NOT for gene-level orthology/synteny (use tooluniverse-comparative-genomics), plant gene structure (use tooluniverse-plant-genomics), de novo assembly from raw reads (no tool exists), or taxonomy-only name/lineage lookups.