Adoption

Agent Skills are supported by leading AI development tools.

VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory

mims-harvard/devtu-benchmark-harness

Name: devtu-benchmark-harness
Author: mims-harvard

skills/devtu-benchmark-harness/SKILL.md

npx skillsauth add mims-harvard/tooluniverse devtu-benchmark-harness

Clean

TrivyContainer and dependency vulnerability scanner

Clean

SemgrepStatic code analysis for vulnerabilities

Clean

mcp-scan (Snyk)Model Context Protocol security validation

Skipped

Snyk (dep)Open source security scanning

Skipped

Socket.devSupply chain security analysis

Skipped

VirusTotalMulti-engine malware detection

Skipped

CrowdStrikeAdvanced threat intelligence

Skipped

OSV-ScannerOpen Source Vulnerability database check

Skipped

OWASP Dep-Check

Benchmark Harness — Continuous Improvement System

A 5-step feedback loop for improving ToolUniverse tools, skills, and plugin quality.

Note: This skill is dataset-agnostic. Per-benchmark score history, known-failing question IDs, and dataset-specific investigations belong in temp_docs_and_tests/benchmark_tracking/ (gitignored workfolder), NOT in this skill directory.

The Feedback Loop

1. RUN benchmark → 2. ANALYZE results → 3. DIAGNOSE failures → 4. FIX via devtu skill → 5. RETEST → repeat

Orchestrated runner (preferred)

One command does steps 0 (memorization audit), 1 (build), 2 (run), 3 (analyze), 4 (diagnose + extract failures):

bash skills/devtu-benchmark-harness/scripts/run_harness_loop.sh --benchmark bixbench --n 20 --seed 42
# After reviewing diagnose.log and applying devtu skill fixes:
bash skills/devtu-benchmark-harness/scripts/run_harness_loop.sh --retest /path/to/failures.json

The script creates temp_docs_and_tests/benchmark_tracking/run_<TS>/ with results.json, analysis.log, diagnose.log, failures.json. Diagnose output lists each failure with the exact devtu skill to invoke — do NOT fix manually.

Anti-memorization guard

Before accepting any skill edit (from devtu-optimize-skills or manual), run:

python3 skills/devtu-benchmark-harness/scripts/check_memorization.py --all

Fails if any skill contains benchmark names, capsule UUIDs, bix-N question IDs, or known-to-be-GT specific numeric answers. This prevents overfitting the plugin to a single benchmark's answer key. Run in --strict mode to also flag specific gene names and dataset filenames (softer signal).

Step 1: RUN — Execute Benchmark

bash scripts/build-plugin.sh   # rebuild plugin with latest skills

python skills/devtu-benchmark-harness/scripts/run_eval.py \
  --benchmark bixbench \        # bixbench | lab-bench | custom
  --mode plugin-only \          # plugin-only | baseline-only | comparison
  --n 205 \                     # number of questions
  --timeout 1800 \              # seconds per question
  --max-turns 30                # agent turns per question

Options: --category DESeq2 (filter), --resume results.json (skip done), --guidance path.md (inject custom).

Reliability mode: APPEND_CONVENTIONS env var

Skill auto-matching in interactive mode is variable — sometimes Claude reads the skill description but starts writing code before loading the skill body. To force the router's critical conventions into every request's system prompt (more reliable, measures the plugin's conventions as-designed rather than skill-routing-as-implemented):

APPEND_CONVENTIONS=1 python skills/evals/run_benchmark.py --benchmark bixbench --plugin-only

Use this mode when measuring the CORRECTNESS of the conventions (are they the right rules?). Use default mode when measuring the RELIABILITY of skill routing (does Claude actually invoke the skill?). The gap between these two numbers is the routing-reliability problem.

Benchmark setup (first time only)

Rscript skills/evals/install_r_packages.R                  # R packages
python3 skills/evals/bixbench/download_capsules.py         # BixBench data (~5 GB)

Available benchmarks

| Benchmark | Questions | Tests | Data | |-----------|----------|-------|------| | lab-bench | 20 MCQ | Database lookup accuracy | skills/evals/lab-bench/questions.json | | bixbench | 205 computational | Data analysis + statistics | skills/evals/bixbench/questions.json + capsule data | | custom | User-defined | Any | Custom JSON file |

Step 2: ANALYZE — Map Failures to Skills

python skills/devtu-benchmark-harness/scripts/analyze_results.py \
  --results results.json \
  --questions skills/evals/bixbench/questions.json \
  --benchmark bixbench

Output:

By skill: which skills have lowest accuracy (fix those first)
By category: DESeq2, ANOVA, phylogenetics, variant_analysis, etc.
Failure types: timeout, wrong_answer, tool_error, api_key_missing

Category → Skill mapping

| Category | Skill | |----------|-------| | DESeq2, fold_change | tooluniverse-rnaseq-deseq2 | | ANOVA, regression, chi_square, spline_fitting | tooluniverse-statistical-modeling | | pathway_enrichment, DESeq2+enrichGO | tooluniverse-gene-enrichment | | phylogenetics | tooluniverse-phylogenetics | | variant_analysis, epigenomics | tooluniverse-variant-analysis | | crispr_screen, functional_genomics | tooluniverse-crispr-screen-analysis | | single_cell | tooluniverse-single-cell |

Step 3: DIAGNOSE — Get Improvement Recommendations

python skills/devtu-benchmark-harness/scripts/analyze_results.py \
  --results results.json \
  --questions skills/evals/bixbench/questions.json \
  --diagnose

Each recommendation includes the failing category, responsible skill, failure type, and which devtu skill to invoke for the fix.

Root cause investigation

For each failure, verify whether it's an agent error or a GT (ground truth) issue:

Find the capsule data: temp_docs_and_tests/bixbench/bixbench/data/CapsuleFolder-{uuid}/
Look for authoritative scripts: *.py, *.R, analysis.R, run_*.py
Run the script yourself — does it reproduce the GT value?
If your computation matches the agent (not the GT), it's a GT issue, not an agent error

Step 4: FIX — Route to the Right devtu Skill

Do not fix manually — use devtu skills so fixes follow established patterns and include tests.

| Diagnosis | What to do | Invoke | |-----------|-----------|--------| | Tool returns wrong data | Fix tool code + JSON config | Skill('devtu-fix-tool') | | No tool exists for this computation | Create new ToolUniverse tool | Skill('devtu-create-tool') | | Skill gives wrong guidance | Update SKILL.md conventions | Skill('devtu-optimize-skills') | | Agent needs bundled script | Add script to skill's scripts/ dir | Skill('devtu-optimize-skills') Pattern 15 | | Grader false negative | Fix grade_answers.py | Direct code fix | | Multiple coordinated changes | Full cycle | Skill('devtu-self-evolve') |

Fix workflow

1. analyze_results.py --diagnose → get recommendations
2. For each recommendation → invoke the appropriate devtu skill
3. bash scripts/build-plugin.sh → rebuild dist
4. run_eval.py --retest failures.json → verify fix

Example

Diagnosis: "ANOVA wrong_answer → tooluniverse-statistical-modeling"
  → Invoke: Skill('devtu-optimize-skills')
  → Tell it: "statistical-modeling skill produces wrong F-statistics for
     per-gene expression ANOVA. Agent aggregates at sample level instead
     of gene level."
  → The skill handles: read SKILL.md, add convention, verify no
     memorization, rebuild, suggest retest.

Step 5: RETEST — Verify Fixes

# Extract failed question IDs
python skills/devtu-benchmark-harness/scripts/analyze_results.py \
  --results results.json --extract-failures /tmp/failures.json

# Retest only failures
python skills/devtu-benchmark-harness/scripts/run_eval.py \
  --benchmark bixbench --mode plugin-only --retest /tmp/failures.json

Compare: how many flipped from wrong to correct? Update baseline if improved.

Grader

grade_answers.py applies 7 strategies in order:

Exact match — GT substring in prediction
MC match — letter answer detection (A/B/C/D)
Range match — numeric value within (low, high) with rounding tolerance
Normalized match — strip punctuation, bidirectional substring + bold-segment extraction
Numeric proximity — within 5% tolerance
Synonym match — scientific term equivalences
LLM verifier — Claude judges semantic correctness (for eval_mode=llm_verifier)

Unicode normalization: minus signs (U+2212), superscript exponents (10⁻²⁶ → e-26).

# Re-grade with LLM
python skills/devtu-benchmark-harness/scripts/grade_answers.py \
  --results results.json --output graded.json --llm

Plugin Architecture

The ToolUniverse plugin uses router-only skill matching:

1 auto-matchable skill: "tooluniverse" (router, ~300 chars)
  └── Routing table → 113 sub-skills (all disable-model-invocation: true)

Why: Claude Code has a character budget for skill descriptions (~1% of context). 114 skills × 500 chars = 57K exceeds budget → descriptions get dropped. With 1 router, the agent always sees it and routes correctly.

In -p mode, skills don't auto-match. The benchmark runner simulates interactive behavior via full_skill_injection mode: programmatically detects matching skill, injects its full SKILL.md content.

Integration with devtu-self-evolve

Insert as Phase 3.5 between Testing and Fix:

Phase 3 (Test) → Phase 3.5 (Benchmark) → Phase 4 (Fix via devtu) → Phase 5 (Retest)

Known Failure Patterns

The --diagnose flag references these patterns:

| Pattern | Root cause | Fix action | |---------|-----------|------------| | DESeq2 wrong_answer | pydeseq2 vs R disagreement, wrong set operations | devtu-optimize-skills on rnaseq-deseq2 | | ANOVA wrong_answer | F-stat vs p-value confusion, wrong aggregation | devtu-optimize-skills on statistical-modeling | | spline wrong_answer | R ns() ≠ Python patsy; endpoint inclusion varies | devtu-optimize-skills on statistical-modeling | | phylogenetics wrong_answer | PhyKIT output column selection, file pairing | devtu-fix-tool on phykit_batch_analysis | | variant wrong_answer | Multi-row Excel headers, coding-variant denominator | devtu-optimize-skills on variant-analysis | | enrichGO wrong_answer | R clusterProfiler version sensitivity | devtu-fix-tool on run_deseq2_analysis | | timeout | Pipeline >30 min (Trimmomatic, GATK) | devtu-create-tool to wrap pipeline | | GT issue | Ground truth unreproducible with current tools | Document in results, exclude from score |

Skill convention rules

When adding conventions to skills from benchmark findings:

General knowledge only — no dataset-specific values, no memorized answers
Principles over examples — "per-gene ANOVA not per-sample" rather than "F=0.77 on this dataset"
Tool preferences — "use R DESeq2 for dispersion" rather than "R gives 4, pydeseq2 gives 2"
Verify no contamination — grep for dataset names, specific numeric answers in the convention text

mims-harvard/devtu-benchmark-harness

skills/devtu-benchmark-harness/SKILL.md

Continuous improvement system for ToolUniverse tools, skills, and plugin. Run benchmarks, diagnose failures, route fixes to devtu skills, retest. Use after skill optimization, tool additions, or as regression check.

1,383 stars

tools

Updated May 27, 2026

$ install --global

skillsauth

npx skillsauth add mims-harvard/tooluniverse devtu-benchmark-harness

Install this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.

Security Scan Results

3 of 9 scanners reported clean

Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.

Scanners Passed

Scanners in report

Clean

TrivyContainer and dependency vulnerability scanner

95%

Clean

SemgrepStatic code analysis for vulnerabilities

95%

Clean

mcp-scan (Snyk)Model Context Protocol security validation

95%

Skipped

Snyk (dep)Open source security scanning

50%

Skipped

Socket.devSupply chain security analysis

50%

Skipped

VirusTotalMulti-engine malware detection

50%

Skipped

CrowdStrikeAdvanced threat intelligence

50%

Skipped

OSV-ScannerOpen Source Vulnerability database check

50%

Skipped

OWASP Dep-Check

50%

Last scanned: May 27, 2026, 7:09 AM229.9s15 files scanned

SKILL.md

name:: devtu-benchmark-harness
description:: Continuous improvement system for ToolUniverse tools, skills, and plugin. Run benchmarks, diagnose failures, route fixes to devtu skills, retest. Use after skill optimization, tool additions, or as regression check.

Benchmark Harness — Continuous Improvement System

A 5-step feedback loop for improving ToolUniverse tools, skills, and plugin quality.

The Feedback Loop

1. RUN benchmark → 2. ANALYZE results → 3. DIAGNOSE failures → 4. FIX via devtu skill → 5. RETEST → repeat

Orchestrated runner (preferred)

One command does steps 0 (memorization audit), 1 (build), 2 (run), 3 (analyze), 4 (diagnose + extract failures):

bash skills/devtu-benchmark-harness/scripts/run_harness_loop.sh --benchmark bixbench --n 20 --seed 42
# After reviewing diagnose.log and applying devtu skill fixes:
bash skills/devtu-benchmark-harness/scripts/run_harness_loop.sh --retest /path/to/failures.json

Anti-memorization guard

Before accepting any skill edit (from devtu-optimize-skills or manual), run:

python3 skills/devtu-benchmark-harness/scripts/check_memorization.py --all

Step 1: RUN — Execute Benchmark

bash scripts/build-plugin.sh   # rebuild plugin with latest skills

python skills/devtu-benchmark-harness/scripts/run_eval.py \
  --benchmark bixbench \        # bixbench | lab-bench | custom
  --mode plugin-only \          # plugin-only | baseline-only | comparison
  --n 205 \                     # number of questions
  --timeout 1800 \              # seconds per question
  --max-turns 30                # agent turns per question

Options: --category DESeq2 (filter), --resume results.json (skip done), --guidance path.md (inject custom).

Reliability mode: APPEND_CONVENTIONS env var

APPEND_CONVENTIONS=1 python skills/evals/run_benchmark.py --benchmark bixbench --plugin-only

Benchmark setup (first time only)

Rscript skills/evals/install_r_packages.R                  # R packages
python3 skills/evals/bixbench/download_capsules.py         # BixBench data (~5 GB)

Available benchmarks

Step 2: ANALYZE — Map Failures to Skills

python skills/devtu-benchmark-harness/scripts/analyze_results.py \
  --results results.json \
  --questions skills/evals/bixbench/questions.json \
  --benchmark bixbench

Output:

By skill: which skills have lowest accuracy (fix those first)
By category: DESeq2, ANOVA, phylogenetics, variant_analysis, etc.
Failure types: timeout, wrong_answer, tool_error, api_key_missing

Category → Skill mapping

Step 3: DIAGNOSE — Get Improvement Recommendations

python skills/devtu-benchmark-harness/scripts/analyze_results.py \
  --results results.json \
  --questions skills/evals/bixbench/questions.json \
  --diagnose

Each recommendation includes the failing category, responsible skill, failure type, and which devtu skill to invoke for the fix.

Root cause investigation

For each failure, verify whether it's an agent error or a GT (ground truth) issue:

Find the capsule data: temp_docs_and_tests/bixbench/bixbench/data/CapsuleFolder-{uuid}/
Look for authoritative scripts: *.py, *.R, analysis.R, run_*.py
Run the script yourself — does it reproduce the GT value?
If your computation matches the agent (not the GT), it's a GT issue, not an agent error

Step 4: FIX — Route to the Right devtu Skill

Do not fix manually — use devtu skills so fixes follow established patterns and include tests.

Fix workflow

1. analyze_results.py --diagnose → get recommendations
2. For each recommendation → invoke the appropriate devtu skill
3. bash scripts/build-plugin.sh → rebuild dist
4. run_eval.py --retest failures.json → verify fix

Example

Diagnosis: "ANOVA wrong_answer → tooluniverse-statistical-modeling"
  → Invoke: Skill('devtu-optimize-skills')
  → Tell it: "statistical-modeling skill produces wrong F-statistics for
     per-gene expression ANOVA. Agent aggregates at sample level instead
     of gene level."
  → The skill handles: read SKILL.md, add convention, verify no
     memorization, rebuild, suggest retest.

Step 5: RETEST — Verify Fixes

# Extract failed question IDs
python skills/devtu-benchmark-harness/scripts/analyze_results.py \
  --results results.json --extract-failures /tmp/failures.json

# Retest only failures
python skills/devtu-benchmark-harness/scripts/run_eval.py \
  --benchmark bixbench --mode plugin-only --retest /tmp/failures.json

Compare: how many flipped from wrong to correct? Update baseline if improved.

Grader

grade_answers.py applies 7 strategies in order:

Exact match — GT substring in prediction
MC match — letter answer detection (A/B/C/D)
Range match — numeric value within (low, high) with rounding tolerance
Normalized match — strip punctuation, bidirectional substring + bold-segment extraction
Numeric proximity — within 5% tolerance
Synonym match — scientific term equivalences
LLM verifier — Claude judges semantic correctness (for eval_mode=llm_verifier)

Unicode normalization: minus signs (U+2212), superscript exponents (10⁻²⁶ → e-26).

# Re-grade with LLM
python skills/devtu-benchmark-harness/scripts/grade_answers.py \
  --results results.json --output graded.json --llm

Plugin Architecture

The ToolUniverse plugin uses router-only skill matching:

1 auto-matchable skill: "tooluniverse" (router, ~300 chars)
  └── Routing table → 113 sub-skills (all disable-model-invocation: true)

In -p mode, skills don't auto-match. The benchmark runner simulates interactive behavior via full_skill_injection mode: programmatically detects matching skill, injects its full SKILL.md content.

Integration with devtu-self-evolve

Insert as Phase 3.5 between Testing and Fix:

Phase 3 (Test) → Phase 3.5 (Benchmark) → Phase 4 (Fix via devtu) → Phase 5 (Retest)

Known Failure Patterns

The --diagnose flag references these patterns:

Skill convention rules

When adding conventions to skills from benchmark findings:

General knowledge only — no dataset-specific values, no memorized answers
Principles over examples — "per-gene ANOVA not per-sample" rather than "F=0.77 on this dataset"
Tool preferences — "use R DESeq2 for dispersion" rather than "R gives 4, pydeseq2 gives 2"
Verify no contamination — grep for dataset names, specific numeric answers in the convention text

Related Skills

mims-harvard/tooluniverse-self-review

tools

VerifiedTrustedCommunity

Generate the success criteria for a task or question, then review work against them. Given a task, goal, or open-ended question, decompose it into scenarios, evaluation perspectives, and fine-grained weighted YES/NO criteria using the Recursive Expansion Tree (RET) method; if work is supplied, score it criterion-by-criterion and surface what is missing or could be better. Use when asked to self-review or check your own work, judge whether a task is done well or completely, build a definition-of-done or completeness checklist, create an evaluation rubric or grading criteria, score or grade answers to a question, set up an LLM-as-judge rubric, or when the user mentions self-review, completeness check, success criteria, evaluation criteria, scoring rubric, Qworld, or the RET algorithm.

1,583SKILL.mdUpdated Jul 22, 2026

mims-harvard/tooluniverse-self-review

mims-harvard/tooluniverse-peptide-target-deorphanization

tools

VerifiedTrustedCommunity

Find the real protein target(s) of a peptide from its sequence — peptide target deorphanization / off-target identification, for ANY target class (GPCR, ion channel, protease, cytokine/growth-factor receptor, enzyme, integrin), not only GPCRs. Use when a peptide has a phenotype but does not bind its hypothesized target, when a peptide binds a target in one species or assay but not another, or to screen candidate targets for an orphan peptide. A target-class router steers a multi-route keyless pipeline (PROSITE/ELM motif, BLAST homology, HGNC/InterPro/GPCRdb/GtoPdb target-family enumeration, OpenTargets phenotype anchor, EnsemblCompara/Alliance cross-species reconciliation) plus optional NVIDIA-NIM co-folding (Boltz2, AlphaFold2-Multimer, OpenFold3) for structural confirmation.

1,583SKILL.mdUpdated Jul 22, 2026

mims-harvard/tooluniverse-peptide-target-deorphanization

mims-harvard/tooluniverse-cs-setup

tools

VerifiedTrustedCommunity

Install or update ToolUniverse in Claude Science — create the conda env, install the tooluniverse pip package, and (re)build the tooluniverse-research skill by fetching the current workflow library from GitHub. Use for first-time setup, upgrading the ToolUniverse version, refreshing the bundled workflows after an upstream release, or reinstalling on a new machine.

1,583SKILL.mdUpdated Jul 22, 2026

mims-harvard/tooluniverse-cs-setup

mims-harvard/tooluniverse-codex-plugin

tools

VerifiedTrustedCommunity

Install, set up, verify, update, pin, uninstall, or troubleshoot the ToolUniverse plugin on OpenAI Codex. ALWAYS consult this skill for any of those — don't answer from memory, because the exact marketplace name (mims-harvard/ToolUniverse), the "codex plugin marketplace add" then "codex plugin add -m tooluniverse" flow, Codex's startup auto-upgrade behavior, the uvx tooluniverse MCP server, and the API-key env vars are easy to get wrong. Use it whenever someone wants to get ToolUniverse (or "the 1000+ scientific tools" / "the harvard tools") working on Codex, says the Codex plugin or its tools/skills won't load, hits a uvx or MCP-server startup error, asks how Codex updates it, wants to pin or remove it, or finds it running an old tool version — even if they never say the word "plugin". Not for the Claude Code plugin (use tooluniverse-claude-code-plugin), for running research with the tools, or for authoring new tools or skills.

1,583SKILL.mdUpdated Jul 22, 2026

mims-harvard/tooluniverse-codex-plugin

Download

For Claude Desktop. Download once, then upload the file in the app — no terminal needed.

Need help? View full Cowork setup guide →

Install manually

Choose your platform

# Clone the repo
git clone https://github.com/mims-harvard/tooluniverse.git

# Copy into Claude Code skills folder (global)
cp -r tooluniverse/skills/devtu-benchmark-harness ~/.claude/skills/

Claude Code Skills — official skills path docs.

Repository

mims-harvard/tooluniverse

1,383 stars

Compatible with

Claude Code

OpenAI Codex CLI

ChatGPT