skills/devtu-self-evolve/SKILL.md
Orchestrate the full ToolUniverse self-improvement cycle: discover APIs, create tools, test with researcher personas, fix issues, optimize skills, and push via git. References and dispatches to all other devtu skills. Use when asked to: run the self-improvement loop, do a debug/test round, expand tool coverage, improve tool quality, or evolve ToolUniverse.
npx skillsauth add mims-harvard/tooluniverse devtu-self-evolveInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Coordinates the full development lifecycle by dispatching to specialized devtu skills.
Discover → Create → Test → Fix → Optimize → Ship → Repeat
Each phase maps to a dedicated skill:
| Phase | Skill | What it does |
|-------|-------|-------------|
| Discover | devtu-auto-discover-apis | Gap analysis, web search for APIs, batch discovery |
| Create | devtu-create-tool | Build tool class + JSON config + test examples |
| Test | (this skill) | Launch researcher persona agents to find issues |
| Fix | devtu-fix-tool | Diagnose failures, implement fixes, validate |
| Optimize | devtu-optimize-skills | Improve skill reports, evidence handling, UX |
| Optimize | devtu-optimize-descriptions | Improve tool JSON descriptions for clarity |
| Docs | devtu-docs-quality | Validate documentation accuracy |
| Ship | devtu-github | Branch, commit, push, create PR |
Pick an entry point based on what's needed:
Skill(skill="devtu-auto-discover-apis")Skill(skill="devtu-create-tool")Skill(skill="devtu-fix-tool")Skill(skill="devtu-optimize-skills")Invoke Skill(skill="devtu-auto-discover-apis") to:
Invoke Skill(skill="devtu-create-tool") for each new API:
_lazy_registry_static.py and default_config.pypython -m tooluniverse.cli test <ToolName>This is the core testing loop, run directly by this skill.
gh pr list --state openorigin/maingit fetch origin && git rebase origin/mainLaunch 2 agents per round (A + B) using the Agent tool with these parameters:
Each agent gets:
Feature-{round}{letter}-{num} (e.g., Feature-59A-001)Agent prompt template — see references/persona-template.md
Before implementing ANY agent-reported issue, verify via CLI:
python3 -m tooluniverse.cli run <ToolName> '<json_args>'
50%+ of agent reports are false positives from MCP interface confusion. Only fix verified issues.
Anti-patterns: hint text instead of validation, parameter aliases instead of fixing naming, post-hoc probing instead of pre-validation.
Standard testing verifies tools work. Usefulness testing verifies skills actually solve scientist problems. Run this after standard testing:
Score 1-10 rubric:
Common failure patterns found in usefulness tests:
| Pattern | Score Impact | Fix |
|---------|-------------|-----|
| "Call A, then B, then C" without explaining what to DO with results | -3 | Add interpretation tables |
| Tool params wrong (tool works but skill documents wrong names) | -2 | Verify ALL tool params via get_tool_info() |
| Promises data the API can't deliver (e.g., DepMap CRISPR scores) | -2 | Be honest about limitations; add computational procedure workaround |
| No synthesis phase at the end | -2 | Add "so what?" phase that combines all evidence |
| No evidence grading | -1 | Add T1-T4 or similar confidence tiers |
| No computational procedures for things tools can't do | -1 | Add Python code blocks using scipy/pandas/numpy |
When tools can't help, add computational procedures: Some analyses need Python code, not API calls. Skills should include working code blocks for:
See devtu-optimize-skills Patterns 14-15 for full guidance.
Quantify plugin performance after testing. Uses Skill(skill="devtu-benchmark-harness").
# Run lab-bench (20 MCQ)
python skills/devtu-benchmark-harness/scripts/run_eval.py --benchmark lab-bench --mode plugin-only --n 20
# Run BixBench (computational, use first 20)
python skills/devtu-benchmark-harness/scripts/run_eval.py --benchmark bixbench --mode plugin-only --n 20
# Analyze results
python skills/devtu-benchmark-harness/scripts/analyze_results.py --results <results-file>
# Generate report
python skills/devtu-benchmark-harness/scripts/generate_report.py --results <results-file> --output BENCHMARK_REPORT.md
Compare with previous round. If any category regresses, prioritize fixing that skill/tool in Phase 4.
Skill(skill="simplify") — always after writing or modifying coderuff check src/tooluniverse/<file>.pypython -c "from tooluniverse.<module> import <Class>"python -m tooluniverse.cli run <Tool> '<json>'git push origin <branch>Also see
Skill(skill="devtu-code-optimization")for reusable fix patterns and anti-patterns.
After fixes are stable:
Skill(skill="devtu-optimize-descriptions") — improve tool descriptionsSkill(skill="devtu-optimize-skills") — improve research skill qualitySkill(skill="devtu-docs-quality") — validate docs accuracyInvoke Skill(skill="devtu-github") or manually:
git fetch origin && git stash && git rebase origin/main && git stash popgit push --force-with-lease origin <branch>gh pr create / verify with gh pr view <N> --json mergeable"mergeable": "MERGEABLE" before reporting doneGitHub repo: mims-harvard/ToolUniverse — always verify with git remote -v before pushing.
git fetch origin && git rebase origin/main| Category | Signal |
|----------|--------|
| Silent parameter miss | Wrong-field check; param ignored |
| Always-fires conditional | .get("field") on wrong type |
| Silent normalization | Auto-transform not disclosed |
| Wrong notation/case | Gene fusions, Title Case names |
| Substring match | Short symbol returns multiple targets |
| try/except indent | Mismatched → SyntaxError |
Full patterns → references/bug-patterns.md
After each round: advance counter, update patterns file, keep this SKILL.md under 150 lines.
Current round: 127 (rounds completed: 52-126)
tools
Post-market safety surveillance and recall/adverse-event RETRIEVAL across the full spectrum of FDA-regulated products that are NOT covered by the drug-AE signal skills: medical devices, food / dietary supplements / cosmetics, veterinary drugs, and drug supply (shortages). Orchestrates openFDA endpoints (MAUDE device adverse events + device recalls + 510(k), CAERS food/supplement/ cosmetic adverse events, veterinary adverse events, drug shortages, and cross-product enforcement/recall reports). USE WHEN the user asks: "are there adverse events for [device / pacemaker / infusion pump / insulin pump]", "device recalls for [firm/product]", "supplement / vitamin / cosmetic adverse reactions", "is [drug] in shortage", "what injectables are on shortage", "veterinary / animal adverse events for [drug] in [dog/cat/horse]", "food recall for listeria", "MAUDE report for [device]", "CAERS reactions for [brand]". DO NOT USE for drug adverse-event SIGNAL detection or disproportionality (PRR / ROR / IC) or drug-AE association scoring — that is `tooluniverse-pharmacovigilance` / `tooluniverse-adverse-event-detection`. This skill is multi-product surveillance and retrieval, not drug-AE statistical signal mining.
tools
--- name: tooluniverse-phewas description: Cross-ancestry / cross-biobank phenome-wide association (PheWAS) and replication. Given ONE variant (rsID) or ONE gene, look up every phenotype it associates with across European/UK (UKB-TOPMed), Finnish (FinnGen), Japanese (BioBank Japan), and Taiwanese (TPMI) biobanks, plus exome-wide gene-burden PheWAS (Genebass), then judge whether an association replicates across ancestries or is population-specific. Use whenever the user asks "what else is this va
tools
Dereplicate a putative natural product and assign its chemical taxonomy. Use to answer "is [compound] a known natural product", "what microbe/organism produces [compound]", "what chemical class is [compound]", "dereplicate this metabolite (by formula/exact mass/InChIKey/SMILES)", or "classify this molecule into ChemOnt". Searches NPAtlas for known microbial natural products (producing organism + literature reference), assigns the ChemOnt kingdom→superclass→class→subclass hierarchy via ClassyFire, resolves systematic IUPAC names to structure via OPSIN, and cross-references identity in PubChem. NOT for general drug/compound identity or ADMET (use tooluniverse-chemical-compound-retrieval / tooluniverse-small-molecule-discovery) and NOT for metabolomics pathway/enrichment analysis (use tooluniverse-metabolomics skills).
tools
Genome-ASSEMBLY discovery, QC, and replicon mapping for any organism (bacteria, archaea, fungi, and beyond) using NCBI Datasets. Resolves an organism name or taxid to assemblies, picks the reference/representative or best-quality assembly, pulls assembly QC metrics (total length, contig/scaffold N50, contig count, GC%, assembly level, RefSeq category), enumerates chromosomes and plasmids via per-replicon sequence reports, and compares candidate assemblies on quality. Use for "what genomes are available for [organism]", "assembly stats / N50 / GC content for [GCF_/GCA_ accession]", "how many plasmids does [strain] have", "compare assemblies for [species]", "find the reference genome for [taxon]", "is this assembly Complete Genome or just contigs". NOT for gene-level orthology/synteny (use tooluniverse-comparative-genomics), plant gene structure (use tooluniverse-plant-genomics), de novo assembly from raw reads (no tool exists), or taxonomy-only name/lineage lookups.