skills/tier-5-automation/autoresearch/SKILL.md
Autonomous iterative improvement for any measurable system. A three-agent loop (researcher, critic, meta-reviewer) with holdout validation, coverage-driven exploration, and metacognitive self-modification. Extends Karpathy's autoresearch with generator/verifier separation and ideas from the HyperAgents paper.
npx skillsauth add pbc-os/agent-skills-public autoresearchInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Autonomous iterative improvement for any measurable system.
Built on Karpathy's autoresearch pattern, extended with:
See the references/ folder for the academic and practitioner sources behind each of these extensions.
Three agents work in a pipeline, coordinated by a small set of files on disk:
┌─────────────────────────────────────────────────────────────────┐
│ │
│ ┌─────────────┐ ┌─────────────┐ ┌──────────────────┐ │
│ │ RESEARCHER │───▶│ CRITIC │───▶│ META-REVIEWER │ │
│ │ (proposes) │ │ (validates) │ │ (every N exps) │ │
│ └─────────────┘ └─────────────┘ └──────────────────┘ │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ 1. Analyze weakness 4. Holdout gate 7. Review full log │
│ 2. Form hypothesis 5. Overfitting check 8. Update strategy │
│ 3. Change 1 param 6. Binary pass/fail 9. Suggest new │
│ + run eval → KEEP/DISCARD dimensions │
│ │ │ │ │
│ └───────────────────┴────────────────────┘ │
│ │ │
│ ┌───────▼────────┐ │
│ │ COVERAGE MAP │ │
│ │ What's been │ │
│ │ explored? │ │
│ │ What's missing?│ │
│ └────────────────┘ │
│ │ │
│ ┌───────▼────────┐ │
│ │ ARCHIVE │ │
│ │ Top N configs │ │
│ │ Branch from │ │
│ │ diverse parents│ │
│ └────────────────┘ │
└─────────────────────────────────────────────────────────────────┘
| File | Role | Who Edits |
|------|------|-----------|
| research.md | Agent instructions — what to optimize, how to think, domain knowledge | Human initially; meta-reviewer can append |
| eval script | Fixed measurement — runs the metric calculation | Nobody (frozen) |
| parameters file | Tunable knobs — the ONLY thing the researcher changes | Researcher agent |
| experiments/ | Accountability trail — every experiment logged | All agents (append-only) |
| archive.json | Top N parameter configs with scores | Researcher agent |
| coverage.json | Which parameter dimensions have been explored | Auto-tracked |
The eval script is sacred. No agent ever modifies it. This prevents "improving" the score by weakening the test — the single most common failure mode in autonomous optimization systems.
Ask the user (or infer from context):
--holdout flag so the critic can validate on data the researcher never saw.Create 6 files in the project. Adapt file names and formats to the domain.
research.md — Agent Instructions (mutable)The "program.md" equivalent from Karpathy's original. It tells the researcher agent:
v2 change: The meta-reviewer agent can append to research.md in a clearly marked ## Learned Insights section. This is the metacognitive self-modification idea — the agent improves not just the parameters, but its own optimization strategy.
Use the template: templates/research-template.md.
A script that:
--holdout flag to evaluate on a held-out data split (the critic uses this)This can be a Python/Node/Shell script, CLI command, API call, database query, etc.
Mark it read-only after creation:
chmod 444 eval.py # or eval.sh, eval.js, etc.
A structured file (YAML, JSON, Markdown — whatever fits the domain) containing every parameter the agent can adjust. For each parameter:
Use the template: templates/parameters-template.md.
experiments/ Directory — The LogEach experiment gets its own file: experiments/NNN-hypothesis-slug.md.
Use the template: templates/experiment-template.md.
archive.json — Stepping Stones ArchiveMaintain the top N parameter configs (default: 5) with their scores. Instead of always branching from the current best, the researcher can branch from any archived config — this is how you escape local optima.
{
"entries": [
{
"id": "exp-025",
"metric": 3.50,
"params_snapshot": { },
"lineage": "baseline → exp-001 → exp-009 → exp-025",
"notes": "Best blending weights, no structural features"
}
],
"max_entries": 5
}
Keep configs that are diverse, not just the top 5 by raw score. When two configs have similar scores, keep the one that's more different from existing entries — this is the quality-diversity principle from MAP-Elites and from the HyperAgents archive.
coverage.json — Exploration MapAuto-tracked file recording which parameter dimensions have been explored and how thoroughly:
{
"dimensions": {
"blending.recent_weight": {"experiments": 12, "range_tested": [0.50, 0.90], "last_explored": "exp-036"},
"momentum.trend_weight": {"experiments": 0, "range_tested": null, "last_explored": null},
"mean_reversion.strength": {"experiments": 2, "range_tested": [0.2, 0.5], "last_explored": "exp-041"}
}
}
The researcher agent reads this before each experiment to steer toward under-explored dimensions. Don't spend 20 experiments micro-tuning one knob when another knob has never been touched.
#000 (the starting point)archive.json with the baseline configcoverage.json with all parameter dimensions at 0The researcher proposes and tests changes:
coverage.json. If any dimension has 0 experiments, explore it before micro-tuning explored dimensions.archive.json and explore from there.The critic validates with binary pass/fail gates. Verification is fundamentally easier than generation (the P vs NP intuition — checking an answer is easier than producing one). The critic doesn't propose changes — it only validates or rejects.
--holdout. Did the metric improve on data the researcher never saw? If the training metric improved but holdout worsened → FAIL (overfitting).research.md) can flag: "You increased the holiday threshold to 3.0× — that means you're barely trimming holidays at all. This is likely overfitting to the few holiday weeks in the test set."Decision: ALL gates must pass → ✅ KEEP. Any gate fails → ❌ DISCARD with the specific gate failure noted.
The critic's gates are defined in research.md and are configurable per domain. Some domains need stricter or different gates. The critic never modifies the eval script.
Every 10 experiments (configurable), a fresh meta-reviewer agent reads the entire experiment log and does four things:
Pattern analysis. "The last 8 experiments all tried blending weight variations and 6 failed. The parameter space for blending is exhausted."
Coverage gaps. "Momentum, mean reversion, and week-of-month have never been explored. These represent structural changes that could break through the current ceiling."
Strategy update. Append to research.md under ## Learned Insights:
## Learned Insights (auto-generated by meta-reviewer)
### Insight from review at experiment #030 (2026-04-03)
- Blending weights are saturated (0.78–0.82 range all within noise)
- The remaining error is structural: week 3 consistently under-forecasts by 7%
- RECOMMENDATION: Explore week-of-month adjustments before any more blending experiments
- RECOMMENDATION: Try per-segment parameter overrides — segments have different error profiles
Archive pruning. Suggest removing archive entries that are clearly dominated or too similar to other entries.
This is metacognitive self-modification from the HyperAgents paper — the system improves not just the parameters but its own improvement strategy. A meta-reviewer is the mechanism that stops the researcher from spending 50 experiments on blending weights when the real breakthrough is a structural change the researcher couldn't see from inside the loop.
Present a summary table at the end of the session:
## Autoresearch Session: [System Name]
Date: YYYY-MM-DD
Duration: [time]
### Results
| Metric | Before | After | Change | Holdout |
|--------|--------|-------|--------|---------|
| [Primary] | X.XX | X.XX | -X.X% | X.XX |
| [Secondary] | X.XX | X.XX | -X.X% | X.XX |
### Experiments Run: N (kept: K, critic-rejected: C, researcher-reverted: R)
| # | Hypothesis | Change | Train | Holdout | Critic | Decision |
|---|------------|--------|-------|---------|--------|----------|
| 1 | [description] | [param: old → new] | -X.X% | -X.X% | ✅ PASS | ✅ KEEP |
| 2 | [description] | [param: old → new] | -X.X% | +X.X% | ❌ Gate 1 | ❌ DISCARD |
### Critic Gate Statistics
| Gate | Passed | Failed | Rejection Rate |
|------|--------|--------|----------------|
| Holdout validation | 15 | 3 | 17% |
| No-regression | 16 | 2 | 11% |
| Stability | 14 | 4 | 22% |
| Directional sanity | 18 | 0 | 0% |
### Coverage Map
| Dimension | Experiments | Range Tested | Status |
|-----------|-------------|--------------|--------|
| blending weights | 15 | 0.50–0.90 | Saturated |
| momentum | 0 | — | Unexplored |
### Meta-Reviewer Insights
[Key strategy changes recommended during the session]
### Archive (Top 5 Configs)
[Diverse set of high-performing parameter configs]
### Remaining Weaknesses
[Biggest remaining errors — seeds for the next session]
If the user wants continuous improvement:
research.md (under ## Learned Insights) and update the archive. Its job is strategy, not tactics.revenue-forecaster skill — which ships with an eval script and parameters file pre-wired for autoresearchgoogle-ads skilllighthouse --output=json URL | jq '.categories.performance.score'This skill stands on the shoulders of several sources. The extensions in v2 are applications of ideas from the references below.
Andrej Karpathy — autoresearch (announcement thread) — the original pattern: a human writes program.md files that instruct an agent to iterate on experiments overnight. The single-file simplicity, the sacred eval, the git-log-as-experiment-trail, and the "you're programming the program.md" framing all come from here. See references/karpathy-autoresearch.md.
Generator/verifier separation (a.k.a. the critic pattern) — the principle that verification is easier than generation, and that you get better results by having one agent propose and a separate agent validate with concrete binary gates. Popularized in multi-agent LLM orchestration writing (see Anthropic's engineering posts on multi-agent systems) and in the trading-strategy literature around quality-diversity archives plus adversarial critics. See references/multi-agent-patterns.md.
HyperAgents — Zhang et al. (2026) (code) — introduces metacognitive self-modification: the improvement mechanism itself is editable, so the agent can improve both how it solves tasks and how it generates future improvements. The meta-reviewer agent, the archive of stepping stones, and coverage-driven exploration in this skill are direct applications of ideas from that paper. See references/hyperagents.md.
MAP-Elites and quality-diversity algorithms (Mouret & Clune, 2015) — the principle that you should maintain a diverse archive of good solutions rather than just the single best, because diverse parents lead to stepping-stone discoveries. This is why the archive here is scored on quality and diversity, not raw metric alone.
The original insight — that you should set this up, go to sleep, and wake up to better parameters — is still Karpathy's. Everything else is scaffolding around that core idea.
playbook-discovery — Find repeatable workflows to optimize; use autoresearch once you know what to improverevenue-forecaster — Ships with an autoresearch-ready eval script, parameters file, and research.md template, so you can tune it on your own historical datasemantic-layer-audit — Document the data sources your eval script pulls from"You're not touching Python files like you normally would. Instead, you're programming the program.md files that provide context to the AI agents." — Andrej Karpathy
tools
Generate and iteratively refine USPTO-style patent figure drawings from provisional patent application markdown files, using nano-banana for v1 generation and targeted single-fix edits for v2+ iteration.
data-ai
Weekly revenue / sales forecasting for small businesses with multiple locations or product lines. Blends recent trend + seasonal baseline + YoY growth with per-entity holiday multipliers and week-of-month adjustments. Ships autoresearch-compatible eval and parameters so you can tune it on your own historical data.
data-ai
Analyze email, calendar, and file patterns to discover repeatable workflows that AI agents can automate.
testing
Automated daily digest for small business owners. Combines email triage, calendar agenda, open tasks, and business KPIs into a single morning briefing. Composable — works with whatever data sources are available. Urgent emails require body inspection and explicit escalation signals — never classified from sender/timing metadata alone.