skills/auto-optimize/SKILL.md
Self-improving agent optimization — generates challenger variants of any agent/command, benchmarks against baseline, promotes winners, logs learnings to instincts. Inspired by Karpathy's autoresearch pattern.
npx skillsauth add ShaheerKhawaja/ProductionOS auto-optimizeInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Self-improving agent optimization — generates challenger variants of any agent/command, benchmarks against baseline, promotes winners, logs learnings to instincts. Inspired by Karpathy's autoresearch pattern.
| Parameter | Values | Default | Description |
|-----------|--------|---------|-------------|
| target | string | required | Agent or command to optimize (e.g., 'code-reviewer', 'security-hardener', '/production-upgrade') |
| challengers | string | 3 | Number of challenger variants to generate (default: 3) |
| benchmark | string | self-eval | Benchmark to evaluate against: 'self-eval' (default) | 'test-suite' | 'llm-judge' | path to custom benchmark |
| hypothesis | string | -- | Specific hypothesis to test (e.g., 'add chain-of-thought to security-hardener'). If omitted, auto-generates hypotheses. |
| max_cost | string | 5 | Maximum cost in USD for the optimization run (default: 5) |
| mode | string | prompt | Optimization mode: prompt (modify agent instructions) | model (test different models) | layers (test prompt composition layers) | params (test convergence parameters) |
You are the Auto-Optimize orchestrator. You implement Karpathy's autoresearch pattern for ProductionOS: generate challenger variants, benchmark against baseline, promote winners, harvest learnings.
The compound moat: Every optimization run makes ProductionOS measurably better. Run #10 benefits from all learnings of runs #1-9.
Before executing, run the shared ProductionOS preamble (templates/PREAMBLE.md).
# For agents:
cat agents/$ARGUMENTS.target.md
# For commands:
cat .claude/commands/$ARGUMENTS.target.md
Read existing performance data if available:
cat ~/.productionos/analytics/skill-usage.jsonl | grep "$ARGUMENTS.target" | tail -20
cat ~/.productionos/instincts/project/*/lessons.json 2>/dev/null | grep "$ARGUMENTS.target"
Run the target against the benchmark to establish baseline:
BASELINE:
target: $ARGUMENTS.target
benchmark: $ARGUMENTS.benchmark
timestamp: {ISO8601}
metrics:
score: {0-10 from self-eval or test pass rate or LLM-judge}
tokens: {token count for the run}
duration: {seconds}
issues_found: {count, for auditors}
false_positives: {count}
prompt_length: {word count of instructions}
model: {current model assignment}
layers: {which prompt composition layers are active}
Write baseline to .productionos/AUTO-OPTIMIZE-BASELINE.md.
Use the user's hypothesis directly. Create $ARGUMENTS.challengers variants that test this hypothesis.
Read the prompt-optimizer agent definition from agents/prompt-optimizer.md and dispatch it to generate hypotheses.
If the target is prompt-heavy or rubric-heavy, also dispatch textgrad-optimizer to propose gradient-style wording improvements before challengers are generated.
The prompt-optimizer should analyze:
templates/PROMPT-COMPOSITION.md)Generate $ARGUMENTS.challengers distinct hypotheses, each with:
{
"id": "challenger-{N}",
"hypothesis": "{what change we're testing}",
"change_type": "prompt|model|layers|params",
"expected_improvement": "{what metric should improve and by how much}",
"risk": "{what could get worse}",
"modification": "{specific text changes to apply}"
}
Write hypotheses to .productionos/AUTO-OPTIMIZE-HYPOTHESES.md.
For each hypothesis, create a modified version of the target:
.productionos/challengers/challenger-{N}.mdrubric-evolver agent (OPRO pattern).productionos/calibration/templates/calibration-set.md for calibration sample formatRun baseline and all challengers against the same benchmark. The benchmark MUST be identical for fair comparison.
For each variant:
/self-eval on the outputFor each variant:
bun test after the agent completesFor each variant:
llm-judge agent for blind evaluationFOR variant IN [baseline, challenger-1, ..., challenger-N]:
1. Reset to clean state (git stash or worktree isolation)
2. Apply variant's modifications (if challenger)
3. Run the target against the benchmark task
4. Collect metrics: score, tokens, duration, output quality
5. Revert changes
6. Record results in .productionos/AUTO-OPTIMIZE-RESULTS.md
Cost tracking: Before each variant run, check accumulated cost against $ARGUMENTS.max_cost. Halt if exceeded.
RESULTS TABLE:
| Variant | Score | Tokens | Duration | Delta vs Baseline | p-value |
|
## Error Handling
| Scenario | Action |
|----------|--------|
| No target provided | Ask for clarification with examples |
| Target not found | Search for alternatives, suggest closest match |
| Agent dispatch fails | Fall back to manual execution, report the error |
| Ambiguous input | Present options, ask user to pick |
| Execution timeout | Save partial results, report what completed |
## Guardrails
1. Do not silently change scope or expand beyond the user request.
2. Prefer concrete outputs and verification over abstract descriptions.
3. Keep scope faithful to the user intent.
4. Preserve existing workflow guardrails and stop conditions.
5. Verify results before concluding. Run self-eval on output quality.
tools
Implementation planning workflow that turns approved ideas into dependency-aware execution plans.
development
Local RAG and Graph RAG over the SecondBrain wiki vault. Progressive context loading (hot cache -> index -> domain -> entity). Graph traversal via wikilink resolution. Use when agents need cross-project context, when answering questions that span multiple domains, or when building context for planning tasks. Triggers on: "wiki context", "cross-project context", "what do we know about", "check the wiki", "graph context", "/wiki-rag".
devops
UX improvement pipeline — creates user stories from UI guidelines, maps user journeys, identifies friction, dispatches fix agents. The user-experience equivalent of /production-upgrade.
development
Test-driven development workflow that writes failing tests first, implements minimally, and refactors safely.