skills/experiment-loop/SKILL.md
Autonomous experiment loop: hypothesize > modify > test > evaluate > keep/discard > repeat. Run N experiments automatically with measurable metrics. Works for performance optimization, A/B testing, prompt engineering, and any measurable improvement task.
npx skillsauth add rubicanjr/FinCognis experiment-loopInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Autonomous, iterative improvement inspired by Karpathy's autoresearch methodology. Define a metric, set a target, and let the loop run until the target is met or the iteration limit is reached.
1. HYPOTHESIZE -> Form a specific, falsifiable improvement hypothesis
2. MODIFY -> Apply the minimal code/config/prompt change
3. TEST -> Run the measurement suite (benchmarks, tests, evals)
4. EVALUATE -> Compare result against baseline and previous best
5. DECIDE -> KEEP if better, DISCARD (git stash pop --index) if worse
|
Repeat until target met OR max_iterations reached
Each iteration is atomic: one hypothesis, one change, one measurement, one decision.
Define an experiment in your task or in thoughts/EXPERIMENTS.md:
experiment:
name: "reduce-api-latency"
metric: "p95 response time (ms)"
baseline: 340
target: 200
direction: minimize # minimize | maximize
max_iterations: 10 # hard cap, never exceed
measurement_cmd: "npm run bench:api"
measurement_key: "p95" # JSON key from bench output
scope: "src/api/" # files the loop is allowed to touch
| Field | Description |
|-------|-------------|
| metric | Human-readable name of what you are measuring |
| baseline | Measured value before any changes (run this first) |
| target | Success condition -- loop exits when this is met |
| direction | minimize for latency/size, maximize for coverage/score |
| max_iterations | Safety cap, default 10, absolute maximum 10 |
| measurement_cmd | Shell command that produces JSON with the metric value |
| scope | Directories/files the loop is allowed to modify |
Before every experiment iteration:
# Save current state
git stash push -u -m "experiment-loop: iteration N baseline"
# Run experiment
# ... apply hypothesis change ...
# ... run measurement ...
# Decision
if result is better:
git stash drop # keep changes, discard stash
else:
git stash pop --index # restore exactly: staged + unstaged
Never skip the stash. Never accumulate multiple iterations without a decision checkpoint. If the measurement command fails or times out, treat it as DISCARD.
The experiment loop coordinates three vibecosystem agents:
| Phase | Agent | Role |
|-------|-------|------|
| Hypothesize | profiler | Identify bottlenecks, suggest what to change |
| Modify | spark | Apply the focused code change |
| Test + Evaluate | verifier / tdd-guide | Run benchmarks, tests, evals and parse results |
Spawn profiler once at the start to get the initial hypothesis queue. Then run spark + verifier in tight loops per iteration.
experiment:
name: "optimize-bundle-size"
metric: "gzipped bundle size (KB)"
baseline: 420
target: 300
direction: minimize
max_iterations: 10
measurement_cmd: "npm run build && node scripts/measure-bundle.js"
measurement_key: "gzipped_kb"
scope: "src/"
Hypothesis queue to try in order:
moment with date-fns (smaller footprint)import() at route boundariesusedExports: true in webpack/rollup configaxios with native fetch wrapperexperiment:
name: "reduce-api-latency"
metric: "p95 response time (ms)"
baseline: 340
target: 200
direction: minimize
max_iterations: 8
measurement_cmd: "npm run bench:api"
measurement_key: "p95"
scope: "src/api/"
Hypothesis queue:
max: 20)Promise.all)experiment:
name: "improve-test-coverage"
metric: "line coverage (%)"
baseline: 64
target: 80
direction: maximize
max_iterations: 10
measurement_cmd: "npm test -- --coverage --json > coverage.json"
measurement_key: "coverageMap.total.lines.pct"
scope: "src/"
experiment:
name: "improve-extraction-accuracy"
metric: "extraction F1 score"
baseline: 0.71
target: 0.85
direction: maximize
max_iterations: 10
measurement_cmd: "python eval/run_evals.py --output eval/results.json"
measurement_key: "f1"
scope: "prompts/"
Append each iteration result to thoughts/EXPERIMENTS.md:
## Experiment: reduce-api-latency
Started: 2026-04-07T10:00:00Z
Baseline: 340ms | Target: 200ms | Direction: minimize
### Iteration 1
- Hypothesis: Add Redis cache for repeated DB reads
- Change: `src/api/users.ts` lines 45-67 -- wrap DB call with cache layer
- Result: 280ms (improvement: -60ms, -17.6%)
- Decision: KEEP
- Cumulative best: 280ms
### Iteration 2
- Hypothesis: Replace N+1 queries with JOIN
- Change: `src/api/users.ts` lines 89-102 -- rewrite fetchWithPosts()
- Result: 210ms (improvement: -70ms, -25%)
- Decision: KEEP
- Cumulative best: 210ms
### Iteration 3
- Hypothesis: Add connection pool sizing max:20
- Change: `src/db/pool.ts` line 12 -- max: 10 -> 20
- Result: 215ms (regression: +5ms)
- Decision: DISCARD (restored via git stash pop)
- Cumulative best: 210ms
### Final Result
- Target: 200ms | Achieved: 210ms | Status: NEAR_MISS (within 5%)
- Iterations: 3 of 10 used
- Total improvement: -38% from baseline
| Condition | Action | |-----------|--------| | Target met | EXIT -- log SUCCESS, keep all accumulated changes | | max_iterations reached | EXIT -- log PARTIAL, keep best achieved state | | 3 consecutive DISCARDs | PAUSE -- re-run profiler for new hypothesis queue | | Measurement command fails | DISCARD current iteration, continue loop | | Git stash fails | STOP -- do not continue, report error |
Invoke this skill by describing the experiment:
Use experiment-loop to reduce the API p95 latency from 340ms to under 200ms.
Baseline measurement: npm run bench:api
Max iterations: 8
Scope: src/api/
The loop will:
thoughts/EXPERIMENTS.md for prior runs on the same metricprofiler for an ordered hypothesis queuedevelopment
Goal-based workflow orchestration - routes tasks to specialist agents based on user goals
tools
Wiring Verification
development
Connection management, room patterns, reconnection strategies, message buffering, and binary protocol design.
development
Screenshot comparison QA for frontend development. Takes a screenshot of the current implementation, scores it across multiple visual dimensions, and returns a structured PASS/REVISE/FAIL verdict with concrete fixes. Use when implementing UI from a design reference or verifying visual correctness.