.claude/skills/iterate/SKILL.md
--- name: iterate description: WF8 structured experiment iteration. Manages the hypothesis→code→run→eval cycle, maintains iteration_log.json, with optional Codex cross-validation. Supported commands: plan (design iteration), code (implement changes), run (execute training + collect metrics), eval (evaluate results), ablate (ablation experiments), status (view progress), log (full history). argument-hint: "[plan|code|run|eval|ablate|status|log] [details]" disable-model-invocation: true allowed-to
npx skillsauth add linzhe001/Harness-Research .claude/skills/iterateInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Input: Working codebase from WF7 (code-expert) + baseline metrics from WF5. Output: iteration_log.json (continuously updated), best checkpoint for WF9. On CONTINUE (final) → WF9 (final-exp).
The iteration log file is at iteration_log.json in the project root.
For the schema, see templates/iteration-log-schema.json.
For language behavior, see ../../shared/language-policy.md.
iteration_log.json — The single source of truth for experiments. All iteration data (hypothesis, metrics, decisions, lessons) is written here only.PROJECT_STATE.json — Only manages stage transitions. iterate does not write directly to PROJECT_STATE.json.
Stage-level decisions (CONTINUE→WF9, PIVOT→WF2) are handled by orchestrator reading iteration_log.json and then updating.project_map.json — Only manages code structure. Maintained by code-debug when interfaces change.Utility skills available:
/code-debug — for implementing code changes (called by code sub-command)/evaluate — for detailed metrics analysis (called by eval sub-command)Inter-skill context passing: Before calling /code-debug or /evaluate, write
context to .claude/iterations/iter{N}/context.json (persistent per-iteration context).
Symlink .claude/current_iteration.json → .claude/iterations/iter{N}/context.json
for sub-skill compatibility. After the sub-skill completes, remove the symlink but
keep the persistent context for crash recovery and historical traceability.
Codex MCP integration (always): The plan sub-command always calls Codex MCP for
cross-validation. Record codex_review: "used"|"unavailable" (never null).
Screening protocol: For experiments that don't introduce new architecture/loss,
recommend a 5K-10K step proxy run before full training. Add screening field to
iteration entry (screening.recommended as a structured boolean field).
/iterate phases are invoked via runtime adapter prompt, not directly by the user.iteration_log.json ownership is unchanged — /iterate still owns it exclusively.iteration_log.json via postcondition validation; it does not write to it..claude/iterations/** or any /iterate-owned state..auto_iterate/ is controller-owned; /iterate must not write to it.Before executing any sub-command, check whether .claude/current_iteration.json exists (symlink or regular file).
If it exists, the previous invocation was interrupted mid-execution (crash/timeout/cancellation), and cleanup is needed:
iteration_id from the file.claude/current_iteration.json symlink (keep .claude/iterations/iter{N}/context.json)Execute the corresponding sub-command based on the first word of $ARGUMENTS.
plan [hypothesis] — Design a new iterationPre-check: Read iteration_log.json, confirm there are no incomplete iterations with status="running" or "coding". If there are incomplete iterations, prompt the user to complete or abandon them first.
Assign a new iteration ID (incrementing, supports suffixes like iter25a, iter25b)
Record the hypothesis (extracted from $ARGUMENTS or asked from the user)
Repeated lesson check (Lesson Dedup Guard):
Scan the lessons field from all completed iterations in iteration_log.json.
If the current hypothesis is similar to a known failure pattern (keyword match or semantic similarity),
warn the user and list the related failed iterations and lessons. Proceed only after user confirmation.
Design the specific change plan:
Localize user-facing plan summaries and recommendations according to ../../shared/language-policy.md, while keeping schema keys, commands, and status/decision tokens unchanged.
Codex Cross-Validation (always triggered):
Every plan invocation calls Codex for review, regardless of change type.
If Codex MCP is available (mcp__codex__codex tool exists):
a. Format prompt: hypothesis + current best results + previously tried approaches + known lessons
b. Call mcp__codex__codex: "Review this experiment hypothesis. Are there known issues or better alternatives?"
c. Parse feedback
d. If there are concerns:
mcp__codex__codex-reply to reply with the updated plan
e. Maximum 3 rounds, until consensus or rounds exhausted
f. Record codex_review: "used" + review contentIf Codex MCP is unavailable → codex_review: "unavailable"
Create persistent context directory: mkdir -p .claude/iterations/iter{N}/
Write to iteration_log.json:
{
"id": "iter{N}",
"date": "{today}",
"hypothesis": "...",
"changes_summary": "...",
"config_diff": {...},
"status": "planned",
"screening": {"recommended": true},
"codex_review": "used" | "unavailable",
"codex_review_detail": {...} | null
}
code [description] — Implement changes.claude/iterations/iter{N}/context.json:
{
"caller": "iterate",
"sub_command": "code",
"mode": "planned_change",
"iteration_id": "iter{N}",
"hypothesis": "...",
"changes_summary": "...",
"config_diff": {...},
"best_iteration": "iter{X}",
"best_metric": "{value}",
"files_to_modify": ["src/...", "configs/..."],
"lessons_from_previous": ["lesson1", "lesson2"]
}
.claude/current_iteration.json → .claude/iterations/iter{N}/context.json/code-debug {description}, letting code-debug perform the actual code changes and commit.claude/current_iteration.json (keep persistent context)git_commit: commit hash (required, cannot be null)git_message: commit messagestatus: "training" (code is ready, awaiting training registration)run [config_path] — Execute training + collect metricsAutomatically executes training, runs eval, and collects metrics, replacing the previous manual training workflow.
Runtime variable resolution:
{TRAIN_SCRIPT}: Read from the Train line in CLAUDE.md ## Entry Scripts{EVAL_SCRIPT}: Read from the Eval line in CLAUDE.md ## Entry Scripts{exp_prefix}: Derived from PROJECT_STATE.json project_meta.name (lowercase + underscores)config_diff:
python {TRAIN_SCRIPT} --config {config_path} --no_snapshot
config_path obtained from $ARGUMENTS, or inferred from config_diffexp_dir (e.g., experiments/{exp_prefix}_{iter_id}/)run_in_background: true
started_at timestamppython {EVAL_SCRIPT} --checkpoint {best_ckpt} --scene_dir {scene_dir} --output_dir {exp_dir}/eval --downscale {downscale}
run_manifest: fill in command, config_path, exp_dir, duration_seconds, exit_code, checkpoint_pathmetrics: fill in only protocol-defined tracked metricstraining_trace: fill in auxiliary training info like best_step/final_step"running" (meaning "metrics collected, awaiting eval analysis")/iterate evalError handling:
Manual mode fallback: If the user passes --manual or training needs to run on a cluster,
degrade to registration mode: record command, config_path, exp_dir, expected_steps, status→"running",
user calls /iterate eval after training completes.
eval [log_path] — Evaluate results.claude/iterations/iter{N}/context.json:
{
"caller": "iterate",
"sub_command": "eval",
"iteration_id": "iter{N}",
"hypothesis": "...",
"changes_summary": "...",
"baseline_metrics": {...},
"best_iteration": "iter{X}",
"best_metric": "{value}",
"previous_iteration": {"id": "iter{N-1}", "primary_metric": ..., "metrics": {...}}
}
.claude/current_iteration.json → .claude/iterations/iter{N}/context.json/evaluate {log_path} for detailed analysis (or directly parse metrics).claude/current_iteration.jsonmetrics: fill in extracted metricsdecision: decisionlessons: lessons learnedstatus: "completed"best_iterationRecommended: /iterate plan "{improvement hypothesis based on lessons}"Recommended: /iterate plan "{improvement hypothesis based on lessons}" [debug-oriented]Recommended: /orchestrator next (advance to WF9 ablation experiments)Recommended: /orchestrator rollback 2 (roll back to architecture design)Recommended: /orchestrator decision (record termination decision)ablate [base_iter_id] --components "comp1,comp2,..." — Ablation experimentsQuickly determine individual component contributions during WF8 iteration, without waiting for WF9.
Usage: /iterate ablate {base_iter} --components "name1:override1,name2:override2"
Component format (each component needs name + config override):
| Component Name | Display Name | Config Override | Description |
|----------------|--------------|-----------------|-------------|
| {component_a} | {description} | {config.key=value} | Feature to disable |
| {component_b} | {description} | {config.key=value} | Feature to disable |
Example: /iterate ablate iter5a --components "aux_loss:loss.lambda_aux=0.0,lr_warmup:train.warmup_steps=0"
The component list is parsed from the --components argument (name:override pairs), or inferred from existing ablation records in iteration_log.json.
Execution flow:
name:override pairs from the --components argument in $ARGUMENTS){base_iter}_no_{component}
b. Check if iteration_log.json already has this ID with status="completed" → skip (supports resuming from interruption)
c. Build training command: python {TRAIN_SCRIPT} --config {base_config} --no_snapshot {override}
--components argument
d. Execute training using Bash tool with run_in_background: true
e. After training completes, run {EVAL_SCRIPT} to collect metrics
f. Record to iteration_log.json as a new iteration entry:{
"id": "{base_iter}_no_{component}",
"date": "{today}",
"hypothesis": "Ablation: remove {component_name} from {base_iter}",
"parent_iteration": "{base_iter}",
"ablation_component": "{component_name}",
"config_diff": {"{config.key}": "{disabled_value}"},
"status": "completed",
"metrics": {...}
}
ABLATION RESULTS (baseline: {base_iter}, primary metric: {primary_metric})
| Component | Metric | Delta | Contribution |
|-------------------|--------|--------|-------------|
| Full model | XX.XX | — | — |
| w/o {component_a} | XX.XX | -X.XX | significant |
| w/o {component_b} | XX.XX | -X.XX | moderate |
| w/o {component_c} | XX.XX | -X.XX | minimal |
significantmoderateminimalnegative (removal actually improves results)ablation_summary fieldError handling:
status — View current stateDisplay:
log — Full iteration historyDisplay all iterations in table format:
| Iter | Primary Metric | Status | Decision | Key Change | |------|----------------|--------|----------|------------| | iter{N} | XX.XX | completed | DEBUG | {key change description} | | ... | ... | ... | ... | ... |
If iteration_log.json does not exist, create the initial file:
project: obtained from PROJECT_STATE.json or CLAUDE.mdevaluation_protocol: obtained from PROJECT_STATE.json's evaluation_protocolbaseline_metrics: obtained from PROJECT_STATE.json's baseline_metrics, or prompt user for inputiterations: empty arraybest_iteration: nullIf .claude/iterations/ directory does not exist, create it.
</instructions>
development
WF7.5 training pipeline validation. Before entering WF8 iteration, first use Codex to review code for baseline equivalence, then run a 100-step smoke test to verify end-to-end pipeline functionality.
business
WF1 Inspiration survey and gap analysis. Takes the user's research idea, performs literature search, gap analysis, competitor analysis, and feasibility scoring, then outputs Feasibility_Report.md. Use when the user has a new CV research idea that needs a feasibility assessment.
tools
WF10 Submission/Release Tool. Multi-scene training, result packaging, filename validation, dry-run submission checks. Used after ablation experiments are complete and before competition submission.
development
WF2 Architecture refinement and MVP design. Reads the feasibility report, analyzes the base codebase architecture, designs plug-and-play new modules, defines the MVP, provides A/B/C alternative plans, and outputs Technical_Spec.md. Use when a research idea needs to be translated into a concrete technical architecture design.