.claude/skills/validate-run/SKILL.md
WF7.5 training pipeline validation. Before entering WF8 iteration, first use Codex to review code for baseline equivalence, then run a 100-step smoke test to verify end-to-end pipeline functionality.
npx skillsauth add linzhe001/Harness-Research validate-runInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
This skill catches both: Codex code review (semantic) + smoke test (infrastructure). Failure here means issues that must be fixed before entering WF8.
Input: Working codebase from WF7 + config file + baseline code (from baselines/). Output: Code review findings + smoke test pass/fail report. On PASS → WF8 (iterate). On FAIL → fix issues via /code-debug. For language behavior, see ../../shared/language-policy.md. </context>
<instructions> 1. **Determine configuration and locate files**Get config_path from $ARGUMENTS, or infer the default config from CLAUDE.md. Read the config file to confirm training parameters.
Locate three sets of review materials:
① WF7 new code (subject of review):
src/ node in project_map.json{TRAIN_SCRIPT} and evaluation script {EVAL_SCRIPT} from CLAUDE.md ## Entry Scripts② Baseline reference code (equivalence benchmark):
entry_point from the baselines/ node in project_map.jsonstatus: verified as reference③ Design documents (implementation intent reference):
docs/Technical_Spec.md (architecture design from WF2, specifying which parts should be equivalent to baseline and which are new additions)Codex code review (always attempt)
WF7.5 is the only review gate before code enters iteration, so always attempt Codex review.
If Codex MCP is available (mcp__codex__codex tool exists):
a. Collect review materials: Read the three sets of files located in step 1. For each review dimension (data, model, loss, eval), organize the new code and baseline corresponding modules in pairs for easy side-by-side comparison by Codex.
b. Submit review request, call mcp__codex__codex, prompt structure:
## Review Task
Check equivalence between new code and baseline, answering the review checklist item by item.
## New Code (WF7 Implementation)
### Data Loading: src/data/...
{file contents}
### Model: src/models/...
{file contents}
### Loss: src/losses/...
{file contents}
### Evaluation: scripts/{EVAL_SCRIPT} + src/utils/metrics.py
{file contents}
### Training Loop: scripts/{TRAIN_SCRIPT}
{file contents}
## Baseline Reference Implementation
### Data Loading: baselines/{name}/...
{file contents}
### Model: baselines/{name}/...
{file contents}
### Loss: baselines/{name}/...
{file contents}
### Evaluation: baselines/{name}/...
{file contents}
## Design Intent
{Key paragraphs from Technical_Spec.md}
## Review Checklist (answer each item)
{see below}
c. Review checklist (Codex must answer each item):
Data pipeline equivalence:
Model/rendering equivalence:
Loss computation equivalence:
Evaluation metric equivalence (critical, directly affects competition ranking):
Common ML bug checks:
d. Parse review results, classify as:
critical: will definitely produce incorrect results (e.g., inconsistent metric computation, normalization errors)warning: may cause performance differences (e.g., different initialization strategy, loss weight deviations)info: style differences, does not affect correctnesse. If there are critical/warning level concerns:
mcp__codex__codex-reply to reply with verification results, confirming or dismissing concernsf. Record codex_review: "used" + review results
If Codex MCP is unavailable:
Claude performs a simplified self-review (only checking evaluation metric equivalence and data normalization),
recording codex_review: "unavailable".
Run 100-step training
Read {TRAIN_SCRIPT} from CLAUDE.md ## Entry Scripts:
python {TRAIN_SCRIPT} --config {config_path} --max_steps 100 --exp_name smoke_test
Record:
Verify checkpoint saving
Check whether the smoke test generated checkpoint files:
torch.load does not error)Verify evaluation pipeline
Read {EVAL_SCRIPT} from CLAUDE.md ## Entry Scripts:
python {EVAL_SCRIPT} --checkpoint {smoke_test_checkpoint} --split val
Check:
Verify wandb connection (if enabled)
Check whether wandb initialized successfully in the smoke test training logs.
Verify git_snapshot
Check whether git_snapshot executed successfully in the smoke test training logs.
Output report
Report to the user:
Code review results:
Smoke test results:
Final verdict:
Keep checklist item names, status labels, commands, and identifiers stable, but localize surrounding narrative text according to ../../shared/language-policy.md unless a field is explicitly marked English-only.
Cleanup
Delete temporary files generated by the smoke test (checkpoints, logs) to avoid polluting the experiment directory.
Update project state
If PASS or REVIEW (user confirms to proceed):
/code-debug for fixes
</instructions>
business
WF1 Inspiration survey and gap analysis. Takes the user's research idea, performs literature search, gap analysis, competitor analysis, and feasibility scoring, then outputs Feasibility_Report.md. Use when the user has a new CV research idea that needs a feasibility assessment.
tools
WF10 Submission/Release Tool. Multi-scene training, result packaging, filename validation, dry-run submission checks. Used after ablation experiments are complete and before competition submission.
development
WF2 Architecture refinement and MVP design. Reads the feasibility report, analyzes the base codebase architecture, designs plug-and-play new modules, defines the MVP, provides A/B/C alternative plans, and outputs Technical_Spec.md. Use when a research idea needs to be translated into a concrete technical architecture design.
testing
--- name: orchestrator description: CV research project orchestrator. Coordinates the 10-stage research workflow (WF1-WF10 + WF7.5 gate), tracks progress, and manages PROJECT_STATE.json. Supported commands: init (initialize), status (view state), next (advance stage), rollback (revert), decision (record decisions). Use when the user wants to manage CV research project progress, initialize a project, view status, or switch workflow stages. argument-hint: "[command: init|status|next|rollback|decis