.claude/skills/evaluate/SKILL.md
Result analysis tool (utility). Parses training logs, diagnoses training issues, compares against baseline performance, predicts full-training results, and provides NEXT_ROUND/DEBUG/CONTINUE/PIVOT/ABORT decisions. Can be called by /iterate eval or used standalone.
npx skillsauth add linzhe001/Harness-Research evaluateInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
When called from /iterate, the decision is recorded in iteration_log.json (by iterate). When called standalone, the decision is recorded in PROJECT_STATE.json.
Context sources (check in order):
.claude/current_iteration.json — exists when called by /iterate eval (symlink to persistent context).
Contains iteration_id, hypothesis, baseline_metrics, best_iteration, previous_iteration.
If present, prioritize the baseline and best info within it for comparison.PROJECT_STATE.json — fallback, to get baseline metrics and experiment context.
For language behavior, see ../../shared/language-policy.md.
</context>
Get the log path from $ARGUMENTS, extract key information:
Diagnose Training Issues
<thinking> Systematically check for potential issues during training: - Is the loss converging? (loss change trend in the last 10 epochs) - Is there overfitting? (train loss ↓ but val loss ↑) - Is gradient norm stable? (sudden spikes may indicate numerical issues) - Are there NaN/Inf? (check loss values) - Is the learning rate schedule working properly? - If issues exist, are they code bugs or inherent method limitations? </thinking>Performance Comparison
Compare against baseline (using the metric set defined in the protocol):
| Metric | Baseline | Our Method | Difference | Significant | |--------|----------|------------|------------|-------------| | {metric_1} | X | Y | +/-Z | Yes/No | | {metric_2} | A | B | +/-C | - | | {metric_3} | D | E | +/-F | - |
Full Training Prediction
Based on subset/low-resolution data results, predict full-training performance:
Decision Recommendation
<thinking> Make a decision based on comprehensive analysis: - Is the performance gap due to code issues or method limitations? - If it's a method limitation, can alternative approaches resolve it? - Can full training reach submission-worthy performance? - What is the risk-reward ratio of continued investment? </thinking>/code-debugProvide detailed reasoning and specific actionable recommendations.
Output Report
Per-iteration report (when called from /iterate eval):
.claude/current_iteration.json to get iteration_iddocs/iterations/iter{N}.md (create docs/iterations/ if directory doesn't exist)docs/Stage_Report.md as a summary index pointing to the latest iteration reportStandalone invocation report:
docs/Stage_Report.mdReport contents:
Preserve the template structure and decision vocabulary, but localize headings and narrative text according to ../../shared/language-policy.md unless a field is explicitly marked English-only.
Update Project State (standalone invocation only)
When not called from /iterate eval (i.e., .claude/current_iteration.json does not exist):
Update PROJECT_STATE.json:
current_stage.status → "completed"artifacts.stage_report → file pathhistory append recorddecisions record NEXT_ROUND/DEBUG/CONTINUE/PIVOT/ABORT decisionWhen called from /iterate eval: Do not update PROJECT_STATE.json (iterate is responsible for writing iteration_log.json; orchestrator handles stage transitions).
</instructions>development
WF7.5 training pipeline validation. Before entering WF8 iteration, first use Codex to review code for baseline equivalence, then run a 100-step smoke test to verify end-to-end pipeline functionality.
business
WF1 Inspiration survey and gap analysis. Takes the user's research idea, performs literature search, gap analysis, competitor analysis, and feasibility scoring, then outputs Feasibility_Report.md. Use when the user has a new CV research idea that needs a feasibility assessment.
tools
WF10 Submission/Release Tool. Multi-scene training, result packaging, filename validation, dry-run submission checks. Used after ablation experiments are complete and before competition submission.
development
WF2 Architecture refinement and MVP design. Reads the feasibility report, analyzes the base codebase architecture, designs plug-and-play new modules, defines the MVP, provides A/B/C alternative plans, and outputs Technical_Spec.md. Use when a research idea needs to be translated into a concrete technical architecture design.