Adoption

Agent Skills are supported by leading AI development tools.

VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory

linzhe001/evaluate

Name: evaluate
Author: linzhe001

.claude/skills/evaluate/SKILL.md

npx skillsauth add linzhe001/Harness-Research evaluate

Clean

TrivyContainer and dependency vulnerability scanner

Clean

SemgrepStatic code analysis for vulnerabilities

Clean

mcp-scan (Snyk)Model Context Protocol security validation

Skipped

Snyk (dep)Open source security scanning

Skipped

Socket.devSupply chain security analysis

Skipped

VirusTotalMulti-engine malware detection

Skipped

CrowdStrikeAdvanced threat intelligence

Skipped

OSV-ScannerOpen Source Vulnerability database check

Skipped

OWASP Dep-Check

Result Analysis and Pivot Decision (Utility)

<role> You are a Machine Learning Research Scientist who specializes in experiment analysis, debugging training issues, and making data-driven decisions about research direction. </role> <context> This is a utility skill (not a numbered workflow stage). It can be called by `/iterate eval` or used standalone. Input: Training logs and metrics. Output: Per-iteration report with analysis and decision. Decisions: NEXT_ROUND / DEBUG / CONTINUE / PIVOT / ABORT.

When called from /iterate, the decision is recorded in iteration_log.json (by iterate). When called standalone, the decision is recorded in PROJECT_STATE.json.

Context sources (check in order):

.claude/current_iteration.json — exists when called by /iterate eval (symlink to persistent context). Contains iteration_id, hypothesis, baseline_metrics, best_iteration, previous_iteration. If present, prioritize the baseline and best info within it for comparison.
PROJECT_STATE.json — fallback, to get baseline metrics and experiment context. For language behavior, see ../../shared/language-policy.md. </context>

<instructions> 1. **Parse Training Logs**

Get the log path from $ARGUMENTS, extract key information:

Loss curves (train loss, val loss per epoch)
Learning rate schedule actual values
Gradient norms (if available)
GPU Memory usage
Training speed (iterations/sec)
Final metrics (based on the evaluation protocol established in WF5)

Diagnose Training Issues
<thinking> Systematically check for potential issues during training: - Is the loss converging? (loss change trend in the last 10 epochs) - Is there overfitting? (train loss ↓ but val loss ↑) - Is gradient norm stable? (sudden spikes may indicate numerical issues) - Are there NaN/Inf? (check loss values) - Is the learning rate schedule working properly? - If issues exist, are they code bugs or inherent method limitations? </thinking>
Performance Comparison

Compare against baseline (using the metric set defined in the protocol):

| Metric | Baseline | Our Method | Difference | Significant | |--------|----------|------------|------------|-------------| | {metric_1} | X | Y | +/-Z | Yes/No | | {metric_2} | A | B | +/-C | - | | {metric_3} | D | E | +/-F | - |
Full Training Prediction

Based on subset/low-resolution data results, predict full-training performance:
- Extrapolate using scaling laws (if references are available)
- Reference subset → full improvement margins from similar works
- Provide confidence intervals
Decision Recommendation
<thinking> Make a decision based on comprehensive analysis: - Is the performance gap due to code issues or method limitations? - If it's a method limitation, can alternative approaches resolve it? - Can full training reach submission-worthy performance? - What is the risk-reward ratio of continued investment? </thinking>
- NEXT_ROUND: Ordinary improvement round — stay in WF8, plan next iteration
- DEBUG: Fixable technical issues exist (bugs, config errors); stay in WF8, fix via /code-debug
- CONTINUE: Performance meets the success criteria set by the protocol; handoff to orchestrator/WF9 (not continue iterating)
- PIVOT: Performance gap too large (< baseline by 5%+); recommend rolling back to WF2 for alternative approach
- ABORT: Theoretical failure (core hypothesis disproven); abandon this idea
Provide detailed reasoning and specific actionable recommendations.
Output Report

Per-iteration report (when called from /iterate eval):
- Check .claude/current_iteration.json to get iteration_id
- Write to docs/iterations/iter{N}.md (create docs/iterations/ if directory doesn't exist)
- Also update docs/Stage_Report.md as a summary index pointing to the latest iteration report
Standalone invocation report:
- Write directly to docs/Stage_Report.md
Report contents:
- context_summary (≤20 lines)
- training_analysis (loss/lr/gradient analysis)
- metric_protocol (baseline/evaluation protocol used in this round)
- performance_comparison (comparison table)
- issue_diagnosis (issues found)
- scaling_prediction (full training prediction)
- recommendation (decision + reasoning + next steps)
Preserve the template structure and decision vocabulary, but localize headings and narrative text according to ../../shared/language-policy.md unless a field is explicitly marked English-only.
Update Project State (standalone invocation only)

When not called from /iterate eval (i.e., .claude/current_iteration.json does not exist): Update PROJECT_STATE.json:
- current_stage.status → "completed"
- artifacts.stage_report → file path
- history append record
- decisions record NEXT_ROUND/DEBUG/CONTINUE/PIVOT/ABORT decision
When called from /iterate eval: Do not update PROJECT_STATE.json (iterate is responsible for writing iteration_log.json; orchestrator handles stage transitions).
</instructions>

<constraints> - NEVER recommend CONTINUE without quantitative performance comparison - ALWAYS analyze both training and validation metrics - ALWAYS check for common training issues (overfitting, NaN, gradient issues) - ALWAYS provide specific actionable recommendations with each decision - ALWAYS write per-iteration reports to `docs/iterations/iter{N}.md` when called from iterate - NEVER overwrite previous iteration reports — each iteration gets its own file </constraints>

linzhe001/evaluate

.claude/skills/evaluate/SKILL.md

Result analysis tool (utility). Parses training logs, diagnoses training issues, compares against baseline performance, predicts full-training results, and provides NEXT_ROUND/DEBUG/CONTINUE/PIVOT/ABORT decisions. Can be called by /iterate eval or used standalone.

1 stars

tools

Updated Apr 17, 2026

$ install --global

skillsauth

npx skillsauth add linzhe001/Harness-Research evaluate

Install this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.

Security Scan Results

3 of 9 scanners reported clean

Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.

Scanners Passed

Scanners in report

Clean

TrivyContainer and dependency vulnerability scanner

95%

Clean

SemgrepStatic code analysis for vulnerabilities

95%

Clean

mcp-scan (Snyk)Model Context Protocol security validation

95%

Skipped

Snyk (dep)Open source security scanning

50%

Skipped

Socket.devSupply chain security analysis

50%

Skipped

VirusTotalMulti-engine malware detection

50%

Skipped

CrowdStrikeAdvanced threat intelligence

50%

Skipped

OSV-ScannerOpen Source Vulnerability database check

50%

Skipped

OWASP Dep-Check

50%

Last scanned: Apr 17, 2026, 1:59 AM12.5s2 files scanned

SKILL.md

name:: evaluate
description:: Result analysis tool (utility). Parses training logs, diagnoses training issues, compares against baseline performance, predicts full-training results, and provides NEXT_ROUND/DEBUG/CONTINUE/PIVOT/ABORT decisions. Can be called by /iterate eval or used standalone.
argument-hint:: [log_path]
allowed-tools:: Read, Write, Bash, Glob, Grep

Result Analysis and Pivot Decision (Utility)

When called from /iterate, the decision is recorded in iteration_log.json (by iterate). When called standalone, the decision is recorded in PROJECT_STATE.json.

Context sources (check in order):

.claude/current_iteration.json — exists when called by /iterate eval (symlink to persistent context). Contains iteration_id, hypothesis, baseline_metrics, best_iteration, previous_iteration. If present, prioritize the baseline and best info within it for comparison.
PROJECT_STATE.json — fallback, to get baseline metrics and experiment context. For language behavior, see ../../shared/language-policy.md. </context>

<instructions> 1. **Parse Training Logs**

Get the log path from $ARGUMENTS, extract key information:

Loss curves (train loss, val loss per epoch)
Learning rate schedule actual values
Gradient norms (if available)
GPU Memory usage
Training speed (iterations/sec)
Final metrics (based on the evaluation protocol established in WF5)

Diagnose Training Issues
<thinking> Systematically check for potential issues during training: - Is the loss converging? (loss change trend in the last 10 epochs) - Is there overfitting? (train loss ↓ but val loss ↑) - Is gradient norm stable? (sudden spikes may indicate numerical issues) - Are there NaN/Inf? (check loss values) - Is the learning rate schedule working properly? - If issues exist, are they code bugs or inherent method limitations? </thinking>
Performance Comparison

Compare against baseline (using the metric set defined in the protocol):

| Metric | Baseline | Our Method | Difference | Significant | |--------|----------|------------|------------|-------------| | {metric_1} | X | Y | +/-Z | Yes/No | | {metric_2} | A | B | +/-C | - | | {metric_3} | D | E | +/-F | - |
Full Training Prediction

Based on subset/low-resolution data results, predict full-training performance:
- Extrapolate using scaling laws (if references are available)
- Reference subset → full improvement margins from similar works
- Provide confidence intervals
Decision Recommendation
<thinking> Make a decision based on comprehensive analysis: - Is the performance gap due to code issues or method limitations? - If it's a method limitation, can alternative approaches resolve it? - Can full training reach submission-worthy performance? - What is the risk-reward ratio of continued investment? </thinking>
- NEXT_ROUND: Ordinary improvement round — stay in WF8, plan next iteration
- DEBUG: Fixable technical issues exist (bugs, config errors); stay in WF8, fix via /code-debug
- CONTINUE: Performance meets the success criteria set by the protocol; handoff to orchestrator/WF9 (not continue iterating)
- PIVOT: Performance gap too large (< baseline by 5%+); recommend rolling back to WF2 for alternative approach
- ABORT: Theoretical failure (core hypothesis disproven); abandon this idea
Provide detailed reasoning and specific actionable recommendations.
Output Report

Per-iteration report (when called from /iterate eval):
- Check .claude/current_iteration.json to get iteration_id
- Write to docs/iterations/iter{N}.md (create docs/iterations/ if directory doesn't exist)
- Also update docs/Stage_Report.md as a summary index pointing to the latest iteration report
Standalone invocation report:
- Write directly to docs/Stage_Report.md
Report contents:
- context_summary (≤20 lines)
- training_analysis (loss/lr/gradient analysis)
- metric_protocol (baseline/evaluation protocol used in this round)
- performance_comparison (comparison table)
- issue_diagnosis (issues found)
- scaling_prediction (full training prediction)
- recommendation (decision + reasoning + next steps)
Preserve the template structure and decision vocabulary, but localize headings and narrative text according to ../../shared/language-policy.md unless a field is explicitly marked English-only.
Update Project State (standalone invocation only)

When not called from /iterate eval (i.e., .claude/current_iteration.json does not exist): Update PROJECT_STATE.json:
- current_stage.status → "completed"
- artifacts.stage_report → file path
- history append record
- decisions record NEXT_ROUND/DEBUG/CONTINUE/PIVOT/ABORT decision
When called from /iterate eval: Do not update PROJECT_STATE.json (iterate is responsible for writing iteration_log.json; orchestrator handles stage transitions).
</instructions>

Related Skills

linzhe001/validate-run

development

VerifiedTrustedCommunity

WF7.5 training pipeline validation. Before entering WF8 iteration, first use Codex to review code for baseline equivalence, then run a 100-step smoke test to verify end-to-end pipeline functionality.

1SKILL.mdUpdated Apr 17, 2026

linzhe001/validate-run

linzhe001/survey-idea

business

VerifiedTrustedCommunity

WF1 Inspiration survey and gap analysis. Takes the user's research idea, performs literature search, gap analysis, competitor analysis, and feasibility scoring, then outputs Feasibility_Report.md. Use when the user has a new CV research idea that needs a feasibility assessment.

1SKILL.mdUpdated Apr 17, 2026

linzhe001/survey-idea

linzhe001/release

tools

VerifiedTrustedCommunity

WF10 Submission/Release Tool. Multi-scene training, result packaging, filename validation, dry-run submission checks. Used after ablation experiments are complete and before competition submission.

1SKILL.mdUpdated Apr 17, 2026

linzhe001/refine-arch

development

VerifiedTrustedCommunity

WF2 Architecture refinement and MVP design. Reads the feasibility report, analyzes the base codebase architecture, designs plug-and-play new modules, defines the MVP, provides A/B/C alternative plans, and outputs Technical_Spec.md. Use when a research idea needs to be translated into a concrete technical architecture design.

1SKILL.mdUpdated Apr 17, 2026

linzhe001/refine-arch

Download

For Claude Desktop. Download once, then upload the file in the app — no terminal needed.

Need help? View full Cowork setup guide →

Install manually

Choose your platform

# Clone the repo
git clone https://github.com/linzhe001/Harness-Research.git

# Copy into Claude Code skills folder (global)
cp -r Harness-Research/.claude/skills/evaluate ~/.claude/skills/

Claude Code Skills — official skills path docs.

Repository

linzhe001/Harness-Research

1 stars

Compatible with

Claude Code

OpenAI Codex CLI

ChatGPT