Adoption

Agent Skills are supported by leading AI development tools.

VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory

a-green-hand-jack/result-diagnosis

Name: result-diagnosis
Author: a-green-hand-jack

skills/result-diagnosis/SKILL.md

npx skillsauth add a-green-hand-jack/ml-research-skills result-diagnosis

Clean

TrivyContainer and dependency vulnerability scanner

Clean

SemgrepStatic code analysis for vulnerabilities

Clean

mcp-scan (Snyk)Model Context Protocol security validation

Skipped

Snyk (dep)Open source security scanning

Skipped

Socket.devSupply chain security analysis

Skipped

VirusTotalMulti-engine malware detection

Skipped

CrowdStrikeAdvanced threat intelligence

Skipped

OSV-ScannerOpen Source Vulnerability database check

Skipped

OWASP Dep-Check

Result Diagnosis

Diagnose what an experiment result means for the project. This skill is for decision-making after results exist, especially when they are negative, surprising, unstable, or hard to interpret.

Use this skill when:

a method does not improve over baseline
results vary strongly across seeds
a metric improves but another metric worsens
a baseline unexpectedly wins
a plot or table looks suspicious
a result may be caused by an implementation bug, metric bug, data issue, or unfair comparison
early experiments suggest revising the algorithm or paper claim
the user asks "what does this result mean?" or "what should we do next?"

Do not use this skill to write a polished report. Pair it with experiment-report-writer after the diagnosis is clear.

Pair this skill with:

research-project-memory when the diagnosis should update claims, evidence, risks, actions, or worktree status
experiment-report-writer when results need a shareable report
algorithm-design-planner when the diagnosis points to method revision
experiment-design-planner when the diagnosis requires a new controlled experiment
run-experiment when the next step is a rerun, sanity check, or ablation
conference-writing-adapter when the right action is to narrow or reframe paper claims

Skill Directory Layout

<installed-skill-dir>/
├── SKILL.md
└── references/
    ├── diagnosis-taxonomy.md
    ├── evidence-audit.md
    ├── next-decision-rules.md
    ├── report-template.md
    └── triage-protocol.md

Progressive Loading

Always read references/diagnosis-taxonomy.md, references/triage-protocol.md, and references/next-decision-rules.md.
Read references/evidence-audit.md when inspecting logs, configs, metrics, plots, runs, or code state.
Use references/report-template.md for full diagnosis reports.
If a result depends on current SOTA, benchmark conventions, or recent baseline performance, verify current sources with web search or user-provided papers.

Core Principles

Diagnose before optimizing.
Separate observed result from interpretation.
Prefer simple sanity checks before expensive reruns.
Treat negative results as information: they may kill a claim, not the whole project.
Do not blame the algorithm before checking implementation, data, metric, baseline, and selection rules.
Do not blame implementation forever when repeated controlled evidence falsifies the claim.
Every diagnosis should end with a decision: debug, rerun, ablate, revise method, narrow claim, write, park, or kill.
Record uncertainty explicitly.

Step 1 - Define the Result and Expected Behavior

Extract:

experiment question and linked claim
method and baseline
dataset/split
metrics and expected direction
observed result
number of seeds/repeats
configs, commit, logs, tables, and figures
what result was expected and why
whether this result affects paper claims or only internal debugging

Rewrite vague input into:

Expected [method] to improve [metric/diagnostic] over [baseline] on [setting], but observed [result] under [controls].

If expected behavior was never defined, route back to experiment-design-planner.

Step 2 - Classify the Symptom

Read references/diagnosis-taxonomy.md.

Classify the primary symptom:

no improvement
regression
instability or high variance
metric conflict
suspiciously large gain
baseline unexpectedly strong
diagnostic/performance mismatch
training failure or divergence
reproducibility failure
plot/table inconsistency
result contradicts paper story

Then classify likely diagnosis categories:

implementation bug
metric/evaluation bug
data/split/preprocessing issue
unfair baseline or tuning issue
seed variance or insufficient repeats
optimization/hyperparameter issue
method mechanism failure
scale/regime mismatch
claim/evidence mismatch
expected negative result

Step 3 - Gather Evidence

Read references/evidence-audit.md.

Prefer primary artifacts:

config diffs
run commands
git commit
logs and stderr
metric files
checkpoints
seeds
dataset versions and split hashes
plots and tables
previous baseline runs
implementation changes

Mark missing evidence rather than guessing.

Step 4 - Run Triage

Read references/triage-protocol.md.

Use this order:

Reproducibility and provenance: correct commit, config, data, seed, output path.
Metric and evaluation: metric direction, aggregation, split, leakage, postprocessing.
Baseline fairness: same budget, tuning, checkpoint rule, data, sampler, and code path.
Implementation sanity: feature flag, tensor shapes, gradient flow, loss scale, train/eval mode.
Statistical stability: seeds, variance, confidence intervals, outliers.
Mechanism diagnostic: whether the intended mechanism changed.
Claim alignment: whether the result supports, weakens, or falsifies the paper claim.

Stop early only when a blocking bug or invalid comparison is found.

Step 5 - Build Competing Explanations

For each plausible explanation, state:

evidence for it
evidence against it
cheapest test that would distinguish it
decision if true

At minimum consider:

bug
bad metric
weak experiment design
baseline too strong or under-tuned
hyperparameter issue
mechanism false
claim too broad

Step 6 - Choose Next Decision

Read references/next-decision-rules.md.

Choose one primary decision:

debug: result is not trustworthy until a bug or provenance issue is resolved
rerun: result is plausible but underpowered or missing controls
ablate: result needs mechanism isolation
revise-method: mechanism likely needs design change
narrow-claim: evidence supports a smaller or different claim
write: evidence is trustworthy enough to report
park: result is inconclusive and not worth immediate compute
kill: claim or direction is falsified under fair controls

Do not pick write if basic provenance or fairness is unresolved.

Step 7 - Write the Diagnosis

Use references/report-template.md for full reports.

If saving to a project and no path is given, use:

docs/diagnosis/result_diagnosis_YYYY-MM-DD_<short-name>.md

Required output:

# Result Diagnosis: [Short Name]

## Result Snapshot
## Expected vs Observed
## Symptom Classification
## Evidence Checked
## Competing Explanations
## Most Likely Diagnosis
## Decision
## Next Checks or Actions
## Claim Impact
## Project Memory Writeback

Step 8 - Write Back to Project Memory

If the project uses research-project-memory, update:

memory/evidence-board.md: observed result, limitations, and source paths
memory/provenance-board.md: mark result provenance verified, stale, contradictory, or missing when diagnosis depends on source validity
memory/claim-board.md: claims supported, weakened, revised, evidence-needed, provisional, parked, or cut
memory/risk-board.md: bugs, metric risks, baseline risks, mechanism risks, or claim risks
memory/action-board.md: debug, rerun, ablation, method revision, writing, park, or kill actions
memory/handoff-board.md: create handoffs to method design, experiment design, paper evidence, or writing when diagnosis changes downstream work
memory/phase-dashboard.md: update the active gate when diagnosis advances evidence production or regresses the project to debugging, method revision, or claim narrowing
worktree .agent/worktree-status.md "Local Hot Results": update here first when in a code-worktree; mark confirmed/invalidated/superseded status locally before any graduation
<ProjectRoot>/memory/hot-results.md: graduate here only when the result is confirmed and changes a project-level claim; do not write here while diagnosis is still in progress
memory/decision-log.md: durable decisions such as killing a claim, changing method, or narrowing scope

Use observed for verified results and inferred for explanations. Mark stale claims explicitly.

Final Sanity Check

Before finalizing:

observed result and interpretation are separated
provenance and config are checked or listed as missing
metric direction and aggregation are clear
baseline fairness is addressed
implementation sanity checks are considered
seed variance and repeats are considered
mechanism diagnostic is checked when relevant
result is mapped to a concrete decision
paper claim impact is explicit
project memory is updated when present

a-green-hand-jack/result-diagnosis

skills/result-diagnosis/SKILL.md

Use when results are valid but surprising, negative, unstable, or ambiguous — to decide debug/rerun/ablate/revise/park. Not for engineering failures like NaN/OOM (use experiment-debugger). Not for confound or claim-drift audit before locking results into the paper (use research-results-auditor).

4 stars

development

Updated May 19, 2026

$ install --global

skillsauth

npx skillsauth add a-green-hand-jack/ml-research-skills result-diagnosis

Install this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.

Security Scan Results

3 of 9 scanners reported clean

Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.

Scanners Passed

Scanners in report

Clean

TrivyContainer and dependency vulnerability scanner

95%

Clean

SemgrepStatic code analysis for vulnerabilities

95%

Clean

mcp-scan (Snyk)Model Context Protocol security validation

95%

Skipped

Snyk (dep)Open source security scanning

50%

Skipped

Socket.devSupply chain security analysis

50%

Skipped

VirusTotalMulti-engine malware detection

50%

Skipped

CrowdStrikeAdvanced threat intelligence

50%

Skipped

OSV-ScannerOpen Source Vulnerability database check

50%

Skipped

OWASP Dep-Check

50%

Last scanned: May 19, 2026, 7:55 AM58.1s1 file scanned

SKILL.md

name:: result-diagnosis
description:: Use when results are valid but surprising, negative, unstable, or ambiguous — to decide debug/rerun/ablate/revise/park. Not for engineering failures like NaN/OOM (use experiment-debugger). Not for confound or claim-drift audit before locking results into the paper (use research-results-auditor).
argument-hint:: [project-dir] [--result <summary>] [--mode quick|full|debug|decision]
allowed-tools:: Read, Write, Edit, Bash, Glob, WebSearch, WebFetch

Result Diagnosis

Diagnose what an experiment result means for the project. This skill is for decision-making after results exist, especially when they are negative, surprising, unstable, or hard to interpret.

Use this skill when:

a method does not improve over baseline
results vary strongly across seeds
a metric improves but another metric worsens
a baseline unexpectedly wins
a plot or table looks suspicious
a result may be caused by an implementation bug, metric bug, data issue, or unfair comparison
early experiments suggest revising the algorithm or paper claim
the user asks "what does this result mean?" or "what should we do next?"

Do not use this skill to write a polished report. Pair it with experiment-report-writer after the diagnosis is clear.

Pair this skill with:

research-project-memory when the diagnosis should update claims, evidence, risks, actions, or worktree status
experiment-report-writer when results need a shareable report
algorithm-design-planner when the diagnosis points to method revision
experiment-design-planner when the diagnosis requires a new controlled experiment
run-experiment when the next step is a rerun, sanity check, or ablation
conference-writing-adapter when the right action is to narrow or reframe paper claims

Skill Directory Layout

<installed-skill-dir>/
├── SKILL.md
└── references/
    ├── diagnosis-taxonomy.md
    ├── evidence-audit.md
    ├── next-decision-rules.md
    ├── report-template.md
    └── triage-protocol.md

Progressive Loading

Always read references/diagnosis-taxonomy.md, references/triage-protocol.md, and references/next-decision-rules.md.
Read references/evidence-audit.md when inspecting logs, configs, metrics, plots, runs, or code state.
Use references/report-template.md for full diagnosis reports.
If a result depends on current SOTA, benchmark conventions, or recent baseline performance, verify current sources with web search or user-provided papers.

Core Principles

Diagnose before optimizing.
Separate observed result from interpretation.
Prefer simple sanity checks before expensive reruns.
Treat negative results as information: they may kill a claim, not the whole project.
Do not blame the algorithm before checking implementation, data, metric, baseline, and selection rules.
Do not blame implementation forever when repeated controlled evidence falsifies the claim.
Every diagnosis should end with a decision: debug, rerun, ablate, revise method, narrow claim, write, park, or kill.
Record uncertainty explicitly.

Step 1 - Define the Result and Expected Behavior

Extract:

experiment question and linked claim
method and baseline
dataset/split
metrics and expected direction
observed result
number of seeds/repeats
configs, commit, logs, tables, and figures
what result was expected and why
whether this result affects paper claims or only internal debugging

Rewrite vague input into:

Expected [method] to improve [metric/diagnostic] over [baseline] on [setting], but observed [result] under [controls].

If expected behavior was never defined, route back to experiment-design-planner.

Step 2 - Classify the Symptom

Read references/diagnosis-taxonomy.md.

Classify the primary symptom:

no improvement
regression
instability or high variance
metric conflict
suspiciously large gain
baseline unexpectedly strong
diagnostic/performance mismatch
training failure or divergence
reproducibility failure
plot/table inconsistency
result contradicts paper story

Then classify likely diagnosis categories:

implementation bug
metric/evaluation bug
data/split/preprocessing issue
unfair baseline or tuning issue
seed variance or insufficient repeats
optimization/hyperparameter issue
method mechanism failure
scale/regime mismatch
claim/evidence mismatch
expected negative result

Step 3 - Gather Evidence

Read references/evidence-audit.md.

Prefer primary artifacts:

config diffs
run commands
git commit
logs and stderr
metric files
checkpoints
seeds
dataset versions and split hashes
plots and tables
previous baseline runs
implementation changes

Mark missing evidence rather than guessing.

Step 4 - Run Triage

Read references/triage-protocol.md.

Use this order:

Reproducibility and provenance: correct commit, config, data, seed, output path.
Metric and evaluation: metric direction, aggregation, split, leakage, postprocessing.
Baseline fairness: same budget, tuning, checkpoint rule, data, sampler, and code path.
Implementation sanity: feature flag, tensor shapes, gradient flow, loss scale, train/eval mode.
Statistical stability: seeds, variance, confidence intervals, outliers.
Mechanism diagnostic: whether the intended mechanism changed.
Claim alignment: whether the result supports, weakens, or falsifies the paper claim.

Stop early only when a blocking bug or invalid comparison is found.

Step 5 - Build Competing Explanations

For each plausible explanation, state:

evidence for it
evidence against it
cheapest test that would distinguish it
decision if true

At minimum consider:

bug
bad metric
weak experiment design
baseline too strong or under-tuned
hyperparameter issue
mechanism false
claim too broad

Step 6 - Choose Next Decision

Read references/next-decision-rules.md.

Choose one primary decision:

debug: result is not trustworthy until a bug or provenance issue is resolved
rerun: result is plausible but underpowered or missing controls
ablate: result needs mechanism isolation
revise-method: mechanism likely needs design change
narrow-claim: evidence supports a smaller or different claim
write: evidence is trustworthy enough to report
park: result is inconclusive and not worth immediate compute
kill: claim or direction is falsified under fair controls

Do not pick write if basic provenance or fairness is unresolved.

Step 7 - Write the Diagnosis

Use references/report-template.md for full reports.

If saving to a project and no path is given, use:

docs/diagnosis/result_diagnosis_YYYY-MM-DD_<short-name>.md

Required output:

# Result Diagnosis: [Short Name]

## Result Snapshot
## Expected vs Observed
## Symptom Classification
## Evidence Checked
## Competing Explanations
## Most Likely Diagnosis
## Decision
## Next Checks or Actions
## Claim Impact
## Project Memory Writeback

Step 8 - Write Back to Project Memory

If the project uses research-project-memory, update:

memory/evidence-board.md: observed result, limitations, and source paths
memory/provenance-board.md: mark result provenance verified, stale, contradictory, or missing when diagnosis depends on source validity
memory/claim-board.md: claims supported, weakened, revised, evidence-needed, provisional, parked, or cut
memory/risk-board.md: bugs, metric risks, baseline risks, mechanism risks, or claim risks
memory/action-board.md: debug, rerun, ablation, method revision, writing, park, or kill actions
memory/handoff-board.md: create handoffs to method design, experiment design, paper evidence, or writing when diagnosis changes downstream work
memory/phase-dashboard.md: update the active gate when diagnosis advances evidence production or regresses the project to debugging, method revision, or claim narrowing
worktree .agent/worktree-status.md "Local Hot Results": update here first when in a code-worktree; mark confirmed/invalidated/superseded status locally before any graduation
<ProjectRoot>/memory/hot-results.md: graduate here only when the result is confirmed and changes a project-level claim; do not write here while diagnosis is still in progress
memory/decision-log.md: durable decisions such as killing a claim, changing method, or narrowing scope

Use observed for verified results and inferred for explanations. Mark stale claims explicitly.

Final Sanity Check

Before finalizing:

observed result and interpretation are separated
provenance and config are checked or listed as missing
metric direction and aggregation are clear
baseline fairness is addressed
implementation sanity checks are considered
seed variance and repeats are considered
mechanism diagnostic is checked when relevant
result is mapped to a concrete decision
paper claim impact is explicit
project memory is updated when present

Related Skills

a-green-hand-jack/ml-research-bootstrap

testing

VerifiedTrustedCommunity

Bootstrap project-local ml-research-skills. Use from global installs when creating a new ML research project, enabling this collection in an existing ML research repo, or deciding whether to install the full bundle locally. Route to project-init for new projects; do not handle paper or experiment work directly.

4SKILL.mdUpdated May 26, 2026

a-green-hand-jack/ml-research-bootstrap

a-green-hand-jack/project-ops-router

development

VerifiedTrustedCommunity

Route project operations tasks — git, memory, bootstrap, remote, workspace, code review, timeline, ops — to the correct skill. Use when the task involves commits, pushes, worktrees, project memory, enabling project-local skills, SSH/server coordination, sidecar runners, or audits. Do not solve the ops task directly.

4SKILL.mdUpdated May 19, 2026

a-green-hand-jack/project-ops-router

a-green-hand-jack/paper-writing-router

testing

VerifiedTrustedCommunity

Route ML/AI paper writing tasks to the correct skill — contract planning, prose drafting, section writing, consistency editing, review simulation, rebuttal, submission, or citation work. Use when the task involves writing, revising, reviewing, or submitting a paper instead of guessing between paper-writing-assistant, paper-writing-contract-planner, paper-reviewer-simulator, auto-paper-improvement-loop, or citation skills. Do not draft prose directly.

4SKILL.mdUpdated May 19, 2026

a-green-hand-jack/paper-writing-router

a-green-hand-jack/ml-research-router

data-ai

VerifiedTrustedCommunity

Project-local router for ML research skill selection. Use inside an initialized ML research project, or while maintaining this skill repo, when the user describes an ML research/paper/experiment/discovery/ops/release workflow and may not know the skill; route to a domain router or high-signal leaf. Do not use for generic non-ML projects.

4SKILL.mdUpdated May 19, 2026

a-green-hand-jack/ml-research-router

Download

For Claude Desktop. Download once, then upload the file in the app — no terminal needed.

Need help? View full Cowork setup guide →

Install manually

Choose your platform

# Clone the repo
git clone https://github.com/a-green-hand-jack/ml-research-skills.git

# Copy into Claude Code skills folder (global)
cp -r ml-research-skills/skills/result-diagnosis ~/.claude/skills/

Claude Code Skills — official skills path docs.

Repository

a-green-hand-jack/ml-research-skills

4 stars

Compatible with

Claude Code

OpenAI Codex CLI

ChatGPT