skills/ds-verify/SKILL.md
This skill should be used when the user asks to 'verify analysis results', 'check reproducibility', 'validate data science output', or 'confirm completion'.
npx skillsauth add edwinhu/workflows ds-verifyInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Announce: "Using ds-verify (Phase 5) to confirm reproducibility and completion."
| Level | Remaining Context | Action | |-------|------------------|--------| | Normal | >35% | Proceed normally | | Warning | 25-35% | Complete current review cycle, then trigger ds-handoff | | Critical | ≤25% | Immediately trigger ds-handoff — do not start new review cycles |
Final verification with reproducibility checks and user acceptance interview.
<EXTREMELY-IMPORTANT> ## The Iron Law of DS VerificationNO COMPLETION CLAIMS WITHOUT FRESH VERIFICATION. This is not negotiable.
Load shared enforcement first.
Auto-load all constraints matching applies-to: ds-verify:
!uv run python3 ${CLAUDE_SKILL_DIR}/../../scripts/load-constraints.py ds-verify
You MUST have these constraints loaded before proceeding. No claiming you "remember" them.
Before claiming analysis is complete, you MUST:
This applies even when:
If you catch yourself thinking "I can skip verification," STOP — you're about to deliver unverified results that waste the user's time. </EXTREMELY-IMPORTANT>
| Drive | Shortcut | Consequence | |-------|----------|-------------| | Helpfulness | Skipping fresh re-run | You assumed prior results still hold. They don't reproduce — the user publishes irreproducible work. Your assumption is the error they discover. Anti-helpful. | | Competence | Verifying your own work | You ran the reproducibility check yourself instead of dispatching a fresh agent. You share the implementer's biases. A fresh agent would have caught the issue. Incompetent verification. | | Efficiency | Not running reproducibility check | You skipped the 10-minute check. The irreproducible results take 10 days to debug when someone else tries to run them. Anti-efficient. | | Approval | Skipping user acceptance interview | You declared completion without asking the user. They discover the results don't answer their question. They now require manual review of all analysis. Lost approval. | | Honesty | Rubber-stamping verification | You reported 'verified' without re-executing. The analysis fails on fresh data — your unverified claim wastes the user's time. |
| Excuse | Reality | Do Instead | |--------|---------|------------| | "The results matched before" | Prior results don't prove current reproducibility. Code, data, or environment may have changed. | Re-run fresh and compare outputs | | "I just need to check the numbers" | Reproducibility means re-running, not re-reading. Reading cached output proves nothing. | Execute the analysis fresh and verify outputs match | | "The reviewer already verified this" | Review checks methodology, verify checks reproducibility. They are different gates. | Run the reproducibility demonstration yourself | | "Fresh re-run will give same results" | If you're sure, running it costs nothing. If you're wrong, skipping it costs everything. | Run it. Proof is cheap, assumptions are expensive. | | "The user is waiting" | Publishing irreproducible results wastes more time than verification. A 10-minute check prevents a 10-day retraction. | Run verification now — the user wants correct results, not fast wrong ones |
| Thought | Why It's Wrong | Do Instead | |---------|----------------|------------| | "Results should be the same" | Your "should" isn't verification | Re-run and compare | | "I ran it earlier" | Your earlier run isn't fresh | Run it again now | | "It's reproducible" | Your claim requires evidence | Demonstrate reproducibility | | "User will be happy" | Your assumption isn't their acceptance | Ask explicitly | | "Outputs look right" | Your visual inspection isn't verified | Check against criteria |
Before running runtime DQ checks, run the static analysis constraint check suite:
bash "${CLAUDE_SKILL_DIR}/../../scripts/check-all-ds.sh" "$(pwd)"
This runs all DS constraint check scripts (determinism, join audits, idempotency, error handling, schema contracts, standard errors, visualization integrity).
If any check FAILS: Report the failures in LEARNINGS.md. These are code quality issues in the analysis scripts that must be fixed before proceeding. Dispatch a fix subagent if needed.
If all checks PASS: Proceed to runtime DQ checks.
Checkpoint type: decision (user confirms results — cannot auto-advance)
Before making ANY completion claim, follow this flowchart.
This flowchart IS the specification. If prose elsewhere and this diagram disagree, the diagram wins.
┌──────────────────────────────┐
│ 1. RE-RUN (fresh, not cached) │
└──────────────┬───────────────┘
▼
┌──────────────────────────────┐
│ 2. CHECK vs success criteria │
└──────────────┬───────────────┘
pass? │
┌───── no ──┴── yes ─────┐
▼ ▼
┌─────────────────┐ ┌──────────────────────────┐
│ NEEDS WORK → │ │ 3. REPRODUCE │
│ log + dispatch │ │ (same inputs→same outputs)│
│ fix subagent │ └────────────┬─────────────┘
└────────┬────────┘ match? │
│ ┌──── no ──────┴── yes ───┐
│ ▼ ▼
│ ┌─────────────────┐ ┌─────────────────────────┐
│ │ NEEDS WORK → │ │ 4. ASK — user │
│ │ non-determinism │ │ acceptance interview │
│ │ is a defect │ └───────────┬─────────────┘
│ └────────┬────────┘ accept? │
│ │ ┌── no/partial ────┴── yes ──┐
│ │ ▼ ▼
│ │ ┌──────────────────┐ ┌────────────────────┐
└───────────┴─▶│ loop: ds-fix / │ │ 5. CLAIM COMPLETE │
│ ds-implement, │ │ (only after 1-4) │
│ then re-verify │ └────────────────────┘
└──────────────────┘
Skipping any step is not verification. Reaching step 5 without passing 1-4 is a false completion claim.
When presenting verification results to the user in the acceptance interview, generate diagnostic plots to support the decision:
| Verification Check | Diagnostic to Generate | |-------------------|----------------------| | Reproducibility comparison | Overlay plot of Run 1 vs Run 2 key outputs | | Data integrity | Pipeline waterfall chart (input rows → cleaning → joins → final) | | Distribution sanity | Histogram/density plots of key variables with expected ranges annotated | | Model performance | ROC curve, residual plot, or coefficient comparison (as appropriate) |
Format: Inline plots in notebooks, or saved to scratch/diagnostics/ for script-based workflows. Present alongside the acceptance interview questions.
Trace to Requirements: For each success criterion, reference its requirement ID (e.g., "DATA-01: Panel has 50K+ firm-years — VERIFIED with df.shape output"). End-to-end traceability from SPEC.md through PLAN.md through VALIDATION.md through verification.
CRITICAL: Before claiming completion, conduct user interview.
AskUserQuestion:
question: "Were there specific methodology requirements I should have followed?"
options:
- label: "Yes, replicating existing analysis"
description: "Results should match a reference"
- label: "Yes, required methodology"
description: "Specific methods were mandated"
- label: "No constraints"
description: "Methodology was flexible"
If replicating:
AskUserQuestion:
question: "Do these results answer your original question?"
options:
- label: "Yes, fully"
description: "Analysis addresses the core question"
- label: "Partially"
description: "Some aspects addressed, others missing"
- label: "No"
description: "Does not answer the question"
If "Partially" or "No":
/ds-implement to address gapsAskUserQuestion:
question: "Are the outputs in the format you need?"
options:
- label: "Yes"
description: "Format is correct"
- label: "Need adjustments"
description: "Format needs modification"
AskUserQuestion:
question: "Do you have any concerns about the methodology or results?"
options:
- label: "No concerns"
description: "Comfortable with approach and results"
- label: "Minor concerns"
description: "Would like clarification on some points"
- label: "Major concerns"
description: "Significant issues need addressing"
MANDATORY: Demonstrate reproducibility before completion.
<EXTREMELY-IMPORTANT> ## Independent Verification RequiredYou MUST NOT verify your own work. Spawn a fresh Task agent for reproducibility.
The implementer shares biases and sunk-cost attachment. A fresh subagent sees only the spec and outputs — it verifies without context pollution.
If you're about to re-run the analysis yourself, STOP. Dispatch a Task agent. </EXTREMELY-IMPORTANT>
Dispatch a fresh Task agent to run the reproducibility check:
All paths below are relative to this skill's base directory.
Agent(subagent_type="general-purpose",
allowed_tools=["Read", "Glob", "Grep", "Bash(read-only)"],
prompt="""
# Reproducibility Verification
**Tool Restrictions:** The verifier is READ-ONLY. It re-runs analyses and checks output but MUST NOT modify notebooks, scripts, or code. It MUST NOT use Write or Edit.
Verify this analysis produces consistent results from a fresh run.
## Context
- Read .planning/SPEC.md for objectives and success criteria
- Read .planning/PLAN.md for expected outputs
- Read .planning/LEARNINGS.md for pipeline documentation
## Shared Checks
Read the shared check definitions:
Read `${CLAUDE_SKILL_DIR}/../../skills/ds-implement/references/ds-checks.md` and follow its instructions.
Run checks: DQ1-DQ4, DQ6, M1, R1
## Reproducibility Protocol
### For scripts:
```python
# Run 1
result1 = run_analysis(seed=42)
hash1 = hash(str(result1))
# Run 2
result2 = run_analysis(seed=42)
hash2 = hash(str(result2))
# Verify
assert hash1 == hash2, "Results not reproducible!"
print(f"Reproducibility confirmed: {hash1} == {hash2}")
jupyter nbconvert --execute --inplace notebook.ipynb
papermill notebook.ipynb output.ipynb -p seed 42
Report:
**Post-subagent boundary (C5):** After verification agent returns, read its report only. Do NOT read source code, notebooks, or data files yourself. If FAIL, dispatch a fresh investigation subagent.
**If Task agent reports FAIL:** Dispatch a fresh Task agent to investigate the discrepancy. Do NOT investigate yourself — that violates the post-subagent boundary (C5 from ds-common-constraints.md).
## Claims Requiring Evidence
| Claim | Required Evidence |
|-------|-------------------|
| "Analysis complete" | All success criteria verified |
| "Results reproducible" | Same output from fresh run |
| "Matches reference" | Comparison showing match |
| "Data quality handled" | Documented cleaning steps |
| "Methodology appropriate" | Assumptions checked |
## Insufficient Evidence
These do NOT count as verification:
- Previous run results (must be fresh)
- "Should be reproducible" (demonstrate it)
- Visual inspection only (quantify where possible)
- Single run (need reproducibility check)
- Skipped user acceptance (must ask)
## Required Output Structure
```markdown
## Verification Report: [Analysis Name]
### Technical Verification
#### Outputs Generated
- [ ] Output 1: [location] - verified [date/time]
- [ ] Output 2: [location] - verified [date/time]
#### Reproducibility Check
- Run 1 hash: [value]
- Run 2 hash: [value]
- Match: YES/NO
#### Environment
- Python: [version]
- Key packages: [list with versions]
- Random seed: [value]
### User Acceptance
#### Replication Check
- Constraint: [none/replicating/required methodology]
- Reference: [if applicable]
- Match status: [if applicable]
#### User Responses
- Results address question: [yes/partial/no]
- Output format acceptable: [yes/needs adjustment]
- Methodology concerns: [none/minor/major]
### Verdict
**COMPLETE** or **NEEDS WORK**
[If COMPLETE]
- All technical checks passed
- User accepted results
- Reproducibility demonstrated
[If NEEDS WORK]
- [List items requiring attention]
- Recommended next steps
Maximum 3 verification cycles. If issues persist after 3 rounds, escalate to user with summary of blocking issues.
Chaining instruction (if NEEDS WORK). Discover and load ds-implement:
Read ${CLAUDE_SKILL_DIR}/../../skills/ds-implement/SKILL.md and follow its instructions.
Then fix the identified issues and re-run verification.
Only claim COMPLETE when ALL are true:
Both technical and user acceptance must pass. No shortcuts.
When user confirms all criteria are met:
Announce: "DS workflow complete. All 5 phases passed."
The /ds workflow is now finished. Offer to:
.planning/ files/dstesting
Internal skill for literature review and source materialization. Called after brainstorm, before setup. NOT user-facing.
documentation
This skill should be used when the user asks to 'write a paper', 'start a writing project', 'draft an article', 'write about', 'brainstorm writing topics', 'gather sources for a paper', 'what should I write about', or needs the writing workflow entry point for any writing task.
testing
Validate draft sections cover all PRECIS claims before review.
testing
Internal skill for creating PRECIS.md, OUTLINE.md, and ACTIVE_WORKFLOW.md. Called after brainstorm sources are gathered.