Adoption

Agent Skills are supported by leading AI development tools.

VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory

brycewang-stanford/experiment-plan

Name: experiment-plan
Author: brycewang-stanford

skills/42-wanshuiyin-ARIS/skills/experiment-plan/SKILL.md

npx skillsauth add brycewang-stanford/Awesome-Agent-Skills-for-Empirical-Research experiment-plan

Clean

TrivyContainer and dependency vulnerability scanner

Clean

SemgrepStatic code analysis for vulnerabilities

Clean

mcp-scan (Snyk)Model Context Protocol security validation

Error

VirusTotalMulti-engine malware detection

Skipped

Snyk (dep)Open source security scanning

Skipped

Socket.devSupply chain security analysis

Skipped

CrowdStrikeAdvanced threat intelligence

Skipped

OSV-ScannerOpen Source Vulnerability database check

Skipped

OWASP Dep-Check

Experiment Plan: Claim-Driven, Paper-Oriented Validation

Refine and concretize: $ARGUMENTS

Overview

Use this skill after the method is stable enough that the next question becomes: what exact experiments should we run, in what order, to defend the paper? If the user wants the full chain in one request, prefer /research-refine-pipeline.

The goal is not to generate a giant benchmark wishlist. The goal is to turn a proposal into a claim -> evidence -> run order roadmap that supports four things:

the method actually solves the anchored problem
the dominant contribution is real and focused
the method is elegant enough that extra complexity is unnecessary
any frontier-model-era component is genuinely useful, not decorative

Constants

OUTPUT_DIR = refine-logs/ — Default destination for experiment planning artifacts.
MAX_PRIMARY_CLAIMS = 2 — Prefer one dominant claim plus one supporting claim.
MAX_CORE_BLOCKS = 5 — Keep the must-run experimental story compact.
MAX_BASELINE_FAMILIES = 3 — Prefer a few strong baselines over many weak ones.
DEFAULT_SEEDS = 3 — Use 3 seeds when stochastic variance matters and budget allows.

Workflow

Phase 0: Load the Proposal Context

Read the most relevant existing files first if they exist:

refine-logs/FINAL_PROPOSAL.md
refine-logs/REVIEW_SUMMARY.md
refine-logs/REFINEMENT_REPORT.md

Extract:

Problem Anchor
Dominant contribution
Optional supporting contribution
Critical reviewer concerns
Data / compute / timeline constraints
Which frontier primitive is central, if any

If these files do not exist, derive the same information from the user's prompt.

Phase 1: Freeze the Paper Claims

Before proposing experiments, write down the claims that must be defended.

Use this structure:

Primary claim: the main mechanism-level contribution
Supporting claim: optional, only if it directly strengthens the main paper story
Anti-claim to rule out: e.g. "the gain only comes from more parameters," "the gain only comes from a larger search space," or "the modern component is just decoration"
Minimum convincing evidence: what would make each claim believable to a strong reviewer?

Do not exceed MAX_PRIMARY_CLAIMS unless the paper truly has multiple inseparable claims.

Phase 2: Build the Experimental Storyline

Design the paper around a compact set of experiment blocks. Default to the following blocks and delete any that are not needed:

Main anchor result — does the method solve the actual bottleneck?
Novelty isolation — does the dominant contribution itself matter?
Simplicity / elegance check — can a bigger or more fragmented version be avoided?
Frontier necessity check — if an LLM / VLM / Diffusion / RL-era component is central, is it actually the right tool?
Failure analysis or qualitative diagnosis — what does the method still miss?

For each block, decide whether it belongs in:

Main paper — essential to defend the core claims
Appendix — useful but non-blocking
Cut — interesting, but not worth the paper budget

Prefer one strong baseline family over many weak baselines. If a stronger modern baseline exists, use it instead of padding the list.

Phase 3: Specify Each Experiment Block

For every kept block, fully specify:

Claim tested
Why this block exists
Dataset / split / task
Compared systems: strongest baselines, ablations, and variants only
Metrics: decisive metrics first, secondary metrics second
Setup details: backbone, frozen vs trainable parts, key hyperparameters, training budget, seeds
Success criterion: what outcome would count as convincing evidence?
Failure interpretation: if the result is negative, what does it mean?
Table / figure target: where this result should appear in the paper

Special rules:

A simplicity check should usually compare the final method against either an overbuilt variant or a tempting extra component that the paper intentionally rejects.
A frontier necessity check should usually compare the chosen modern primitive against the strongest plausible simpler or older alternative.
If the proposal is intentionally non-frontier, say so explicitly and skip the frontier block instead of forcing one.

Phase 4: Turn the Plan Into an Execution Order

Build a realistic run order so the user knows what to do first.

Use this milestone structure:

Sanity stage — data pipeline, metric correctness, one quick overfit or toy split
Baseline stage — reproduce the strongest baseline(s)
Main method stage — run the final method on the primary setting
Decision stage — run the decisive ablations for novelty, simplicity, and frontier necessity
Polish stage — robustness, qualitative figures, appendix extras

For each milestone, estimate:

compute cost
expected turnaround time
stop / go decision gate
risk and mitigation

Separate must-run from nice-to-have experiments.

Phase 5: Write the Outputs

Step 5.1: Write `refine-logs/EXPERIMENT_PLAN.md`

Use this structure:

# Experiment Plan

**Problem**: [problem]
**Method Thesis**: [one-sentence thesis]
**Date**: [today]

## Claim Map
| Claim | Why It Matters | Minimum Convincing Evidence | Linked Blocks |
|-------|-----------------|-----------------------------|---------------|
| C1    | ...             | ...                         | B1, B2        |

## Paper Storyline
- Main paper must prove:
- Appendix can support:
- Experiments intentionally cut:

## Experiment Blocks

### Block 1: [Name]
- Claim tested:
- Why this block exists:
- Dataset / split / task:
- Compared systems:
- Metrics:
- Setup details:
- Success criterion:
- Failure interpretation:
- Table / figure target:
- Priority: MUST-RUN / NICE-TO-HAVE

### Block 2: [Name]
...

## Run Order and Milestones
| Milestone | Goal | Runs | Decision Gate | Cost | Risk |
|-----------|------|------|---------------|------|------|
| M0        | ...  | ...  | ...           | ...  | ...  |

## Compute and Data Budget
- Total estimated GPU-hours:
- Data preparation needs:
- Human evaluation needs:
- Biggest bottleneck:

## Risks and Mitigations
- [Risk]:
- [Mitigation]:

## Final Checklist
- [ ] Main paper tables are covered
- [ ] Novelty is isolated
- [ ] Simplicity is defended
- [ ] Frontier contribution is justified or explicitly not claimed
- [ ] Nice-to-have runs are separated from must-run runs

Step 5.2: Write `refine-logs/EXPERIMENT_TRACKER.md`

Use this structure:

# Experiment Tracker

| Run ID | Milestone | Purpose | System / Variant | Split | Metrics | Priority | Status | Notes |
|--------|-----------|---------|------------------|-------|---------|----------|--------|-------|
| R001   | M0        | sanity  | ...              | ...   | ...     | MUST     | TODO   | ...   |

Keep the tracker compact and execution-oriented.

Step 5.3: Present a Brief Summary to the User

Experiment plan ready.

Must-run blocks:
- [Block 1]
- [Block 2]

Highest-risk assumption:
- [risk]

First three runs to launch:
1. [run]
2. [run]
3. [run]

Plan file: refine-logs/EXPERIMENT_PLAN.md
Tracker file: refine-logs/EXPERIMENT_TRACKER.md

Key Rules

Large file handling: If the Write tool fails due to file size, immediately retry using Bash (cat << 'EOF' > file) to write in chunks. Do NOT ask the user for permission — just do it silently.
Every experiment must defend a claim. If it does not change a reviewer belief, cut it.
Prefer a compact paper story. Design the main table first, then add only the ablations that defend it.
Defend simplicity explicitly. If complexity is a concern, include a deletion study or a stronger-but-bloated variant comparison.
Defend frontier choices explicitly. If a modern primitive is central, prove why it is better than the strongest simpler alternative.
Prefer strong baselines over long baseline lists. A short, credible comparison set is better than a padded one.
Separate must-run from nice-to-have. Do not let appendix ideas delay the core paper evidence.
Reuse proposal constraints. Do not invent unrealistic budgets or data assumptions.
Do not fabricate results. Plan evidence; do not claim evidence.

Composing with Other Skills

/research-refine-pipeline -> one-shot method + experiment planning
/research-refine   -> method and claim refinement
/experiment-plan   -> detailed experiment roadmap
/run-experiment    -> execute the runs
/auto-review-loop  -> react to results and iterate on the paper

brycewang-stanford/experiment-plan

skills/42-wanshuiyin-ARIS/skills/experiment-plan/SKILL.md

Turn a refined research proposal or method idea into a detailed, claim-driven experiment roadmap. Use after `research-refine`, or when the user asks for a detailed experiment plan, ablation matrix, evaluation protocol, run order, compute budget, or paper-ready validation that supports the core problem, novelty, simplicity, and any LLM / VLM / Diffusion / RL-based contribution.

1,065 stars

testing

Updated May 20, 2026

$ install --global

skillsauth

npx skillsauth add brycewang-stanford/Awesome-Agent-Skills-for-Empirical-Research experiment-plan

Install this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.

Security Scan Results

3 of 9 scanners reported clean

Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.

Scanners Passed

Scanners in report

Clean

TrivyContainer and dependency vulnerability scanner

95%

Clean

SemgrepStatic code analysis for vulnerabilities

95%

Clean

mcp-scan (Snyk)Model Context Protocol security validation

95%

Error

VirusTotalMulti-engine malware detection

70%

Skipped

Snyk (dep)Open source security scanning

50%

Skipped

Socket.devSupply chain security analysis

50%

Skipped

CrowdStrikeAdvanced threat intelligence

50%

Skipped

OSV-ScannerOpen Source Vulnerability database check

50%

Skipped

OWASP Dep-Check

50%

Last scanned: Mar 20, 2026, 1:07 PM224.7s1 file scanned

SKILL.md

name:: experiment-plan
description:: Turn a refined research proposal or method idea into a detailed, claim-driven experiment roadmap. Use after `research-refine`, or when the user asks for a detailed experiment plan, ablation matrix, evaluation protocol, run order, compute budget, or paper-ready validation that supports the core problem, novelty, simplicity, and any LLM / VLM / Diffusion / RL-based contribution.
allowed-tools:: Bash(*), Read, Write, Edit, Grep, Glob, WebSearch, WebFetch, Agent

Experiment Plan: Claim-Driven, Paper-Oriented Validation

Refine and concretize: $ARGUMENTS

Overview

The goal is not to generate a giant benchmark wishlist. The goal is to turn a proposal into a claim -> evidence -> run order roadmap that supports four things:

the method actually solves the anchored problem
the dominant contribution is real and focused
the method is elegant enough that extra complexity is unnecessary
any frontier-model-era component is genuinely useful, not decorative

Constants

OUTPUT_DIR = refine-logs/ — Default destination for experiment planning artifacts.
MAX_PRIMARY_CLAIMS = 2 — Prefer one dominant claim plus one supporting claim.
MAX_CORE_BLOCKS = 5 — Keep the must-run experimental story compact.
MAX_BASELINE_FAMILIES = 3 — Prefer a few strong baselines over many weak ones.
DEFAULT_SEEDS = 3 — Use 3 seeds when stochastic variance matters and budget allows.

Workflow

Phase 0: Load the Proposal Context

Read the most relevant existing files first if they exist:

refine-logs/FINAL_PROPOSAL.md
refine-logs/REVIEW_SUMMARY.md
refine-logs/REFINEMENT_REPORT.md

Extract:

Problem Anchor
Dominant contribution
Optional supporting contribution
Critical reviewer concerns
Data / compute / timeline constraints
Which frontier primitive is central, if any

If these files do not exist, derive the same information from the user's prompt.

Phase 1: Freeze the Paper Claims

Before proposing experiments, write down the claims that must be defended.

Use this structure:

Primary claim: the main mechanism-level contribution
Supporting claim: optional, only if it directly strengthens the main paper story
Anti-claim to rule out: e.g. "the gain only comes from more parameters," "the gain only comes from a larger search space," or "the modern component is just decoration"
Minimum convincing evidence: what would make each claim believable to a strong reviewer?

Do not exceed MAX_PRIMARY_CLAIMS unless the paper truly has multiple inseparable claims.

Phase 2: Build the Experimental Storyline

Design the paper around a compact set of experiment blocks. Default to the following blocks and delete any that are not needed:

Main anchor result — does the method solve the actual bottleneck?
Novelty isolation — does the dominant contribution itself matter?
Simplicity / elegance check — can a bigger or more fragmented version be avoided?
Frontier necessity check — if an LLM / VLM / Diffusion / RL-era component is central, is it actually the right tool?
Failure analysis or qualitative diagnosis — what does the method still miss?

For each block, decide whether it belongs in:

Main paper — essential to defend the core claims
Appendix — useful but non-blocking
Cut — interesting, but not worth the paper budget

Prefer one strong baseline family over many weak baselines. If a stronger modern baseline exists, use it instead of padding the list.

Phase 3: Specify Each Experiment Block

For every kept block, fully specify:

Claim tested
Why this block exists
Dataset / split / task
Compared systems: strongest baselines, ablations, and variants only
Metrics: decisive metrics first, secondary metrics second
Setup details: backbone, frozen vs trainable parts, key hyperparameters, training budget, seeds
Success criterion: what outcome would count as convincing evidence?
Failure interpretation: if the result is negative, what does it mean?
Table / figure target: where this result should appear in the paper

Special rules:

A simplicity check should usually compare the final method against either an overbuilt variant or a tempting extra component that the paper intentionally rejects.
A frontier necessity check should usually compare the chosen modern primitive against the strongest plausible simpler or older alternative.
If the proposal is intentionally non-frontier, say so explicitly and skip the frontier block instead of forcing one.

Phase 4: Turn the Plan Into an Execution Order

Build a realistic run order so the user knows what to do first.

Use this milestone structure:

Sanity stage — data pipeline, metric correctness, one quick overfit or toy split
Baseline stage — reproduce the strongest baseline(s)
Main method stage — run the final method on the primary setting
Decision stage — run the decisive ablations for novelty, simplicity, and frontier necessity
Polish stage — robustness, qualitative figures, appendix extras

For each milestone, estimate:

compute cost
expected turnaround time
stop / go decision gate
risk and mitigation

Separate must-run from nice-to-have experiments.

Phase 5: Write the Outputs

Step 5.1: Write `refine-logs/EXPERIMENT_PLAN.md`

Use this structure:

# Experiment Plan

**Problem**: [problem]
**Method Thesis**: [one-sentence thesis]
**Date**: [today]

## Claim Map
| Claim | Why It Matters | Minimum Convincing Evidence | Linked Blocks |
|-------|-----------------|-----------------------------|---------------|
| C1    | ...             | ...                         | B1, B2        |

## Paper Storyline
- Main paper must prove:
- Appendix can support:
- Experiments intentionally cut:

## Experiment Blocks

### Block 1: [Name]
- Claim tested:
- Why this block exists:
- Dataset / split / task:
- Compared systems:
- Metrics:
- Setup details:
- Success criterion:
- Failure interpretation:
- Table / figure target:
- Priority: MUST-RUN / NICE-TO-HAVE

### Block 2: [Name]
...

## Run Order and Milestones
| Milestone | Goal | Runs | Decision Gate | Cost | Risk |
|-----------|------|------|---------------|------|------|
| M0        | ...  | ...  | ...           | ...  | ...  |

## Compute and Data Budget
- Total estimated GPU-hours:
- Data preparation needs:
- Human evaluation needs:
- Biggest bottleneck:

## Risks and Mitigations
- [Risk]:
- [Mitigation]:

## Final Checklist
- [ ] Main paper tables are covered
- [ ] Novelty is isolated
- [ ] Simplicity is defended
- [ ] Frontier contribution is justified or explicitly not claimed
- [ ] Nice-to-have runs are separated from must-run runs

Step 5.2: Write `refine-logs/EXPERIMENT_TRACKER.md`

Use this structure:

# Experiment Tracker

| Run ID | Milestone | Purpose | System / Variant | Split | Metrics | Priority | Status | Notes |
|--------|-----------|---------|------------------|-------|---------|----------|--------|-------|
| R001   | M0        | sanity  | ...              | ...   | ...     | MUST     | TODO   | ...   |

Keep the tracker compact and execution-oriented.

Step 5.3: Present a Brief Summary to the User

Experiment plan ready.

Must-run blocks:
- [Block 1]
- [Block 2]

Highest-risk assumption:
- [risk]

First three runs to launch:
1. [run]
2. [run]
3. [run]

Plan file: refine-logs/EXPERIMENT_PLAN.md
Tracker file: refine-logs/EXPERIMENT_TRACKER.md

Key Rules

Large file handling: If the Write tool fails due to file size, immediately retry using Bash (cat << 'EOF' > file) to write in chunks. Do NOT ask the user for permission — just do it silently.
Every experiment must defend a claim. If it does not change a reviewer belief, cut it.
Prefer a compact paper story. Design the main table first, then add only the ablations that defend it.
Defend simplicity explicitly. If complexity is a concern, include a deletion study or a stronger-but-bloated variant comparison.
Defend frontier choices explicitly. If a modern primitive is central, prove why it is better than the strongest simpler alternative.
Prefer strong baselines over long baseline lists. A short, credible comparison set is better than a padded one.
Separate must-run from nice-to-have. Do not let appendix ideas delay the core paper evidence.
Reuse proposal constraints. Do not invent unrealistic budgets or data assumptions.
Do not fabricate results. Plan evidence; do not claim evidence.

Composing with Other Skills

/research-refine-pipeline -> one-shot method + experiment planning
/research-refine   -> method and claim refinement
/experiment-plan   -> detailed experiment roadmap
/run-experiment    -> execute the runs
/auto-review-loop  -> react to results and iterate on the paper

Related Skills

brycewang-stanford/literature-review-tools

tools

VerifiedTrustedCommunity

Recommend AND run open-source AI tools, agents, Claude Code / Codex skills, and MCP servers for any stage of a literature review — searching, reading, extracting, synthesizing, screening, citation-checking, and paper writing. Use when the user asks "what tool should I use to..." OR "install/run/use <tool> to ..." for research/lit-review work: automating a survey or related-work section, PDF→Markdown extraction for LLMs (MinerU/marker/docling), PRISMA / systematic review (ASReview), citation-backed Q&A over PDFs (PaperQA2), wiring papers into Claude/Cursor via MCP (arxiv/paper-search/zotero servers), or chatting with a Zotero library. Ships a launcher (scripts/litrun.py) that installs each tool in an isolated venv and runs it. Curated catalog of 70+ vetted projects. 支持中英文（用于「文献综述工具选型」与「一键安装/运行」）。

3,109SKILL.mdUpdated Jul 28, 2026

brycewang-stanford/literature-review-tools

brycewang-stanford/auto-empirical-research-skills

development

VerifiedTrustedCommunity

Route empirical-research requests through the Auto-Empirical Research Skills catalog when this whole repository is installed as one skill in Codex, CodeBuddy, Claude Code, or another IDE. Use to choose and load the right vendored AERS skill for causal inference, econometrics, replication, data acquisition, manuscript writing, peer review and referee responses, citation checking, de-AIGC editing, or full empirical-paper workflows without reading the entire repository at once.

3,109SKILL.mdUpdated Jun 27, 2026

brycewang-stanford/auto-empirical-research-skills

brycewang-stanford/aer-preregistration

documentation

VerifiedTrustedCommunity

Use when the project collects primary data or runs a field, lab, or survey experiment, before the intervention begins — write the pre-analysis plan, size the sample from a power calculation, and register with the AEA RCT Registry. Apply after the design is chosen in aer-identification and before any outcome data are seen.

3,021SKILL.mdUpdated Jul 23, 2026

brycewang-stanford/aer-preregistration

brycewang-stanford/economist-data-skill

tools

VerifiedTrustedCommunity

Guide economists to authoritative data sources with explicit, confirmed data specifications before retrieval; interfaces with Playwright MCP to navigate portals and extract real data, not articles about data.

3,021SKILL.mdUpdated Jul 23, 2026

brycewang-stanford/economist-data-skill

Download

For Claude Desktop. Download once, then upload the file in the app — no terminal needed.

Need help? View full Cowork setup guide →

Install manually

Choose your platform

# Clone the repo
git clone https://github.com/brycewang-stanford/Awesome-Agent-Skills-for-Empirical-Research.git

# Copy into Claude Code skills folder (global)
cp -r Awesome-Agent-Skills-for-Empirical-Research/skills/42-wanshuiyin-ARIS/skills/experiment-plan ~/.claude/skills/

Claude Code Skills — official skills path docs.

Repository

brycewang-stanford/Awesome-Agent-Skills-for-Empirical-Research

1,065 stars

Compatible with

Claude Code

OpenAI Codex CLI

ChatGPT

Adoption

brycewang-stanford/experiment-plan

$ install --global

Security Scan Results

SKILL.md

Experiment Plan: Claim-Driven, Paper-Oriented Validation

Overview

Constants

Workflow

Phase 0: Load the Proposal Context

Phase 1: Freeze the Paper Claims

Phase 2: Build the Experimental Storyline

Phase 3: Specify Each Experiment Block

Phase 4: Turn the Plan Into an Execution Order

Phase 5: Write the Outputs

Step 5.1: Write refine-logs/EXPERIMENT_PLAN.md

Step 5.2: Write refine-logs/EXPERIMENT_TRACKER.md

Step 5.3: Present a Brief Summary to the User

Key Rules

Composing with Other Skills

Related Skills

brycewang-stanford/literature-review-tools

brycewang-stanford/auto-empirical-research-skills

brycewang-stanford/aer-preregistration

brycewang-stanford/economist-data-skill

brycewang-stanford/experiment-plan

$ install --global

Security Scan Results

SKILL.md

Experiment Plan: Claim-Driven, Paper-Oriented Validation

Overview

Constants

Workflow

Phase 0: Load the Proposal Context

Phase 1: Freeze the Paper Claims

Phase 2: Build the Experimental Storyline

Phase 3: Specify Each Experiment Block

Phase 4: Turn the Plan Into an Execution Order

Phase 5: Write the Outputs

Step 5.1: Write refine-logs/EXPERIMENT_PLAN.md

Step 5.2: Write refine-logs/EXPERIMENT_TRACKER.md

Step 5.3: Present a Brief Summary to the User

Key Rules

Composing with Other Skills

Related Skills

brycewang-stanford/literature-review-tools

brycewang-stanford/auto-empirical-research-skills

brycewang-stanford/aer-preregistration

brycewang-stanford/economist-data-skill

Step 5.1: Write `refine-logs/EXPERIMENT_PLAN.md`

Step 5.2: Write `refine-logs/EXPERIMENT_TRACKER.md`

Step 5.1: Write `refine-logs/EXPERIMENT_PLAN.md`

Step 5.2: Write `refine-logs/EXPERIMENT_TRACKER.md`