skills/compute-budget-planner/SKILL.md
Estimate GPU compute budget before running ML experiments. Use when planning how much compute an experiment, ablation matrix, or sweep will cost, sizing smoke tests, finding cheaper alternatives, or deciding whether a planned run fits available resources.
npx skillsauth add a-green-hand-jack/ml-research-skills compute-budget-plannerInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Estimate and control compute spend before jobs are submitted. This skill prevents GPU-hour waste from under-scoped smoke tests, over-specified ablations, and jobs that could have been half the cost with a smarter design.
Use this skill when:
Do not use this skill to submit jobs — use run-experiment after sizing is done. Do not use this skill for experiment design — use experiment-design-planner first.
Pair this skill with:
experiment-design-planner to align experiment scope with compute budget before design is lockedrun-experiment to submit the sized job with the correct resource specbaseline-selection-audit when baseline replication costs need to be included in the total budgetresearch-project-memory to record compute decisions as evidence or risk items<installed-skill-dir>/
├── SKILL.md
└── references/
└── estimation-guide.md
references/estimation-guide.md.jobs/, configs/, or experiments/ when a previous run exists.memory/evidence-board.md when the compute estimate is tied to a paper claim.Collect:
If no prior run exists, use references/estimation-guide.md to estimate throughput.
Compute:
total_steps = ceil(dataset_size / batch_size) × epochs
or planned_steps (if step-based)
gpu_hours = total_steps × seconds_per_step / 3600 × num_gpus
Report:
If throughput is unknown, provide a range using the estimation guide's reference values for the model class and hardware.
A smoke test should:
Recommend:
Smoke test steps: max(100, 0.01 × total_steps)
Smoke dataset: subsample 1% of training data, or use 1 batch per shard
Expected smoke wall-clock: < 30 min on 1× GPU
Flag if the smallest feasible smoke run exceeds 30 minutes. This is a sign the setup overhead is too high.
For each ablation dimension, record:
| Variant | Changed variable | Cost multiplier | GPU-hours | Priority | |---|---|---|---|---| | Baseline | — | 1× | | required | | Ablation A | [removed component] | 1× | | required | | Ablation B | [changed config] | 0.5× | | should-have | | Sweep C | [hyperparameter] | 3× | | optional |
Classify each variant as:
required: paper claim depends on this comparisonshould-have: reviewer will likely ask; defensible to omit with a reasonoptional: nice-to-have; drop first if compute is limitedReport:
Check whether any of the following can answer the same question at lower cost:
For each alternative, record the expected signal quality and the risk of a misleading result.
Save to:
code/.agent/compute-budget.md
The plan should include:
memory/risk-board.mdmemory/decision-log.mdmemory/evidence-board.md when compute details need to be cited in the paper (hardware, training duration, GPU-hours)Before finalizing:
testing
Bootstrap project-local ml-research-skills. Use from global installs when creating a new ML research project, enabling this collection in an existing ML research repo, or deciding whether to install the full bundle locally. Route to project-init for new projects; do not handle paper or experiment work directly.
development
Route project operations tasks — git, memory, bootstrap, remote, workspace, code review, timeline, ops — to the correct skill. Use when the task involves commits, pushes, worktrees, project memory, enabling project-local skills, SSH/server coordination, sidecar runners, or audits. Do not solve the ops task directly.
testing
Route ML/AI paper writing tasks to the correct skill — contract planning, prose drafting, section writing, consistency editing, review simulation, rebuttal, submission, or citation work. Use when the task involves writing, revising, reviewing, or submitting a paper instead of guessing between paper-writing-assistant, paper-writing-contract-planner, paper-reviewer-simulator, auto-paper-improvement-loop, or citation skills. Do not draft prose directly.
data-ai
Project-local router for ML research skill selection. Use inside an initialized ML research project, or while maintaining this skill repo, when the user describes an ML research/paper/experiment/discovery/ops/release workflow and may not know the skill; route to a domain router or high-signal leaf. Do not use for generic non-ML projects.