skills/md-benchmark/SKILL.md
Run MDPrepBench and MDStudyBench tasks with prompt-driven MD agents and deterministic scorer commands. Use for benchmark runs, agent submissions, and comparing MD agents.
npx skillsauth add matsunagalab/mdclaw md-benchmarkInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
MDPrepBench and MDStudyBench evaluate prompt-driven MD agents under the MDAgentBench family. The agent may use MDClaw, MDCrow, GROMACS, Amber, OpenMM scripts, or another backend, but scoring is always artifact-based and script-driven.
Use the suite that matches the task:
benchmarks/mdprepbench: preparation-only tasks (P01-P25), focused on
source retrieval, preparation, topology artifacts, minimization evidence, and
preparation provenance.benchmarks/mdstudybench: scientific question / study tasks (S01-S03),
focused on comparative MD evidence, analysis metrics, methods drafts,
provenance, decision logs, and calibrated scientific answers.For MDClaw commands, do not use the external GNU timeout wrapper. macOS does
not ship timeout; rely on the task time limit, tool/runtime errors, and
MDClaw's internal timeout handling instead.
Give the agent the task prompt and a submission directory. The prompt is the problem statement; it names public sources such as PDB IDs, UniProt accessions, DOIs, URLs, protocols, and required outputs. Retrieval and provenance are part of the evaluated behavior.
Agent-facing files:
<task_dir>/prompt.mdmdclaw export_benchmark_public_package; give the agent only
prompt.md, submission_contract.json, and submission_checklist.md.Evaluator-facing files:
benchmarks/mdprepbench/tasks/<task_id>/task.jsonbenchmarks/mdstudybench/tasks/<task_id>/task.jsonmdclaw export_benchmark_private_package in a separate repo/container mount
that the evaluated agent cannot read.Never expose:
task.json to the benchmark agent<task_dir>/truth/<task_dir>/scorer/harness_execution.json records before the task completesNo fake trajectories, fake metrics, fake citations, or guessed conclusions.
Treat canonical task.json as runner/scorer metadata, not a solution recipe.
Harness code may read failure_policy, time_limit_minutes, required outputs,
and scoring checks before deciding whether a task can be blocked; agents should
not read it. The runner must not inject task-specific MDClaw command-line
arguments, geometry values, selected chains, model numbers, or workflow knobs
that are not present in the public prompt/submission contract. Those choices
belong to the evaluated agent and must be justified in its submission.
Evaluated agents solve exactly one task at a time. They must not inspect,
categorize, or hardcode behavior across the benchmark suite, and must not write
benchmark-wide solver scripts. A task-local helper script is allowed only when
it executes real workflow steps for the current task and is recorded in
provenance.command_log.
For strict scoring, provenance.command_log is not enough by itself. The
operator or harness must keep a measured harness_execution.json outside the
solver-writable submission/ directory with stage, command/action, exit status,
and walltime for each required stage.
run_id is an opaque label for the run directory and records. Do not infer task
subset, smoke-test behavior, execution depth, or expected outcome from words in
the run ID. Suite and task selection come only from the benchmark name, dataset
directory, and explicit task IDs.
If failure_policy.blocked_by_missing_input_allowed=false and
failure_policy.insufficient_information_allowed=false, do not submit
manifest.status="blocked" just because the run may be slow or inconvenient.
Attempt the required stages until one of these happens:
For prep tasks, attempt source retrieval, preparation, explicit solvation by
default, topology build, and a short minimization / finite-energy check. Use an
implicit topology only when the prompt explicitly asks for implicit/no-solvent
handling; in that case, do not retain crystallographic or bulk ions as explicit
particles for implicit solvent. Vacuum/no-solvent is a separate explicit prompt
choice and may keep explicit ions. Full equilibration and production are not part of the prep battery. For
execution tasks outside the prep battery, attempt the MD work requested by the
prompt. For restart tasks, run the requested chunks and attempt the
concatenation/continuity checks. For MDStudyBench comparative answer tasks, run
the requested systems before reporting an effect direction, list real trajectory
artifacts in manifest.outputs.trajectories, and mirror the quantitative MD
analysis in metrics.md_analysis and evidence_report.evidence.md_metrics.
For MDStudyBench dry-run evidence-bundle tasks, do not invent trajectories;
submit the requested methods, decision log, evidence report, and study/report
provenance evidence.
Before writing manifest.status="blocked", record enough evidence to prove
that the task was actually attempted:
mdclaw ... commands or sub-agent actions attemptedsource, prep, solv, topo, min, eq, prod,
analysis, or reportWrite this evidence in provenance.json, evidence_report.json, and a
decision log when useful. If no required stage was attempted, the submission is
not a valid MDClaw benchmark attempt.
Do not call a run a full benchmark unless every selected task is validated and scored, and the MD execution / comparative-answer tasks were genuinely attempted. If long MD tasks only received blocked placeholders, report the run as partial or blocked-only, not full.
The intended user-facing prompts are short:
MDPrepBenchを run_id=prep_full_run で実行して評価して
MDPrepBenchの P11_prep_site_protonation_t4l_glu11 だけを実行して評価して
MDStudyBenchの S03_t4l_wt_vs_l99a_methods だけを実行して評価して
For these prompts, prepare the run, execute each task from its generated
agent_prompt.md, then run score_benchmark_run. Keep the evaluated task
agent and the scorer separated as described below.
Prepare an MDClaw benchmark run from the repository root with:
mdclaw prepare_benchmark_run \
--output-dir benchmark_runs \
--run-id <run_id> \
--dataset-dir benchmarks/mdprepbench \
--execution-mode lite
The command writes <run_dir>/agent_tasks.json plus one
task_instructions.json per task. Each instruction points to the agent-safe
prompt.md, submission_contract.json, and the task's submission/
directory. Scoring metadata is written separately to harness_tasks.json and
harness_instructions.json; do not give those files to the evaluated agent.
harness_instructions.json also names the scorer-side harness_execution.json
path used by strict provenance checks.
Use --task-ids P01_prep_simple_monomer_t4l P02_prep_1ake_chain_ap5 to run a
subset.
For StudyBench, select the study dataset explicitly:
mdclaw prepare_benchmark_run \
--output-dir benchmark_runs \
--run-id <run_id> \
--dataset-dir benchmarks/mdstudybench \
--execution-mode lite \
--task-ids S01_stability_t4l_l99a
For MDClaw, launch one evaluated agent per task with the generated prompt:
Use the md-benchmark skill. Run the task in:
<run_dir>/tasks/<task_id>/agent_prompt.md
The generated agent_prompt.md points to task_instructions.json, which in
turn points to only agent-safe files. Do not hand-write long benchmark prompts;
keep task-specific requirements in prompt.md, submission_contract.json, and
submission_checklist.md.
Internal submission rules for this skill:
agent_prompt.md; do not
inspect sibling task directories or categorize all benchmark tasks.run_id and directory names as labels only; do not infer smoke-test
shortcuts, task subsets, or expected outcomes from them.provenance.command_log.conda run -n mdclaw python ...; system python3 may not have OpenMM,
gemmi, or MDClaw installed.node.json, or
progress.json. If a step must be retried, create a new node or use the
MDClaw node/need tools so provenance remains auditable.min stage.manifest.outputs.topology as a list
containing system.xml, topology.pdb, and state.xml.min node after topo and record
run_minimization plus its minimized_structure.pdb /
minimization_report.json artifacts.state.xml carries the topology-time minimized coordinates and
topology.pdb supplies the atom/residue topology. Create the benchmark
minimized_structure.pdb with:
mdclaw export_state_pdb \
--topology-pdb-file <topology.pdb> \
--state-xml-file <state.xml> \
--output-pdb-file <submission_dir>/minimized_structure.pdb
Record this command in provenance.command_log. Do not assume
topology.pdb itself is the minimized structure unless the workflow
explicitly wrote it with minimized coordinates.min node for MDClaw MDPrepBench submissions. If you
package topology-time minimized coordinates directly from state.xml, record
that command as the min stage in provenance.command_log.topology.backend = "openmm" in metrics.json when the public
contract requires OpenMM topology artifacts.metric_requirements path and follow
submission_blueprint; do not invent hidden task options.command_log entries for the stages named by the public
checklist: normally source, prep, topo, and min for MDPrepBench.manifest.outputs.trajectories and connect metrics.md_analysis to the
conclusion.submission/; the harness runs scorer commands.The scorer treats the submitted artifact as the source of truth. It detects
OpenMM by deserializing the system.xml + topology.pdb + state.xml triple
(not by trusting metrics.topology.backend) and recomputes physical properties
directly from the system: force field applied to every atom, net charge, the
water-model fingerprint, and ion molarity from the box volume. Declared
metrics.json values are downgraded to cross-checked declarations — a mismatch
between declared and recomputed becomes an integrity warning, and the recomputed
value is what scores. Do not rely on writing correct metrics over a wrong
topology; build the right system.
Scoring is a small physical-validity gate plus graded per-capability partial
credit. A completed submission that fails to load, has non-finite energy, has no
force field applied, or is missing the required minimized structure scores zero.
Otherwise identity / fidelity / provenance checks contribute weighted partial
credit and roll up into a per-capability profile (identity,
physical_validity, fidelity, provenance) in the score and run summary.
The required submission set is slim: manifest.json, metrics.json,
provenance.json, prepared_structure.pdb, minimized_structure.pdb,
minimization_report.json, and the OpenMM triple. evidence_report.json is
optional unless a specific task's contract lists it.
Any agent is valid if it solves the public prompt, retrieves public sources as
needed, and writes the standard submission/. Agents that use no MDClaw code on
the solver side (MDCrow, a plain OpenMM/pdbfixer script, an LLM writing its own
OpenMM code) are first-class entrants; the shared MDClaw scorer is the neutral
judge for everyone.
mdclaw init_benchmark_run --tooling-condition mdclaw-free
(and --harness-name, --backend-name, --model-name for the actual
toolchain). prepare_benchmark_run defaults to mdclaw-skills+cli; use it
only for MDClaw skill-driven runs. Use mdclaw-cli-only when the solver used
MDClaw CLI tools without the skills.prompt.md (and submission_contract.json).
Do not inject task-specific options absent from the prompt.submission/:
mdclaw-free — it only reshapes
files): mdclaw package_openmm_submission --submission-dir ... --task-id ... --system-xml-file ... --topology-pdb-file ... --state-xml-file ... --command-log-file ....python benchmarks/tools/package_submission.py --submission-dir ... --task-id ... --system-xml ... --topology-pdb ... --state-xml ....
Neither packager invents force field, water model, chains, ions, or
mutations; undeclared values are recorded as unspecified and recomputed
from the artifact. Supply the agent's real steps via the command-log option.See docs/benchmark/mdcrow-runner.md for the full MDCrow recipe and
docs/benchmark/fairness-protocol.md for comparison conditions, attestation,
and the verified flag.
After the agent writes submission/, score only with scripts:
mdclaw validate_and_score_benchmark_submission \
--task-file <canonical_task_dir>/task.json \
--submission-dir <submission_dir> \
--run-id <run_id> \
--harness-record-file <run_task_dir>/harness_execution.json \
--validation-output-file <run_task_dir>/validation.json \
--output-file <run_task_dir>/score.json
For a run directory prepared by MDClaw, score all task submissions and write the run summary with:
mdclaw score_benchmark_run \
--run-dir <run_dir> \
--dataset-dir benchmarks/mdprepbench
Use --dataset-dir benchmarks/mdstudybench for StudyBench runs.
Read the wrapper's normalized fields: validation_success, score_status,
weighted_total, and benchmark_passed. Do not infer benchmark pass/fail from
the wrapper's success field alone; success only means the evaluator wrapper
completed.
development
Generate monomer conformational source candidates with BioEmu, then hand them to MDClaw preparation.
testing
Study-level planning for MDClaw. Turns scientific questions into a small MD research plan, planned jobs, analysis intent, and decision criteria before handing off to stage skills.
data-ai
AI-driven protein structure prediction using Boltz-2 for single proteins, multimers, and protein-ligand complexes.
tools
Production molecular dynamics simulation using MDClaw CLI tools and OpenMM. Runs MD from an equilibrated state, with HMR, restart, and HPC submission support.