skills/md-benchmark-openrouter/SKILL.md
Run MDAgentBench harness × OpenRouter model matrix evaluations. Use when comparing multiple agent harnesses, model slugs, or OpenRouter provider-routing settings.
npx skillsauth add matsunagalab/mdclaw MD Benchmark OpenRouter MatrixInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Read skills/md-benchmark/SKILL.md, skills/common/preamble.md, and
skills/common/tool-output.md before acting.
Use this skill when the user wants to compare multiple harnesses and LLM models
against MDAgentBench. OpenRouter is the model/provider router; MDAgentBench
still scores only submission/ artifacts.
harness = the agent runner, e.g. Pydantic AI, OpenAI Agents SDK, LangGraph,
smolagents, Cursor, Claude Code, OpenCode, or a custom script.model_provider = openrouter for these runs.model_name = OpenRouter model slug, e.g.
anthropic/claude-sonnet-4-5.backend_name = MD engine or workflow used by the harness, e.g. openmm,
gromacs, literature-answer-workflow, or mock.run = one harness/model combination over one or more benchmark tasks.Every combination must end in a normal MDAgentBench submission/, score.json,
and summary.json.
truth/, scorer/, or expected/ as the agent under test.manifest.status="blocked" or "partial" and run scoring.provenance.json:
router.name="openrouter", router.model, and router.provider.harness_name, backend_name, model_provider="openrouter", model_name.provider.allow_fallbacks=false or
explicit provider.only so the actual provider is controlled.provider.require_parameters
where supported and record any unsupported-provider failures.OPENROUTER_API_KEY.Create or inspect a matrix config such as
examples/benchmark/harness_matrix.openrouter.json.
Run mock mode first:
python examples/benchmark/run_openrouter_matrix.py \
--config examples/benchmark/harness_matrix.openrouter.json \
--output-dir benchmark_runs \
--mock
Inspect generated run_config.json, provenance.json, score.json, and
summary.json.
For real OpenRouter runs:
export OPENROUTER_API_KEY=...
python examples/benchmark/run_openrouter_matrix.py \
--config examples/benchmark/harness_matrix.openrouter.json \
--output-dir benchmark_runs
Compare runs by summary.json and keep per-task score.json for audit.
Matrix config fields:
run_prefix: prefix for generated run IDs.tasks: list of MDAgentBench task_id values.harnesses: list of {name, adapter} objects.models: list of {name, provider} objects where name is an OpenRouter
model slug and provider is passed to OpenRouter / provenance.budget: optional token, walltime, and cost budget metadata.generic-openrouter is the built-in minimal adapter for plan-only tasks. It can
call OpenRouter directly, but it does not run MD and is not sufficient for
execution tasks that require real trajectories.
Long-form guide:
docs/benchmark/openrouter-harness-matrix.md.
tools
Molecular dynamics trajectory analysis using MDClaw CLI tools. Routes concat, metric, and troubleshooting workflows through focused guidance pages.
development
Generate monomer conformational source candidates with BioEmu, then hand them to MDClaw preparation.
testing
Study-level planning for MDClaw. Turns scientific questions into a small MD research plan, planned jobs, analysis intent, and decision criteria before handing off to stage skills.
data-ai
Run MDAgentBench tasks with prompt-driven MD agents and deterministic scorer commands. Use for benchmark runs, agent submissions, and comparing MD agents.