Adoption

Agent Skills are supported by leading AI development tools.

VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory

matsunagalab/MD Benchmark OpenRouter Matrix

Name: MD Benchmark OpenRouter Matrix
Author: matsunagalab

skills/md-benchmark-openrouter/SKILL.md

npx skillsauth add matsunagalab/mdclaw MD Benchmark OpenRouter Matrix

Clean

TrivyContainer and dependency vulnerability scanner

Clean

SemgrepStatic code analysis for vulnerabilities

Clean

mcp-scan (Snyk)Model Context Protocol security validation

Skipped

Snyk (dep)Open source security scanning

Skipped

Socket.devSupply chain security analysis

Skipped

VirusTotalMulti-engine malware detection

Skipped

CrowdStrikeAdvanced threat intelligence

Skipped

OSV-ScannerOpen Source Vulnerability database check

Skipped

OWASP Dep-Check

MD Benchmark OpenRouter Matrix

Read skills/md-benchmark/SKILL.md, skills/common/preamble.md, and skills/common/tool-output.md before acting.

Use this skill when the user wants to compare multiple harnesses and LLM models against MDAgentBench. OpenRouter is the model/provider router; MDAgentBench still scores only submission/ artifacts.

Required Mental Model

harness = the agent runner, e.g. Pydantic AI, OpenAI Agents SDK, LangGraph, smolagents, Cursor, Claude Code, OpenCode, or a custom script.
model_provider = openrouter for these runs.
model_name = OpenRouter model slug, e.g. anthropic/claude-sonnet-4-5.
backend_name = MD engine or workflow used by the harness, e.g. openmm, gromacs, literature-answer-workflow, or mock.
run = one harness/model combination over one or more benchmark tasks.

Every combination must end in a normal MDAgentBench submission/, score.json, and summary.json.

Critical Rules

Never read truth/, scorer/, or expected/ as the agent under test.
Do not hide failed combinations. If a harness/model fails, still create a submission with manifest.status="blocked" or "partial" and run scoring.
Record routing in provenance.json: router.name="openrouter", router.model, and router.provider.
Record run metadata: harness_name, backend_name, model_provider="openrouter", model_name.
For publishable comparisons, prefer provider.allow_fallbacks=false or explicit provider.only so the actual provider is controlled.
If using tool calling or structured output, set provider.require_parameters where supported and record any unsupported-provider failures.
Keep OpenRouter API keys out of committed files. Use OPENROUTER_API_KEY.

Recommended Workflow

Create or inspect a matrix config such as examples/benchmark/harness_matrix.openrouter.json.

Run mock mode first:

python examples/benchmark/run_openrouter_matrix.py \
  --config examples/benchmark/harness_matrix.openrouter.json \
  --output-dir benchmark_runs \
  --mock

Inspect generated run_config.json, provenance.json, score.json, and summary.json.

For real OpenRouter runs:

export OPENROUTER_API_KEY=...
python examples/benchmark/run_openrouter_matrix.py \
  --config examples/benchmark/harness_matrix.openrouter.json \
  --output-dir benchmark_runs

Compare runs by summary.json and keep per-task score.json for audit.

Config Contract

Matrix config fields:

run_prefix: prefix for generated run IDs.
tasks: list of MDAgentBench task_id values.
harnesses: list of {name, adapter} objects.
models: list of {name, provider} objects where name is an OpenRouter model slug and provider is passed to OpenRouter / provenance.
budget: optional token, walltime, and cost budget metadata.

generic-openrouter is the built-in minimal adapter for plan-only tasks. It can call OpenRouter directly, but it does not run MD and is not sufficient for execution tasks that require real trajectories.

Documentation

Long-form guide: docs/benchmark/openrouter-harness-matrix.md.

matsunagalab/MD Benchmark OpenRouter Matrix

skills/md-benchmark-openrouter/SKILL.md

Run MDAgentBench harness × OpenRouter model matrix evaluations. Use when comparing multiple agent harnesses, model slugs, or OpenRouter provider-routing settings.

3 stars

data-ai

Updated May 11, 2026

$ install --global

skillsauth

npx skillsauth add matsunagalab/mdclaw MD Benchmark OpenRouter Matrix

Install this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.

Security Scan Results

3 of 9 scanners reported clean

Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.

Scanners Passed

Scanners in report

Clean

TrivyContainer and dependency vulnerability scanner

95%

Clean

SemgrepStatic code analysis for vulnerabilities

95%

Clean

mcp-scan (Snyk)Model Context Protocol security validation

95%

Skipped

Snyk (dep)Open source security scanning

50%

Skipped

Socket.devSupply chain security analysis

50%

Skipped

VirusTotalMulti-engine malware detection

50%

Skipped

CrowdStrikeAdvanced threat intelligence

50%

Skipped

OSV-ScannerOpen Source Vulnerability database check

50%

Skipped

OWASP Dep-Check

50%

Last scanned: May 11, 2026, 6:15 AM332.6s1 file scanned

SKILL.md

name:: MD Benchmark OpenRouter Matrix
description:: Run MDAgentBench harness × OpenRouter model matrix evaluations. Use when comparing multiple agent harnesses, model slugs, or OpenRouter provider-routing settings.

MD Benchmark OpenRouter Matrix

Read skills/md-benchmark/SKILL.md, skills/common/preamble.md, and skills/common/tool-output.md before acting.

Use this skill when the user wants to compare multiple harnesses and LLM models against MDAgentBench. OpenRouter is the model/provider router; MDAgentBench still scores only submission/ artifacts.

Required Mental Model

harness = the agent runner, e.g. Pydantic AI, OpenAI Agents SDK, LangGraph, smolagents, Cursor, Claude Code, OpenCode, or a custom script.
model_provider = openrouter for these runs.
model_name = OpenRouter model slug, e.g. anthropic/claude-sonnet-4-5.
backend_name = MD engine or workflow used by the harness, e.g. openmm, gromacs, literature-answer-workflow, or mock.
run = one harness/model combination over one or more benchmark tasks.

Every combination must end in a normal MDAgentBench submission/, score.json, and summary.json.

Critical Rules

Never read truth/, scorer/, or expected/ as the agent under test.
Do not hide failed combinations. If a harness/model fails, still create a submission with manifest.status="blocked" or "partial" and run scoring.
Record routing in provenance.json: router.name="openrouter", router.model, and router.provider.
Record run metadata: harness_name, backend_name, model_provider="openrouter", model_name.
For publishable comparisons, prefer provider.allow_fallbacks=false or explicit provider.only so the actual provider is controlled.
If using tool calling or structured output, set provider.require_parameters where supported and record any unsupported-provider failures.
Keep OpenRouter API keys out of committed files. Use OPENROUTER_API_KEY.

Recommended Workflow

Create or inspect a matrix config such as examples/benchmark/harness_matrix.openrouter.json.

Run mock mode first:

python examples/benchmark/run_openrouter_matrix.py \
  --config examples/benchmark/harness_matrix.openrouter.json \
  --output-dir benchmark_runs \
  --mock

Inspect generated run_config.json, provenance.json, score.json, and summary.json.

For real OpenRouter runs:

export OPENROUTER_API_KEY=...
python examples/benchmark/run_openrouter_matrix.py \
  --config examples/benchmark/harness_matrix.openrouter.json \
  --output-dir benchmark_runs

Compare runs by summary.json and keep per-task score.json for audit.

Config Contract

Matrix config fields:

run_prefix: prefix for generated run IDs.
tasks: list of MDAgentBench task_id values.
harnesses: list of {name, adapter} objects.
models: list of {name, provider} objects where name is an OpenRouter model slug and provider is passed to OpenRouter / provenance.
budget: optional token, walltime, and cost budget metadata.

Documentation

Long-form guide: docs/benchmark/openrouter-harness-matrix.md.

Related Skills

matsunagalab/md-analyze

tools

VerifiedTrustedCommunity

Molecular dynamics trajectory analysis using MDClaw CLI tools. Routes concat, metric, and troubleshooting workflows through focused guidance pages.

4SKILL.mdUpdated Apr 17, 2026

matsunagalab/md-analyze

matsunagalab/bioemu-sample

development

VerifiedTrustedCommunity

Generate monomer conformational source candidates with BioEmu, then hand them to MDClaw preparation.

3SKILL.mdUpdated May 15, 2026

matsunagalab/bioemu-sample

matsunagalab/md-study

testing

VerifiedTrustedCommunity

Study-level planning for MDClaw. Turns scientific questions into a small MD research plan, planned jobs, analysis intent, and decision criteria before handing off to stage skills.

3SKILL.mdUpdated May 14, 2026

matsunagalab/md-study