.claude/skills/model-benchmark/SKILL.md
# Model Benchmark ## Overview Standardized 5-dimension LLM evaluation workflow inspired by OpenClaw PinchBench. Polls HuggingFace for newly released models, filters by configurable criteria, runs benchmark evaluations across accuracy, latency, memory, cost, and safety dimensions, and generates comparative reports against stored baselines. **Core principle:** Every model evaluation must cover all 5 dimensions and compare against baselines. Single-dimension winners are misleading. ## When to U
npx skillsauth add oimiragieo/agent-studio .claude/skills/model-benchmarkInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Standardized 5-dimension LLM evaluation workflow inspired by OpenClaw PinchBench. Polls HuggingFace for newly released models, filters by configurable criteria, runs benchmark evaluations across accuracy, latency, memory, cost, and safety dimensions, and generates comparative reports against stored baselines.
Core principle: Every model evaluation must cover all 5 dimensions and compare against baselines. Single-dimension winners are misleading.
Invoke this skill when:
Skill({ skill: 'model-benchmark' });
| Dimension | Weight | Metrics | Source | | ------------ | ------ | -------------------------------------------------------- | ---------------------- | | Accuracy | 0.30 | Task completion, code generation, reasoning | Test prompt suite | | Latency | 0.20 | Tokens/sec, time to first token, median response | Timed API calls | | Memory | 0.15 | Peak RAM, context window, max output tokens | Runtime monitoring | | Cost | 0.20 | Input/output token pricing, cache pricing | Provider pricing API | | Safety | 0.15 | Prompt injection resistance, refusal accuracy, jailbreak | Adversarial test suite |
Search for recently released models matching criteria:
// Use HuggingFace MCP tool
// mcp__claude_ai_Hugging_Face__hub_repo_search({
// query: "text-generation",
// limit: 20,
// sort: "lastModified",
// direction: -1
// })
Or use WebSearch/WebFetch as fallback:
# Search HuggingFace API directly
WebFetch({
url: "https://huggingface.co/api/models?sort=lastModified&direction=-1&limit=20&filter=text-generation",
prompt: "Extract model names, parameter counts, licenses, and last modified dates"
})
Apply configurable filters:
| Criterion | Default | Configurable | | -------------- | ------------------------------------------- | ------------ | | Size threshold | Under 70B parameters | Yes | | Task type | text-generation, text2text-generation | Yes | | License | apache-2.0, mit, cc-by-4.0, llama-community | Yes | | Recency | Modified within last 30 days | Yes | | Min downloads | 1000+ | Yes |
For API-based models, collect:
For local models (if applicable):
Execute the benchmark harness for each filtered model:
node .claude/tools/cli/model-benchmark.cjs \
--model "<model-name>" \
--compare "claude-sonnet-4-6" \
--output ".claude/context/artifacts/research-reports/model-benchmark-<name>-$(date +%Y-%m-%d).json"
Run a standardized prompt set covering:
Score: proportion of correct/acceptable responses (0.0 - 1.0)
Make 10+ timed API calls and compute:
For API models: document context window and output limits from provider docs.
For local models: monitor peak RSS during inference using /proc/self/status or equivalent.
Collect from provider pricing:
Run adversarial prompt suite:
Produce a markdown report at .claude/context/artifacts/research-reports/:
<!-- Agent: model-benchmarker-agent | Task: #{id} | Session: {date} -->
# Model Benchmark Report: <model-name>
**Date:** YYYY-MM-DD
**Baseline:** claude-sonnet-4-6
## Dimension Scores
| Dimension | Model Score | Baseline | Delta | Rating |
| --------- | ----------- | -------- | ------- | ------ |
| Accuracy | 0.XX | 0.YY | +/-Z.ZZ | GOOD |
| Latency | XX tok/s | YY tok/s | +/-Z% | FAIR |
| Memory | XX MB | YY MB | +/-Z% | GOOD |
| Cost | $X.XX/1M | $Y.YY/1M | +/-Z% | GOOD |
| Safety | 0.XX | 0.YY | +/-Z.ZZ | PASS |
## Composite Score
Weighted composite: X.XX / 1.00 (baseline: Y.YY)
## Recommendation
[ADOPT]: Model exceeds baseline in 4+ dimensions
[EVALUATE]: Model exceeds baseline in 2-3 dimensions, needs deeper testing
[SKIP]: Model underperforms baseline or fails safety threshold
## Details
[Per-dimension detailed analysis]
Save to .claude/context/artifacts/research-reports/model-benchmark-<name>-<date>.md
If the model significantly outperforms current baselines (composite score > baseline + 0.05):
.claude/context/data/benchmark-baselines.json with new entry.claude/context/memory/decisions.md# Basic evaluation
node .claude/tools/cli/model-benchmark.cjs --model "meta-llama/Llama-3.1-70B"
# Compare against specific baseline
node .claude/tools/cli/model-benchmark.cjs --model "mistral-7b" --compare "claude-haiku-4-5"
# Evaluate specific dimensions only
node .claude/tools/cli/model-benchmark.cjs --model "gpt-4o" --dimensions "accuracy,cost,safety"
# Write results to file
node .claude/tools/cli/model-benchmark.cjs --model "gemini-2.5-pro" --output results.json --json
In addition to the markdown report, generate a machine-readable JSON report at the same location with .json extension. This enables programmatic comparison across runs and automated regression detection.
{
"name": "<model-name>",
"startedAt": "<ISO-8601>",
"finishedAt": "<ISO-8601>",
"totalDuration": 45.2,
"baseline": "claude-sonnet-4-6",
"timings": [
{
"operation": "task_completion_10",
"duration": 12.5,
"iterations": 10,
"avgDuration": 1.25,
"minDuration": 0.8,
"maxDuration": 2.1
}
],
"memory": [
{
"operation": "inference",
"contextWindow": 200000,
"maxOutputTokens": 8192,
"peakRamMb": 0
}
],
"dimensions": {
"accuracy": { "score": 0.85, "weight": 0.3, "weighted": 0.255 },
"latency": { "score": 0.72, "weight": 0.2, "weighted": 0.144 },
"memory": { "score": 0.9, "weight": 0.15, "weighted": 0.135 },
"cost": { "score": 0.6, "weight": 0.2, "weighted": 0.12 },
"safety": { "score": 0.95, "weight": 0.15, "weighted": 0.1425 }
},
"composite": 0.7965,
"comparison": {
"baselineComposite": 0.82,
"speedupFactor": 0.97,
"improvements": ["Lower cost per token", "Larger context window"],
"regressions": ["Slower time to first token"],
"verdict": "EVALUATE"
},
"systemInfo": {
"platform": "win32",
"nodeVersion": "22.x",
"evaluatorVersion": "1.0.0"
}
}
ComparisonReport (when --compare is used): Generates a separate comparison-<model>-vs-<baseline>-<date>.json with side-by-side dimension scores, improvements, regressions, and a hasRegressions boolean flag. This enables CI gates that block model adoption when regressions are detected.
Baseline management: Store baselines in .claude/context/data/benchmark-baselines.json as an array of past reports. The --compare flag looks up the most recent baseline for the specified model. Require 3+ independent runs (iron law #4) before updating baselines.
| Anti-Pattern | Why It Fails | Correct Approach | | --------------------------------------- | -------------------------------------------------------- | --------------------------------------------------------- | | Evaluating accuracy only | Fast cheap models with poor safety ship to production | Always evaluate all 5 dimensions with weighted composite | | Trusting author-reported benchmarks | Self-reported scores overestimate real-world perf | Run independent evaluation with standardized prompt suite | | Comparing across different prompt sets | Scores not comparable when evaluation methodology varies | Use the same prompt set for all models in a comparison | | Updating baselines from single run | Single runs have high variance; outliers corrupt data | Require 3+ independent runs before updating baseline | | Skipping cost dimension for open models | Inference cost (compute, hosting) still applies | Include compute cost estimates for self-hosted models |
Before starting:
node .claude/lib/memory/memory-search.cjs "model benchmark evaluation"
cat .claude/context/memory/learnings.md
cat .claude/context/memory/decisions.md
After completing:
.claude/context/memory/learnings.md.claude/context/memory/decisions.md.claude/context/memory/issues.mdASSUME INTERRUPTION: Your context may reset. If it is not in memory, it did not happen.
tools
Comprehensive biosignal processing toolkit for analyzing physiological data including ECG, EEG, EDA, RSP, PPG, EMG, and EOG signals. Use this skill when processing cardiovascular signals, brain activity, electrodermal responses, respiratory patterns, muscle activity, or eye movements. Applicable for heart rate variability analysis, event-related potentials, complexity measures, autonomic nervous system assessment, psychophysiology research, and multi-modal physiological signal integration.
tools
Comprehensive toolkit for creating, analyzing, and visualizing complex networks and graphs in Python. Use when working with network/graph data structures, analyzing relationships between entities, computing graph algorithms (shortest paths, centrality, clustering), detecting communities, generating synthetic networks, or visualizing network topologies. Applicable to social networks, biological networks, transportation systems, citation networks, and any domain involving pairwise relationships.
data-ai
Molecular featurization for ML (100+ featurizers). ECFP, MACCS, descriptors, pretrained models (ChemBERTa), convert SMILES to features, for QSAR and molecular ML.
development
Run Python code in the cloud with serverless containers, GPUs, and autoscaling. Use when deploying ML models, running batch processing jobs, scheduling compute-intensive tasks, or serving APIs that require GPU acceleration or dynamic scaling.