Adoption

Agent Skills are supported by leading AI development tools.

VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory

mathews-tom/benchmark-runner

Name: benchmark-runner
Author: mathews-tom

skills/benchmark-runner/SKILL.md

npx skillsauth add mathews-tom/armory benchmark-runner

Clean

TrivyContainer and dependency vulnerability scanner

Clean

SemgrepStatic code analysis for vulnerabilities

Clean

mcp-scan (Snyk)Model Context Protocol security validation

Skipped

Snyk (dep)Open source security scanning

Skipped

Socket.devSupply chain security analysis

Skipped

VirusTotalMulti-engine malware detection

Skipped

CrowdStrikeAdvanced threat intelligence

Skipped

OSV-ScannerOpen Source Vulnerability database check

Skipped

OWASP Dep-Check

Benchmark Runner

Standardizes performance comparison methodology: metric selection, test case design, environment capture, result formatting, and tradeoff analysis. Produces reproducible benchmark reports that support informed decisions — not just "A is faster than B" but "A is faster for small inputs while B scales better."

Reference Files

| File | Contents | Load When | | ----------------------------------- | ---------------------------------------------------------------------------------------------------- | ----------------------------------- | | references/metric-selection.md | Metric catalog (latency percentiles, throughput, memory, accuracy), selection criteria per task type | Always | | references/test-case-design.md | Representative input selection, scale variation, edge case coverage, warmup strategies | Always | | references/environment-capture.md | Hardware/software context recording, reproducibility requirements, variance control | Always | | references/statistical-rigor.md | Sample sizing, variance measurement, significance testing, outlier handling | Results need statistical validation |

Prerequisites

Clear candidates to compare (at least 2)
Access to run or observe the candidates (code, API, or existing results)
Representative workload definition

Workflow

Phase 1: Define Scope

What are the candidates? — Name each candidate precisely, including version. "Python dict vs Redis" is too vague. "Python 3.12 dict (in-process) vs Redis 7.2 (localhost, TCP)" is testable.
What claims need validation? — "A is faster" → faster at what? For what input size? Under what load? Benchmark design flows from the specific claim.
What is the decision context? — Why does this comparison matter? This determines which metrics are most important.

Phase 2: Select Metrics

Choose metrics that match the decision context:

| Metric Category | Specific Metrics | When Important | | --------------- | --------------------------------------- | -------------------------------------------- | | Latency | P50, P95, P99, mean, std dev | User-facing operations, API calls | | Throughput | ops/sec, tokens/sec, MB/sec | Batch processing, streaming | | Memory | Peak RSS, avg RSS, allocation rate | Resource-constrained environments | | Accuracy | F1, BLEU, exact match, precision/recall | ML models, algorithms with quality tradeoffs | | Cost | $/1K operations, $/hour, $/GB | Cloud services, API comparisons | | Startup | Time to first operation, cold start | Serverless, CLI tools |

Select 2-4 metrics. More than 4 makes comparison tables unreadable.

Phase 3: Design Test Cases

Create a matrix of inputs that reveal performance characteristics:

Scale variation — Small, medium, large inputs. Performance often changes non-linearly with scale.
Representative data — Use realistic inputs, not synthetic best-case data.
Edge cases — Empty input, maximum size, adversarial input.
Warmup — Exclude JIT compilation, cache warming, and connection establishment from measurements. Run N warmup iterations before recording.

Phase 4: Specify Environment

Record everything needed to reproduce the results:

Hardware — CPU model, core count, RAM size, GPU model (if applicable)
Software — OS version, language runtime version, dependency versions
Configuration — Thread count, batch size, connection pool size, cache settings
Isolation — What else was running? Background processes affect results.

Phase 5: Structure Results

Produce comparison tables with clear winners per metric, followed by tradeoff analysis.

Output Format

# Benchmark: {Descriptive Title}

**Date:** {YYYY-MM-DD}
**Hardware:** {CPU}, {RAM}, {GPU if applicable}
**Software:** {runtime versions}
**Configuration:** {key settings that affect results}

## Candidates

| # | Candidate | Version | Configuration |
|---|-----------|---------|---------------|
| A | {name} | {version} | {relevant config} |
| B | {name} | {version} | {relevant config} |

## Test Cases

| # | Name | Input Size | Description | Warmup | Iterations |
|---|------|------------|-------------|--------|------------|
| 1 | Small | {size} | {what it represents} | {N} | {N} |
| 2 | Medium | {size} | {what it represents} | {N} | {N} |
| 3 | Large | {size} | {what it represents} | {N} | {N} |

## Results

### Latency (ms, lower is better)

| Test Case | A (P50 / P95 / P99) | B (P50 / P95 / P99) | Winner |
|-----------|---------------------|---------------------|--------|
| Small | {values} | {values} | {A or B} |
| Medium | {values} | {values} | {A or B} |
| Large | {values} | {values} | {A or B} |

### Memory (MB, lower is better)

| Test Case | A (Peak) | B (Peak) | Winner |
|-----------|----------|----------|--------|
| Small | {value} | {value} | {A or B} |
| Medium | {value} | {value} | {A or B} |
| Large | {value} | {value} | {A or B} |

## Analysis

### Overall Winner
**{Candidate}** wins on {N} of {M} metrics across all test cases.

### Tradeoff Summary
- **Choose A when:** {conditions where A is the better choice}
- **Choose B when:** {conditions where B is the better choice}

### Caveats
- {Limitation of this benchmark}
- {Condition under which results may differ}

## Reproduction

```bash
# Environment setup
{commands to recreate the environment}

# Run benchmark
{commands to execute the benchmark}


## Configuring Scope

| Mode | Candidates | Depth | When to Use |
|------|-----------|-------|-------------|
| `quick` | 2 candidates, 1-2 metrics | Single test case, no statistics | Rough comparison, sanity check |
| `standard` | 2-3 candidates, 2-4 metrics | 3 test cases, mean + std dev | Default for most comparisons |
| `rigorous` | Any count, full metric suite | Multiple test cases, percentiles, significance tests | Publication, critical decisions |

## Calibration Rules

1. **Measure, don't guess.** Intuition about performance is unreliable. "Obviously
   faster" is not a benchmark result.
2. **Apples to apples.** Candidates must be compared under identical conditions.
   Different hardware, configuration, or input data invalidates the comparison.
3. **Report variance, not just means.** A mean of 50ms with std dev of 100ms is not
   the same as a mean of 50ms with std dev of 2ms. Always report spread.
4. **Warm up before measuring.** First-run performance includes JIT, cache warming,
   and connection setup. Exclude warmup iterations from results.
5. **Representative inputs only.** Benchmarking with synthetic best-case input is
   misleading. Use data that resembles production workloads.
6. **State the winner per metric, not overall.** "A is better" is lazy. "A has lower
   latency; B uses less memory" is useful.

## Error Handling

| Problem | Resolution |
|---------|------------|
| Cannot run candidates locally | Design the benchmark specification. Document what to measure and how. The user executes separately. |
| Results are noisy (high variance) | Increase iteration count. Check for background processes. Use dedicated hardware or containers for isolation. |
| Candidates serve different purposes | Acknowledge that the comparison is partial. Benchmark only the overlapping functionality. |
| No baseline exists | Establish one candidate as the baseline. Report relative performance (e.g., "B is 1.3x faster than A"). |
| Hardware context unavailable | Document what is known. Note that results may not be reproducible without full context. |

## When NOT to Benchmark

Push back if:
- The comparison is not performance-related (feature comparison → use a decision matrix or ADR instead)
- The candidates are fundamentally different tools (comparing a database to a message queue)
- The user wants to benchmark trivial operations (comparing two string concatenation methods in Python)
- Results from others already exist and conditions match — link to existing benchmarks instead

mathews-tom/benchmark-runner

skills/benchmark-runner/SKILL.md

Designs structured benchmarks comparing algorithms, models, or implementations with metrics, test cases, hardware context, and reproduction steps. Triggers on: "benchmark", "compare performance", "which is faster", "latency comparison", "run benchmark", "throughput test", "speed test".

174 stars

development

Updated Apr 17, 2026

$ install --global

skillsauth

npx skillsauth add mathews-tom/armory benchmark-runner

Install this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.

Security Scan Results

3 of 9 scanners reported clean

Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.

Scanners Passed

Scanners in report

Clean

TrivyContainer and dependency vulnerability scanner

95%

Clean

SemgrepStatic code analysis for vulnerabilities

95%

Clean

mcp-scan (Snyk)Model Context Protocol security validation

95%

Skipped

Snyk (dep)Open source security scanning

50%

Skipped

Socket.devSupply chain security analysis

50%

Skipped

VirusTotalMulti-engine malware detection

50%

Skipped

CrowdStrikeAdvanced threat intelligence

50%

Skipped

OSV-ScannerOpen Source Vulnerability database check

50%

Skipped

OWASP Dep-Check

50%

Last scanned: Apr 17, 2026, 9:39 AM4.3s6 files scanned

SKILL.md

name:: benchmark-runner
description:: Designs structured benchmarks comparing algorithms, models, or implementations with metrics, test cases, hardware context, and reproduction steps. Triggers on: "benchmark", "compare performance", "which is faster", "latency comparison", "run benchmark", "throughput test", "speed test".
version:: 1.1.1
category:: data
tags:: [benchmarking, performance, comparison, metrics]
difficulty:: intermediate
phase:: verify

Benchmark Runner

Reference Files

Prerequisites

Clear candidates to compare (at least 2)
Access to run or observe the candidates (code, API, or existing results)
Representative workload definition

Workflow

Phase 1: Define Scope

What are the candidates? — Name each candidate precisely, including version. "Python dict vs Redis" is too vague. "Python 3.12 dict (in-process) vs Redis 7.2 (localhost, TCP)" is testable.
What claims need validation? — "A is faster" → faster at what? For what input size? Under what load? Benchmark design flows from the specific claim.
What is the decision context? — Why does this comparison matter? This determines which metrics are most important.

Phase 2: Select Metrics

Choose metrics that match the decision context:

Select 2-4 metrics. More than 4 makes comparison tables unreadable.

Phase 3: Design Test Cases

Create a matrix of inputs that reveal performance characteristics:

Scale variation — Small, medium, large inputs. Performance often changes non-linearly with scale.
Representative data — Use realistic inputs, not synthetic best-case data.
Edge cases — Empty input, maximum size, adversarial input.
Warmup — Exclude JIT compilation, cache warming, and connection establishment from measurements. Run N warmup iterations before recording.

Phase 4: Specify Environment

Record everything needed to reproduce the results:

Hardware — CPU model, core count, RAM size, GPU model (if applicable)
Software — OS version, language runtime version, dependency versions
Configuration — Thread count, batch size, connection pool size, cache settings
Isolation — What else was running? Background processes affect results.

Phase 5: Structure Results

Produce comparison tables with clear winners per metric, followed by tradeoff analysis.

Output Format

# Benchmark: {Descriptive Title}

**Date:** {YYYY-MM-DD}
**Hardware:** {CPU}, {RAM}, {GPU if applicable}
**Software:** {runtime versions}
**Configuration:** {key settings that affect results}

## Candidates

| # | Candidate | Version | Configuration |
|---|-----------|---------|---------------|
| A | {name} | {version} | {relevant config} |
| B | {name} | {version} | {relevant config} |

## Test Cases

| # | Name | Input Size | Description | Warmup | Iterations |
|---|------|------------|-------------|--------|------------|
| 1 | Small | {size} | {what it represents} | {N} | {N} |
| 2 | Medium | {size} | {what it represents} | {N} | {N} |
| 3 | Large | {size} | {what it represents} | {N} | {N} |

## Results

### Latency (ms, lower is better)

| Test Case | A (P50 / P95 / P99) | B (P50 / P95 / P99) | Winner |
|-----------|---------------------|---------------------|--------|
| Small | {values} | {values} | {A or B} |
| Medium | {values} | {values} | {A or B} |
| Large | {values} | {values} | {A or B} |

### Memory (MB, lower is better)

| Test Case | A (Peak) | B (Peak) | Winner |
|-----------|----------|----------|--------|
| Small | {value} | {value} | {A or B} |
| Medium | {value} | {value} | {A or B} |
| Large | {value} | {value} | {A or B} |

## Analysis

### Overall Winner
**{Candidate}** wins on {N} of {M} metrics across all test cases.

### Tradeoff Summary
- **Choose A when:** {conditions where A is the better choice}
- **Choose B when:** {conditions where B is the better choice}

### Caveats
- {Limitation of this benchmark}
- {Condition under which results may differ}

## Reproduction

```bash
# Environment setup
{commands to recreate the environment}

# Run benchmark
{commands to execute the benchmark}


## Configuring Scope

| Mode | Candidates | Depth | When to Use |
|------|-----------|-------|-------------|
| `quick` | 2 candidates, 1-2 metrics | Single test case, no statistics | Rough comparison, sanity check |
| `standard` | 2-3 candidates, 2-4 metrics | 3 test cases, mean + std dev | Default for most comparisons |
| `rigorous` | Any count, full metric suite | Multiple test cases, percentiles, significance tests | Publication, critical decisions |

## Calibration Rules

1. **Measure, don't guess.** Intuition about performance is unreliable. "Obviously
   faster" is not a benchmark result.
2. **Apples to apples.** Candidates must be compared under identical conditions.
   Different hardware, configuration, or input data invalidates the comparison.
3. **Report variance, not just means.** A mean of 50ms with std dev of 100ms is not
   the same as a mean of 50ms with std dev of 2ms. Always report spread.
4. **Warm up before measuring.** First-run performance includes JIT, cache warming,
   and connection setup. Exclude warmup iterations from results.
5. **Representative inputs only.** Benchmarking with synthetic best-case input is
   misleading. Use data that resembles production workloads.
6. **State the winner per metric, not overall.** "A is better" is lazy. "A has lower
   latency; B uses less memory" is useful.

## Error Handling

| Problem | Resolution |
|---------|------------|
| Cannot run candidates locally | Design the benchmark specification. Document what to measure and how. The user executes separately. |
| Results are noisy (high variance) | Increase iteration count. Check for background processes. Use dedicated hardware or containers for isolation. |
| Candidates serve different purposes | Acknowledge that the comparison is partial. Benchmark only the overlapping functionality. |
| No baseline exists | Establish one candidate as the baseline. Report relative performance (e.g., "B is 1.3x faster than A"). |
| Hardware context unavailable | Document what is known. Note that results may not be reproducible without full context. |

## When NOT to Benchmark

Push back if:
- The comparison is not performance-related (feature comparison → use a decision matrix or ADR instead)
- The candidates are fundamentally different tools (comparing a database to a message queue)
- The user wants to benchmark trivial operations (comparing two string concatenation methods in Python)
- Results from others already exist and conditions match — link to existing benchmarks instead

Related Skills

mathews-tom/stacked-prs

testing

VerifiedTrustedCommunity

Manages dependent branch stacks and stacked pull requests using safe Git topology rules. Triggers on: "create stacked PRs", "publish this stack", "sync my PR stack", "rebase this stack", "merge the stack", "retarget child PRs", "split this branch into stacked PRs", "validate this stack", "cleanup stacked branches". Use when local branches or one source branch need to become a dependency-ordered PR stack with correct parent bases, validation, synchronization, merge order, and cleanup.

242SKILL.mdUpdated May 23, 2026

mathews-tom/stacked-prs

mathews-tom/project-context-setup

development

VerifiedTrustedCommunity

Scaffolds per-repository agent context so coding agents share the same issue tracker rules, triage label vocabulary, domain glossary, ADR layout, and handoff conventions. Triggers on: "set up project context", "configure agent docs", "create CONTEXT.md", "setup agent workflow", "agent issue tracker setup", "triage labels", "domain glossary for agents". Use when a repo needs durable context files before planning, triage, debugging, TDD, architecture review, or multi-agent implementation.

230SKILL.mdUpdated May 12, 2026

mathews-tom/project-context-setup

mathews-tom/task-decomposer

testing

VerifiedTrustedCommunity

Produces phased task boards from feature requests: dependency-mapped work items, parallelization flags, risk flags, edge cases, test matrices. Triggers on: "decompose this feature", "task breakdown with dependencies", "phased implementation plan", "work breakdown structure". NOT for effort estimates, use estimate-calibrator.

230SKILL.mdUpdated Apr 6, 2026

mathews-tom/task-decomposer

mathews-tom/debug-investigator

development

VerifiedTrustedCommunity

Hypothesis-driven debugging with ranked hypotheses, git bisect strategy, instrumentation planning, and minimal reproduction design. Triggers on: "debug this systematically", "root cause analysis", "bisect this bug", "rank hypotheses", "isolate this issue", "minimal reproduction". NOT for general reasoning.

230SKILL.mdUpdated Apr 6, 2026

mathews-tom/debug-investigator

Download

For Claude Desktop. Download once, then upload the file in the app — no terminal needed.

Need help? View full Cowork setup guide →

Install manually

Choose your platform

# Clone the repo
git clone https://github.com/mathews-tom/armory.git

# Copy into Claude Code skills folder (global)
cp -r armory/skills/benchmark-runner ~/.claude/skills/

Claude Code Skills — official skills path docs.

Repository

mathews-tom/armory

174 stars

Compatible with

Claude Code

OpenAI Codex CLI

ChatGPT