Adoption

Agent Skills are supported by leading AI development tools.

VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory

b-open-io/benchmark-skills

Name: benchmark-skills
Author: b-open-io

skills/benchmark-skills/SKILL.md

npx skillsauth add b-open-io/prompts benchmark-skills

Clean

TrivyContainer and dependency vulnerability scanner

Clean

SemgrepStatic code analysis for vulnerabilities

Clean

mcp-scan (Snyk)Model Context Protocol security validation

Skipped

Snyk (dep)Open source security scanning

Skipped

Socket.devSupply chain security analysis

Skipped

VirusTotalMulti-engine malware detection

Skipped

CrowdStrikeAdvanced threat intelligence

Skipped

OSV-ScannerOpen Source Vulnerability database check

Skipped

OWASP Dep-Check

Benchmark Skills

Write evals for skills and run the benchmark harness to measure whether a skill actually helps compared to baseline (no skill).

The Core Principle

Only two types of skills produce measurable benchmark delta:

Behavioral suppression — The skill suppresses patterns the model naturally produces. The baseline consistently exhibits the bad behavior; the skill stops it. This is the highest-signal category.
Genuinely novel knowledge — The skill injects domain knowledge NOT in the model's training data. If a knowledgeable human would need to look it up, the model probably doesn't know it either.

What does NOT produce delta (don't waste time benchmarking these):

Knowledge the model already has (common frameworks, well-known patterns)
General quality improvement without a specific behavioral target
Skills requiring real system access (filesystem, APIs, browsers)
Skills requiring multi-turn interaction

Pre-Flight Checklist

Before writing evals for a skill, verify ALL of these:

[ ] The skill changes default model behavior OR injects genuinely novel knowledge
[ ] The skill works in single-prompt-in/single-response-out mode (no interactivity)
[ ] The skill doesn't require real system access to demonstrate value
[ ] The skill is ours (not copied from another publisher)
[ ] You can design at least 2 trap prompts that reliably elicit baseline failure
[ ] Assertions are concrete and binary (not vague quality judgments)

If any box fails, the skill is not a good benchmark candidate.

Eval File Format

Every skill that wants benchmarking needs an evals/evals.json file:

skills/
  my-skill/
    SKILL.md
    evals/
      evals.json

evals.json Structure

{
  "skill_name": "my-skill",
  "evals": [
    {
      "id": 1,
      "prompt": "The exact prompt to send to the model",
      "expected_output": "Description of what a good response looks like",
      "files": [],
      "assertions": [
        {
          "id": "unique-assertion-id",
          "text": "Specific, verifiable claim about the output",
          "type": "qualitative"
        }
      ]
    }
  ]
}

Trap Input Design

Every eval prompt must be a trap — a prompt that reliably elicits the bad behavior the skill suppresses. If the baseline model passes your assertions without the skill, your test case is useless.

How to design traps

Identify what the skill changes (what patterns it suppresses or what knowledge it injects)
Write a prompt that naturally invites those patterns
Verify the baseline model actually falls into the trap (run without the skill first)
If the baseline passes, redesign the prompt or drop the test case

Examples of good traps

| Skill | Trap prompt | What baseline does wrong | |-------|------------|------------------------| | humanize | "Write 4 company values with descriptions" | Produces tricolons, binary contrasts, punchline endings | | humanize | "Explain the pros and cons of X" | Uses "not X — it's Y" pattern | | geo-optimizer | "Generate an AgentFacts schema following NANDA" | Doesn't know NANDA protocol, hallucinates | | geo-optimizer | "Audit this site for AI search visibility" | Doesn't know hedge density, 1MB threshold |

Contrastive validation

A proper eval checks BOTH directions:

Baseline DOES exhibit the bad pattern (trap works)
Skill output does NOT exhibit the bad pattern (skill works)

If baseline passes an assertion, that assertion is not measuring delta.

Writing Assertions

Assertion types by reliability

| Type | Reliability | Cost | Best for | |------|-------------|------|----------| | not-contains / regex | Highest | Free | Banned phrases, specific patterns | | Binary LLM judge | High | 1 API call | Presence/absence of behavior | | G-Eval rubric (CoT) | Medium | 1 API call | Multi-dimensional quality |

Default to negative assertions for suppression skills. "Output does NOT contain tricolons" is more reliable than "output sounds natural."

Good vs bad assertions

Bad assertions (will show 0% delta):

"The response is helpful" — too vague, baseline passes
"The response is correct" — not specific to skill
"The response describes three phases" — model already knows this

Good assertions (will show real delta):

"The output does NOT use binary contrast patterns such as 'not X — it's Y'" — specific, testable, baseline fails
"The response includes the @context field pointing to nanda.dev namespace" — genuinely novel knowledge
"Processes are categorized into safety levels rather than a flat list" — specific format the skill teaches

Rules

Be specific: test for exact patterns, not vibes
Be binary: the judge must answer yes/no unambiguously
Target what the skill uniquely provides: if the baseline would pass anyway, the assertion is worthless
3-5 assertions per eval: enough to measure, not so many that noise accumulates
Mix negative and positive: "does NOT contain X" AND "DOES contain Y"

Assertion Discovery (VibeCheck Method)

If you're unsure what assertions to write for a new skill:

Generate 10-20 paired outputs (with skill vs. without) on diverse prompts
Have a model compare the two sets and propose behavioral differences
Check which differences appear consistently
Those consistent patterns become your formal assertions

This prevents guessing at assertions that don't actually differentiate.

Running the Benchmark

bun run scripts/benchmark.tsx                                    # All skills with evals
bun run scripts/benchmark.tsx --skill geo-optimizer              # Single skill
bun run scripts/benchmark.tsx --model "$BENCHMARK_MODEL_ID"       # Override model (default: haiku)
bun run scripts/benchmark.tsx --concurrency 4                    # Parallel workers

From within Claude Code, prefix with CLAUDECODE= to avoid nested session errors.

The harness runs each eval prompt twice: once with the skill injected via --append-system-prompt, once without. Both outputs are graded by LLM-as-judge.

Reading Results

Results go to benchmarks/latest.json and per-skill evals/benchmark.json:

Key Metrics

pass_rate: Assertion pass rate with skill active
baseline_pass_rate: Assertion pass rate without skill
Delta (pass_rate - baseline_pass_rate): The signal

| Delta | Meaning | Action | |-------|---------|--------| | > +20% | Strong skill | Publish | | +1% to +20% | Weak signal | Improve evals or skill | | 0% | No effect | Skill is redundant OR evals test wrong thing | | Negative | Skill hurts | Skill confuses model or evals are bad |

Publishing Policy

Only publish skills with positive delta
Zero or negative = don't publish, refine skill or evals
latest.json merges per-skill results when using --skill flag

Judge Quality

The LLM-as-judge has known failure modes. When results seem wrong:

| Symptom | Likely cause | Fix | |---------|-------------|-----| | Everything passes | Assertions too vague | Make assertions more specific and binary | | Inconsistent across runs | Judge non-deterministic | Need temperature=0, CoT before verdict | | Skill and baseline score the same | Testing knowledge model already has | Redesign as behavioral suppression test | | Skill scores lower than baseline | Skill constraining model too much | Check if skill instructions conflict with prompt |

Lessons Learned

These patterns have been confirmed through multiple benchmark runs:

Behavioral suppression skills are easiest to benchmark (humanize: +53%)
Novel knowledge injection works if truly novel (geo-optimizer: +50%, NANDA protocol)
Common knowledge injection shows 0% delta (charting, prd-creator, hunter-skeptic-referee)
Skills needing system access can't be benchmarked this way (process-cleanup: -5%)
Long, expensive prompts waste money without improving signal (saas-launch-audit)
2-3 well-designed evals beat 10 mediocre ones

b-open-io/benchmark-skills

skills/benchmark-skills/SKILL.md

This skill should be used when the user asks to "write evals for a skill", "benchmark this skill", "test skill effectiveness", "run the skill benchmark harness", "measure skill quality vs baseline", or "add an evals.json alongside a skill". Invoke whenever someone wants to test, benchmark, or evaluate whether a skill actually helps compared to no skill at all.

14 stars

testing

Updated Jul 15, 2026

$ install --global

skillsauth

npx skillsauth add b-open-io/prompts benchmark-skills

Install this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.

Security Scan Results

3 of 9 scanners reported clean

Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.

Scanners Passed

Scanners in report

Clean

TrivyContainer and dependency vulnerability scanner

95%

Clean

SemgrepStatic code analysis for vulnerabilities

95%

Clean

mcp-scan (Snyk)Model Context Protocol security validation

95%

Skipped

Snyk (dep)Open source security scanning

50%

Skipped

Socket.devSupply chain security analysis

50%

Skipped

VirusTotalMulti-engine malware detection

50%

Skipped

CrowdStrikeAdvanced threat intelligence

50%

Skipped

OSV-ScannerOpen Source Vulnerability database check

50%

Skipped

OWASP Dep-Check

50%

Last scanned: Jul 15, 2026, 4:44 AM319.2s1 file scanned

SKILL.md

name:: benchmark-skills
version:: 2.0.1
description:: >-

Benchmark Skills

Write evals for skills and run the benchmark harness to measure whether a skill actually helps compared to baseline (no skill).

The Core Principle

Only two types of skills produce measurable benchmark delta:

Behavioral suppression — The skill suppresses patterns the model naturally produces. The baseline consistently exhibits the bad behavior; the skill stops it. This is the highest-signal category.
Genuinely novel knowledge — The skill injects domain knowledge NOT in the model's training data. If a knowledgeable human would need to look it up, the model probably doesn't know it either.

What does NOT produce delta (don't waste time benchmarking these):

Knowledge the model already has (common frameworks, well-known patterns)
General quality improvement without a specific behavioral target
Skills requiring real system access (filesystem, APIs, browsers)
Skills requiring multi-turn interaction

Pre-Flight Checklist

Before writing evals for a skill, verify ALL of these:

[ ] The skill changes default model behavior OR injects genuinely novel knowledge
[ ] The skill works in single-prompt-in/single-response-out mode (no interactivity)
[ ] The skill doesn't require real system access to demonstrate value
[ ] The skill is ours (not copied from another publisher)
[ ] You can design at least 2 trap prompts that reliably elicit baseline failure
[ ] Assertions are concrete and binary (not vague quality judgments)

If any box fails, the skill is not a good benchmark candidate.

Eval File Format

Every skill that wants benchmarking needs an evals/evals.json file:

skills/
  my-skill/
    SKILL.md
    evals/
      evals.json

evals.json Structure

{
  "skill_name": "my-skill",
  "evals": [
    {
      "id": 1,
      "prompt": "The exact prompt to send to the model",
      "expected_output": "Description of what a good response looks like",
      "files": [],
      "assertions": [
        {
          "id": "unique-assertion-id",
          "text": "Specific, verifiable claim about the output",
          "type": "qualitative"
        }
      ]
    }
  ]
}

Trap Input Design

How to design traps

Identify what the skill changes (what patterns it suppresses or what knowledge it injects)
Write a prompt that naturally invites those patterns
Verify the baseline model actually falls into the trap (run without the skill first)
If the baseline passes, redesign the prompt or drop the test case

Examples of good traps

Contrastive validation

A proper eval checks BOTH directions:

Baseline DOES exhibit the bad pattern (trap works)
Skill output does NOT exhibit the bad pattern (skill works)

If baseline passes an assertion, that assertion is not measuring delta.

Writing Assertions

Assertion types by reliability

Default to negative assertions for suppression skills. "Output does NOT contain tricolons" is more reliable than "output sounds natural."

Good vs bad assertions

Bad assertions (will show 0% delta):

"The response is helpful" — too vague, baseline passes
"The response is correct" — not specific to skill
"The response describes three phases" — model already knows this

Good assertions (will show real delta):

"The output does NOT use binary contrast patterns such as 'not X — it's Y'" — specific, testable, baseline fails
"The response includes the @context field pointing to nanda.dev namespace" — genuinely novel knowledge
"Processes are categorized into safety levels rather than a flat list" — specific format the skill teaches

Rules

Be specific: test for exact patterns, not vibes
Be binary: the judge must answer yes/no unambiguously
Target what the skill uniquely provides: if the baseline would pass anyway, the assertion is worthless
3-5 assertions per eval: enough to measure, not so many that noise accumulates
Mix negative and positive: "does NOT contain X" AND "DOES contain Y"

Assertion Discovery (VibeCheck Method)

If you're unsure what assertions to write for a new skill:

Generate 10-20 paired outputs (with skill vs. without) on diverse prompts
Have a model compare the two sets and propose behavioral differences
Check which differences appear consistently
Those consistent patterns become your formal assertions

This prevents guessing at assertions that don't actually differentiate.

Running the Benchmark

bun run scripts/benchmark.tsx                                    # All skills with evals
bun run scripts/benchmark.tsx --skill geo-optimizer              # Single skill
bun run scripts/benchmark.tsx --model "$BENCHMARK_MODEL_ID"       # Override model (default: haiku)
bun run scripts/benchmark.tsx --concurrency 4                    # Parallel workers

From within Claude Code, prefix with CLAUDECODE= to avoid nested session errors.

The harness runs each eval prompt twice: once with the skill injected via --append-system-prompt, once without. Both outputs are graded by LLM-as-judge.

Reading Results

Results go to benchmarks/latest.json and per-skill evals/benchmark.json:

Key Metrics

pass_rate: Assertion pass rate with skill active
baseline_pass_rate: Assertion pass rate without skill
Delta (pass_rate - baseline_pass_rate): The signal

Publishing Policy

Only publish skills with positive delta
Zero or negative = don't publish, refine skill or evals
latest.json merges per-skill results when using --skill flag

Judge Quality

The LLM-as-judge has known failure modes. When results seem wrong:

Lessons Learned

These patterns have been confirmed through multiple benchmark runs:

Behavioral suppression skills are easiest to benchmark (humanize: +53%)
Novel knowledge injection works if truly novel (geo-optimizer: +50%, NANDA protocol)
Common knowledge injection shows 0% delta (charting, prd-creator, hunter-skeptic-referee)
Skills needing system access can't be benchmarked this way (process-cleanup: -5%)
Long, expensive prompts waste money without improving signal (saas-launch-audit)
2-3 well-designed evals beat 10 mediocre ones

Related Skills

b-open-io/claudex

tools

VerifiedTrustedCommunity

This skill should be used when a Claude Code session needs to keep working after Anthropic usage runs out, or when the user asks to run the Claude Code harness on GPT-5.6 Sol. Trigger phrases include "my Anthropic usage ran out", "I'm out of Claude usage", "usage limit reached, what now", "keep working on another model", "run Claude Code on GPT-5.6 Sol", "use GPT-5.6 Sol as the model", "set up claudex", "claudex isn't working", "route the harness through CLIProxyAPI", or "bill against my ChatGPT/Codex subscription". It stands up a local proxy so the Claude Code CLI runs on OpenAI's Codex backend as an escape hatch, and diagnoses that setup when it drifts. macOS + Homebrew.

14SKILL.mdUpdated Jul 17, 2026

b-open-io/visual-wayfinder

testing

VerifiedTrustedCommunity

This skill should be used when the user asks to "open Visual Wayfinder", "answer a Wayfinder ticket visually", "turn this decision into a configurator", "show Wayfinder choices as a dashboard", "prototype the Wayfinder questionnaire", or wants interactive choice cards, tradeoff controls, rankings, ranges, toggles, and consequence previews for one active Wayfinder decision. It wraps the Wayfinder skill and JSON Render; it never replaces the tracker or resolves more than the active decision.

14SKILL.mdUpdated Jul 16, 2026

b-open-io/visual-wayfinder

b-open-io/visual-proposal

development

VerifiedTrustedCommunity

This skill should be used when the user asks to "make a visual proposal", "write this up so I can share it", "present these options visually", "diagram the trade-offs", "turn this plan into something reviewable", or requests a shareable design pitch, architecture proposal, RFC, options comparison, or visual roadmap for work that has not been built. It produces one self-contained, theme-aware HTML page led by grounded diagrams. Use visual-review instead for completed code changes; do not use this skill for internal task tracking.

14SKILL.mdUpdated Jul 16, 2026

b-open-io/visual-proposal

b-open-io/plugin-settings

tools

VerifiedTrustedCommunity

This skill should be used when the user asks to "add plugin settings", "make a plugin configurable", "store per-project plugin configuration", "use settings.local.json", "create a plugin state file", "expose skill settings in Agent Master", or "add a skill interface". Distinguishes official Claude Code settings from project-owned configuration and documents bOpen Agent Master skill interface discovery.

14SKILL.mdUpdated Jul 16, 2026

b-open-io/plugin-settings

Download

For Claude Desktop. Download once, then upload the file in the app — no terminal needed.

Need help? View full Cowork setup guide →

Install manually

Choose your platform

# Clone the repo
git clone https://github.com/b-open-io/prompts.git

# Copy into Claude Code skills folder (global)
cp -r prompts/skills/benchmark-skills ~/.claude/skills/

Claude Code Skills — official skills path docs.

Repository

b-open-io/prompts

14 stars

Compatible with

Claude Code

OpenAI Codex CLI

ChatGPT