skills/benchmark-skills/SKILL.md
Use this skill when creating evals or assertions for a skill, running the skill benchmark harness, measuring skill effectiveness vs baseline, or writing evals.json files alongside skills. Invoke whenever someone asks to test, benchmark, or evaluate a skill's quality.
npx skillsauth add b-open-io/prompts benchmark-skillsInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Write evals for skills and run the benchmark harness to measure whether a skill actually helps compared to baseline (no skill).
Only two types of skills produce measurable benchmark delta:
What does NOT produce delta (don't waste time benchmarking these):
Before writing evals for a skill, verify ALL of these:
If any box fails, the skill is not a good benchmark candidate.
Every skill that wants benchmarking needs an evals/evals.json file:
skills/
my-skill/
SKILL.md
evals/
evals.json
{
"skill_name": "my-skill",
"evals": [
{
"id": 1,
"prompt": "The exact prompt to send to the model",
"expected_output": "Description of what a good response looks like",
"files": [],
"assertions": [
{
"id": "unique-assertion-id",
"text": "Specific, verifiable claim about the output",
"type": "qualitative"
}
]
}
]
}
Every eval prompt must be a trap — a prompt that reliably elicits the bad behavior the skill suppresses. If the baseline model passes your assertions without the skill, your test case is useless.
| Skill | Trap prompt | What baseline does wrong | |-------|------------|------------------------| | humanize | "Write 4 company values with descriptions" | Produces tricolons, binary contrasts, punchline endings | | humanize | "Explain the pros and cons of X" | Uses "not X — it's Y" pattern | | geo-optimizer | "Generate an AgentFacts schema following NANDA" | Doesn't know NANDA protocol, hallucinates | | geo-optimizer | "Audit this site for AI search visibility" | Doesn't know hedge density, 1MB threshold |
A proper eval checks BOTH directions:
If baseline passes an assertion, that assertion is not measuring delta.
| Type | Reliability | Cost | Best for |
|------|-------------|------|----------|
| not-contains / regex | Highest | Free | Banned phrases, specific patterns |
| Binary LLM judge | High | 1 API call | Presence/absence of behavior |
| G-Eval rubric (CoT) | Medium | 1 API call | Multi-dimensional quality |
Default to negative assertions for suppression skills. "Output does NOT contain tricolons" is more reliable than "output sounds natural."
Bad assertions (will show 0% delta):
Good assertions (will show real delta):
If you're unsure what assertions to write for a new skill:
This prevents guessing at assertions that don't actually differentiate.
bun run benchmark # All skills with evals
bun run benchmark --skill geo-optimizer # Single skill
bun run benchmark --model claude-sonnet-4-6 # Override model (default: haiku)
bun run benchmark --concurrency 4 # Parallel workers
From within Claude Code, prefix with CLAUDECODE= to avoid nested session errors.
The harness runs each eval prompt twice: once with the skill injected via --append-system-prompt, once without. Both outputs are graded by LLM-as-judge.
Results go to benchmarks/latest.json and per-skill evals/benchmark.json:
| Delta | Meaning | Action | |-------|---------|--------| | > +20% | Strong skill | Publish | | +1% to +20% | Weak signal | Improve evals or skill | | 0% | No effect | Skill is redundant OR evals test wrong thing | | Negative | Skill hurts | Skill confuses model or evals are bad |
latest.json merges per-skill results when using --skill flagThe LLM-as-judge has known failure modes. When results seem wrong:
| Symptom | Likely cause | Fix | |---------|-------------|-----| | Everything passes | Assertions too vague | Make assertions more specific and binary | | Inconsistent across runs | Judge non-deterministic | Need temperature=0, CoT before verdict | | Skill and baseline score the same | Testing knowledge model already has | Redesign as behavioral suppression test | | Skill scores lower than baseline | Skill constraining model too much | Check if skill instructions conflict with prompt |
These patterns have been confirmed through multiple benchmark runs:
development
This skill should be used when the user asks to "design a business card", "make a printable PDF", "render HTML to PDF", "generate a postcard", "build print collateral", "set up an HTML print pipeline", or needs help with bleed, safe areas, font embedding, or QR generation for print. Provides a Playwright-based pipeline with multiple bundled templates and theme variants for business cards (minimal, watercolor light, watercolor dark) and instructions for adding new templates.
tools
Get recent tweets from an X/Twitter user. Use when user asks "what has @username posted", "recent tweets from", "user's X posts", "show timeline for", "what is @user saying". Requires X_BEARER_TOKEN.
data-ai
Get X/Twitter user profile by username. Use when user asks "who is @username", "get X profile", "lookup Twitter user", "find X account", "user details", "follower count for". Requires X_BEARER_TOKEN.
data-ai
Search recent X/Twitter posts by query. Returns RAW TWEETS (last 7 days). Use when user asks "search X for", "find tweets about", "what are people saying about", "Twitter search", "raw tweets about". For AI summaries/sentiment, use x-research instead. Requires X_BEARER_TOKEN.