.agents/skills/search-benchmark/SKILL.md
Cross-model search quality benchmark for Switchboard's tool discovery. Dispatches identical search scenarios to opus, sonnet, and haiku in parallel, compiles a comparison table, and identifies optimization opportunities. Use when: "benchmark search", "test search quality", "run search benchmark", after changing scoring logic, synonyms, stop words, IDF, or tool descriptions, after adding new integrations, or when evaluating Phase 2 tag impact. Also use when the user mentions "search hit rate", "search recall", or "did search get better/worse". Not for full MCP smoke tests (use mcp-benchmark) or unit testing (use make test).
npx skillsauth add daltoniam/switchboard search-benchmarkInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Measures how well LLMs discover tools through Switchboard's search, across model tiers. The goal: prove that scoring changes translate to real LLM behavior improvement, not just synthetic test improvements.
We're optimizing call efficiency: how many search calls does an LLM need to find the N relevant tools?
x = 1 — one search finds everything1 ≤ x < Nx ≥ N — one call per tool, or worse (retries on vocabulary misses)Two dimensions of improvement:
Fast, deterministic, no LLM involved. Tests the scoring algorithm in isolation.
go test -v -run TestSearchBenchmark ./server/
Reports recall@K for both old (substring AND) and new (synonym+TF-IDF) against 46 curated test cases (17 single-tool + 29 multi-tool intent across 9 personas).
Run this first — it's instant and catches regressions.
Tests the full loop: LLM picks query terms → search returns results → we evaluate whether the right tools surfaced. Reveals how different model tiers interact with the scoring engine differently.
search({"limit": 0})
Check total tools and integrations. With --discover-all, expect ~898 tools
across 21 integrations. Without it, only enabled integrations are searchable.
Launch opus, sonnet, and haiku simultaneously using the Agent tool with
model parameter and run_in_background: true.
Each agent gets identical instructions:
You are benchmarking Switchboard's search tool across {TOTAL} tools and
{INTEGRATIONS} integrations. For each scenario below, use the
mcp__switchboard__search tool to find relevant tools. DO NOT execute any
tools — search only. Make exactly ONE search call per scenario.
For EACH scenario, record: the exact query you used, total results, and the
top 3 tool names with their integrations.
Scenarios:
1. "I need to create a Linear ticket"
2. "Send a Slack message to the team"
3. "Look up Sentry errors from today"
4. "Find GitHub pull requests for review"
5. "Investigate a production error across logging and error tracking"
6. "What deployed recently and did anything break?"
7. "Slow database queries need investigation"
8. "Draft a follow-up email about what we agreed to fix"
After all 8 scenarios, output ONLY this JSON (no other text):
{
"model": "{MODEL}",
"scenarios": [
{
"scenario": 1,
"query_used": "...",
"total_results": N,
"top_3": [{"name": "...", "integration": "..."}, ...]
}
]
}
After all 3 complete, build the per-scenario table:
| Scenario | Opus query (results) | Sonnet query (results) | Haiku query (results) | #1 Tool | Correct? |
|----------|---------------------|----------------------|---------------------|---------|----------|
| # | Scenario | Expected #1 Tool | Expected Integration |
|---|----------|-------------------|---------------------|
| 1 | Create Linear ticket | linear_create_issue | linear |
| 2 | Send Slack message | slack_send_message | slack |
| 3 | Sentry errors | sentry_list_issues or sentry_list_org_issues | sentry |
| 4 | GitHub PRs | github_list_pulls | github |
| 5 | Production error | sentry_list_project_events + datadog_search_logs | sentry, datadog |
| 6 | Recent deploys | github_list_deployments or github_list_releases | github |
| 7 | Slow DB queries | pganalyze_get_query_stats | pganalyze |
| 8 | Draft email | gmail_create_draft | gmail |
Report hit rate per model at multiple rank tiers:
### Hit Rate by Rank Tier
| Model | Top-1 | Top-3 | Top-5 | Top-8 |
|--------|-------|-------|-------|-------|
| Opus | X/8 | X/8 | X/8 | X/8 |
| Sonnet | X/8 | X/8 | X/8 | X/8 |
| Haiku | X/8 | X/8 | X/8 | X/8 |
The key metric is Top-3 — if the correct tool is in the top 3, the LLM has enough context to make the right choice. Top-1 is ideal but not required.
Classify each miss into one of these categories — this tells you what to fix:
| Category | Scoring layer fix | Example |
|----------|-------------------|---------|
| Missing synonym | Add to synonymGroups in server/search.go | "errors" should match "issues" |
| Stop-word gap | Add to stopWords in server/search.go | Verbose query drowning signal |
| Wrong integration ranked first | Phase 2 tags (not yet implemented) | rwx above sentry for "production error" |
| Zero results | Vocabulary gap — no matching words exist | "deployed recently" has no tool matches |
Output a summary block:
## Search Benchmark Results — {date}
Corpus: {N} tools, {M} integrations
Server flags: {--discover-all or default}
### Hit Rate by Rank Tier
| Model | Top-1 | Top-3 | Top-5 | Top-8 |
|--------|-------|-------|-------|-------|
| Opus | X/8 | X/8 | X/8 | X/8 |
| Sonnet | X/8 | X/8 | X/8 | X/8 |
| Haiku | X/8 | X/8 | X/8 | X/8 |
### Synthetic Benchmark
Single-tool recall@5: X/19 (XX%)
Integration recall: X/92 (XX%)
### Optimization Opportunities
[List from Step 5]
### Comparison to Previous
[If prior run data available, note improvements/regressions]
The 8 scenarios above cover common personas (DevOps, PM, CS, CEO, analyst). To add scenarios:
server/search_benchmark_test.go for
the synthetic benchmarkGood scenarios are natural language (how a real person would phrase it, not technical tool-speak) and cross-integration (need tools from 2+ integrations to fully address).
tools
Review a GitHub pull request for the Switchboard Go MCP server project. Enforces idiomatic Go, project conventions (hexagonal architecture, dispatch maps, port interfaces), test coverage, build/lint verification, and production readiness.
tools
Submit a PR review as inline GitHub comments on specific files and lines using the gh CLI.
tools
Improve an existing Switchboard integration adapter's LLM usability — tool description enrichment, field compaction refinement, and response tuning. Use when: "optimize integration", "improve tool descriptions", "extend compaction", "make integration better for LLMs", after user story mapping, or when an LLM is making wrong tool choices or passing wrong IDs. Not for adding new integrations (use add-integration) or fixing bugs.
tools
Live benchmark protocol for Switchboard's MCP server. Runs real tool-calling sequences against enabled integrations, tracks failure metrics, and identifies impediments to successful LLM tool usage. Use when: "benchmark", "test the MCP", "run user stories", "smoke test integrations", after adding/changing integrations or tools, after changing compaction specs or search logic, before releases. Not for unit testing (use make test) or load testing.