Adoption

Agent Skills are supported by leading AI development tools.

VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory

daltoniam/search-benchmark

Name: search-benchmark
Author: daltoniam

.agents/skills/search-benchmark/SKILL.md

npx skillsauth add daltoniam/switchboard search-benchmark

Clean

TrivyContainer and dependency vulnerability scanner

Clean

SemgrepStatic code analysis for vulnerabilities

Clean

mcp-scan (Snyk)Model Context Protocol security validation

Skipped

Snyk (dep)Open source security scanning

Skipped

Socket.devSupply chain security analysis

Skipped

VirusTotalMulti-engine malware detection

Skipped

CrowdStrikeAdvanced threat intelligence

Skipped

OSV-ScannerOpen Source Vulnerability database check

Skipped

OWASP Dep-Check

Search Quality Benchmark

Measures how well LLMs discover tools through Switchboard's search, across model tiers. The goal: prove that scoring changes translate to real LLM behavior improvement, not just synthetic test improvements.

Mental Model

We're optimizing call efficiency: how many search calls does an LLM need to find the N relevant tools?

Best case: x = 1 — one search finds everything
Target: 1 ≤ x < N
Worst case: x ≥ N — one call per tool, or worse (retries on vocabulary misses)

Two dimensions of improvement:

Fewer misses per tool — synonym expansion, stop-word filtering
More tools per hit — cross-integration discovery, keyword tags

Two Benchmarks

1. Synthetic (Go test)

Fast, deterministic, no LLM involved. Tests the scoring algorithm in isolation.

go test -v -run TestSearchBenchmark ./server/

Reports recall@K for both old (substring AND) and new (synonym+TF-IDF) against 46 curated test cases (17 single-tool + 29 multi-tool intent across 9 personas).

Run this first — it's instant and catches regressions.

2. Live Cross-Model (this skill's main output)

Tests the full loop: LLM picks query terms → search returns results → we evaluate whether the right tools surfaced. Reveals how different model tiers interact with the scoring engine differently.

Protocol

Step 1: Verify Connection

search({"limit": 0})

Check total tools and integrations. With --discover-all, expect ~898 tools across 21 integrations. Without it, only enabled integrations are searchable.

Step 2: Dispatch 3 Agents in Parallel

Launch opus, sonnet, and haiku simultaneously using the Agent tool with model parameter and run_in_background: true.

Each agent gets identical instructions:

You are benchmarking Switchboard's search tool across {TOTAL} tools and
{INTEGRATIONS} integrations. For each scenario below, use the
mcp__switchboard__search tool to find relevant tools. DO NOT execute any
tools — search only. Make exactly ONE search call per scenario.

For EACH scenario, record: the exact query you used, total results, and the
top 3 tool names with their integrations.

Scenarios:
1. "I need to create a Linear ticket"
2. "Send a Slack message to the team"
3. "Look up Sentry errors from today"
4. "Find GitHub pull requests for review"
5. "Investigate a production error across logging and error tracking"
6. "What deployed recently and did anything break?"
7. "Slow database queries need investigation"
8. "Draft a follow-up email about what we agreed to fix"

After all 8 scenarios, output ONLY this JSON (no other text):
{
  "model": "{MODEL}",
  "scenarios": [
    {
      "scenario": 1,
      "query_used": "...",
      "total_results": N,
      "top_3": [{"name": "...", "integration": "..."}, ...]
    }
  ]
}

Step 3: Compile Comparison Table

After all 3 complete, build the per-scenario table:

| Scenario | Opus query (results) | Sonnet query (results) | Haiku query (results) | #1 Tool | Correct? |
|----------|---------------------|----------------------|---------------------|---------|----------|

Step 4: Score Against Expected Results

| # | Scenario | Expected #1 Tool | Expected Integration | |---|----------|-------------------|---------------------| | 1 | Create Linear ticket | linear_create_issue | linear | | 2 | Send Slack message | slack_send_message | slack | | 3 | Sentry errors | sentry_list_issues or sentry_list_org_issues | sentry | | 4 | GitHub PRs | github_list_pulls | github | | 5 | Production error | sentry_list_project_events + datadog_search_logs | sentry, datadog | | 6 | Recent deploys | github_list_deployments or github_list_releases | github | | 7 | Slow DB queries | pganalyze_get_query_stats | pganalyze | | 8 | Draft email | gmail_create_draft | gmail |

Report hit rate per model at multiple rank tiers:

### Hit Rate by Rank Tier
| Model  | Top-1 | Top-3 | Top-5 | Top-8 |
|--------|-------|-------|-------|-------|
| Opus   | X/8   | X/8   | X/8   | X/8   |
| Sonnet | X/8   | X/8   | X/8   | X/8   |
| Haiku  | X/8   | X/8   | X/8   | X/8   |

Top-1: LLM doesn't need to choose — the best tool is first
Top-3: LLM picks from a small set, almost always correct
Top-5: Tool is visible in the default result window
Top-8: Tool is present on the first page

The key metric is Top-3 — if the correct tool is in the top 3, the LLM has enough context to make the right choice. Top-1 is ideal but not required.

Step 5: Identify Optimization Opportunities

Classify each miss into one of these categories — this tells you what to fix:

| Category | Scoring layer fix | Example | |----------|-------------------|---------| | Missing synonym | Add to synonymGroups in server/search.go | "errors" should match "issues" | | Stop-word gap | Add to stopWords in server/search.go | Verbose query drowning signal | | Wrong integration ranked first | Phase 2 tags (not yet implemented) | rwx above sentry for "production error" | | Zero results | Vocabulary gap — no matching words exist | "deployed recently" has no tool matches |

Step 6: Report

Output a summary block:

## Search Benchmark Results — {date}

Corpus: {N} tools, {M} integrations
Server flags: {--discover-all or default}

### Hit Rate by Rank Tier
| Model  | Top-1 | Top-3 | Top-5 | Top-8 |
|--------|-------|-------|-------|-------|
| Opus   | X/8   | X/8   | X/8   | X/8   |
| Sonnet | X/8   | X/8   | X/8   | X/8   |
| Haiku  | X/8   | X/8   | X/8   | X/8   |

### Synthetic Benchmark
Single-tool recall@5: X/19 (XX%)
Integration recall:   X/92 (XX%)

### Optimization Opportunities
[List from Step 5]

### Comparison to Previous
[If prior run data available, note improvements/regressions]

Adding New Scenarios

The 8 scenarios above cover common personas (DevOps, PM, CS, CEO, analyst). To add scenarios:

Add to the agent prompt template in Step 2
Add expected results to the table in Step 4
Update the hit rate denominator in Step 6
Consider adding matching cases to server/search_benchmark_test.go for the synthetic benchmark

Good scenarios are natural language (how a real person would phrase it, not technical tool-speak) and cross-integration (need tools from 2+ integrations to fully address).

daltoniam/search-benchmark

.agents/skills/search-benchmark/SKILL.md

Cross-model search quality benchmark for Switchboard's tool discovery. Dispatches identical search scenarios to opus, sonnet, and haiku in parallel, compiles a comparison table, and identifies optimization opportunities. Use when: "benchmark search", "test search quality", "run search benchmark", after changing scoring logic, synonyms, stop words, IDF, or tool descriptions, after adding new integrations, or when evaluating Phase 2 tag impact. Also use when the user mentions "search hit rate", "search recall", or "did search get better/worse". Not for full MCP smoke tests (use mcp-benchmark) or unit testing (use make test).

13 stars

tools

Updated Apr 17, 2026

$ install --global

skillsauth

npx skillsauth add daltoniam/switchboard search-benchmark

Install this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.

Security Scan Results

3 of 9 scanners reported clean

Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.

Scanners Passed

Scanners in report

Clean

TrivyContainer and dependency vulnerability scanner

95%

Clean

SemgrepStatic code analysis for vulnerabilities

95%

Clean

mcp-scan (Snyk)Model Context Protocol security validation

95%

Skipped

Snyk (dep)Open source security scanning

50%

Skipped

Socket.devSupply chain security analysis

50%

Skipped

VirusTotalMulti-engine malware detection

50%

Skipped

CrowdStrikeAdvanced threat intelligence

50%

Skipped

OSV-ScannerOpen Source Vulnerability database check

50%

Skipped

OWASP Dep-Check

50%

Last scanned: Apr 17, 2026, 10:22 AM6.1s1 file scanned

SKILL.md

name:: search-benchmark
description:: >
Use when:: benchmark search", "test search quality", "run search benchmark",
author:: switchboard
version:: 1.0

Search Quality Benchmark

Mental Model

We're optimizing call efficiency: how many search calls does an LLM need to find the N relevant tools?

Best case: x = 1 — one search finds everything
Target: 1 ≤ x < N
Worst case: x ≥ N — one call per tool, or worse (retries on vocabulary misses)

Two dimensions of improvement:

Fewer misses per tool — synonym expansion, stop-word filtering
More tools per hit — cross-integration discovery, keyword tags

Two Benchmarks

1. Synthetic (Go test)

Fast, deterministic, no LLM involved. Tests the scoring algorithm in isolation.

go test -v -run TestSearchBenchmark ./server/

Reports recall@K for both old (substring AND) and new (synonym+TF-IDF) against 46 curated test cases (17 single-tool + 29 multi-tool intent across 9 personas).

Run this first — it's instant and catches regressions.

2. Live Cross-Model (this skill's main output)

Tests the full loop: LLM picks query terms → search returns results → we evaluate whether the right tools surfaced. Reveals how different model tiers interact with the scoring engine differently.

Protocol

Step 1: Verify Connection

search({"limit": 0})

Check total tools and integrations. With --discover-all, expect ~898 tools across 21 integrations. Without it, only enabled integrations are searchable.

Step 2: Dispatch 3 Agents in Parallel

Launch opus, sonnet, and haiku simultaneously using the Agent tool with model parameter and run_in_background: true.

Each agent gets identical instructions:

You are benchmarking Switchboard's search tool across {TOTAL} tools and
{INTEGRATIONS} integrations. For each scenario below, use the
mcp__switchboard__search tool to find relevant tools. DO NOT execute any
tools — search only. Make exactly ONE search call per scenario.

For EACH scenario, record: the exact query you used, total results, and the
top 3 tool names with their integrations.

Scenarios:
1. "I need to create a Linear ticket"
2. "Send a Slack message to the team"
3. "Look up Sentry errors from today"
4. "Find GitHub pull requests for review"
5. "Investigate a production error across logging and error tracking"
6. "What deployed recently and did anything break?"
7. "Slow database queries need investigation"
8. "Draft a follow-up email about what we agreed to fix"

After all 8 scenarios, output ONLY this JSON (no other text):
{
  "model": "{MODEL}",
  "scenarios": [
    {
      "scenario": 1,
      "query_used": "...",
      "total_results": N,
      "top_3": [{"name": "...", "integration": "..."}, ...]
    }
  ]
}

Step 3: Compile Comparison Table

After all 3 complete, build the per-scenario table:

| Scenario | Opus query (results) | Sonnet query (results) | Haiku query (results) | #1 Tool | Correct? |
|----------|---------------------|----------------------|---------------------|---------|----------|

Step 4: Score Against Expected Results

Report hit rate per model at multiple rank tiers:

### Hit Rate by Rank Tier
| Model  | Top-1 | Top-3 | Top-5 | Top-8 |
|--------|-------|-------|-------|-------|
| Opus   | X/8   | X/8   | X/8   | X/8   |
| Sonnet | X/8   | X/8   | X/8   | X/8   |
| Haiku  | X/8   | X/8   | X/8   | X/8   |

Top-1: LLM doesn't need to choose — the best tool is first
Top-3: LLM picks from a small set, almost always correct
Top-5: Tool is visible in the default result window
Top-8: Tool is present on the first page

The key metric is Top-3 — if the correct tool is in the top 3, the LLM has enough context to make the right choice. Top-1 is ideal but not required.

Step 5: Identify Optimization Opportunities

Classify each miss into one of these categories — this tells you what to fix:

Step 6: Report

Output a summary block:

## Search Benchmark Results — {date}

Corpus: {N} tools, {M} integrations
Server flags: {--discover-all or default}

### Hit Rate by Rank Tier
| Model  | Top-1 | Top-3 | Top-5 | Top-8 |
|--------|-------|-------|-------|-------|
| Opus   | X/8   | X/8   | X/8   | X/8   |
| Sonnet | X/8   | X/8   | X/8   | X/8   |
| Haiku  | X/8   | X/8   | X/8   | X/8   |

### Synthetic Benchmark
Single-tool recall@5: X/19 (XX%)
Integration recall:   X/92 (XX%)

### Optimization Opportunities
[List from Step 5]

### Comparison to Previous
[If prior run data available, note improvements/regressions]

Adding New Scenarios

The 8 scenarios above cover common personas (DevOps, PM, CS, CEO, analyst). To add scenarios:

Add to the agent prompt template in Step 2
Add expected results to the table in Step 4
Update the hit rate denominator in Step 6
Consider adding matching cases to server/search_benchmark_test.go for the synthetic benchmark

Good scenarios are natural language (how a real person would phrase it, not technical tool-speak) and cross-integration (need tools from 2+ integrations to fully address).

Related Skills

daltoniam/pr-review

tools

VerifiedTrustedCommunity

Review a GitHub pull request for the Switchboard Go MCP server project. Enforces idiomatic Go, project conventions (hexagonal architecture, dispatch maps, port interfaces), test coverage, build/lint verification, and production readiness.

13SKILL.mdUpdated Apr 17, 2026

daltoniam/pr-comments

tools

VerifiedTrustedCommunity

Submit a PR review as inline GitHub comments on specific files and lines using the gh CLI.

13SKILL.mdUpdated Apr 17, 2026

daltoniam/pr-comments

daltoniam/optimize-integration

tools

VerifiedTrustedCommunity

Improve an existing Switchboard integration adapter's LLM usability — tool description enrichment, field compaction refinement, and response tuning. Use when: "optimize integration", "improve tool descriptions", "extend compaction", "make integration better for LLMs", after user story mapping, or when an LLM is making wrong tool choices or passing wrong IDs. Not for adding new integrations (use add-integration) or fixing bugs.

13SKILL.mdUpdated Apr 17, 2026

daltoniam/optimize-integration

daltoniam/mcp-benchmark

tools

VerifiedTrustedCommunity

Live benchmark protocol for Switchboard's MCP server. Runs real tool-calling sequences against enabled integrations, tracks failure metrics, and identifies impediments to successful LLM tool usage. Use when: "benchmark", "test the MCP", "run user stories", "smoke test integrations", after adding/changing integrations or tools, after changing compaction specs or search logic, before releases. Not for unit testing (use make test) or load testing.

13SKILL.mdUpdated Apr 17, 2026

daltoniam/mcp-benchmark

Download

For Claude Desktop. Download once, then upload the file in the app — no terminal needed.

Need help? View full Cowork setup guide →

Install manually

Choose your platform

# Clone the repo
git clone https://github.com/daltoniam/switchboard.git

# Copy into Claude Code skills folder (global)
cp -r switchboard/.agents/skills/search-benchmark ~/.claude/skills/

Claude Code Skills — official skills path docs.

Repository

daltoniam/switchboard

13 stars

Compatible with

Claude Code

OpenAI Codex CLI

ChatGPT