.agents/skills/mcp-benchmark/SKILL.md
Live benchmark protocol for Switchboard's MCP server. Runs real tool-calling sequences against enabled integrations, tracks failure metrics, and identifies impediments to successful LLM tool usage. Use when: "benchmark", "test the MCP", "run user stories", "smoke test integrations", after adding/changing integrations or tools, after changing compaction specs or search logic, before releases. Not for unit testing (use make test) or load testing.
npx skillsauth add daltoniam/switchboard mcp-benchmarkInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Run live tool-calling sequences against enabled Switchboard integrations. Measure failure rates, identify silent data loss, and report impediments to LLM tool usage.
search and execute MCP toolsgithub_list_user_repos) as safe entry points.{} is failure: An empty object response means compaction stripped all fields. Flag as Critical.Execute phases in order. Record every result. Do not skip phases.
search({"limit": 0}) to get total tool count and enabled integrationssearch({"query": "<integration_name>", "limit": 5}) to sample tool definitionsOutput table:
| Integration | Tools | Status | |-------------|-------|--------| | github | 100 | enabled | | slack | 42 | enabled | | ... | ... | ... |
For each enabled integration, run ONE read-only list/search tool. Choose the safest entry point (list, not create/update/delete).
Suggested smoke tests per integration:
| Integration | Tool | Args |
|-------------|------|------|
| github | github_list_user_repos (safe — uses authenticated user; github_list_org_repos requires a known org name — discover first via github_list_user_orgs) | {per_page: 3} |
| linear | linear_list_issues | {first: 5} |
| sentry | sentry_list_org_issues | {query: "is:unresolved"} |
| slack | slack_list_conversations | {limit: 5} |
| notion | notion_search | {query: "<any known term>", limit: 3} |
| metabase | metabase_list_databases | {} |
| datadog | datadog_search_logs | {query: "*", from: "-1h"} |
| aws | aws_sts_get_caller_identity | {} |
| posthog | posthog_list_projects | {} |
| postgres | postgres_list_schemas | {} |
| clickhouse | clickhouse_list_databases | {} |
| pganalyze | pganalyze_list_servers | {} |
| rwx | rwx_list_workspaces | {} |
| gmail | gmail_list_messages | {max_results: 5} |
| homeassistant | homeassistant_list_states | {} |
| ynab | ynab_list_budgets | {} |
| gcp | gcp_list_projects | {} |
For any integration not listed, use the first list/search tool found via
search({"integration": "<name>"}).
For each call, record:
| Field | How to check |
|-------|-------------|
| Tool name | What you called |
| Response shape | Array, columnar (columns/rows/constants), object, or {} |
| Columnar format | If 8+ items, verify columnar shape with columns+rows. Check constants for lifted uniform values |
| Empty object {} | FLAG as compaction shape mismatch (Critical) |
| Error | Record error message verbatim |
| Approximate size | Eyeball response length; flag if approaching 50KB |
Moved to /search-benchmark skill. Use that skill for cross-model search
quality benchmarking with synonym expansion, TF-IDF scoring, and stop-word
analysis. This phase is retained here only as a lightweight sanity check.
Quick smoke test — verify these return non-zero results:
search({"query": "create ticket"}) → should find linear_create_issuesearch({"query": "send message"}) → should find slack_send_messageFor full search quality analysis, run /search-benchmark instead.
Run 2-3 scripts that chain tools across integrations. Use api.tryCall for
resilience so partial results are preserved.
IMPORTANT: Read-only scripts only. Do NOT send Slack messages, create issues, or perform any write operations during benchmarking. Scripts should read and cross-reference data, not mutate state.
Template script pattern:
// Discover data from one integration, cross-reference another.
var data1 = api.call('<integration1_list_tool>', {<args>});
var data2 = api.tryCall('<integration2_tool>', {<args_from_data1>});
({source: data1, cross_ref: data2});
Example scripts (adapt to enabled integrations):
Record per script:
| Field | How to check | |-------|-------------| | Number of api.call()s | Count calls in script | | Errors | Which call failed and error message | | Script rewritten? | Did you have to modify the script to get results? Why? | | Execution time | Did it approach 30s timeout? |
Calculate metrics from all phases:
| Metric | Formula |
|--------|---------|
| Script rewrite rate | Scripts rewritten / scripts attempted |
| Integration error rate | Upstream API errors / total calls |
| Server error rate | Tool-not-found + 50KB exceeded / total calls |
| Silent data loss | Responses returning {} when data expected |
| Search miss rate | 0-result queries / total search queries |
| Search false-negative rate | Expected tool not in results / total queries |
Report template:
## MCP Benchmark Report - [date]
### Environment
- Enabled integrations: [list]
- Total tools: [count]
### Results
| Metric | Value | Target | Status |
|--------|-------|--------|--------|
| Script rewrite rate | X% | <10% | PASS/FAIL |
| Integration errors | X | 0 | PASS/FAIL |
| Server errors | X | 0 | PASS/FAIL |
| Silent data loss | X | 0 | PASS/FAIL |
| Search miss rate | X% | 0% | PASS/FAIL |
### Smoke Test Results
| Integration | Tool | Shape | Status |
|-------------|------|-------|--------|
| ... | ... | ... | PASS/FAIL |
### Search Discoverability Results
| Query | Results | Expected Tool Found? | Status |
|-------|---------|---------------------|--------|
| ... | ... | ... | PASS/FAIL |
### Cross-Integration Script Results
| Script | Calls | Errors | Rewritten? | Status |
|--------|-------|--------|------------|--------|
| ... | ... | ... | ... | PASS/FAIL |
### Findings
[List each failure: tool name, what happened, severity, suggested fix]
### Comparison to Previous
[If previous benchmark exists, note improvements/regressions]
per_page/limit,
that's a bug worth flagging.tools
Cross-model search quality benchmark for Switchboard's tool discovery. Dispatches identical search scenarios to opus, sonnet, and haiku in parallel, compiles a comparison table, and identifies optimization opportunities. Use when: "benchmark search", "test search quality", "run search benchmark", after changing scoring logic, synonyms, stop words, IDF, or tool descriptions, after adding new integrations, or when evaluating Phase 2 tag impact. Also use when the user mentions "search hit rate", "search recall", or "did search get better/worse". Not for full MCP smoke tests (use mcp-benchmark) or unit testing (use make test).
tools
Review a GitHub pull request for the Switchboard Go MCP server project. Enforces idiomatic Go, project conventions (hexagonal architecture, dispatch maps, port interfaces), test coverage, build/lint verification, and production readiness.
tools
Submit a PR review as inline GitHub comments on specific files and lines using the gh CLI.
tools
Improve an existing Switchboard integration adapter's LLM usability — tool description enrichment, field compaction refinement, and response tuning. Use when: "optimize integration", "improve tool descriptions", "extend compaction", "make integration better for LLMs", after user story mapping, or when an LLM is making wrong tool choices or passing wrong IDs. Not for adding new integrations (use add-integration) or fixing bugs.