Adoption

Agent Skills are supported by leading AI development tools.

VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory

daltoniam/mcp-benchmark

Name: mcp-benchmark
Author: daltoniam

.agents/skills/mcp-benchmark/SKILL.md

npx skillsauth add daltoniam/switchboard mcp-benchmark

Clean

TrivyContainer and dependency vulnerability scanner

Clean

SemgrepStatic code analysis for vulnerabilities

Clean

mcp-scan (Snyk)Model Context Protocol security validation

Skipped

Snyk (dep)Open source security scanning

Skipped

Socket.devSupply chain security analysis

Skipped

VirusTotalMulti-engine malware detection

Skipped

CrowdStrikeAdvanced threat intelligence

Skipped

OSV-ScannerOpen Source Vulnerability database check

Skipped

OWASP Dep-Check

MCP Benchmark

Run live tool-calling sequences against enabled Switchboard integrations. Measure failure rates, identify silent data loss, and report impediments to LLM tool usage.

When to Use

After adding or modifying an integration adapter
After changing compaction specs, search logic, or server error handling
Before releases
When evaluating MCP quality or comparing before/after changes

Prerequisites

Switchboard running locally or accessible via MCP
At least one integration enabled with valid credentials
Access to search and execute MCP tools

Hard Rules (apply to ALL phases)

Discovery before identity: NEVER pass an org name, team slug, channel ID, project name, or any entity identifier to a tool unless you discovered it from a prior list/search call in this session. Use authenticated-user tools (e.g., github_list_user_repos) as safe entry points.
Read-only: NEVER call create/update/delete/send tools. All scripts must be read-only.
Record everything: Every call gets a row in the results table, even failures.
{} is failure: An empty object response means compaction stripped all fields. Flag as Critical.

Protocol

Execute phases in order. Record every result. Do not skip phases.

Phase 1: Discovery (read-only)

Call search({"limit": 0}) to get total tool count and enabled integrations
For each enabled integration, call search({"query": "<integration_name>", "limit": 5}) to sample tool definitions
Record: total tools, enabled integrations, tools per integration

Output table:

| Integration | Tools | Status | |-------------|-------|--------| | github | 100 | enabled | | slack | 42 | enabled | | ... | ... | ... |

Phase 2: Single-Tool Smoke Tests

For each enabled integration, run ONE read-only list/search tool. Choose the safest entry point (list, not create/update/delete).

Suggested smoke tests per integration:

| Integration | Tool | Args | |-------------|------|------| | github | github_list_user_repos (safe — uses authenticated user; github_list_org_repos requires a known org name — discover first via github_list_user_orgs) | {per_page: 3} | | linear | linear_list_issues | {first: 5} | | sentry | sentry_list_org_issues | {query: "is:unresolved"} | | slack | slack_list_conversations | {limit: 5} | | notion | notion_search | {query: "<any known term>", limit: 3} | | metabase | metabase_list_databases | {} | | datadog | datadog_search_logs | {query: "*", from: "-1h"} | | aws | aws_sts_get_caller_identity | {} | | posthog | posthog_list_projects | {} | | postgres | postgres_list_schemas | {} | | clickhouse | clickhouse_list_databases | {} | | pganalyze | pganalyze_list_servers | {} | | rwx | rwx_list_workspaces | {} | | gmail | gmail_list_messages | {max_results: 5} | | homeassistant | homeassistant_list_states | {} | | ynab | ynab_list_budgets | {} | | gcp | gcp_list_projects | {} |

For any integration not listed, use the first list/search tool found via search({"integration": "<name>"}).

For each call, record:

| Field | How to check | |-------|-------------| | Tool name | What you called | | Response shape | Array, columnar (columns/rows/constants), object, or {} | | Columnar format | If 8+ items, verify columnar shape with columns+rows. Check constants for lifted uniform values | | Empty object {} | FLAG as compaction shape mismatch (Critical) | | Error | Record error message verbatim | | Approximate size | Eyeball response length; flag if approaching 50KB |

Phase 3: Search Discoverability Tests

Moved to /search-benchmark skill. Use that skill for cross-model search quality benchmarking with synonym expansion, TF-IDF scoring, and stop-word analysis. This phase is retained here only as a lightweight sanity check.

Quick smoke test — verify these return non-zero results:

search({"query": "create ticket"}) → should find linear_create_issue
search({"query": "send message"}) → should find slack_send_message

For full search quality analysis, run /search-benchmark instead.

Phase 4: Cross-Integration Scripts

Run 2-3 scripts that chain tools across integrations. Use api.tryCall for resilience so partial results are preserved.

IMPORTANT: Read-only scripts only. Do NOT send Slack messages, create issues, or perform any write operations during benchmarking. Scripts should read and cross-reference data, not mutate state.

Template script pattern:

// Discover data from one integration, cross-reference another.
var data1 = api.call('<integration1_list_tool>', {<args>});
var data2 = api.tryCall('<integration2_tool>', {<args_from_data1>});
({source: data1, cross_ref: data2});

Example scripts (adapt to enabled integrations):

Sentry+Linear cross-ref: List unresolved Sentry issues, search Linear for matching issue titles
Linear+GitHub cross-ref: List Linear issues, search GitHub PRs matching issue IDs
Notion search + page content: Search Notion, get full page content for top result

Record per script:

| Field | How to check | |-------|-------------| | Number of api.call()s | Count calls in script | | Errors | Which call failed and error message | | Script rewritten? | Did you have to modify the script to get results? Why? | | Execution time | Did it approach 30s timeout? |

Phase 5: Metrics & Report

Calculate metrics from all phases:

| Metric | Formula | |--------|---------| | Script rewrite rate | Scripts rewritten / scripts attempted | | Integration error rate | Upstream API errors / total calls | | Server error rate | Tool-not-found + 50KB exceeded / total calls | | Silent data loss | Responses returning {} when data expected | | Search miss rate | 0-result queries / total search queries | | Search false-negative rate | Expected tool not in results / total queries |

Report template:

## MCP Benchmark Report - [date]

### Environment
- Enabled integrations: [list]
- Total tools: [count]

### Results
| Metric | Value | Target | Status |
|--------|-------|--------|--------|
| Script rewrite rate | X% | <10% | PASS/FAIL |
| Integration errors | X | 0 | PASS/FAIL |
| Server errors | X | 0 | PASS/FAIL |
| Silent data loss | X | 0 | PASS/FAIL |
| Search miss rate | X% | 0% | PASS/FAIL |

### Smoke Test Results
| Integration | Tool | Shape | Status |
|-------------|------|-------|--------|
| ... | ... | ... | PASS/FAIL |

### Search Discoverability Results
| Query | Results | Expected Tool Found? | Status |
|-------|---------|---------------------|--------|
| ... | ... | ... | PASS/FAIL |

### Cross-Integration Script Results
| Script | Calls | Errors | Rewritten? | Status |
|--------|-------|--------|------------|--------|
| ... | ... | ... | ... | PASS/FAIL |

### Findings
[List each failure: tool name, what happened, severity, suggested fix]

### Comparison to Previous
[If previous benchmark exists, note improvements/regressions]

Common Mistakes

Writing during benchmarks: NEVER send messages, create issues, or mutate state. All scripts must be read-only.
Ignoring parameter defaults: If a tool ignores your per_page/limit, that's a bug worth flagging.
Skipping discovery: Don't assume which integrations are enabled. Run Phase 1 first.

daltoniam/mcp-benchmark

.agents/skills/mcp-benchmark/SKILL.md

Live benchmark protocol for Switchboard's MCP server. Runs real tool-calling sequences against enabled integrations, tracks failure metrics, and identifies impediments to successful LLM tool usage. Use when: "benchmark", "test the MCP", "run user stories", "smoke test integrations", after adding/changing integrations or tools, after changing compaction specs or search logic, before releases. Not for unit testing (use make test) or load testing.

13 stars

tools

Updated Apr 17, 2026

$ install --global

skillsauth

npx skillsauth add daltoniam/switchboard mcp-benchmark

Install this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.

Security Scan Results

3 of 9 scanners reported clean

Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.

Scanners Passed

Scanners in report

Clean

TrivyContainer and dependency vulnerability scanner

95%

Clean

SemgrepStatic code analysis for vulnerabilities

95%

Clean

mcp-scan (Snyk)Model Context Protocol security validation

95%

Skipped

Snyk (dep)Open source security scanning

50%

Skipped

Socket.devSupply chain security analysis

50%

Skipped

VirusTotalMulti-engine malware detection

50%

Skipped

CrowdStrikeAdvanced threat intelligence

50%

Skipped

OSV-ScannerOpen Source Vulnerability database check

50%

Skipped

OWASP Dep-Check

50%

Last scanned: Apr 17, 2026, 10:22 AM4.4s1 file scanned

SKILL.md

name:: mcp-benchmark
description:: >
Use when:: benchmark", "test the MCP", "run user stories", "smoke test
author:: switchboard
version:: 1.0

MCP Benchmark

Run live tool-calling sequences against enabled Switchboard integrations. Measure failure rates, identify silent data loss, and report impediments to LLM tool usage.

When to Use

After adding or modifying an integration adapter
After changing compaction specs, search logic, or server error handling
Before releases
When evaluating MCP quality or comparing before/after changes

Prerequisites

Switchboard running locally or accessible via MCP
At least one integration enabled with valid credentials
Access to search and execute MCP tools

Hard Rules (apply to ALL phases)

Discovery before identity: NEVER pass an org name, team slug, channel ID, project name, or any entity identifier to a tool unless you discovered it from a prior list/search call in this session. Use authenticated-user tools (e.g., github_list_user_repos) as safe entry points.
Read-only: NEVER call create/update/delete/send tools. All scripts must be read-only.
Record everything: Every call gets a row in the results table, even failures.
{} is failure: An empty object response means compaction stripped all fields. Flag as Critical.

Protocol

Execute phases in order. Record every result. Do not skip phases.

Phase 1: Discovery (read-only)

Call search({"limit": 0}) to get total tool count and enabled integrations
For each enabled integration, call search({"query": "<integration_name>", "limit": 5}) to sample tool definitions
Record: total tools, enabled integrations, tools per integration

Output table:

| Integration | Tools | Status | |-------------|-------|--------| | github | 100 | enabled | | slack | 42 | enabled | | ... | ... | ... |

Phase 2: Single-Tool Smoke Tests

For each enabled integration, run ONE read-only list/search tool. Choose the safest entry point (list, not create/update/delete).

Suggested smoke tests per integration:

For any integration not listed, use the first list/search tool found via search({"integration": "<name>"}).

For each call, record:

Phase 3: Search Discoverability Tests

Quick smoke test — verify these return non-zero results:

search({"query": "create ticket"}) → should find linear_create_issue
search({"query": "send message"}) → should find slack_send_message

For full search quality analysis, run /search-benchmark instead.

Phase 4: Cross-Integration Scripts

Run 2-3 scripts that chain tools across integrations. Use api.tryCall for resilience so partial results are preserved.

IMPORTANT: Read-only scripts only. Do NOT send Slack messages, create issues, or perform any write operations during benchmarking. Scripts should read and cross-reference data, not mutate state.

Template script pattern:

// Discover data from one integration, cross-reference another.
var data1 = api.call('<integration1_list_tool>', {<args>});
var data2 = api.tryCall('<integration2_tool>', {<args_from_data1>});
({source: data1, cross_ref: data2});

Example scripts (adapt to enabled integrations):

Sentry+Linear cross-ref: List unresolved Sentry issues, search Linear for matching issue titles
Linear+GitHub cross-ref: List Linear issues, search GitHub PRs matching issue IDs
Notion search + page content: Search Notion, get full page content for top result

Record per script:

Phase 5: Metrics & Report

Calculate metrics from all phases:

Report template:

## MCP Benchmark Report - [date]

### Environment
- Enabled integrations: [list]
- Total tools: [count]

### Results
| Metric | Value | Target | Status |
|--------|-------|--------|--------|
| Script rewrite rate | X% | <10% | PASS/FAIL |
| Integration errors | X | 0 | PASS/FAIL |
| Server errors | X | 0 | PASS/FAIL |
| Silent data loss | X | 0 | PASS/FAIL |
| Search miss rate | X% | 0% | PASS/FAIL |

### Smoke Test Results
| Integration | Tool | Shape | Status |
|-------------|------|-------|--------|
| ... | ... | ... | PASS/FAIL |

### Search Discoverability Results
| Query | Results | Expected Tool Found? | Status |
|-------|---------|---------------------|--------|
| ... | ... | ... | PASS/FAIL |

### Cross-Integration Script Results
| Script | Calls | Errors | Rewritten? | Status |
|--------|-------|--------|------------|--------|
| ... | ... | ... | ... | PASS/FAIL |

### Findings
[List each failure: tool name, what happened, severity, suggested fix]

### Comparison to Previous
[If previous benchmark exists, note improvements/regressions]

Common Mistakes

Writing during benchmarks: NEVER send messages, create issues, or mutate state. All scripts must be read-only.
Ignoring parameter defaults: If a tool ignores your per_page/limit, that's a bug worth flagging.
Skipping discovery: Don't assume which integrations are enabled. Run Phase 1 first.

Related Skills

daltoniam/search-benchmark

tools

VerifiedTrustedCommunity

Cross-model search quality benchmark for Switchboard's tool discovery. Dispatches identical search scenarios to opus, sonnet, and haiku in parallel, compiles a comparison table, and identifies optimization opportunities. Use when: "benchmark search", "test search quality", "run search benchmark", after changing scoring logic, synonyms, stop words, IDF, or tool descriptions, after adding new integrations, or when evaluating Phase 2 tag impact. Also use when the user mentions "search hit rate", "search recall", or "did search get better/worse". Not for full MCP smoke tests (use mcp-benchmark) or unit testing (use make test).

13SKILL.mdUpdated Apr 17, 2026

daltoniam/search-benchmark

daltoniam/pr-review

tools

VerifiedTrustedCommunity

Review a GitHub pull request for the Switchboard Go MCP server project. Enforces idiomatic Go, project conventions (hexagonal architecture, dispatch maps, port interfaces), test coverage, build/lint verification, and production readiness.

13SKILL.mdUpdated Apr 17, 2026

daltoniam/pr-comments

tools

VerifiedTrustedCommunity

Submit a PR review as inline GitHub comments on specific files and lines using the gh CLI.

13SKILL.mdUpdated Apr 17, 2026

daltoniam/pr-comments

daltoniam/optimize-integration

tools

VerifiedTrustedCommunity

Improve an existing Switchboard integration adapter's LLM usability — tool description enrichment, field compaction refinement, and response tuning. Use when: "optimize integration", "improve tool descriptions", "extend compaction", "make integration better for LLMs", after user story mapping, or when an LLM is making wrong tool choices or passing wrong IDs. Not for adding new integrations (use add-integration) or fixing bugs.

13SKILL.mdUpdated Apr 17, 2026

daltoniam/optimize-integration

Download

For Claude Desktop. Download once, then upload the file in the app — no terminal needed.

Need help? View full Cowork setup guide →

Install manually

Choose your platform

# Clone the repo
git clone https://github.com/daltoniam/switchboard.git

# Copy into Claude Code skills folder (global)
cp -r switchboard/.agents/skills/mcp-benchmark ~/.claude/skills/

Claude Code Skills — official skills path docs.

Repository

daltoniam/switchboard

13 stars

Compatible with

Claude Code

OpenAI Codex CLI

ChatGPT