skills/mcp-tester/SKILL.md
Test and evaluate MCP server tools in the current Claude Code session. Use when auditing MCP configurations, validating tool quality, testing MCP servers end-to-end, generating test cases, checking tool descriptions and schemas, analyzing tool efficiency and redundancy, or debugging MCP integration issues. Covers tool discovery, quality analysis, test generation with AAA pattern, execution, rating, and cross-tool redundancy analysis.
npx skillsauth add ckorhonen/claude-skills mcp-testerInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
A comprehensive skill for testing and evaluating MCP (Model Context Protocol) server tools available in the current Claude Code session.
Use this skill when:
MCP servers must be configured in your Claude Code settings. If the user asks to test tools that don't exist, guide them to add the MCP server to their configuration.
Execute these phases in order for comprehensive testing:
Identify all MCP tools available in the current session.
Steps:
mcp__ prefix__)Output Format:
## Tool Inventory
| Server | Tool Name | Required Params | Optional Params | Description Preview |
|--------|-----------|-----------------|-----------------|---------------------|
| context7 | resolve-library-id | libraryName, query | - | Resolves package to Context7 ID... |
| context7 | query-docs | libraryId, query | - | Retrieves documentation... |
Metrics to Report:
Evaluate each tool's design quality.
Check for:
get-user not getUser or get_usercreate-item, list-users, delete-record| Score | Criteria | |-------|----------| | Excellent | Clear purpose, usage context, examples, input/output expectations | | Good | Clear purpose, some context, basic expectations | | Fair | Purpose stated but lacking context or examples | | Poor | Vague, missing, or misleading description |
Evaluate:
Check:
enum vs generic string where applicableToken Efficiency Formula:
Efficiency = (Useful information conveyed) / (Token count)
Flag tools where description is verbose relative to complexity.
Use these indicators for findings:
Analysis Output Format:
### Tool: `mcp__server__tool-name`
**Naming**: ✅ Clear verb-first naming
**Description**: 🟡 Warning - Verbose (450 tokens), could be reduced to ~200
**Parameters**: 🔵 Suggestion - Consider enum for `format` param
**Detailed Findings:**
- [Specific observations]
**Recommendations:**
- [Actionable improvements]
Generate test cases for each tool using the AAA (Arrange-Act-Assert) pattern.
1. Valid Inputs (Happy Path)
2. Invalid Inputs (Error Handling)
3. Edge Cases
""null, undefined, None)### Test: [Tool Name] - [Scenario Name]
**Category**: Valid / Invalid / Edge Case
**Arrange**:
- Context: [What setup is needed]
- Preconditions: [What must be true]
**Act**:
```json
{
"param1": "value1",
"param2": "value2"
}
Assert:
#### Test Generation Guidelines
For each tool, generate at minimum:
1. 1 happy path test with minimal params
2. 1 happy path test with all params
3. 1 missing required param test
4. 1 wrong type test
5. 1 edge case test (empty/boundary)
### Phase 4: Test Execution
Execute generated tests and capture results.
#### Read-Only Tools
Execute immediately without confirmation:
- Tools that fetch/query data
- Tools that list/search resources
- Tools that analyze/inspect
#### Mutating Tools
**ALWAYS ask for confirmation before testing:**
```markdown
⚠️ **Mutating Tool Detected**
Tool: `mcp__server__create-item`
Operation: Creates new item in external system
**Test Parameters:**
```json
{
"name": "test-item-12345",
"type": "test"
}
Potential Effects:
Proceed with this test? (yes/no/skip)
Mutating operations include:
- `create`, `write`, `post`, `add`, `insert`
- `update`, `edit`, `modify`, `patch`, `put`
- `delete`, `remove`, `destroy`, `clear`
- `send`, `publish`, `trigger`, `execute`
#### Response Capture
For each test, capture:
- Full response content (truncate if > 2000 chars)
- Response time (if perceivable delay)
- Error messages and codes
- Unexpected warnings or notices
### Phase 5: Rating & Feedback
Rate each tool's test results and provide actionable feedback.
#### Rating Criteria
| Rating | Symbol | Criteria |
|--------|--------|----------|
| **Worked** | ✅ | Response matches expected format, no errors, useful output |
| **Partially Worked** | 🟡 | Response returned but incomplete, warnings present, or unexpected format |
| **Failed** | ❌ | Error returned, timeout, or completely wrong behavior |
#### Quality Assessment Dimensions
1. **Response Completeness** (High/Medium/Low)
- Does it return all expected data?
- Are there missing fields?
2. **Response Efficiency** (High/Medium/Low)
- Token usage vs. value provided
- Unnecessary verbosity in response?
3. **Error Handling** (Clear/Vague/Missing)
- Are error messages helpful?
- Do they indicate how to fix the issue?
4. **Format Consistency** (Consistent/Inconsistent)
- Does response format match description?
- Is format consistent across calls?
#### Feedback Template
```markdown
## Tool: `mcp__server__tool-name`
### Test Results Summary
| Test | Category | Rating | Notes |
|------|----------|--------|-------|
| Minimal params | Valid | ✅ Worked | Response in 200ms |
| All params | Valid | ✅ Worked | - |
| Missing required | Invalid | 🟡 Partial | Error unclear |
| Wrong type | Invalid | ❌ Failed | No error, silent fail |
| Empty string | Edge | ✅ Worked | Handled gracefully |
### Overall Rating: 🟡 Partially Worked (4/5 tests passed)
### Quality Assessment
- **Completeness**: High - Returns all documented fields
- **Efficiency**: Medium - Response includes redundant metadata
- **Error Handling**: Vague - Errors don't indicate fix
- **Consistency**: Consistent
### Critical Issues
🔴 Silent failure on wrong type - should return validation error
### Improvement Suggestions
1. Add input validation with descriptive error messages
2. Remove redundant `metadata.internal_id` from response (saves ~50 tokens)
3. Consider pagination for list responses
Analyze the tool set as a whole.
Look for:
Output Format:
## Redundancy Findings
| Tool A | Tool B | Overlap | Recommendation |
|--------|--------|---------|----------------|
| mcp__a__get-user | mcp__b__fetch-user | 90% same function | Consolidate to single tool |
| mcp__a__list-all | mcp__a__search | Search can replace list | Deprecate list-all |
Identify tools that could be:
Note gaps in tool coverage:
## Efficiency Recommendations
### High Impact
1. **Reduce description verbosity** - 3 tools have descriptions >500 tokens
- Potential savings: ~800 tokens total
### Medium Impact
2. **Add enum constraints** - 5 parameters accept free text but have limited valid values
- Improves: Validation, documentation, autocomplete
### Low Impact
3. **Standardize naming** - Mix of `get-X` and `fetch-X` patterns
- Improves: Consistency, discoverability
Generate this report after completing all phases:
# MCP Tool Test Report
**Generated**: [timestamp]
**Session ID**: [if available]
---
## Executive Summary
| Metric | Value |
|--------|-------|
| Servers Tested | [N] |
| Tools Tested | [N] |
| Tests Executed | [N] |
| Pass Rate | [X]% |
### Results Overview
- ✅ **Passed**: [X] tools
- 🟡 **Partial**: [Y] tools
- ❌ **Failed**: [Z] tools
### Key Findings
- 🔴 [N] critical issues requiring immediate attention
- 🟡 [N] warnings to address
- 🔵 [N] suggestions for improvement
---
## Tool Inventory
[Phase 1 output]
---
## Quality Analysis
[Phase 2 output for each tool]
---
## Test Results
[Phase 5 output for each tool]
---
## Cross-Tool Analysis
[Phase 6 output]
---
## Recommendations
### 🔴 Critical (Must Address)
1. [Issue] - [Tool] - [Impact] - [Fix]
### 🟡 Warning (Should Address)
1. [Issue] - [Tool] - [Impact] - [Fix]
### 🔵 Suggestions (Consider)
1. [Improvement] - [Tool] - [Benefit]
---
*Report generated by mcp-tester skill*
This section documents real failure modes and anti-patterns encountered when testing MCP tools. These gotchas are non-obvious and often discovered through painful iteration.
The Problem: Tests pass with happy-path inputs but fail in production with edge cases or error conditions. Missing error path testing creates blind spots in tool reliability.
Anti-patterns to Avoid:
Pattern: Comprehensive Test Matrix
For each tool, test:
✅ Happy path: minimal params
✅ Happy path: all params
✅ Missing required param (each required param tested separately)
✅ Wrong param type (string to number, number to boolean, etc.)
✅ Invalid enum value (param expects ['active', 'inactive'], test 'disabled')
✅ Empty string / null-like values
✅ Very long strings (1000+ chars, 10K+ chars)
✅ Special characters: !@#$%^&*()_+-={}[]|:;"'<>?,./~`
✅ Unicode: emoji, CJK, RTL text
✅ Boundary values: 0, -1, MAX_INT, Float edge cases
✅ Malformed data: incomplete JSON, null objects, circular references
Impact: Incomplete test coverage → tools fail when actually used with real-world data
The Problem: Tests pass with mocked data but fail when calling real MCP servers. Mock behavior doesn't match actual server behavior — timeouts, error formats, response sizes, etc.
Anti-patterns to Avoid:
Pattern: Realistic Mock Behavior
When mocking MCP tool responses:
✅ Return realistic response sizes (don't shrink large responses to 1-2 items)
✅ Match actual error response format (test against real API error docs)
✅ Add realistic latency (~100-500ms for network calls)
✅ Include rate limit headers/indicators in mock
✅ Mock pagination correctly (include next_page token, limits)
✅ Match response schema exactly, including optional fields that are usually present
✅ Test against actual MCP server output when possible (don't guess schema)
Example mismatch:
// ❌ Mock (too clean)
{ "users": [{ "id": 1, "name": "Alice" }] }
// ✅ Real API (includes metadata, nulls, pagination)
{
"data": [{ "id": 1, "name": "Alice", "email": null, "tags": [] }],
"meta": { "total": 5000, "page": 1, "limit": 50, "next": "cursor-xyz" },
"timing": { "ms": 234 }
}
Impact: Tests pass locally with mocks but fail against real MCP server
The Problem: Tests pass when run individually but fail intermittently when run in parallel. Race conditions in setup/teardown, timeout assumptions that don't hold, or async operations that complete in unpredictable order.
Anti-patterns to Avoid:
Pattern: Robust Async Testing
✅ Set timeouts per tool (measure real response times first)
✅ Use proper async/await or .then() chains — no unhandled promises
✅ Isolate tests: each test gets fresh MCP server connection if possible
✅ Add explicit setup/teardown phases with proper waiting
✅ For concurrent tests: use semaphores/locks to prevent interference
✅ Test with real latency: add 100-500ms overhead to account for network
✅ Run tests multiple times to catch intermittent failures
✅ Use flake detection: run failing tests 5x to confirm it's not timing
Example race condition:
// ❌ No await - assertion runs before operation completes
const test = () => {
callMCPTool({ param: "value" });
assert(responseReceived); // Fails! Call still pending
};
// ✅ Proper await
const test = async () => {
const response = await callMCPTool({ param: "value" });
assert(response.success); // Now it's safe to assert
};
Impact: Flaky tests that pass sometimes, fail others → low confidence in test results
The Problem: Tools accept invalid parameter values that should be rejected by schema validation. MCP schema validation is incomplete, allowing garbage input to propagate to the server, causing cryptic errors downstream.
Anti-patterns to Avoid:
Pattern: Comprehensive Schema Testing
For each parameter:
✅ Test missing (should fail if required)
✅ Test wrong type (if expects number, pass string)
✅ Test enum validation (if expects ['a','b','c'], test 'd')
✅ Test length constraints (min/max length for strings)
✅ Test pattern/format validation (regex patterns, URLs, emails)
✅ Test numeric bounds (min/max for numbers, negative values)
✅ Test nested object structure (all required fields present?)
✅ Test null vs undefined behavior
✅ Test special characters in string params
✅ Validate response schema (returned data matches documented format)
Example validation gap:
// Tool spec says: name (required, string, max 255 chars)
// Test reveals:
callMCPTool({ name: 123 }) // ❌ Accepted! Should reject (not a string)
callMCPTool({ name: "a".repeat(1000) }) // ❌ Accepted! Should reject (>255 chars)
callMCPTool({ }) // ❌ Accepted! Should reject (required param missing)
Impact: Invalid requests accepted by tool → confusing errors from MCP server, hard to debug root cause
The Problem: Tools work in unit tests against mocks but fail during integration testing against a real MCP server. The tool schema doesn't match server expectations, response format is different, or endpoint has changed.
Anti-patterns to Avoid:
Pattern: Integration Testing
Separate unit tests (mocks) from integration tests (real server):
✅ Unit tests: test schema validation, parameter handling, response parsing
✅ Integration tests: call actual MCP server with real credentials/permissions
✅ Version your integration tests: document which server version they test
✅ Test error paths in integration: wrong credentials, missing permissions, server errors
✅ Validate response against schema in integration tests
✅ Test server downtime handling (timeouts, retries)
✅ Test against multiple server versions if applicable
✅ Use test fixtures: real-world data from actual server responses
Example mismatch:
// Unit test passes
callMCPTool({ userId: 123 }) // ✅ Mock returns { user: { id: 123 } }
// Integration test fails
callMCPTool({ userId: 123 }) // ❌ Real server expects { user_id: 123 } (different key!)
Impact: Tests pass locally but fail in CI/production against real MCP server
Use this checklist before finalizing tool tests:
User: "Test all my MCP tools"
Response Flow:
User: "Test just the context7 MCP tools"
Response Flow:
User: "Audit my MCP tool descriptions without running tests"
Response Flow:
User: "Test the notion MCP tools"
Response:
I don't see any Notion MCP tools in the current session.
To add a Notion MCP server:
1. Install the server:
```bash
npm install -g @notionhq/notion-mcp-server
Add to your Claude Code MCP config (~/.claude/mcp_settings.json):
{
"mcpServers": {
"notion": {
"command": "notion-mcp-server",
"env": {
"NOTION_API_KEY": "your-api-key"
}
}
}
}
Restart Claude Code to load the new server
Would you like me to help you configure this?
## Error Recovery
If a test fails or times out:
1. Log the failure with available details
2. Continue testing remaining tools
3. Include failure in final report
4. Suggest debugging steps for failed tools
## MCP Inspector: Standalone Testing
For testing outside of Claude Code sessions, use the MCP Inspector:
```bash
# Launch MCP Inspector UI (browser-based)
npx @modelcontextprotocol/inspector
# Test a specific stdio server
npx @modelcontextprotocol/inspector stdio node /path/to/server.js
# Test a specific HTTP server
npx @modelcontextprotocol/inspector http http://localhost:3000/mcp
# With environment variables
MCP_API_KEY=your-key npx @modelcontextprotocol/inspector stdio node server.js
The MCP Inspector provides:
# Check if MCP server process starts correctly
node dist/index.js # Should hang waiting for input (stdio) or log port (HTTP)
# Test stdio protocol manually
echo '{"jsonrpc":"2.0","id":1,"method":"initialize","params":{"protocolVersion":"2024-11-05","capabilities":{},"clientInfo":{"name":"test","version":"1.0"}}}' | node dist/index.js
# Check server logs (stderr for stdio servers)
node dist/index.js 2>server.log &
# Then interact, then cat server.log
# For HTTP servers
curl -X POST http://localhost:3000/mcp \
-H "Content-Type: application/json" \
-d '{"jsonrpc":"2.0","id":1,"method":"tools/list","params":{}}'
documentation
Create or expand an Idea.md / IDEA.md file from a rough description, existing repo, conversation history, notes, or other early-stage product inputs. Use when the user asks to "write an Idea.md", "turn this into an idea file", "capture this product idea", "expand this concept", or wants a repo-grounded concept brief before validation, PRD, or implementation work.
development
Write structured implementation plans from specs or requirements before touching code. Use when given a spec, requirements doc, or feature description, when user says "plan this out", "write a plan for", "how should we implement", or before starting any multi-step coding task.
testing
Expert guidance for video editing with ffmpeg, encoding best practices, and quality optimization. Use when working with video files, transcoding, remuxing, encoding settings, color spaces, or troubleshooting video quality issues.
development
Opinionated constraints for building better interfaces with agents. Use when building UI components, implementing animations, designing layouts, reviewing frontend accessibility, or working with Tailwind CSS, motion/react, or accessible primitives like Radix/Base UI.