skills/skill-testing-framework/SKILL.md
Provides test cases and validation tools for skills. Use when creating tests for a new skill, adding regression tests after skill updates, running test suites to verify skill functionality, or validating that skill outputs match expected results. Supports unit tests, integration tests, and regression tests with input/output pair validation.
npx skillsauth add auldsyababua/instructor-workflow skill-testing-frameworkInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
This skill provides a comprehensive framework for testing skills at multiple levels: unit tests for individual components, integration tests for complete workflows, and regression tests to catch breaking changes. It includes scripts for generating test templates, running test suites, and validating outputs against baselines.
Use this skill when:
Start by generating a test template for your skill:
scripts/generate_test_template.py /path/to/your-skill --output your-skill-tests.json
This analyzes your skill structure and creates a template with unit, integration, and regression test sections.
Edit the generated test file to add specific test cases. Use assets/test_template.json as a reference for test case structure.
Execute the test suite:
scripts/run_tests.py your-skill-tests.json --skill-path /path/to/your-skill
Add --verbose flag for detailed output.
Test individual components in isolation (scripts, functions, modules).
Example:
{
"name": "Test PDF rotation script",
"type": "script",
"script": "rotate_pdf.py",
"args": ["input.pdf", "output.pdf", "90"],
"expected_exit_code": 0,
"description": "Verify PDF rotation executes without errors"
}
Test complete workflows from start to finish.
Example:
{
"name": "Test document creation workflow",
"type": "workflow",
"description": "End-to-end test of document generation",
"input": {
"user_query": "Create a report with sections and formatting",
"files": []
},
"expected_output": {
"type": "docx",
"validation": "Contains title, sections, and proper formatting"
}
}
Compare outputs against known baselines to catch unintended changes.
Example:
{
"name": "Regression: Output format consistency",
"description": "Ensure output format hasn't changed",
"input": {
"user_query": "Process sample data",
"files": ["sample.csv"]
},
"baseline_file": "baselines/sample_output_v1.txt",
"validation_method": "exact_match"
}
Test cases use input/output pairs to define expected behavior:
Input:
Expected Output:
Choose the appropriate validation method:
For regression tests, create baseline files from known good outputs:
# Run skill and capture output
./skill-script.py input.txt > output.txt
# Create baseline from output
scripts/validate_test_results.py --create-baseline output.txt baselines/
Compare current output with baseline:
scripts/validate_test_results.py actual.txt baseline.txt --mode exact
When skill behavior intentionally changes:
Important: Don't automatically update baselines when tests fail. Investigate first to ensure it's not a regression.
Organize test files and data using this structure:
tests/
├── your-skill-tests.json # Main test suite
├── fixtures/ # Test input files
│ ├── sample_input.pdf
│ ├── test_data.csv
│ └── edge_case_data.txt
├── baselines/ # Expected outputs for regression tests
│ ├── baseline_v1.txt
│ └── baseline_v2.json
└── outputs/ # Actual test outputs (gitignored)
└── test_run_*.txt
Am I building a new skill? → Generate test template → Customize tests → Run tests
Am I updating an existing skill? → Run existing tests → Add new test cases for new functionality → Update baselines if needed
Am I debugging a failing test? → Run with --verbose → Compare outputs using validate_test_results.py → Fix issue or update test
Do I want to add regression tests? → Capture current output as baseline → Create regression test case → Document what the baseline represents
Creates test case templates based on skill structure.
# Basic usage
generate_test_template.py /path/to/skill
# Specify output file
generate_test_template.py /path/to/skill --output my-tests.json
# YAML format
generate_test_template.py /path/to/skill --format yaml
Executes test suites and reports results.
# Run all tests
run_tests.py test-suite.json
# Run with skill path
run_tests.py test-suite.json --skill-path /mnt/skills/public/pdf
# Verbose output
run_tests.py test-suite.json --verbose
Output:
Validates outputs against expected results and manages baselines.
# Compare two files (exact match)
validate_test_results.py actual.txt expected.txt
# Check if output contains string
validate_test_results.py output.txt expected.txt --mode contains
# Pattern matching
validate_test_results.py output.txt pattern.txt --mode pattern
# Create baseline
validate_test_results.py --create-baseline output.txt baselines/
For detailed guidance, see:
references/test_patterns.md - Examples of test cases for different skill types:
references/writing_tests.md - Best practices for effective testing:
See assets/test_template.json for a complete example test suite with:
tools
Toolkit for interacting with and testing local web applications using Playwright. Supports verifying frontend functionality, debugging UI behavior, capturing browser screenshots, and viewing browser logs.
testing
Three-step Linear update protocol after job completion - update child issue, check parent completion, update parent if all children done
testing
This skill should be used whenever users need help planning trips, creating travel itineraries, managing travel budgets, or seeking destination advice. On first use, collects comprehensive travel preferences including budget level, travel style, interests, and dietary restrictions. Generates detailed travel plans with day-by-day itineraries, budget breakdowns, packing checklists, cultural do's and don'ts, and region-specific schedules. Maintains database of preferences and past trips for personalized recommendations.
tools
Proactive token budget assessment and task chunking strategy. Use this skill when queries involve multiple large file uploads, requests for comprehensive multi-document analysis, complex multi-step workflows with heavy research (10+ tool calls), phrases like "complete analysis", "full audit", "thorough review", "deep dive", or tasks combining extensive research with large output artifacts. This skill helps assess token consumption risk early and recommend chunking strategies before beginning work.