skills/skill-tester/SKILL.md
Test skills end-to-end: design test cases, run with/without skill, grade results, test description triggering accuracy, produce improvement report. Use when: "протестируй скилл", "запусти тесты для скилла", "проверь скилл", "run skill tests", "test this skill", "skill eval", "оцени скилл", "придумай тесты для скилла", "создай сценарии тестирования"
npx skillsauth add pavel-molyanov/molyanov-ai-dev skill-testerInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Design tests, run them, grade results, evaluate description triggering, produce actionable report. All in one workflow — no separate test-design step needed.
You are team lead, test designer, user actor, and analyst — all in one.
Design test prompts applying criteria from test-design-guide.md (realistic prompts, assertion design, persona setup):
Design trigger eval queries applying patterns from trigger-eval-guide.md:
Present the full test plan:
"Here are the test cases and trigger queries I've prepared. Do these look right, or do you want to adjust anything?"
Wait for confirmation before proceeding.
Checkpoint: User confirmed test plan. All prompts have assertions. Trigger eval queries prepared.
~/.claude/skill-tests/{skill-name}/iteration-{N}/
(N = 1 for first run, increment for re-runs)evals.json with all prompts and assertionstrigger-evals.json with all trigger queriesShow plan to user: "I'll run {N} scenarios, {M} runners total. Model: {model}. Proceed?"
For each scenario, spawn all runners in parallel:
With-skill runners (2 per scenario):
Skill(skill="{tested-skill-name}")run_in_background: trueBaseline runner (1 per scenario):
run_in_background: trueSave each runner's task_id — needed for grader agents to retrieve transcripts.
Scenarios run sequentially. Runners within a scenario run in parallel.
If runners send questions, answer in character per the scenario's persona. Rules:
When each runner completes, immediately save timing data:
{
"total_tokens": 84852,
"duration_ms": 23332,
"total_duration_seconds": 23.3
}
This is the only opportunity to capture this data — it comes through the task notification and isn't persisted elsewhere. Process each notification as it arrives rather than trying to batch them.
Save to timing.json in each runner's result directory.
Checkpoint: All runners completed. Timing data captured.
When all runners for a scenario finish, spawn grader agents — one per runner. Delegate transcript analysis to grader agents — transcripts are large and reading them directly would exhaust the lead agent's context, leaving no room for report compilation.
Each grader receives instructions from grading-guide.md and:
TaskOutput(task_id) for transcript)Spawn all graders in parallel. Wait for all to return.
Using only grader outputs (not transcripts):
Clean up runners for this scenario before moving to the next one.
Across all scenarios, compute:
Surface patterns the aggregate stats might hide:
scripts/ directory.Checkpoint: All scenarios graded. Benchmark computed. Analyst observations recorded.
For each trigger eval query from trigger-evals.json:
Categorize each query:
Trigger accuracy = (true positives + true negatives) / total queries
False negative rate = false negatives / should-trigger queries
False positive rate = false positives / should-not-trigger queries
False negatives (undertriggering) are the most costly — users won't discover the skill exists. False positives waste time but are less harmful.
If trigger accuracy < 85% or false negative rate > 20%:
Checkpoint: Trigger accuracy calculated. Description improvement suggested if needed.
Structure the report according to report-template.md.
The report includes:
Save to: ~/.claude/skill-tests/{skill-name}/reports/{timestamp}-report.md
Show report to user: "Here's the test report. Key findings: [summary]. The report is at [path] — you can share it with skill-master to apply fixes."
TeamDelete after report delivery.
If the user wants to iterate after receiving the report:
iteration-{N+1}/When iterating, keep these principles in mind:
development
Creates user-spec.md through adaptive interview with codebase scanning and dual validation. Use when: "сделай юзер спек", "проведи интервью для юзер спека", "создай юзерспек", "user spec", "detailed planning", "хочу продумать фичу", "опиши требования к фиче", "сделай описание фичи", "/new-user-spec" For tech planning use tech-spec-planning. For project planning use project-planning.
testing
Testing methodology: when to write which tests, how to ensure test quality, test pyramid strategy. Use when: "напиши тесты", "как тестировать", "проанализируй тесты", "проверь качество тестов", "ревью тестов", "тестовая стратегия"
testing
Creates tech-spec.md with architecture, decisions, testing strategy, and implementation plan. Use when: "сделай техспек", "составь техспек", "техническая спецификация", "tech spec", "создай тз", "составь тз", "new-tech-spec", "/new-tech-spec" Requires existing user-spec.md as input (create with user-spec-planning skill first if missing).
tools
Decompose approved tech-spec into atomic task files with parallel creation and validation. Use when: "разбей на задачи", "декомпозиция", "decompose tech-spec", "создай задачи из техспека", "/decompose-tech-spec"