plugin-tools/archive/test-skill/SKILL.md
Run or generate test suites for any skill. Use when testing a skill before deployment, after making changes, before/after plugin upgrades, when validating skill behavior, or when the user says "test skill", "run skill tests", "generate tests for skill", or "check for regressions".
npx skillsauth add kenneth-liao/ai-launchpad-marketplace test-skillInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Run or generate test suites for any skill -- task, discipline, or orchestrator -- and track results across versions for regression detection.
Arguments: $ARGUMENTS -- skill identifier (e.g., creator-stack:write-note, path to skill dir) and optional flags (--generate, --headless, --tag <tag>). If no skill specified, ask which one.
Positioning: Complements skill-creator (which focuses on iterative skill improvement). This skill answers: "does this skill still work?" Test artifacts are schema-compatible with skill-creator for deeper analysis when needed.
Find the skill to test:
plugin:skill format, resolve:
<cwd>/<plugin>/skills/<skill>/SKILL.md (source mode -- marketplace repo)~/.claude/plugins/cache/*/<plugin>/*/skills/<skill>/SKILL.md (installed mode)Set SKILL_ROOT to the directory containing SKILL.md.
Set PLUGIN_ROOT to the parent plugin directory (two levels up from skill).
Read SKILL.md and classify the skill into one of three types. This determines test generation strategy and execution behavior.
Read the skill's full content and apply these checks in order:
Discipline -- Skill enforces rules agents might resist:
| Excuse | Reality |)Orchestrator -- Skill sequences other skills without generating content:
plugin:skill backtick syntaxTask -- Everything else (default):
Set SKILL_TYPE to task, discipline, or orchestrator.
Present to user: "Detected [type] skill. Does that look right?" Allow override.
Look for ${SKILL_ROOT}/tests/evals.json.
If found: Validate the file structure against the schema (see references/schemas.md). Report eval count and proceed to Run Test Suite.
If not found: Inform the user and offer options:
references/test-generation.md)If --generate flag was passed, skip straight to generation without asking.
After generation, save to ${SKILL_ROOT}/tests/evals.json and proceed to Run Test Suite.
Create an ephemeral workspace as a sibling to the skill directory:
WORKSPACE="${SKILL_ROOT}-test-workspace"
RUN_DIR="${WORKSPACE}/run-$(date -u +%Y-%m-%dT%H-%M-%S)"
mkdir -p "${RUN_DIR}"
mkdir -p "${WORKSPACE}/history"
If the workspace is inside a git repo, check for a .gitignore entry. If missing, suggest adding *-test-workspace/ to .gitignore.
If --tag <tag> was provided, filter evals to only those with a matching tag. Otherwise run all evals.
For each eval in evals.json, create a task and dispatch subagents:
Spawn an executor subagent per eval. Independent evals run in parallel.
Executor instructions:
You are a skill test executor. Your job:
1. Read the skill at: ${SKILL_ROOT}/SKILL.md
2. Read any referenced files the skill points to
3. Execute this prompt as if you were a user invoking the skill:
"${eval.prompt}"
4. Use any provided input files from: ${SKILL_ROOT}/tests/fixtures/
5. Save your complete output to: ${RUN_DIR}/eval-${eval.id}/with_skill/outputs/
6. Write a detailed transcript to: ${RUN_DIR}/eval-${eval.id}/with_skill/transcript.md
Transcript format:
- Eval prompt (verbatim)
- Each action taken with tool used and result
- Final output
- Any issues encountered
Discipline skills with "run_baseline": true -- spawn a second executor WITHOUT the skill loaded:
You are a test executor. Execute this prompt WITHOUT any skill guidance.
Just respond naturally as you would by default.
"${eval.prompt}"
Save transcript to: ${RUN_DIR}/eval-${eval.id}/without_skill/transcript.md
Save outputs to: ${RUN_DIR}/eval-${eval.id}/without_skill/outputs/
After each executor completes, spawn a grader subagent:
You are a skill test grader. Evaluate whether the execution met expectations.
Read the transcript at: ${RUN_DIR}/eval-${eval.id}/with_skill/transcript.md
Read outputs at: ${RUN_DIR}/eval-${eval.id}/with_skill/outputs/
Grade each expectation as PASS or FAIL with evidence.
IMPORTANT: Executors simulate skill execution -- they read the skill and
produce the output it would generate, but do not run live commands against
real backends. When an expectation says "runs command X", grade PASS if the
executor correctly constructed and presented the command in its output.
Do not penalize for simulated vs actual execution.
Save results to: ${RUN_DIR}/eval-${eval.id}/with_skill/grading.json
Use the grading.json schema from references/schemas.md:
{
"expectations": [
{"text": "...", "passed": true/false, "evidence": "..."}
],
"summary": {"passed": N, "failed": N, "total": N, "pass_rate": 0.0-1.0}
}
Discipline skill grading -- the grader also checks:
anti_behaviors that appeared in the with-skill run are flagged as FAILOrchestrator skill grading -- the grader checks the transcript for:
When --headless flag is passed or when invoked via claude -p, run the same execution logic but:
PASS or FAIL on the first line# Example headless invocation:
claude -p "/test-skill creator-stack:write-note --headless" \
--permission-mode bypassPermissions \
--add-dir /path/to/plugin
Note: Headless mode is slower (sequential) but provides full isolation. Use it for final validation before shipping, not for iterative development.
After all evals are graded, compile a run summary:
grading.json files from the run directory${RUN_DIR}/summary.json -- see references/schemas.md for the full summary.json schemaCheck for previous snapshots at ${WORKSPACE}/history/snapshots.json.
If previous snapshots exist:
Append current run to ${WORKSPACE}/history/snapshots.json -- see references/schemas.md for the snapshots.json schema.
Include:
<plugin>@<version> from plugin.json)Present a concise report to the user:
## Test Results: <skill-name> (<skill-type>)
Pass rate: X% (N/M) [PASS | REGRESSION from Y% | FIRST RUN]
| Eval | Tags | Status | Details |
|------|------|--------|---------|
| #1 ... | happy-path | PASS | 4/4 expectations |
| #2 ... | edge-case | FAIL | 2/3 -- "Voice matches..." failed |
[If regression detected:]
Regressed expectations (were PASS, now FAIL):
- "expectation text" (eval #N)
Results saved to: <workspace path>
Snapshot recorded in: <snapshots.json path>
development
Manage scheduled Claude Code tasks — add (recurring or one-off), list, pause, resume, remove, view results, and test execution of skills, prompts, and scripts with safety controls and notifications. Use when the user mentions scheduling, cron, automated tasks, recurring tasks, background tasks, running something on a schedule, periodic execution, or wants a skill/prompt/script to run automatically at a set time. Cross-platform (macOS, Linux, Windows).
tools
Upgrade a plugin's skills, hooks, and patterns to align with latest Claude Code capabilities and best practices. Use when a plugin needs modernization, after Claude Code updates, or when the user says "upgrade plugin", "modernize plugin", or "update plugin to latest patterns".
tools
Use when reviewing how skills performed during a session, when the user wants to analyze skill invocations and identify improvements, or when the user says "skill retro", "review skills", "how did skills do", "improve this skill", or "skill retrospective".
tools
Update the context system based on our conversation so far.