Adoption

Agent Skills are supported by leading AI development tools.

VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory

kenneth-liao/test-skill

Name: test-skill
Author: kenneth-liao

plugin-tools/archive/test-skill/SKILL.md

npx skillsauth add kenneth-liao/ai-launchpad-marketplace test-skill

Clean

TrivyContainer and dependency vulnerability scanner

Clean

SemgrepStatic code analysis for vulnerabilities

Clean

mcp-scan (Snyk)Model Context Protocol security validation

Skipped

Snyk (dep)Open source security scanning

Skipped

Socket.devSupply chain security analysis

Skipped

VirusTotalMulti-engine malware detection

Skipped

CrowdStrikeAdvanced threat intelligence

Skipped

OSV-ScannerOpen Source Vulnerability database check

Skipped

OWASP Dep-Check

Test Skill

Run or generate test suites for any skill -- task, discipline, or orchestrator -- and track results across versions for regression detection.

Arguments: $ARGUMENTS -- skill identifier (e.g., creator-stack:write-note, path to skill dir) and optional flags (--generate, --headless, --tag <tag>). If no skill specified, ask which one.

Positioning: Complements skill-creator (which focuses on iterative skill improvement). This skill answers: "does this skill still work?" Test artifacts are schema-compatible with skill-creator for deeper analysis when needed.

Locate Target Skill

Find the skill to test:

If argument matches plugin:skill format, resolve:
- <cwd>/<plugin>/skills/<skill>/SKILL.md (source mode -- marketplace repo)
- ~/.claude/plugins/cache/*/<plugin>/*/skills/<skill>/SKILL.md (installed mode)
If argument is a path, use directly
If not found, list available skills and ask the user to clarify

Set SKILL_ROOT to the directory containing SKILL.md. Set PLUGIN_ROOT to the parent plugin directory (two levels up from skill).

Detect Skill Type

Read SKILL.md and classify the skill into one of three types. This determines test generation strategy and execution behavior.

Detection Heuristic

Read the skill's full content and apply these checks in order:

Discipline -- Skill enforces rules agents might resist:
- Contains rationalization tables (| Excuse | Reality |)
- Contains "Red Flags" or "STOP" sections
- Uses "Iron Law", "NEVER", "No exceptions" enforcement language
- Contains pressure scenarios or compliance checklists
Orchestrator -- Skill sequences other skills without generating content:
- Invokes other skills via plugin:skill backtick syntax
- Contains "Orchestrators never generate content" or equivalent
- Workflow is primarily about sequencing, not producing output
Task -- Everything else (default):
- Produces output (text, files, visuals, data)
- Single responsibility -- does one thing well

Set SKILL_TYPE to task, discipline, or orchestrator.

Present to user: "Detected [type] skill. Does that look right?" Allow override.

Check for Existing Tests

Look for ${SKILL_ROOT}/tests/evals.json.

If found: Validate the file structure against the schema (see references/schemas.md). Report eval count and proceed to Run Test Suite.

If not found: Inform the user and offer options:

Generate tests -- Auto-generate a draft test suite from the skill content (see references/test-generation.md)
Skip -- Exit without testing

If --generate flag was passed, skip straight to generation without asking.

After generation, save to ${SKILL_ROOT}/tests/evals.json and proceed to Run Test Suite.

Run Test Suite

Prepare Workspace

Create an ephemeral workspace as a sibling to the skill directory:

WORKSPACE="${SKILL_ROOT}-test-workspace"
RUN_DIR="${WORKSPACE}/run-$(date -u +%Y-%m-%dT%H-%M-%S)"
mkdir -p "${RUN_DIR}"
mkdir -p "${WORKSPACE}/history"

If the workspace is inside a git repo, check for a .gitignore entry. If missing, suggest adding *-test-workspace/ to .gitignore.

Filter Evals

If --tag <tag> was provided, filter evals to only those with a matching tag. Otherwise run all evals.

Interactive Mode (Default)

For each eval in evals.json, create a task and dispatch subagents:

Step 1: Execute

Spawn an executor subagent per eval. Independent evals run in parallel.

Executor instructions:

You are a skill test executor. Your job:

1. Read the skill at: ${SKILL_ROOT}/SKILL.md
2. Read any referenced files the skill points to
3. Execute this prompt as if you were a user invoking the skill:

   "${eval.prompt}"

4. Use any provided input files from: ${SKILL_ROOT}/tests/fixtures/
5. Save your complete output to: ${RUN_DIR}/eval-${eval.id}/with_skill/outputs/
6. Write a detailed transcript to: ${RUN_DIR}/eval-${eval.id}/with_skill/transcript.md

Transcript format:
- Eval prompt (verbatim)
- Each action taken with tool used and result
- Final output
- Any issues encountered

Discipline skills with "run_baseline": true -- spawn a second executor WITHOUT the skill loaded:

You are a test executor. Execute this prompt WITHOUT any skill guidance.
Just respond naturally as you would by default.

"${eval.prompt}"

Save transcript to: ${RUN_DIR}/eval-${eval.id}/without_skill/transcript.md
Save outputs to: ${RUN_DIR}/eval-${eval.id}/without_skill/outputs/

Step 2: Grade

After each executor completes, spawn a grader subagent:

You are a skill test grader. Evaluate whether the execution met expectations.

Read the transcript at: ${RUN_DIR}/eval-${eval.id}/with_skill/transcript.md
Read outputs at: ${RUN_DIR}/eval-${eval.id}/with_skill/outputs/

Grade each expectation as PASS or FAIL with evidence.

IMPORTANT: Executors simulate skill execution -- they read the skill and
produce the output it would generate, but do not run live commands against
real backends. When an expectation says "runs command X", grade PASS if the
executor correctly constructed and presented the command in its output.
Do not penalize for simulated vs actual execution.

Save results to: ${RUN_DIR}/eval-${eval.id}/with_skill/grading.json

Use the grading.json schema from references/schemas.md:
{
  "expectations": [
    {"text": "...", "passed": true/false, "evidence": "..."}
  ],
  "summary": {"passed": N, "failed": N, "total": N, "pass_rate": 0.0-1.0}
}

Discipline skill grading -- the grader also checks:

Baseline (without_skill) transcript shows violation of expected behavior
With-skill transcript shows compliance
Any anti_behaviors that appeared in the with-skill run are flagged as FAIL

Orchestrator skill grading -- the grader checks the transcript for:

Skill invocations (look for Skill tool calls in transcript)
Invocation order matches expectations
No direct content generation (orchestrator didn't produce output itself)

Headless Mode

When --headless flag is passed or when invoked via claude -p, run the same execution logic but:

All execution happens sequentially in the main loop (no subagents -- headless sessions have limited subagent support)
Read the executor and grader instructions above and follow them inline
After all evals complete, output a summary to stdout
The summary should be parseable -- start with PASS or FAIL on the first line

# Example headless invocation:
claude -p "/test-skill creator-stack:write-note --headless" \
  --permission-mode bypassPermissions \
  --add-dir /path/to/plugin

Note: Headless mode is slower (sequential) but provides full isolation. Use it for final validation before shipping, not for iterative development.

Results and Reporting

Aggregate Results

After all evals are graded, compile a run summary:

Read all grading.json files from the run directory
Calculate overall pass rate and per-eval results
Save ${RUN_DIR}/summary.json -- see references/schemas.md for the full summary.json schema

Regression Detection

Check for previous snapshots at ${WORKSPACE}/history/snapshots.json.

If previous snapshots exist:

Compare current pass rate against the most recent snapshot
Compare per-expectation results -- identify any expectations that previously passed but now fail
Flag regressions prominently in the report

Update Snapshot History

Append current run to ${WORKSPACE}/history/snapshots.json -- see references/schemas.md for the snapshots.json schema.

Include:

Timestamp (ISO-8601)
Skill version (<plugin>@<version> from plugin.json)
Pass/fail/total counts and pass rate
Notes field (ask user if they want to add a note, e.g., "Baseline before voice rewrite"; leave empty if not)

Display Report

Present a concise report to the user:

## Test Results: <skill-name> (<skill-type>)

Pass rate: X% (N/M) [PASS | REGRESSION from Y% | FIRST RUN]

| Eval | Tags | Status | Details |
|------|------|--------|---------|
| #1 ... | happy-path | PASS | 4/4 expectations |
| #2 ... | edge-case | FAIL | 2/3 -- "Voice matches..." failed |

[If regression detected:]
Regressed expectations (were PASS, now FAIL):
- "expectation text" (eval #N)

Results saved to: <workspace path>
Snapshot recorded in: <snapshots.json path>

kenneth-liao/test-skill

plugin-tools/archive/test-skill/SKILL.md

Run or generate test suites for any skill. Use when testing a skill before deployment, after making changes, before/after plugin upgrades, when validating skill behavior, or when the user says "test skill", "run skill tests", "generate tests for skill", or "check for regressions".

124 stars

tools

Updated Apr 6, 2026

$ install --global

skillsauth

npx skillsauth add kenneth-liao/ai-launchpad-marketplace test-skill

Install this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.

Security Scan Results

3 of 9 scanners reported clean

Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.

Scanners Passed

Scanners in report

Clean

TrivyContainer and dependency vulnerability scanner

95%

Clean

SemgrepStatic code analysis for vulnerabilities

95%

Clean

mcp-scan (Snyk)Model Context Protocol security validation

95%

Skipped

Snyk (dep)Open source security scanning

50%

Skipped

Socket.devSupply chain security analysis

50%

Skipped

VirusTotalMulti-engine malware detection

50%

Skipped

CrowdStrikeAdvanced threat intelligence

50%

Skipped

OSV-ScannerOpen Source Vulnerability database check

50%

Skipped

OWASP Dep-Check

50%

Last scanned: Apr 24, 2026, 8:39 PM1.8s1 file scanned

SKILL.md

name:: test-skill
description:: Run or generate test suites for any skill. Use when testing a skill before deployment, after making changes, before/after plugin upgrades, when validating skill behavior, or when the user says "test skill", "run skill tests", "generate tests for skill", or "check for regressions".
user-invocable:: true

Test Skill

Run or generate test suites for any skill -- task, discipline, or orchestrator -- and track results across versions for regression detection.

Locate Target Skill

Find the skill to test:

If argument matches plugin:skill format, resolve:
- <cwd>/<plugin>/skills/<skill>/SKILL.md (source mode -- marketplace repo)
- ~/.claude/plugins/cache/*/<plugin>/*/skills/<skill>/SKILL.md (installed mode)
If argument is a path, use directly
If not found, list available skills and ask the user to clarify

Set SKILL_ROOT to the directory containing SKILL.md. Set PLUGIN_ROOT to the parent plugin directory (two levels up from skill).

Detect Skill Type

Read SKILL.md and classify the skill into one of three types. This determines test generation strategy and execution behavior.

Detection Heuristic

Read the skill's full content and apply these checks in order:

Discipline -- Skill enforces rules agents might resist:
- Contains rationalization tables (| Excuse | Reality |)
- Contains "Red Flags" or "STOP" sections
- Uses "Iron Law", "NEVER", "No exceptions" enforcement language
- Contains pressure scenarios or compliance checklists
Orchestrator -- Skill sequences other skills without generating content:
- Invokes other skills via plugin:skill backtick syntax
- Contains "Orchestrators never generate content" or equivalent
- Workflow is primarily about sequencing, not producing output
Task -- Everything else (default):
- Produces output (text, files, visuals, data)
- Single responsibility -- does one thing well

Set SKILL_TYPE to task, discipline, or orchestrator.

Present to user: "Detected [type] skill. Does that look right?" Allow override.

Check for Existing Tests

Look for ${SKILL_ROOT}/tests/evals.json.

If found: Validate the file structure against the schema (see references/schemas.md). Report eval count and proceed to Run Test Suite.

If not found: Inform the user and offer options:

Generate tests -- Auto-generate a draft test suite from the skill content (see references/test-generation.md)
Skip -- Exit without testing

If --generate flag was passed, skip straight to generation without asking.

After generation, save to ${SKILL_ROOT}/tests/evals.json and proceed to Run Test Suite.

Run Test Suite

Prepare Workspace

Create an ephemeral workspace as a sibling to the skill directory:

WORKSPACE="${SKILL_ROOT}-test-workspace"
RUN_DIR="${WORKSPACE}/run-$(date -u +%Y-%m-%dT%H-%M-%S)"
mkdir -p "${RUN_DIR}"
mkdir -p "${WORKSPACE}/history"

If the workspace is inside a git repo, check for a .gitignore entry. If missing, suggest adding *-test-workspace/ to .gitignore.

Filter Evals

If --tag <tag> was provided, filter evals to only those with a matching tag. Otherwise run all evals.

Interactive Mode (Default)

For each eval in evals.json, create a task and dispatch subagents:

Step 1: Execute

Spawn an executor subagent per eval. Independent evals run in parallel.

Executor instructions:

You are a skill test executor. Your job:

1. Read the skill at: ${SKILL_ROOT}/SKILL.md
2. Read any referenced files the skill points to
3. Execute this prompt as if you were a user invoking the skill:

   "${eval.prompt}"

4. Use any provided input files from: ${SKILL_ROOT}/tests/fixtures/
5. Save your complete output to: ${RUN_DIR}/eval-${eval.id}/with_skill/outputs/
6. Write a detailed transcript to: ${RUN_DIR}/eval-${eval.id}/with_skill/transcript.md

Transcript format:
- Eval prompt (verbatim)
- Each action taken with tool used and result
- Final output
- Any issues encountered

Discipline skills with "run_baseline": true -- spawn a second executor WITHOUT the skill loaded:

You are a test executor. Execute this prompt WITHOUT any skill guidance.
Just respond naturally as you would by default.

"${eval.prompt}"

Save transcript to: ${RUN_DIR}/eval-${eval.id}/without_skill/transcript.md
Save outputs to: ${RUN_DIR}/eval-${eval.id}/without_skill/outputs/

Step 2: Grade

After each executor completes, spawn a grader subagent:

You are a skill test grader. Evaluate whether the execution met expectations.

Read the transcript at: ${RUN_DIR}/eval-${eval.id}/with_skill/transcript.md
Read outputs at: ${RUN_DIR}/eval-${eval.id}/with_skill/outputs/

Grade each expectation as PASS or FAIL with evidence.

IMPORTANT: Executors simulate skill execution -- they read the skill and
produce the output it would generate, but do not run live commands against
real backends. When an expectation says "runs command X", grade PASS if the
executor correctly constructed and presented the command in its output.
Do not penalize for simulated vs actual execution.

Save results to: ${RUN_DIR}/eval-${eval.id}/with_skill/grading.json

Use the grading.json schema from references/schemas.md:
{
  "expectations": [
    {"text": "...", "passed": true/false, "evidence": "..."}
  ],
  "summary": {"passed": N, "failed": N, "total": N, "pass_rate": 0.0-1.0}
}

Discipline skill grading -- the grader also checks:

Baseline (without_skill) transcript shows violation of expected behavior
With-skill transcript shows compliance
Any anti_behaviors that appeared in the with-skill run are flagged as FAIL

Orchestrator skill grading -- the grader checks the transcript for:

Skill invocations (look for Skill tool calls in transcript)
Invocation order matches expectations
No direct content generation (orchestrator didn't produce output itself)

Headless Mode

When --headless flag is passed or when invoked via claude -p, run the same execution logic but:

All execution happens sequentially in the main loop (no subagents -- headless sessions have limited subagent support)
Read the executor and grader instructions above and follow them inline
After all evals complete, output a summary to stdout
The summary should be parseable -- start with PASS or FAIL on the first line

# Example headless invocation:
claude -p "/test-skill creator-stack:write-note --headless" \
  --permission-mode bypassPermissions \
  --add-dir /path/to/plugin

Note: Headless mode is slower (sequential) but provides full isolation. Use it for final validation before shipping, not for iterative development.

Results and Reporting

Aggregate Results

After all evals are graded, compile a run summary:

Read all grading.json files from the run directory
Calculate overall pass rate and per-eval results
Save ${RUN_DIR}/summary.json -- see references/schemas.md for the full summary.json schema

Regression Detection

Check for previous snapshots at ${WORKSPACE}/history/snapshots.json.

If previous snapshots exist:

Compare current pass rate against the most recent snapshot
Compare per-expectation results -- identify any expectations that previously passed but now fail
Flag regressions prominently in the report

Update Snapshot History

Append current run to ${WORKSPACE}/history/snapshots.json -- see references/schemas.md for the snapshots.json schema.

Include:

Timestamp (ISO-8601)
Skill version (<plugin>@<version> from plugin.json)
Pass/fail/total counts and pass rate
Notes field (ask user if they want to add a note, e.g., "Baseline before voice rewrite"; leave empty if not)

Display Report

Present a concise report to the user:

## Test Results: <skill-name> (<skill-type>)

Pass rate: X% (N/M) [PASS | REGRESSION from Y% | FIRST RUN]

| Eval | Tags | Status | Details |
|------|------|--------|---------|
| #1 ... | happy-path | PASS | 4/4 expectations |
| #2 ... | edge-case | FAIL | 2/3 -- "Voice matches..." failed |

[If regression detected:]
Regressed expectations (were PASS, now FAIL):
- "expectation text" (eval #N)

Results saved to: <workspace path>
Snapshot recorded in: <snapshots.json path>

Related Skills

kenneth-liao/manage

development

VerifiedTrustedCommunity

Manage scheduled Claude Code tasks — add (recurring or one-off), list, pause, resume, remove, view results, and test execution of skills, prompts, and scripts with safety controls and notifications. Use when the user mentions scheduling, cron, automated tasks, recurring tasks, background tasks, running something on a schedule, periodic execution, or wants a skill/prompt/script to run automatically at a set time. Cross-platform (macOS, Linux, Windows).

124SKILL.mdUpdated Apr 6, 2026

kenneth-liao/upgrade-plugin

tools

VerifiedTrustedCommunity

Upgrade a plugin's skills, hooks, and patterns to align with latest Claude Code capabilities and best practices. Use when a plugin needs modernization, after Claude Code updates, or when the user says "upgrade plugin", "modernize plugin", or "update plugin to latest patterns".

124SKILL.mdUpdated Apr 6, 2026

kenneth-liao/upgrade-plugin

kenneth-liao/skill-retro

tools

VerifiedTrustedCommunity

Use when reviewing how skills performed during a session, when the user wants to analyze skill invocations and identify improvements, or when the user says "skill retro", "review skills", "how did skills do", "improve this skill", or "skill retrospective".

124SKILL.mdUpdated Apr 6, 2026

kenneth-liao/skill-retro

kenneth-liao/update-context

tools

VerifiedTrustedCommunity

Update the context system based on our conversation so far.

124SKILL.mdUpdated Apr 6, 2026

kenneth-liao/update-context

Download

For Claude Desktop. Download once, then upload the file in the app — no terminal needed.

Need help? View full Cowork setup guide →

Install manually

Choose your platform

# Clone the repo
git clone https://github.com/kenneth-liao/ai-launchpad-marketplace.git

# Copy into Claude Code skills folder (global)
cp -r ai-launchpad-marketplace/plugin-tools/archive/test-skill ~/.claude/skills/

Claude Code Skills — official skills path docs.

Repository

kenneth-liao/ai-launchpad-marketplace

124 stars

Compatible with

Claude Code

OpenAI Codex CLI

ChatGPT