Adoption

Agent Skills are supported by leading AI development tools.

VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory

latestaiagents/skill-testing

Name: skill-testing
Author: latestaiagents

skills/skills-authoring/skill-testing/SKILL.md

npx skillsauth add latestaiagents/agent-skills skill-testing

Clean

TrivyContainer and dependency vulnerability scanner

Clean

SemgrepStatic code analysis for vulnerabilities

Clean

mcp-scan (Snyk)Model Context Protocol security validation

Skipped

Snyk (dep)Open source security scanning

Skipped

Socket.devSupply chain security analysis

Skipped

VirusTotalMulti-engine malware detection

Skipped

CrowdStrikeAdvanced threat intelligence

Skipped

OSV-ScannerOpen Source Vulnerability database check

Skipped

OWASP Dep-Check

Skill Testing

An untested skill is a skill that silently breaks. Build a test harness covering frontmatter, content, and activation, run it in CI, and you'll catch regressions long before users file bugs.

When to Use

Setting up a new skills repository
Adding skills to an existing repo without test coverage
Debugging why a skill exists but never activates
Reviewing PRs that add or modify skills

Three Layers of Testing

Static validation — frontmatter parses, required fields present
Content lint — length, structure, code-block sanity
Activation testing — does the skill fire on realistic queries?

Layer 1: Static Validation

import matter from "gray-matter";
import { readFileSync } from "fs";
import { globby } from "globby";

async function validateAll() {
  const files = await globby("skills/**/SKILL.md");
  const errors: string[] = [];

  for (const file of files) {
    const raw = readFileSync(file, "utf-8");
    let parsed;
    try {
      parsed = matter(raw);
    } catch (e) {
      errors.push(`${file}: invalid YAML`);
      continue;
    }
    const { data } = parsed;
    if (!data.name) errors.push(`${file}: missing name`);
    if (!data.description) errors.push(`${file}: missing description`);
    if (data.description && data.description.length < 50) {
      errors.push(`${file}: description too short (${data.description.length} chars)`);
    }
    // Name must match directory
    const dirName = file.split("/").slice(-2)[0];
    if (data.name && data.name !== dirName) {
      errors.push(`${file}: name '${data.name}' doesn't match directory '${dirName}'`);
    }
  }

  if (errors.length) {
    console.error(errors.join("\n"));
    process.exit(1);
  }
}
validateAll();

Run on every PR. Catches 80% of authoring mistakes.

Layer 2: Content Lint

Check the body for common quality issues:

function lintBody(body: string, file: string) {
  const issues = [];
  if (body.length > 15_000) issues.push("body too long (>500 lines equiv)");
  if (!body.includes("## When to Use") && !body.includes("## When To Use")) {
    issues.push("missing 'When to Use' section");
  }
  if (!body.includes("## Best Practices")) {
    issues.push("missing 'Best Practices' section");
  }
  if (!/```/.test(body)) issues.push("no code example");
  // detect stale model IDs
  if (/claude-3[.-]/.test(body)) issues.push("stale model ID (claude-3-*)");
  // detect TODO markers
  if (/TODO|FIXME|XXX/.test(body)) issues.push("contains TODO marker");
  return issues.map((i) => `${file}: ${i}`);
}

Enforce via CI. Fail on errors; warn on style issues.

Layer 3: Activation Testing

The highest-value and hardest layer. For each skill, maintain a test file:

# skills/claude-4-6-features/memory-tool/tests.yaml
should_activate:
  - "how do I give my agent persistent memory?"
  - "I want Claude to remember user preferences across sessions"
  - "What's the memory tool in the Anthropic API?"
  - "My agent forgets everything after each conversation"
should_not_activate:
  - "How do I reduce my agent's token cost?"        # about cost, not memory
  - "My RAM is full, how do I fix it?"              # system memory, not the tool
  - "Add caching to my prompts"                     # different feature

Run these through your skill-activation engine and assert:

for (const query of tests.should_activate) {
  const activated = engine.selectSkills(query);
  assert(activated.includes(skillName), `${skillName} failed to activate on: ${query}`);
}
for (const query of tests.should_not_activate) {
  const activated = engine.selectSkills(query);
  assert(!activated.includes(skillName), `${skillName} wrongly activated on: ${query}`);
}

If you don't have a real activation engine handy, use Claude itself:

async function measureActivation(skill: Skill, query: string): Promise<boolean> {
  const prompt = `Given this skill:
Name: ${skill.name}
Description: ${skill.description}

Should this skill activate for the query: "${query}"?
Answer only "yes" or "no".`;

  const response = await client.messages.create({
    model: "claude-haiku-4-5",
    max_tokens: 10,
    messages: [{ role: "user", content: prompt }],
  });
  return response.content[0].text.trim().toLowerCase().startsWith("y");
}

Cheap, reproducible, catches bad descriptions.

Layer 3b: Precision/Recall Scoreboard

Aggregate results across all skills:

interface Result {
  skill: string;
  truePositive: number;  // should activate + did
  falsePositive: number; // shouldn't activate + did
  falseNegative: number; // should activate + didn't
}

function score(results: Result[]) {
  for (const r of results) {
    const precision = r.truePositive / (r.truePositive + r.falsePositive || 1);
    const recall = r.truePositive / (r.truePositive + r.falseNegative || 1);
    console.log(`${r.skill}: P=${precision.toFixed(2)} R=${recall.toFixed(2)}`);
  }
}

Target: P ≥ 0.85, R ≥ 0.90. Lower P = over-activation; lower R = under-activation.

Regression Testing

When you change a skill's description, re-run the activation suite:

npm run test:skills -- --skill memory-tool

Any query that previously activated and now doesn't (or vice versa) is flagged for review.

Content Examples Testing

If your skill shows runnable code, compile-check it:

function extractCodeBlocks(body: string, lang: "ts" | "py") {
  const re = new RegExp("```" + lang + "\\n([\\s\\S]*?)```", "g");
  const blocks = [];
  let m;
  while ((m = re.exec(body))) blocks.push(m[1]);
  return blocks;
}

Then for each TS block, pipe through tsc --noEmit on a tmp file. For Python, py_compile. Won't run tests, but catches syntax errors.

CI Integration

# .github/workflows/skills.yml
name: skills
on: [pull_request]
jobs:
  validate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
      - run: npm ci
      - run: npm run validate:skills
      - run: npm run lint:skills
      - run: npm run test:activation -- --changed  # only changed skills

Gate merges on all three.

Smoke Testing Published Skills

After deploy, run a scheduled job that exercises a known query per skill:

for (const skill of allSkills) {
  const query = skill.tests.should_activate[0];
  const activated = await engine.selectSkills(query);
  if (!activated.includes(skill.name)) await alert(`Skill ${skill.name} not activating in prod`);
}

Catches deployment/config drift.

Anti-Patterns

No CI — regressions ship to users
Only frontmatter validation — catches typos, not activation quality
Activation tests without negative cases — you miss over-activation
Hand-testing — doesn't scale past 20 skills
Ignoring score drops — one-off dips become normalized

Best Practices

Three layers: static validation, content lint, activation tests
Per-skill tests.yaml with both positive and negative queries
Target precision ≥ 0.85, recall ≥ 0.90; track in a scoreboard
Run on every PR; gate merges
Scheduled smoke tests in prod to catch drift
When scoring drops, diff the skill description first — it's almost always the cause

latestaiagents/skill-testing

skills/skills-authoring/skill-testing/SKILL.md

Test skills for correct activation, content quality, and regression — both automated checks (frontmatter validity, lint) and manual verification (query-suite activation testing). Covers CI integration and how to catch skill regressions before users do. Use this skill when adding skills to a repo, setting up CI for a skill library, or debugging "the skill exists but doesn't work". Activate when: test skills, validate skills, skill CI, skill linting, skill activation test, skill regression.

2 stars

development

Updated Apr 23, 2026

$ install --global

skillsauth

npx skillsauth add latestaiagents/agent-skills skill-testing

Install this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.

Security Scan Results

3 of 9 scanners reported clean

Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.

Scanners Passed

Scanners in report

Clean

TrivyContainer and dependency vulnerability scanner

95%

Clean

SemgrepStatic code analysis for vulnerabilities

95%

Clean

mcp-scan (Snyk)Model Context Protocol security validation

95%

Skipped

Snyk (dep)Open source security scanning

50%

Skipped

Socket.devSupply chain security analysis

50%

Skipped

VirusTotalMulti-engine malware detection

50%

Skipped

CrowdStrikeAdvanced threat intelligence

50%

Skipped

OSV-ScannerOpen Source Vulnerability database check

50%

Skipped

OWASP Dep-Check

50%

Last scanned: Apr 24, 2026, 3:27 AM53.1s1 file scanned

SKILL.md

name:: skill-testing
description:: |
Activate when:: test skills, validate skills, skill CI, skill linting, skill activation test, skill regression.

Skill Testing

An untested skill is a skill that silently breaks. Build a test harness covering frontmatter, content, and activation, run it in CI, and you'll catch regressions long before users file bugs.

When to Use

Setting up a new skills repository
Adding skills to an existing repo without test coverage
Debugging why a skill exists but never activates
Reviewing PRs that add or modify skills

Three Layers of Testing

Static validation — frontmatter parses, required fields present
Content lint — length, structure, code-block sanity
Activation testing — does the skill fire on realistic queries?

Layer 1: Static Validation

import matter from "gray-matter";
import { readFileSync } from "fs";
import { globby } from "globby";

async function validateAll() {
  const files = await globby("skills/**/SKILL.md");
  const errors: string[] = [];

  for (const file of files) {
    const raw = readFileSync(file, "utf-8");
    let parsed;
    try {
      parsed = matter(raw);
    } catch (e) {
      errors.push(`${file}: invalid YAML`);
      continue;
    }
    const { data } = parsed;
    if (!data.name) errors.push(`${file}: missing name`);
    if (!data.description) errors.push(`${file}: missing description`);
    if (data.description && data.description.length < 50) {
      errors.push(`${file}: description too short (${data.description.length} chars)`);
    }
    // Name must match directory
    const dirName = file.split("/").slice(-2)[0];
    if (data.name && data.name !== dirName) {
      errors.push(`${file}: name '${data.name}' doesn't match directory '${dirName}'`);
    }
  }

  if (errors.length) {
    console.error(errors.join("\n"));
    process.exit(1);
  }
}
validateAll();

Run on every PR. Catches 80% of authoring mistakes.

Layer 2: Content Lint

Check the body for common quality issues:

function lintBody(body: string, file: string) {
  const issues = [];
  if (body.length > 15_000) issues.push("body too long (>500 lines equiv)");
  if (!body.includes("## When to Use") && !body.includes("## When To Use")) {
    issues.push("missing 'When to Use' section");
  }
  if (!body.includes("## Best Practices")) {
    issues.push("missing 'Best Practices' section");
  }
  if (!/```/.test(body)) issues.push("no code example");
  // detect stale model IDs
  if (/claude-3[.-]/.test(body)) issues.push("stale model ID (claude-3-*)");
  // detect TODO markers
  if (/TODO|FIXME|XXX/.test(body)) issues.push("contains TODO marker");
  return issues.map((i) => `${file}: ${i}`);
}

Enforce via CI. Fail on errors; warn on style issues.

Layer 3: Activation Testing

The highest-value and hardest layer. For each skill, maintain a test file:

# skills/claude-4-6-features/memory-tool/tests.yaml
should_activate:
  - "how do I give my agent persistent memory?"
  - "I want Claude to remember user preferences across sessions"
  - "What's the memory tool in the Anthropic API?"
  - "My agent forgets everything after each conversation"
should_not_activate:
  - "How do I reduce my agent's token cost?"        # about cost, not memory
  - "My RAM is full, how do I fix it?"              # system memory, not the tool
  - "Add caching to my prompts"                     # different feature

Run these through your skill-activation engine and assert:

for (const query of tests.should_activate) {
  const activated = engine.selectSkills(query);
  assert(activated.includes(skillName), `${skillName} failed to activate on: ${query}`);
}
for (const query of tests.should_not_activate) {
  const activated = engine.selectSkills(query);
  assert(!activated.includes(skillName), `${skillName} wrongly activated on: ${query}`);
}

If you don't have a real activation engine handy, use Claude itself:

async function measureActivation(skill: Skill, query: string): Promise<boolean> {
  const prompt = `Given this skill:
Name: ${skill.name}
Description: ${skill.description}

Should this skill activate for the query: "${query}"?
Answer only "yes" or "no".`;

  const response = await client.messages.create({
    model: "claude-haiku-4-5",
    max_tokens: 10,
    messages: [{ role: "user", content: prompt }],
  });
  return response.content[0].text.trim().toLowerCase().startsWith("y");
}

Cheap, reproducible, catches bad descriptions.

Layer 3b: Precision/Recall Scoreboard

Aggregate results across all skills:

interface Result {
  skill: string;
  truePositive: number;  // should activate + did
  falsePositive: number; // shouldn't activate + did
  falseNegative: number; // should activate + didn't
}

function score(results: Result[]) {
  for (const r of results) {
    const precision = r.truePositive / (r.truePositive + r.falsePositive || 1);
    const recall = r.truePositive / (r.truePositive + r.falseNegative || 1);
    console.log(`${r.skill}: P=${precision.toFixed(2)} R=${recall.toFixed(2)}`);
  }
}

Target: P ≥ 0.85, R ≥ 0.90. Lower P = over-activation; lower R = under-activation.

Regression Testing

When you change a skill's description, re-run the activation suite:

npm run test:skills -- --skill memory-tool

Any query that previously activated and now doesn't (or vice versa) is flagged for review.

Content Examples Testing

If your skill shows runnable code, compile-check it:

function extractCodeBlocks(body: string, lang: "ts" | "py") {
  const re = new RegExp("```" + lang + "\\n([\\s\\S]*?)```", "g");
  const blocks = [];
  let m;
  while ((m = re.exec(body))) blocks.push(m[1]);
  return blocks;
}

Then for each TS block, pipe through tsc --noEmit on a tmp file. For Python, py_compile. Won't run tests, but catches syntax errors.

CI Integration

# .github/workflows/skills.yml
name: skills
on: [pull_request]
jobs:
  validate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
      - run: npm ci
      - run: npm run validate:skills
      - run: npm run lint:skills
      - run: npm run test:activation -- --changed  # only changed skills

Gate merges on all three.

Smoke Testing Published Skills

After deploy, run a scheduled job that exercises a known query per skill:

for (const skill of allSkills) {
  const query = skill.tests.should_activate[0];
  const activated = await engine.selectSkills(query);
  if (!activated.includes(skill.name)) await alert(`Skill ${skill.name} not activating in prod`);
}

Catches deployment/config drift.

Anti-Patterns

No CI — regressions ship to users
Only frontmatter validation — catches typos, not activation quality
Activation tests without negative cases — you miss over-activation
Hand-testing — doesn't scale past 20 skills
Ignoring score drops — one-off dips become normalized

Best Practices

Three layers: static validation, content lint, activation tests
Per-skill tests.yaml with both positive and negative queries
Target precision ≥ 0.85, recall ≥ 0.90; track in a scoreboard
Run on every PR; gate merges
Scheduled smoke tests in prod to catch drift
When scoring drops, diff the skill description first — it's almost always the cause

Related Skills

latestaiagents/skill-frontmatter

documentation

VerifiedTrustedCommunity

Write the YAML frontmatter for a SKILL.md file so it activates reliably — name, description, and activation keywords that the model matches against. Covers length, tone, and the most common frontmatter mistakes. Use this skill when authoring a new skill, fixing a skill that isn't auto-activating, or reviewing skills for publication. Activate when: SKILL.md frontmatter, skill description, skill activation, skill YAML, write a skill, author a skill.

2SKILL.mdUpdated Apr 23, 2026

latestaiagents/skill-frontmatter

latestaiagents/skill-activation-patterns

development

VerifiedTrustedCommunity

Design skills that fire at the right moment — neither over-eager (noise) nor under-eager (silent). Covers activation specificity, trigger phrases, disambiguation between overlapping skills, and debugging activation. Use this skill when multiple skills could fire on the same query, a skill never fires, or a skill fires too often. Activate when: skill won't activate, skill over-activates, overlapping skills, skill triggers, skill selection, skill disambiguation.

2SKILL.mdUpdated Apr 23, 2026

latestaiagents/skill-activation-patterns

latestaiagents/progressive-disclosure

development

VerifiedTrustedCommunity

Structure SKILL.md content so the model reads just enough — concise summary up front, progressively deeper detail, examples on demand. Covers section ordering, length budgets, when to split into multiple skills. Use this skill when writing or refactoring a skill body, one skill has grown too long, or a skill is wordy but not useful. Activate when: SKILL.md structure, skill content, skill too long, split skill, progressive disclosure, skill body.

2SKILL.mdUpdated Apr 23, 2026

latestaiagents/progressive-disclosure

latestaiagents/xxe-prevention

development

VerifiedTrustedCommunity

OWASP A04 - XML External Entity (XXE) Prevention. Use this skill when parsing XML, processing SOAP requests, handling SVG uploads, or working with XML-based formats. Activate when: XML parsing, SOAP, SVG upload, XML input, DOCTYPE, DTD, external entity, XML bomb, billion laughs, XSLT.

2SKILL.mdUpdated Apr 23, 2026

latestaiagents/xxe-prevention

Download

For Claude Desktop. Download once, then upload the file in the app — no terminal needed.

Need help? View full Cowork setup guide →

Install manually

Choose your platform

# Clone the repo
git clone https://github.com/latestaiagents/agent-skills.git

# Copy into Claude Code skills folder (global)
cp -r agent-skills/skills/skills-authoring/skill-testing ~/.claude/skills/

Claude Code Skills — official skills path docs.

Repository

latestaiagents/agent-skills

2 stars

Compatible with

Claude Code

OpenAI Codex CLI

ChatGPT