Adoption

Agent Skills are supported by leading AI development tools.

VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory

aws-samples/skill-eval

Name: skill-eval
Author: aws-samples

/SKILL.md

Evaluate AI Agent Skills across safety, quality, reliability, and cost efficiency. Audit for security issues (secrets, injection, unsafe installs), test functional correctness with-skill vs without-skill, measure trigger precision, classify cost-efficiency tradeoffs, track version lifecycle, and generate unified grades. Use when evaluating a skill before installing, auditing marketplace skills, proving your skill works with automated tests, setting up CI/CD quality gates, or comparing two skill versions. NOT for: evaluating full agent systems, testing non-skill plugins, runtime performance benchmarking, or monitoring production agent behavior.

9 stars

tools

Updated May 28, 2026

$ install --global

skillsauth

npx skillsauth add aws-samples/sample-agent-skill-eval skill-eval

Install this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.

Security Scan Results

3 of 9 scanners reported clean

Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.

Scanners Passed

Scanners in report

Clean

TrivyContainer and dependency vulnerability scanner

95%

Clean

SemgrepStatic code analysis for vulnerabilities

95%

Clean

mcp-scan (Snyk)Model Context Protocol security validation

95%

Skipped

Snyk (dep)Open source security scanning

50%

Skipped

Socket.devSupply chain security analysis

50%

Skipped

VirusTotalMulti-engine malware detection

50%

Skipped

CrowdStrikeAdvanced threat intelligence

50%

Skipped

OSV-ScannerOpen Source Vulnerability database check

50%

Skipped

OWASP Dep-Check

50%

Last scanned: May 28, 2026, 3:48 AM22.5s74 files scanned

SKILL.md

name:: skill-eval
description:: Evaluate AI Agent Skills across safety, quality, reliability, and cost efficiency. Audit for security issues (secrets, injection, unsafe installs), test functional correctness with-skill vs without-skill, measure trigger precision, classify cost-efficiency tradeoffs, track version lifecycle, and generate unified grades. Use when evaluating a skill before installing, auditing marketplace skills, proving your skill works with automated tests, setting up CI/CD quality gates, or comparing two skill versions. NOT for: evaluating full agent systems, testing non-skill plugins, runtime performance benchmarking, or monitoring production agent behavior.

Skill Eval — Agent Skill Evaluation Framework

Evaluate Agent Skills across four dimensions: safety (audit), quality (functional), reliability (trigger), and cost efficiency (Pareto classification).

Quick Start

skill-eval audit /path/to/skill          # Is it safe?
skill-eval report /path/to/skill         # Full grade (audit + functional + trigger)
skill-eval functional /path/to/skill     # Quality: with-skill vs without-skill
skill-eval trigger /path/to/skill        # Reliability: activation precision

Decision Tree

"Is this skill safe?" → skill-eval audit <path>
"Full evaluation with grade" → skill-eval report <path>
"Full repo security review" → skill-eval audit <path> --include-all
"Write eval cases" → skill-eval init <path>, then edit evals/
"Compare two versions" → skill-eval compare <old> <new>
"Check for regressions" → skill-eval snapshot <path>, then skill-eval regression <path>
"Track changes" → skill-eval lifecycle <path> --save --label v1.0

Commands

| Command | Purpose | |---------|---------| | audit | Security & structure scan (secrets, permissions, spec compliance) | | functional | Quality eval — runs prompts with and without skill, grades output | | trigger | Reliability eval — tests activation precision for relevant/irrelevant queries | | report | Unified grade combining audit (40%) + functional (40%) + trigger (20%) | | compare | Side-by-side comparison of two skills on the same eval cases | | snapshot | Save current audit as regression baseline | | regression | Check for score regressions against baseline | | lifecycle | Version tracking and change detection | | init | Generate eval scaffold from SKILL.md frontmatter |

For detailed flags and examples, see references/cli-reference.md.

Eval File Format

Functional evals (evals/evals.json):

[{"id": "case-1", "prompt": "...", "assertions": ["contains 'expected'"], "files": ["files/input.csv"]}]

Trigger queries (evals/eval_queries.json):

[{"query": "relevant question", "should_trigger": true}, {"query": "unrelated question", "should_trigger": false}]

Scoring

Grades: A (90+), B (80-89), C (70-79), D (60-69), F (<60). Findings deduct: CRITICAL −25, WARNING −10, INFO −2.

For the full security check reference and OWASP mapping, see references/security-checks.md.

Related Skills

aws-samples/scoped-skill

testing

VerifiedTrustedCommunity

Test fixture for scoped vs full scanning

6SKILL.mdUpdated Apr 3, 2026

aws-samples/scoped-skill

aws-samples/tests/fixtures/no-frontmatter

testing

VerifiedTrustedCommunity

No frontmatter here, just plain text. This is not a valid SKILL.md file.

6SKILL.mdUpdated Apr 3, 2026

aws-samples/tests/fixtures/no-frontmatter

aws-samples/mcp-skill

tools

VerifiedTrustedCommunity

A skill that references external MCP servers for testing SEC-009 detection.

6SKILL.mdUpdated Apr 3, 2026

aws-samples/mcp-skill

aws-samples/good-skill

tools

VerifiedTrustedCommunity

A well-structured test skill that follows the agentskills.io spec. Use when testing skill evaluation tools.

6SKILL.mdUpdated Apr 3, 2026

aws-samples/good-skill

Download

For Claude Desktop. Download once, then upload the file in the app — no terminal needed.

Need help? View full Cowork setup guide →

Install manually

Choose your platform

# Clone the repo
git clone https://github.com/aws-samples/sample-agent-skill-eval.git

# Copy into Claude Code skills folder (global)
cp -r sample-agent-skill-eval/ ~/.claude/skills/

Claude Code Skills — official skills path docs.

Repository

aws-samples/sample-agent-skill-eval

9 stars

Compatible with

Claude Code

OpenAI Codex CLI

ChatGPT

aws-samples/skill-eval

/SKILL.md

9 stars

tools

Updated May 28, 2026

$ install --global

skillsauth

npx skillsauth add aws-samples/sample-agent-skill-eval skill-eval

Install this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.

Security Scan Results

3 of 9 scanners reported clean

Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.

Scanners Passed

Scanners in report

Clean

TrivyContainer and dependency vulnerability scanner

95%

Clean

SemgrepStatic code analysis for vulnerabilities

95%

Clean

mcp-scan (Snyk)Model Context Protocol security validation

95%

Skipped

Snyk (dep)Open source security scanning

50%

Skipped

Socket.devSupply chain security analysis

50%

Skipped

VirusTotalMulti-engine malware detection

50%

Skipped

CrowdStrikeAdvanced threat intelligence

50%

Skipped

OSV-ScannerOpen Source Vulnerability database check

50%

Skipped

OWASP Dep-Check

50%

Last scanned: May 28, 2026, 3:48 AM22.5s74 files scanned

SKILL.md

name:: skill-eval
description:: Evaluate AI Agent Skills across safety, quality, reliability, and cost efficiency. Audit for security issues (secrets, injection, unsafe installs), test functional correctness with-skill vs without-skill, measure trigger precision, classify cost-efficiency tradeoffs, track version lifecycle, and generate unified grades. Use when evaluating a skill before installing, auditing marketplace skills, proving your skill works with automated tests, setting up CI/CD quality gates, or comparing two skill versions. NOT for: evaluating full agent systems, testing non-skill plugins, runtime performance benchmarking, or monitoring production agent behavior.

Skill Eval — Agent Skill Evaluation Framework

Evaluate Agent Skills across four dimensions: safety (audit), quality (functional), reliability (trigger), and cost efficiency (Pareto classification).

Quick Start

skill-eval audit /path/to/skill          # Is it safe?
skill-eval report /path/to/skill         # Full grade (audit + functional + trigger)
skill-eval functional /path/to/skill     # Quality: with-skill vs without-skill
skill-eval trigger /path/to/skill        # Reliability: activation precision

Decision Tree

"Is this skill safe?" → skill-eval audit <path>
"Full evaluation with grade" → skill-eval report <path>
"Full repo security review" → skill-eval audit <path> --include-all
"Write eval cases" → skill-eval init <path>, then edit evals/
"Compare two versions" → skill-eval compare <old> <new>
"Check for regressions" → skill-eval snapshot <path>, then skill-eval regression <path>
"Track changes" → skill-eval lifecycle <path> --save --label v1.0

Commands

For detailed flags and examples, see references/cli-reference.md.

Eval File Format

Functional evals (evals/evals.json):

[{"id": "case-1", "prompt": "...", "assertions": ["contains 'expected'"], "files": ["files/input.csv"]}]

Trigger queries (evals/eval_queries.json):

[{"query": "relevant question", "should_trigger": true}, {"query": "unrelated question", "should_trigger": false}]

Scoring

Grades: A (90+), B (80-89), C (70-79), D (60-69), F (<60). Findings deduct: CRITICAL −25, WARNING −10, INFO −2.

For the full security check reference and OWASP mapping, see references/security-checks.md.

Related Skills

aws-samples/scoped-skill

testing

VerifiedTrustedCommunity

Test fixture for scoped vs full scanning

6SKILL.mdUpdated Apr 3, 2026

aws-samples/scoped-skill

aws-samples/tests/fixtures/no-frontmatter

testing

VerifiedTrustedCommunity

No frontmatter here, just plain text. This is not a valid SKILL.md file.

6SKILL.mdUpdated Apr 3, 2026

aws-samples/tests/fixtures/no-frontmatter

aws-samples/mcp-skill

tools

VerifiedTrustedCommunity

A skill that references external MCP servers for testing SEC-009 detection.

6SKILL.mdUpdated Apr 3, 2026

aws-samples/mcp-skill

aws-samples/good-skill

tools

VerifiedTrustedCommunity

A well-structured test skill that follows the agentskills.io spec. Use when testing skill evaluation tools.

6SKILL.mdUpdated Apr 3, 2026

aws-samples/good-skill

Download

For Claude Desktop. Download once, then upload the file in the app — no terminal needed.

Need help? View full Cowork setup guide →

Install manually

Choose your platform

# Clone the repo
git clone https://github.com/aws-samples/sample-agent-skill-eval.git

# Copy into Claude Code skills folder (global)
cp -r sample-agent-skill-eval/ ~/.claude/skills/

Claude Code Skills — official skills path docs.

Repository

aws-samples/sample-agent-skill-eval

9 stars

Compatible with

Claude Code

OpenAI Codex CLI

ChatGPT