Adoption

Agent Skills are supported by leading AI development tools.

VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory

aiden0z/skill-evaluator

Name: skill-evaluator
Author: aiden0z

skills/skill-evaluator/SKILL.md

npx skillsauth add aiden0z/skills skill-evaluator

Clean

TrivyContainer and dependency vulnerability scanner

Clean

SemgrepStatic code analysis for vulnerabilities

Clean

mcp-scan (Snyk)Model Context Protocol security validation

Skipped

Snyk (dep)Open source security scanning

Skipped

Socket.devSupply chain security analysis

Skipped

VirusTotalMulti-engine malware detection

Skipped

CrowdStrikeAdvanced threat intelligence

Skipped

OSV-ScannerOpen Source Vulnerability database check

Skipped

OWASP Dep-Check

Skill Evaluator

IRON LAW: Evaluate a skill against real-use failure modes, not against how convincing its instructions sound.

Purpose

Assess whether a skill is discoverable, concise, executable, verifiable, and robust under realistic agent behavior. Turn observed failures into concrete workflow gates, scripts, references, or eval cases.

Operating Rules

Treat a skill as a reusable capability module, not a long prompt.
Prefer raw evidence: user prompts, agent transcripts, diffs, output artifacts, validation logs, and failed runs.
Do not pass your diagnosis or intended fix into forward-testing prompts; pass the skill and realistic user request.
Do not grade success by output polish alone. Check whether the agent followed the required process and left verifiable evidence.
Do not require artificial success counts. Require process evidence that makes low-output results trustworthy.
Keep recommendations specific: name the file, rule, script, eval case, or validator check that should change.
If the user asks you to modify the evaluated skill, apply the smallest change that blocks the observed failure mode.

Workflow

Scope the Evaluation ⚠️ REQUIRED
- Identify the skill path, target runtime, intended users, and concrete usage examples.
- If no examples are provided, derive 3 realistic prompts from the skill description and user context.
- Collect evidence: current SKILL.md, relevant references/scripts, prior agent transcript, output artifacts, or failing behavior.
Run Mechanical Checks ⛔ BLOCKING
- Run scripts/check_skill_quality.py <skill-path> and capture the full output (status, error count, warning count, metrics).
- If the skill has its own validator, run that too.
- A verdict is not valid without mechanical check evidence in the output. If the check script cannot be run (missing Python, broken script), record that as the first risk and set the verdict to not-ready.
- Treat script results as a starting point; they do not replace semantic evaluation.
Evaluate Design Fit ⚠️ REQUIRED
- Read references/best-practices.md.
- Read references/skill-prompt-quality.md to review SKILL.md as executable agent instructions; treat it as a front gate, not proof of runtime compliance.
- Use references/evaluation-rubric.md to score trigger quality, workflow control, progressive disclosure, deterministic resources, validation integrity, and output evidence.
- Identify anti-patterns: workflow hidden in description, vague "ensure quality" instructions, no negative cases, no close-loop validation, or final artifacts that can mask weak process.
Design or Review Eval Cases ⚠️ REQUIRED
- Read references/eval-case-design.md.
- For repeatable suites or CI-style checks, read references/harness-engineering.md.
- If the evaluation starts from a known failure, read references/failure-regression.md first and convert the failure into a no-leak regression case.
- Create or review eval cases that include happy paths, ambiguity, edge cases, and known failure modes.
- Prefer assertions over broad expected-output prose: required actions, forbidden shortcuts, required artifacts, and validation commands.
- Validate eval files with scripts/check_eval_cases.py <evals.json> when eval cases exist.
Forward-Test When Useful
- Read references/forward-testing.md.
- For complex, high-impact, or recently changed skills, forward-test with fresh agents when the runtime allows it and the user approves any costly or risky runs.
- Before running a real-agent eval, use references/agent-runtime-discovery.md and optionally scripts/discover_agent_runtime.py to discover available agent runtimes and capture channels.
- Then use references/harness-engineering.md → "Evidence Capture Discovery" to record the chosen evidence level. Codex JSONL is a reference implementation, not a hard dependency.
- Prompt shape: Use $<skill-name> at <path> to handle: <realistic user request>.
- Review available traces, transcripts, command logs, diffs, validator output, and artifacts against the rubric. Do not leak the expected answer unless testing retrieval of a known reference.
- If only final artifacts are available, report artifact-level validation only; do not claim workflow adherence.
Recommend or Apply Changes
- Convert each failure into one of: trigger metadata change, workflow gate, anti-pattern rule, deterministic script, validator check, reference split, eval case, or output schema.
- Keep SKILL.md under 500 lines; move detailed standards into directly linked reference files.
- After edits, rerun mechanical checks and any representative eval/smoke checks.

Output Shape

For a review-only task, return these five sections in order. Omit any section only when the evaluation did not reach that phase:

Mechanical Check Results — output from scripts/check_skill_quality.py (status, error count, warning count, key metrics). If this section is absent, the evaluation is incomplete and the verdict must be not-ready.
Verdict — ready, usable-with-gaps, or not-ready.
Top Risks — ordered by impact, each with evidence from files, scripts, transcripts, or artifacts.
Recommended Changes — each names a target file, rule, script, or eval case, and explains why the change matters.
Suggested Eval Cases — at least one concrete prompt with must-do and must-not-do assertions.

For an edit task, also include changed files and verification commands.

Resources

scripts/check_skill_quality.py — deterministic skill structure and hygiene checks. Pass --receipt <path> to write a timestamped JSON receipt after a passing run; use this as evidence that mechanical checks completed before a verdict.
scripts/check_eval_cases.py — validate portable eval case JSON files.
scripts/discover_agent_runtime.py — read-only discovery of local agent CLIs and likely evidence capture methods.
references/best-practices.md — distilled guidance from OpenAI, Claude, Trae, and local skill-forge/skill-creator principles.
references/skill-prompt-quality.md — static gate for whether SKILL.md can reliably steer agent behavior.
references/evaluation-rubric.md — scoring rubric and pass/fail gates.
references/eval-case-design.md — how to create regression and forward-test cases.
references/agent-runtime-discovery.md — how to discover capture channels for Codex, Claude Code, GitHub Copilot, Kimi, and custom agents.
references/harness-engineering.md — design repeatable eval harnesses: datasets, traces, graders, aggregation, and gates.
references/forward-testing.md — skill-creator style fresh-agent testing protocol.
references/failure-regression.md — convert observed failures into no-leak regression cases and skill changes.
templates/eval-case.json — starter schema for skill eval cases.
templates/harness-plan.md — lightweight plan for repeatable skill eval suites.

aiden0z/skill-evaluator

skills/skill-evaluator/SKILL.md

Use when asked to review a skill's quality, test whether a skill works correctly, find why a skill behaves inconsistently or fails to trigger, check if a skill is ready to publish, harden a skill against known failure modes, or turn an observed failure into a repeatable test case.

3 stars

testing

Updated May 11, 2026

$ install --global

skillsauth

npx skillsauth add aiden0z/skills skill-evaluator

Install this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.

Security Scan Results

3 of 9 scanners reported clean

Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.

Scanners Passed

Scanners in report

Clean

TrivyContainer and dependency vulnerability scanner

95%

Clean

SemgrepStatic code analysis for vulnerabilities

95%

Clean

mcp-scan (Snyk)Model Context Protocol security validation

95%

Skipped

Snyk (dep)Open source security scanning

50%

Skipped

Socket.devSupply chain security analysis

50%

Skipped

VirusTotalMulti-engine malware detection

50%

Skipped

CrowdStrikeAdvanced threat intelligence

50%

Skipped

OSV-ScannerOpen Source Vulnerability database check

50%

Skipped

OWASP Dep-Check

50%

Last scanned: May 11, 2026, 3:50 AM116.7s16 files scanned

SKILL.md

name:: skill-evaluator
description:: Use when asked to review a skill's quality, test whether a skill works correctly, find why a skill behaves inconsistently or fails to trigger, check if a skill is ready to publish, harden a skill against known failure modes, or turn an observed failure into a repeatable test case.

Skill Evaluator

IRON LAW: Evaluate a skill against real-use failure modes, not against how convincing its instructions sound.

Purpose

Operating Rules

Treat a skill as a reusable capability module, not a long prompt.
Prefer raw evidence: user prompts, agent transcripts, diffs, output artifacts, validation logs, and failed runs.
Do not pass your diagnosis or intended fix into forward-testing prompts; pass the skill and realistic user request.
Do not grade success by output polish alone. Check whether the agent followed the required process and left verifiable evidence.
Do not require artificial success counts. Require process evidence that makes low-output results trustworthy.
Keep recommendations specific: name the file, rule, script, eval case, or validator check that should change.
If the user asks you to modify the evaluated skill, apply the smallest change that blocks the observed failure mode.

Workflow

Scope the Evaluation ⚠️ REQUIRED
- Identify the skill path, target runtime, intended users, and concrete usage examples.
- If no examples are provided, derive 3 realistic prompts from the skill description and user context.
- Collect evidence: current SKILL.md, relevant references/scripts, prior agent transcript, output artifacts, or failing behavior.
Run Mechanical Checks ⛔ BLOCKING
- Run scripts/check_skill_quality.py <skill-path> and capture the full output (status, error count, warning count, metrics).
- If the skill has its own validator, run that too.
- A verdict is not valid without mechanical check evidence in the output. If the check script cannot be run (missing Python, broken script), record that as the first risk and set the verdict to not-ready.
- Treat script results as a starting point; they do not replace semantic evaluation.
Evaluate Design Fit ⚠️ REQUIRED
- Read references/best-practices.md.
- Read references/skill-prompt-quality.md to review SKILL.md as executable agent instructions; treat it as a front gate, not proof of runtime compliance.
- Use references/evaluation-rubric.md to score trigger quality, workflow control, progressive disclosure, deterministic resources, validation integrity, and output evidence.
- Identify anti-patterns: workflow hidden in description, vague "ensure quality" instructions, no negative cases, no close-loop validation, or final artifacts that can mask weak process.
Design or Review Eval Cases ⚠️ REQUIRED
- Read references/eval-case-design.md.
- For repeatable suites or CI-style checks, read references/harness-engineering.md.
- If the evaluation starts from a known failure, read references/failure-regression.md first and convert the failure into a no-leak regression case.
- Create or review eval cases that include happy paths, ambiguity, edge cases, and known failure modes.
- Prefer assertions over broad expected-output prose: required actions, forbidden shortcuts, required artifacts, and validation commands.
- Validate eval files with scripts/check_eval_cases.py <evals.json> when eval cases exist.
Forward-Test When Useful
- Read references/forward-testing.md.
- For complex, high-impact, or recently changed skills, forward-test with fresh agents when the runtime allows it and the user approves any costly or risky runs.
- Before running a real-agent eval, use references/agent-runtime-discovery.md and optionally scripts/discover_agent_runtime.py to discover available agent runtimes and capture channels.
- Then use references/harness-engineering.md → "Evidence Capture Discovery" to record the chosen evidence level. Codex JSONL is a reference implementation, not a hard dependency.
- Prompt shape: Use $<skill-name> at <path> to handle: <realistic user request>.
- Review available traces, transcripts, command logs, diffs, validator output, and artifacts against the rubric. Do not leak the expected answer unless testing retrieval of a known reference.
- If only final artifacts are available, report artifact-level validation only; do not claim workflow adherence.
Recommend or Apply Changes
- Convert each failure into one of: trigger metadata change, workflow gate, anti-pattern rule, deterministic script, validator check, reference split, eval case, or output schema.
- Keep SKILL.md under 500 lines; move detailed standards into directly linked reference files.
- After edits, rerun mechanical checks and any representative eval/smoke checks.

Output Shape

For a review-only task, return these five sections in order. Omit any section only when the evaluation did not reach that phase:

Mechanical Check Results — output from scripts/check_skill_quality.py (status, error count, warning count, key metrics). If this section is absent, the evaluation is incomplete and the verdict must be not-ready.
Verdict — ready, usable-with-gaps, or not-ready.
Top Risks — ordered by impact, each with evidence from files, scripts, transcripts, or artifacts.
Recommended Changes — each names a target file, rule, script, or eval case, and explains why the change matters.
Suggested Eval Cases — at least one concrete prompt with must-do and must-not-do assertions.

For an edit task, also include changed files and verification commands.

Resources

scripts/check_skill_quality.py — deterministic skill structure and hygiene checks. Pass --receipt <path> to write a timestamped JSON receipt after a passing run; use this as evidence that mechanical checks completed before a verdict.
scripts/check_eval_cases.py — validate portable eval case JSON files.
scripts/discover_agent_runtime.py — read-only discovery of local agent CLIs and likely evidence capture methods.
references/best-practices.md — distilled guidance from OpenAI, Claude, Trae, and local skill-forge/skill-creator principles.
references/skill-prompt-quality.md — static gate for whether SKILL.md can reliably steer agent behavior.
references/evaluation-rubric.md — scoring rubric and pass/fail gates.
references/eval-case-design.md — how to create regression and forward-test cases.
references/agent-runtime-discovery.md — how to discover capture channels for Codex, Claude Code, GitHub Copilot, Kimi, and custom agents.
references/harness-engineering.md — design repeatable eval harnesses: datasets, traces, graders, aggregation, and gates.
references/forward-testing.md — skill-creator style fresh-agent testing protocol.
references/failure-regression.md — convert observed failures into no-leak regression cases and skill changes.
templates/eval-case.json — starter schema for skill eval cases.
templates/harness-plan.md — lightweight plan for repeatable skill eval suites.

Related Skills

aiden0z/repo-bug-audit

development

VerifiedTrustedCommunity

Use when asked to find Bugs, audit or review a repository, scan code for security/reliability/architecture risks, inspect a folder of many repos, produce evidence-backed Bug reports, continue a prior audit, or compare/triage candidate findings.

3SKILL.mdUpdated May 8, 2026

aiden0z/repo-bug-audit

aiden0z/vibe-deck

development

VerifiedTrustedCommunity

Vibe Deck — vibe-code professional slide presentations — describe what you want, AI builds it. Scaffolds a React + ECharts project, creates slides with charts, animations, theming, and PDF export. Use PROACTIVELY when the user mentions slides, deck, presentation, PPT, PPTX, slideshow, keynote, pitch deck, quarterly review, board meeting, investor update, sales deck, training deck, onboarding slides, report presentation, add a slide, build a deck, create slides, make a roadmap slide, put this data into a presentation, turn this Excel into slides, visualize this data as a deck. Also trigger when the user wants to modify, reorder, or delete slides in an existing slide-kit project. Also trigger when the user wants to share, export, or package the deck as a single HTML file for email or offline viewing. Chinese triggers: 做PPT, 做个deck, 写pptx, 创建演示, 制作幻灯片, 做幻灯片, 加一页, 新增slide, 做演示文稿, 工作汇报, 述职报告, 季度回顾, 方案展示, 写个汇报, 改一下这页, 调整幻灯片顺序, 删掉这页, 把数据做成图表展示, 帮我做个路线图, 导出单个HTML, 分享给别人看.

3SKILL.mdUpdated Mar 30, 2026

aiden0z/email-designer

development

VerifiedTrustedCommunity

Generate Outlook-compatible email templates (EML + HTML) through conversation. Three modes: Design (create from scratch), Import (replicate an existing .eml), Production (fill Excel data into a crystallized template). Use when user wants to: create or design an email template, generate an .eml file, make a newsletter, format an email for Outlook, import/replicate an email (导入/复刻邮件), design a 邮件模板, do 邮件排版 or 邮件设计, build pixel-perfect HTML email with Outlook compatibility. Triggers: weekly report email (周报邮件), product update email, event invitation (活动邀请邮件), announcement (公告邮件), company newsletter, .eml import, replicate email template, or make an email look professional/beautiful for Outlook. Handles visual design, EML generation, and EML import — not SMTP, sending, or account management. Without this skill, Outlook emails break because Outlook uses Word rendering which ignores modern CSS.

3SKILL.mdUpdated Mar 30, 2026

aiden0z/email-designer

steipete/skill-creator

testing

VerifiedTrustedCommunity

Create, edit, improve, or audit AgentSkills. Use when creating a new skill from scratch or when asked to improve, review, audit, tidy up, or clean up an existing skill or SKILL.md file. Also use when editing or restructuring a skill directory (moving files to references/ or scripts/, removing stale content, validating against the AgentSkills spec). Triggers on phrases like "create a skill", "author a skill", "tidy up a skill", "improve this skill", "review the skill", "clean up the skill", "audit the skill".

356,423SKILL.mdUpdated Apr 13, 2026

steipete/skill-creator

Download

For Claude Desktop. Download once, then upload the file in the app — no terminal needed.

Need help? View full Cowork setup guide →

Install manually

Choose your platform

# Clone the repo
git clone https://github.com/aiden0z/skills.git

# Copy into Claude Code skills folder (global)
cp -r skills/skills/skill-evaluator ~/.claude/skills/

Claude Code Skills — official skills path docs.

Repository

aiden0z/skills

3 stars

Compatible with

Claude Code

OpenAI Codex CLI

ChatGPT