skills/create-a-skill/SKILL.md
Create new agent skills from scratch, modify and improve existing skills, and measure skill performance through evaluation and benchmarking. Use when users want to create a skill, write a skill, build a new skill, edit or optimize an existing skill, run evals to test a skill, benchmark skill performance, or optimize a skill's description for better triggering accuracy. Also triggers when users say "turn this into a skill", "make a skill for X", "skill for doing Y", or ask about skill structure, skill format, or SKILL.md files.
npx skillsauth add psenger/ai-agent-skills create-a-skillInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
A skill for creating new skills from scratch and iteratively improving them through testing, user feedback, and evaluation.
The process of creating a skill follows this loop:
.skill fileYour job is to figure out where the user is in this process and jump in. Maybe they say "I want to make a skill for X" — help them from step 1. Maybe they already have a draft — go straight to testing. Always be flexible: if the user says "just vibe with me", skip the formal eval loop.
Start by understanding what the user wants. The conversation might already contain a workflow they want to capture (e.g., "turn this into a skill"). If so, extract answers from the conversation history first — tools used, sequence of steps, corrections the user made, input/output formats observed.
Ask these questions (skip any already answered):
Proactively ask about edge cases, input/output formats, example files, success criteria, and dependencies. Don't write test prompts until you've ironed this out.
Research the domain. If the skill involves specific technologies, frameworks, or APIs:
Wait for the user to confirm your understanding before proceeding to the draft.
skill-name/
├── SKILL.md (required)
│ ├── YAML frontmatter (name, description required)
│ └── Markdown instructions
└── Bundled Resources (optional)
├── scripts/ — Executable code for deterministic/repetitive tasks
├── references/ — Docs loaded into context as needed
└── assets/ — Files used in output (templates, icons, fonts)
Required fields:
Optional fields:
The description is the only thing Claude sees when deciding which skill to load. It appears in the system prompt alongside all other installed skills.
Format:
Good example:
Extract text and tables from PDF files, fill forms, merge documents. Use when working
with PDF files or when user mentions PDFs, forms, or document extraction. Make sure to
use this skill whenever the user mentions any PDF-related task, even if they don't
explicitly ask for extraction.
Bad example:
Helps with documents.
Skills use a three-level loading system:
Key patterns:
Domain organization — when a skill supports multiple domains/frameworks:
cloud-deploy/
├── SKILL.md (workflow + selection)
└── references/
├── aws.md
├── gcp.md
└── azure.md
Claude reads only the relevant reference file.
Add utility scripts when:
Scripts save tokens and improve reliability vs. generated code. If all your test runs independently write similar helper scripts, that's a strong signal to bundle the script.
Defining output formats:
## Report structure
ALWAYS use this exact template:
# [Title]
## Executive summary
## Key findings
## Recommendations
Examples pattern:
## Commit message format
**Example 1:**
Input: Added user authentication with JWT tokens
Output: feat(auth): implement JWT-based authentication
Skills must not contain malware, exploit code, or content that could compromise system security. A skill's contents should not surprise the user in their intent if described. Don't create misleading skills or skills designed to facilitate unauthorized access or data exfiltration. Roleplay skills are OK.
After drafting, verify:
Present the draft to the user and ask:
After writing the skill, create 2–3 realistic test prompts — the kind of thing a real user would actually say. Share them: "Here are a few test cases I'd like to try. Do these look right, or do you want to add more?"
Save test cases to evals/evals.json (don't write assertions yet — just prompts):
{
"skill_name": "example-skill",
"evals": [
{
"id": 1,
"prompt": "User's task prompt",
"expected_output": "Description of expected result",
"files": []
}
]
}
See references/schemas.md for the full schema including the expectations field (added later).
This is one continuous sequence — don't stop partway through.
Put results in <skill-name>-workspace/ as a sibling to the skill directory. Organize by iteration (iteration-1/, iteration-2/, etc.) and each test case gets a directory (eval-0/, eval-1/, etc.). Create directories as you go.
For each test case, spawn two subagents in the same turn — one with the skill, one without (baseline). Launch everything at once.
With-skill run:
Execute this task:
- Skill path: <path-to-skill>
- Task: <eval prompt>
- Input files: <eval files if any, or "none">
- Save outputs to: <workspace>/iteration-<N>/eval-<ID>/with_skill/outputs/
Baseline run (depends on context):
without_skill/outputs/.cp -r), then point baseline at snapshot. Save to old_skill/outputs/.Write an eval_metadata.json for each test case with a descriptive name:
{
"eval_id": 0,
"eval_name": "descriptive-name-here",
"prompt": "The user's task prompt",
"assertions": []
}
Draft quantitative assertions and explain them to the user. Good assertions are objectively verifiable and have descriptive names. Subjective skills (writing style, design quality) are better evaluated qualitatively.
Update eval_metadata.json and evals/evals.json with the assertions.
When each subagent completes, save total_tokens and duration_ms to timing.json:
{
"total_tokens": 84852,
"duration_ms": 23332,
"total_duration_seconds": 23.3
}
This data comes through the task notification and isn't persisted elsewhere — capture it immediately.
Once all runs are done:
Grade each run — use agents/grader.md instructions. Save to grading.json. The expectations array must use fields text, passed, and evidence. For programmatically checkable assertions, write and run a script.
Aggregate into benchmark:
python -m scripts.aggregate_benchmark <workspace>/iteration-N --skill-name <name>
Analyst pass — read benchmark data and surface patterns. See agents/analyzer.md ("Analyzing Benchmark Results" section).
Launch the viewer:
nohup python <skill-creator-path>/eval-viewer/generate_review.py \
<workspace>/iteration-N \
--skill-name "my-skill" \
--benchmark <workspace>/iteration-N/benchmark.json \
> /dev/null 2>&1 &
VIEWER_PID=$!
For iteration 2+, also pass --previous-workspace <workspace>/iteration-<N-1>.
Headless environments: Use --static <output_path> to write standalone HTML.
Tell the user the results are ready for review.
When the user is done, read feedback.json. Empty feedback means it was fine. Focus improvements on test cases with specific complaints.
Kill the viewer: kill $VIEWER_PID 2>/dev/null
This is the heart of the loop.
Generalize from the feedback. You're iterating on a few examples to move fast, but the skill will be used across many different prompts. Don't overfit — avoid fiddly changes or oppressively constrictive MUSTs. If something is stubborn, try different metaphors or recommend different patterns.
Keep the prompt lean. Remove things that aren't pulling their weight. Read the transcripts, not just final outputs — if the skill makes the model waste time on unproductive steps, trim those instructions.
Explain the why. Transmit your understanding of the task and the user's intent into the instructions. Use reasoning over rigid structures.
Look for repeated work. If all test runs independently wrote similar helper scripts or took the same multi-step approach, bundle that script in scripts/.
iteration-<N+1>/ directory with baseline runs--previous-workspaceKeep going until:
After the skill is done, offer to optimize the description for better triggering accuracy.
Create 20 eval queries — mix of should-trigger and should-not-trigger:
[
{"query": "the user prompt", "should_trigger": true},
{"query": "another prompt", "should_trigger": false}
]
Queries must be realistic — concrete, specific, with file paths, personal context, column names, URLs. Some casual with typos, some formal. Focus on edge cases.
Should-trigger (8–10): Different phrasings of the same intent, cases where the user doesn't name the skill explicitly but clearly needs it, uncommon use cases, competitive cases.
Should-not-trigger (8–10): Near-misses that share keywords but need something different. Avoid obviously irrelevant queries — the negative cases should be genuinely tricky.
Present using the HTML template from assets/eval_review.html:
__EVAL_DATA_PLACEHOLDER__, __SKILL_NAME_PLACEHOLDER__, __SKILL_DESCRIPTION_PLACEHOLDER__~/Downloads/eval_set.jsonpython -m scripts.run_loop \
--eval-set <path-to-trigger-eval.json> \
--skill-path <path-to-skill> \
--model <model-id-powering-this-session> \
--max-iterations 5 \
--verbose
This handles the full loop: 60/40 train/test split, 3x runs per query, iterative improvement.
Update SKILL.md frontmatter with best_description. Show before/after and report scores.
python -m scripts.quick_validate <path/to/skill-folder>
python -m scripts.package_skill <path/to/skill-folder>
If present_files tool is available, present the .skill file to the user. Otherwise, tell them the file path.
For rigorous comparison between two skill versions, read agents/comparator.md and agents/analyzer.md. Two outputs are given to an independent agent without telling it which is which, then analyzed for why the winner won. This is optional and most users won't need it.
People across a wide range of coding familiarity use this. Pay attention to context cues:
claude CLI)--static <output_path> for the viewer (no browser)feedback.jsonclaude -p)name frontmatter field/tmp/ firstRead these when you need them:
agents/grader.md — How to evaluate assertions against outputsagents/comparator.md — How to do blind A/B comparisonagents/analyzer.md — How to analyze why one version beat anotherreferences/schemas.md — JSON structures for evals.json, grading.json, benchmark.json, etc.eval-viewer/generate_review.pytesting
Exports a single Obsidian vault note and all its linked images into a portable zip or tar.gz archive, preserving vault-root-relative paths so the archive unpacks correctly anywhere. Use only when the user explicitly invokes /export-vault-note.
development
Provides the audit checklists, severity criteria (blocking/warning/suggestion), and artifact patterns needed to properly review Agent OS profiles and standards. Always invoke this skill before auditing - without it you can only give generic feedback, not structured severity-tagged findings. Invoke when the user pastes a standard and asks if it is good or what is wrong with it; when the user asks to review, audit, validate, or critique an agent-os profile or standard; or when the user mentions "agent-os profile", "agent-os standard", or "my agent-os setup" in a review or validation context.
development
Converts transcripts, video summaries, meeting notes, brainstorming sessions, strategy documents, and rough notes into polished Obsidian-flavored Markdown. Activates when creating or editing notes in an Obsidian vault, generating front matter, applying callout blocks, structuring knowledge base articles, or producing developer-facing guides. Also triggers on mentions of Obsidian, front matter, callout blocks, vault organisation, or requests for GitHub-compatible Markdown documents.
documentation
Generate git commit messages, PR titles/descriptions, and changelog entries. Analyzes staged changes, enforces Conventional Commits, scans for sensitive content, links tickets (GitHub Issues / Jira), and updates CHANGELOG.md. Triggers on: "commit", "create a PR", "push", "changelog", "release", or when the user is ready to commit or open a pull request.