openclaw/skills/skill-creator/SKILL.md
Create new OpenClaw skills, modify and improve existing skills, and measure skill performance with evals. Use when users want to create a skill from scratch, update or optimize an existing skill, run evals to test a skill, benchmark skill performance with variance analysis, or optimize a skill's description for better triggering accuracy. Also use when asked to "make a skill", "turn this into a skill", "improve this skill", or "test this skill".
npx skillsauth add Dbochman/dotfiles skill-creatorInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
A skill for creating new OpenClaw skills and iteratively improving them with eval-driven development.
The skill creation loop:
Jump in wherever the user is in this process. If they already have a draft, skip to eval/iterate. If they say "just vibe with me", skip the formal evals.
This skill runs on a headless Mac Mini via OpenClaw. Key constraints:
--static mode for all viewers, save HTML files
for the user to download or view via screen shareclaude -p requires keychain access — the eval/optimization scripts
(run_eval.py, run_loop.py, improve_description.py) use claude -p
which needs macOS keychain auth. These scripts will not work from the
OpenClaw gateway LaunchAgent session. Use the core workflow (capture intent
→ write SKILL.md → iterate via conversation) instead. The eval scripts are
available for manual use via Terminal or screen share.~/.openclaw/skills/ — real directory copies, not symlinks
(OpenClaw rejects symlinks via fs.realpathSync())claude CLI available at /opt/homebrew/opt/node@22/bin/claude/opt/homebrew/bin/python3openclaw/skills/TEMPLATE/SKILL.md in the dotfiles repoAdapt your language to context cues. Default: "evaluation" and "benchmark" are fine; explain "JSON" or "assertion" before using them unless the user signals technical fluency.
If the current conversation already contains a workflow to capture ("turn this into a skill"), extract answers from conversation history first.
Ask about edge cases, formats, dependencies. Check available MCPs for research. Wait to write test prompts until this is solid.
Follow OpenClaw conventions (see the TEMPLATE skill for reference):
Frontmatter (required):
name: kebab-case, under 64 charsdescription: Primary triggering mechanism. Include what it does AND when to
use it. Be slightly "pushy" — Claude undertriggers skills. All "when to use"
info goes here, not in the body.allowed-tools: Restrict tool access if the skill uses CLI wrappersmetadata: OpenClaw-specific metadata (emoji, required bins)Body: Instructions for using the skill and its bundled resources.
skill-name/
├── SKILL.md (required)
│ ├── YAML frontmatter (name, description required)
│ └── Markdown instructions
└── Bundled Resources (optional)
├── scripts/ - Executable code for deterministic/repetitive tasks
├── references/ - Docs loaded into context as needed
└── assets/ - Files used in output (templates, icons, fonts)
Skills use a three-level loading system:
Keep SKILL.md under 500 lines. If approaching this limit, split into reference files with clear pointers about when to read them.
Match specificity to task fragility:
Use imperative form. Explain the why behind instructions — today's LLMs are smart and respond better to reasoning than rigid MUSTs. If you find yourself writing ALWAYS or NEVER in caps, reframe as an explanation.
Include examples where helpful:
## Commit message format
**Example:**
Input: Added user authentication with JWT tokens
Output: feat(auth): implement JWT-based authentication
After writing the skill draft, create 2-3 realistic test prompts. Share them
with the user for feedback. Save to evals/evals.json:
{
"skill_name": "example-skill",
"evals": [
{
"id": 1,
"prompt": "User's task prompt",
"expected_output": "Description of expected result",
"files": []
}
]
}
See references/schemas.md for the full schema including assertions.
This section is one continuous sequence. Put results in
<skill-name>-workspace/ as a sibling to the skill directory. Organize by
iteration (iteration-1/, iteration-2/, etc.) with each test case in its
own directory (eval-0/, eval-1/, etc.).
For each test case, spawn two subagents simultaneously — one with the skill, one baseline. Launch everything at once.
With-skill run:
Execute this task:
- Skill path: <path-to-skill>
- Task: <eval prompt>
- Input files: <eval files if any, or "none">
- Save outputs to: <workspace>/iteration-<N>/eval-<ID>/with_skill/outputs/
Baseline run (same prompt, no skill or old skill version):
without_skill/outputs/old_skill/outputs/Write eval_metadata.json for each test case with descriptive names.
Use the wait time to draft quantitative assertions. Good assertions are objectively verifiable with descriptive names. Don't force assertions onto subjective outputs.
Update eval_metadata.json and evals/evals.json with assertions.
When each subagent completes, the notification contains total_tokens and
duration_ms. Save immediately to timing.json — this data isn't persisted
elsewhere.
Once all runs complete:
Grade each run — spawn a grader subagent using agents/grader.md.
Save to grading.json. Use fields text, passed, evidence (the viewer
depends on these exact names). For programmatically checkable assertions,
write and run a script.
Aggregate — run from the skill-creator directory:
python3 -m scripts.aggregate_benchmark <workspace>/iteration-N --skill-name <name>
Produces benchmark.json and benchmark.md.
Analyst pass — read benchmark data, surface patterns. See
agents/analyzer.md ("Analyzing Benchmark Results" section).
Generate static viewer:
python3 <skill-creator-path>/eval-viewer/generate_review.py \
<workspace>/iteration-N \
--skill-name "my-skill" \
--benchmark <workspace>/iteration-N/benchmark.json \
--static <workspace>/iteration-N/review.html
For iteration 2+, add --previous-workspace <workspace>/iteration-<N-1>.
Tell the user the viewer HTML is at the path — they can view it via screen share or download it.
When the user provides feedback (inline or via feedback.json), focus
improvements on test cases with specific complaints. Empty feedback = looked fine.
Generalize from feedback. Don't overfit to the test examples — make changes that would help across many different prompts. Rather than fiddly constraints, try different metaphors or patterns.
Keep the prompt lean. Read transcripts, not just outputs. If the skill makes the model waste time on unproductive steps, trim those instructions.
Explain the why. Transmit understanding into instructions. If you find yourself writing rigid constraints, reframe as reasoning.
Look for repeated work. If all test runs independently wrote similar
helper scripts, bundle that script in scripts/ so future invocations
don't reinvent the wheel.
iteration-<N+1>/--previous-workspace pointing at previous iterationKeep going until the user is happy, feedback is empty, or progress stalls.
For rigorous A/B comparison between skill versions, read agents/comparator.md
and agents/analyzer.md. Give two outputs to an independent agent without
revealing which skill produced which.
After creating or improving a skill, offer to optimize the description for better triggering accuracy.
Create 20 eval queries — mix of should-trigger and should-not-trigger:
[
{"query": "the user prompt", "should_trigger": true},
{"query": "another prompt", "should_trigger": false}
]
Queries must be realistic with concrete details (file paths, personal context, abbreviations, typos). For should-not-trigger queries, focus on near-misses — adjacent domains that share keywords but need something different.
Present the eval set using the HTML template:
assets/eval_review.html__EVAL_DATA_PLACEHOLDER__ → JSON array,
__SKILL_NAME_PLACEHOLDER__ → name, __SKILL_DESCRIPTION_PLACEHOLDER__ → descriptioncd <skill-creator-path> && \
python3 -m scripts.run_loop \
--eval-set <path-to-trigger-eval.json> \
--skill-path <path-to-skill> \
--model <model-id> \
--max-iterations 5 \
--report none \
--verbose
Use --report none since we're headless. The loop handles train/test split
(60/40), runs each query 3 times, and uses Claude with extended thinking to
propose improvements.
Take best_description from the JSON output and update the skill's frontmatter.
Report before/after and scores to the user.
Agent instructions (read when spawning the relevant subagent):
agents/grader.md — evaluate assertions against outputsagents/comparator.md — blind A/B comparisonagents/analyzer.md — analyze why one version beat another / benchmark patternsReference documentation:
references/schemas.md — JSON schemas for evals.json, grading.json, etc.development
Search the web for current information, news, facts, and answers. Use when asked questions about current events, needing to look something up, finding websites, researching topics, or when you need up-to-date information beyond your training data.
development
Summarize any URL, YouTube video, podcast, PDF, or file into concise text. Use when asked to read an article, summarize a link, get the gist of a video or podcast, extract content from a URL, or when you need to understand what a web page or document contains.
development
Play music via Spotify and control Google Home speakers. Use when asked to play music, songs, artists, playlists, podcasts, or control speakers/volume/audio.
tools
Run Apple Shortcuts for HomeKit scenes, automations, and system actions. Use when asked about home scenes, automations, fireplace, or triggering shortcuts.