skills/skill-creator/SKILL.md
Use this skill when creating a new Claude skill from scratch, editing or improving an existing skill, or measuring skill performance with evals and benchmarks. Invoke whenever the user says things like 'make a skill for X', 'turn this workflow into a skill', 'test my skill', 'improve my skill', 'run evals', 'benchmark this', or 'optimize my skill description'. Also use proactively when the conversation has produced a repeatable workflow that would benefit from being captured as a skill. Covers the full lifecycle: capture intent, draft SKILL.md, run evals, review with user, iterate, optimize description, package. NOT for general coding help, debugging runtime errors, building MCP servers, writing Claude hooks, or creating plugins - use domain-specific skills for those.
npx skillsauth add curiositech/windags-skills skill-creatorInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
A skill for creating new skills and iteratively improving them through systematic testing and evaluation.
Navigate based on the user's entry point and current skill state:
Entry Assessment
├── "Make a skill for X" → Capture Intent → Draft SKILL.md
├── "Turn this workflow into a skill" → Extract from conversation → Confirm intent
├── "Test/improve my skill" → Locate skill → Run test cases
├── "Optimize description" → Load skill → Generate trigger evals
└── Proactive detection → Assess if workflow is repeatable → Suggest skill creation
Skill State Check
├── No skill exists → Full creation flow (intent → draft → test → iterate)
├── Draft exists, no tests → Create test cases → Run evaluations
├── Has tests, poor performance → Analyze failures → Improve skill → Retest
└── Working skill → Description optimization → Package
Test Case Strategy
├── Objectively verifiable outputs (code, data, files) → Create quantitative assertions
├── Subjective outputs (writing, design) → Skip assertions, focus on qualitative review
└── Mixed outputs → Assertions for verifiable parts, human review for subjective
Evaluation Scope
├── First iteration → 2-3 focused test cases
├── Subsequent iterations → Rerun same cases for comparison
└── Final validation → Expand to 5-10 diverse cases
Improvement Trigger
├── User feedback identifies specific issues → Target those areas
├── Quantitative benchmarks show regression → Analyze root cause
├── No clear feedback but low pass rates → Review transcripts for patterns
└── Consistent success across tests → Move to description optimization
Detection: If skill works on test cases but fails when users try different variations of the same task type. Diagnosis: Intent interview was too narrow or focused only on specific examples rather than the general capability. Fix: Restart intent capture with broader questions: "What are all the ways someone might ask for this?" and "What edge cases should we handle?"
Detection: If assertions are passing but user feedback consistently identifies problems, or if skill works only on the exact test prompts. Diagnosis: Test cases don't represent real usage patterns, or assertions are non-discriminating. Fix: Apply discriminating assertion test - would a clearly wrong output also pass this assertion? Rewrite assertions to check specific content only a correct output would have.
Detection: If 3+ iterations show no meaningful progress, or changes only address test cases without improving general capability. Diagnosis: Over-optimization on specific examples rather than addressing root capability gaps. Fix: Step back and analyze failure patterns across all test cases. Consider fundamental approach changes rather than incremental tweaks.
Detection: If you're only testing with the skill and never comparing to without-skill performance. Diagnosis: No objective measure of whether the skill actually improves outcomes over baseline Claude. Fix: Always run baseline comparisons (without-skill for new skills, old-version for improvements) to measure actual value added.
Detection: If spending time on description optimization before the skill reliably works on test cases. Diagnosis: Optimizing discoverability before ensuring quality. Fix: Complete the test-improve cycle first. Only optimize descriptions once the skill consistently passes evaluations.
Initial Request: "I keep having to explain our API structure to new developers. Can we make a skill for that?"
Intent Capture Decision Tree:
Draft Process:
scripts/openapi_parser.py based on repetitive parsing logicTest Cases Design:
{
"skill_name": "api-documenter",
"evals": [
{
"id": 1,
"prompt": "Document our user authentication API endpoints",
"files": ["evals/files/auth_spec.yaml"],
"expectations": [
"Output includes authentication section with bearer token example",
"Each endpoint has request/response examples",
"Parameter tables include required/optional indicators"
]
}
]
}
First Iteration Results:
Improvement Applied:
Second Iteration: Pass rate improved from 60% to 95%, user satisfied with comprehensive coverage.
Entry Point: User says "My code review skill keeps missing security issues"
State Assessment: Skill exists, has test cases, but low performance on security-related assertions.
Failure Analysis:
Improvement Strategy:
references/security_patterns.md with detailed detection guidanceTest Case Iteration:
Key Learning: The skill was solving the wrong problem - automating style checks instead of augmenting security analysis. Success required fundamental reframing, not incremental improvement.
What this skill should NOT handle:
When to delegate:
This skill focuses exclusively on the skill creation lifecycle: capturing requirements, drafting instructions, creating test cases, running evaluations, and iterating based on results.
tools
Building resilient distributed systems with circuit breakers, retries with full-jitter exponential backoff, retry budgets (per-request 3-attempt + per-client 10% ratio per Google SRE), deadline propagation, and the cascading-failure math (4 layers × 3 retries = 64x amplification). Grounded in Resilience4j, Microsoft Cloud Patterns, AWS Architecture Blog (Marc Brooker), and Google SRE Book.
testing
Designing HTTP cache headers that work correctly across browsers, CDNs, and shared proxies — `Cache-Control` directives per RFC 9111, `stale-while-revalidate` and `stale-if-error` per RFC 5861, the Vary header for varying responses, and surrogate keys for tag-based purging. Grounded in IETF RFCs and Cloudflare/Fastly docs.
development
Use when designing or fixing a Content Security Policy on a real site, choosing between nonce-based and hash-based CSP, adding strict-dynamic, debugging "Refused to execute inline script" errors, deploying CSP in report-only mode first, configuring report-to / report-uri, or auditing an existing policy for unsafe-inline / unsafe-eval / wildcards. Triggers: "CSP blocks legitimate inline script", strict-dynamic, nonce-{RANDOM}, sha256-{HASH}, object-src none, base-uri none, frame-ancestors, Trusted Types, X-Content-Security-Policy obsolete, report-only vs enforced. NOT for general HTTP security headers (HSTS, COOP/COEP), Trusted Types deep dive, CORS configuration, or building a WAF.
tools
Choosing and operating an HTTP API versioning strategy that doesn't break clients — Stripe's date-based pinned versions, the Deprecation/Sunset header pair (RFC 9745 + RFC 8594), URI vs header vs media-type approaches, and the version-transformer pattern. Grounded in Stripe's published architecture and IETF RFCs.