skills/skill-creator-agnostic/SKILL.md
Adds runner-agnostic guardrails on top of Anthropic's skill-creator for creating, modifying, and benchmarking skills across Codex, GitHub Copilot, Opus, and similar agents. Use whenever skill work must follow temp-workspace isolation, valid `iteration-N/eval-name/{config}/run-N/` benchmark layout, honest measured-vs-simulated labeling, UTF-8-safe artifact generation, and repo-managed skill sync/README update rules. Treat requests like "turn this workflow into a skill", "benchmark this skill", "compare with_skill and without_skill", "why is aggregate_benchmark.py showing zeros", or "make this skill robust across agents" as automatic triggers.
npx skillsauth add codebeltnet/agentic skill-creator-agnosticInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
This skill is a thin overlay on Anthropic's skill-creator. It does not replace the upstream workflow; it adds cross-runner and repo-level guardrails so skill work stays reliable across Codex, Copilot, Opus, and similar agents.
Before benchmarking, read references/benchmark-contract.md.
On Windows or when running from PowerShell, also read references/windows-powershell-benchmarking.md.
skill-creator workflow. Use this skill to add environment and repo guardrails, not to fork or replace the upstream skill.skill-creator to encode repo-specific behavior. Keep those rules in repo-managed files and companion skills instead.MEASURED benchmarks by default instead of running evals serially.$env:TEMP/<skill-name>-workspace/, never inside the source repo.skills/<name>/, ~/.claude/skills/<name>/, and ~/.agents/skills/<name>/ in sync before calling the work done.evals/evals.json.files, treat those paths as skill-relative fixtures and stage them into the temp workspace for both benchmark configurations.iteration-N/eval-name/{config}/run-N/ exactly, and the eval directory itself must start with eval-; do not flatten files directly under with_skill/ or without_skill/.grading.json must include both expectations and a populated summary object with passed, failed, total, and pass_rate.benchmark.json through skill-creator/scripts/aggregate_benchmark.py; never hand-author it.skill-creator/eval-viewer/generate_review.py; do not build custom HTML when the upstream viewer already fits.MEASURED for real model executions, SIMULATED for hand-authored or scripted expected outputs. Never present simulated outputs as measured.MEASURED benchmark may still show zero delta or parity between configurations. That is a valid measured result, not a reason to relabel the run as simulated.Decide which of these modes the user is asking for:
Load Anthropic's skill-creator for the base workflow, then apply this overlay for cross-runner and repo-specific execution discipline.
Choose the benchmark execution path from actual available capabilities, not from memory or assumptions.
MEASURED benchmark.SIMULATED benchmark, but label it clearly as such.Do not frame the workflow around one vendor-specific CLI unless that CLI is actually present.
Create the workspace under temp and keep it isolated from the real repo.
$env:TEMP/<skill-name>-workspace/.Read or create the per-skill evals/evals.json, then ensure each eval has a corresponding workspace shape:
iteration-N/
eval-1-name/
eval_metadata.json
fixtures/
with_skill/
run-1/
grading.json
timing.json
outputs/
without_skill/
run-1/
grading.json
timing.json
outputs/
Keep eval_metadata.json at the eval-directory level. Put run artifacts under run-N/ so aggregate_benchmark.py can discover them.
If evals/evals.json declares files, copy those skill-relative fixtures into fixtures/ at the eval-directory level and make them available to both runs.
Do not invent custom eval directory names such as dependency-upgrades-vs-build-refactor/ without the eval- prefix. Anthropic's aggregation tooling discovers eval-* directories, not arbitrary names.
Run each eval in paired configurations:
with_skill: the skill under test is activewithout_skill: baseline with no skill for new skills, or a previous/ original version for existing skillsFor MEASURED runs:
with_skill and without_skill runs when the eval declares filesoutputs/; it is the fallback source of truth when a convenience output file such as last-message.txt is missingFor SIMULATED runs:
Prefer scripts or direct file/diff checks over impressionistic grading.
Each grading.json must minimally look like this:
{
"expectations": [
{
"text": "Uses iteration-N/eval-name/{config}/run-N/ layout",
"passed": true,
"evidence": "Found run-1/grading.json and run-1/timing.json"
}
],
"summary": {
"passed": 1,
"failed": 0,
"total": 1,
"pass_rate": 1.0
}
}
If summary is missing or empty, the aggregation output is not trustworthy.
After the paired runs are graded:
skill-creator root first, usually under ~/.agents/skills/skill-creator/ or ~/.claude/skills/skill-creator/.scripts/aggregate_benchmark.py from that resolved skill-creator root.benchmark.json contains discovered runs and a non-empty run_summary.eval-viewer/generate_review.py from that same resolved skill-creator root to create the human review artifact.If the viewer shows outputs but the benchmark metrics are zero, suspect run layout or grading.json.summary before blaming the viewer.
Summarize benchmark outcomes with plain language:
MEASURED or SIMULATEDCall out benchmark limitations directly. For example:
For repo-managed skills:
~/.claude, and ~/.agentsREADME.mdscripts/validate-skill-templates.ps1skill-creator as the base workflow and this skill as the overlay.skill-creator path before calling benchmark scripts or the review viewer.aggregate_benchmark.py and generate_review.py consume without repair work.SIMULATED and live executions as MEASURED.benchmark.json instead of generating it.with_skill/ or without_skill/.eval-* prefix and then blaming the aggregator for finding zero runs.MEASURED benchmark serially by habit even though the available runner supports safe parallel paired runs.SIMULATED just because the delta is zero.testing
Generate GitHub release notes by summarizing all commits and pull requests between two Git tags, branches, or the current branch and the upstream default branch. Use when the user asks to write release notes, generate release notes, draft a GitHub release, create release notes from tags, summarize changes between versions, summarize the current branch, or provides a GitHub compare URL. Trigger phrases: "release notes", "generate release notes", "what changed between", "summarize changes from v1 to v2", "GitHub release", "summarize this branch", compare URLs like "github.com/owner/repo/compare/v1...v2". When no explicit input is given, detects the current branch and compares against the upstream default branch automatically.
development
Classifies .NET library or NuGet package changes and recommends the correct release bump: Major, Minor, or Patch. Applies both Semantic Versioning (MAJOR.MINOR.PATCH) and .NET assembly/file versioning (Major.Minor.Build.Revision), grounded in Microsoft’s official .NET library compatibility rules. Use when evaluating the current branch, breaking changes, API diffs, public API changes, dependency updates, TFM/platform support, interface or enum changes, overloads, analyzers, source generators, or binary/source/behavioral/design-time/backwards compatibility. When no explicit change details or compare range are provided, inspects the current Git branch and compares it against the upstream default branch automatically. Always returns structured compatibility reasoning with the recommendation.
documentation
Generate source-grounded repository digest markdown from deterministic local evidence bundles. Use when the user asks to create, refresh, or complete repo/package digests, family or project overview pages, .bot/digests output, digest workspace workflows, or result/Index.md plus result/{PackageName}.md files for any repository URL. The skill runs its bundled .NET file-based evidence generator over a git clone, separates authoritative XML evidence from Markdown prompts and reading aids, writes package digests first, then writes the overview from completed package digests, and enforces complete-read grounding and no-invention rules even when file output is capped.
testing
Turn many commits into a curated grouped squash summary compatible with the opinionated wording style of git-visual-commits. Use when the user asks to squash a branch into a concise summary, write a squash-and-merge summary, summarize this branch, summarize a commit range or PR as grouped lines, clean up noisy commit history, or asks for a curated summary without committing. For normal squash-and-merge requests, default to the full current feature branch from merge-base to HEAD against the base branch instead of a same-named tracking remote, include commits from all authors unless the user explicitly narrows by author, and do not ask for yolo because the skill is read-only. Returns grouped lines only, preserves identifiers, merges overlap, drops noise, and avoids changelog wording.