Adoption

Agent Skills are supported by leading AI development tools.

VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory

etanhey/skill-creator

Name: skill-creator
Author: etanhey

skills/golem-powers/skill-creator/SKILL.md

npx skillsauth add etanhey/golems skill-creator

Clean

TrivyContainer and dependency vulnerability scanner

Clean

SemgrepStatic code analysis for vulnerabilities

Clean

mcp-scan (Snyk)Model Context Protocol security validation

Skipped

Snyk (dep)Open source security scanning

Skipped

Socket.devSupply chain security analysis

Skipped

VirusTotalMulti-engine malware detection

Skipped

CrowdStrikeAdvanced threat intelligence

Skipped

OSV-ScannerOpen Source Vulnerability database check

Skipped

OWASP Dep-Check

Skill Creator

Take skills from "drafted" to "proven" through rigorous multi-layer evaluation.

Installation

bash ~/Gits/golems/skills/golem-powers/skill-creator/scripts/install.sh

The installer is idempotent — safe to re-run. It:

Symlinks the skill into ~/.claude/commands/skill-creator (golem-powers convention)
Ensures ~/Gits/skill-creator/.claude/agents/ and scripts/ exist as the project-scope home
Symlinks the session-miner agent definition into the project-scope agents dir (gates the sub-agent so only skillCreator-family sessions can spawn it)
Symlinks the parser into the project-scope scripts dir

Run bash scripts/install.sh --dry-run to preview without changes. bash scripts/install.sh --help for usage.

Core Loop: DRAFT → EVAL → RED → GREEN → SMOKE → SHIP

Every skill change goes through this pipeline. No exceptions.

1. DRAFT — Understand the Skill

Read the SKILL.md or MCP tool definition completely
brain_search for: prior evals, user complaints, known issues, related skills
Identify: what does this skill claim to do? What's the baseline without it?
Document intent, inputs, outputs, edge cases
Classify: capability uplift (may become obsolete) vs encoded preference (durable)

2. EVAL DESIGN — Create Structured Test Cases

Design eval cases as structured JSON in evals/evals.json:

{
  "id": 1,
  "name": "descriptive-kebab-name",
  "category": "compliance|structure|quality",
  "description": "What this eval tests",
  "prompt": "The input scenario given to the agent",
  "assertions": [
    {
      "type": "tool_usage|content|negative",
      "name": "assertion-name",
      "description": "What correct behavior looks like"
    }
  ]
}

Cover: happy path, edge cases, failure modes, interaction with other skills. See references/scoring-rubric.md for scoring methodology.

3. RED — Baseline Measurement (without skill)

Run each eval WITHOUT the skill loaded
Score baseline performance across all eval cases
Document baseline scores
Gate: If baseline >70%, the skill may not be adding value — flag it

4. GREEN — With-Skill Measurement

Run each eval WITH the skill loaded
Score delta between with_skill and without_skill
Gate: Delta <10% = marginal, consider retiring. Delta >30% = clearly valuable.
Maximum 3 iterations before flagging for human review

5. SMOKE — Live Agent Testing

Two tiers based on skill importance:

Tier A: Static Smoke (all skills)

Review evals.json assertions manually
Verify no contradictions between skill instructions and assertions
Check that strongest discriminators are tested

Tier B: Live Agent Smoke (flagship skills only)

Run live-eval-runner.sh for top 3 discriminator evals
Agent routing: Claude skills → Sonnet, code skills → Codex, audit skills → Cursor
Compare captured output against assertions
Gate: Live delta must be within 15% of static delta
Results stored in evals/results/live-{date}.json and brain_store'd

See workflows/live-eval.md for the full live eval workflow.

6. SHIP — Package and Document

Ensure every skill ships with its evals (no eval = no ship)
brain_store eval results, delta scores, issues found
Update skill metadata (version, last-eval-date, compliance-score)
Follow workflows/create-skill.md and workflows/audit-skill.md for SKILL.md structure compliance before shipping structural changes.
REGISTER a NEW skill — committed ≠ installed. A new golem-powers skill is invisible to every agent until it's symlinked into ~/.claude/skills/<name> + ~/.claude/commands/<name> (run golem-install, which auto-discovers + links every golem-powers/*/ dir; or symlink the one new skill). Then VERIFY it appears in the available-skills list — committing/merging does NOT register it. (2026-05-30: /weave was committed + merged but unusable by any agent until the symlinks existed. See workflows/create-skill.md "Final Step: REGISTER".)

Eval Methodology

with_skill vs without_skill comparison is MANDATORY. No exceptions.

Behavioral Compliance Scoring

| Weight | Dimension | What it measures | |--------|-----------|-----------------| | 70% | Compliance | Does the agent follow the skill's instructions? | | 20% | Structure | Does the output match expected format? | | 10% | Quality | Is the output actually good/useful? |

Scoring Thresholds

| Condition | Action | |-----------|--------| | Baseline >70% | Skill may not add value — flag and explain | | Delta <10% | Marginal — consider if complexity is worth it | | Delta >30% | Clearly valuable — ship with confidence | | Compliance <50% with skill | Instructions unclear — rewrite before shipping |

Agent Routing for Live Evals

| Skill type | Test with | Why | |------------|-----------|-----| | Claude behavior skills | Sonnet (default) | Tests actual Claude compliance | | Code implementation skills | Codex (default, no model flag) | Tests code quality | | Audit/review skills | Cursor (default) | Tests review thoroughness |

Workflows

| Workflow | When to use | |----------|-------------| | create-skill.md | Creating a new skill from scratch | | audit-skill.md | Auditing existing skill structure, scripts, workflows, and registration before deploy | | live-eval.md | Running live A/B tests with real agents | | ab-compare.md | Comparing skill versions or platforms | | mine-session.md | Mining a Claude Code session JSONL into a 10-section markdown digest (handoff docs, EOD waves, claim verification) |

References

| Reference | Content | |-----------|---------| | scoring-rubric.md | Full scoring methodology and rubric | | subagent-vs-skill.md | Classification reference — READ BEFORE SCAFFOLDING any new capability. Distinguishes skill (slash-triggered, parent-context) from sub-agent (name-spawned, isolated-context). Documents the misroute pattern (GitHub openai/codex#18823) that even Codex itself routinely makes. |

Integration with Other Skills

/cmux-agents — spawning live eval agents in cmux panes
/never-fabricate — Read() every file before reporting results
/pr-loop — shipping skill changes through the full PR lifecycle

Packaged Sub-Agents (skill-creator family only)

These sub-agents ship in BOTH Claude and Codex formats. They are invokable only from sessions with cwd inside ~/Gits/skill-creator/ (i.e., skillCreatorClaude / skillCreatorCodex / skillCreatorRepoGolem). Sessions in other repos cannot spawn them directly — they must dispatch a skillCreator first.

Canonical source-of-truth lives in this skill's agents/ dir. scripts/install.sh symlinks them into the project repo's project-scope dirs (.claude/agents/ for Claude, .codex/agents/ for Codex). The dual-format packaging means the same sub-agent is available regardless of whether the parent is Claude Code or Codex CLI.

| Sub-agent | Claude format | Codex format | Workflow | |---|---|---|---| | session-miner | agents/session-miner.md (model: inherit) | agents/session-miner.toml (model: gpt-5.3-codex-spark) | mine-session.md |

Both formats invoke the same deterministic parser at scripts/session-miner.py. The Claude format uses subagent_type="session-miner" via the Agent tool; the Codex format is referenced by name in natural-language prompts (session_miner, mine X to Y).

For when to choose sub-agent vs skill: see references/subagent-vs-skill.md.

Critical Rules

Never claim "tests pass" without running them. Read actual output.
Every skill ships with evals. No eval = no ship.
brain_search BEFORE starting work — someone may have already evaluated it.
brain_store AFTER completing work — results, delta scores, decisions.
Max 3 iterations per skill before flagging for human review.
Read() every file you cite. No citing from compacted summaries.
Sequential skill compounding — build skills one at a time, each learning from the last.

etanhey/skill-creator

skills/golem-powers/skill-creator/SKILL.md

Create, edit, audit, and evaluate golem-powers skills. Use for new skills, structural skill edits, workflow/adapter changes, pre-deploy validation, skill evals, benchmarks, live A/B tests, session JSONL mining, batch miners, and handoff digests. Triggers: create skill, edit skill, audit skill, validate skill, skill eval, live eval, mine session. NOT for invoking existing skills or convergence weaving.

3 stars

development

Updated Jun 7, 2026

$ install --global

skillsauth

npx skillsauth add etanhey/golems skill-creator

Install this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.

Security Scan Results

3 of 9 scanners reported clean

Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.

Scanners Passed

Scanners in report

Clean

TrivyContainer and dependency vulnerability scanner

95%

Clean

SemgrepStatic code analysis for vulnerabilities

95%

Clean

mcp-scan (Snyk)Model Context Protocol security validation

95%

Skipped

Snyk (dep)Open source security scanning

50%

Skipped

Socket.devSupply chain security analysis

50%

Skipped

VirusTotalMulti-engine malware detection

50%

Skipped

CrowdStrikeAdvanced threat intelligence

50%

Skipped

OSV-ScannerOpen Source Vulnerability database check

50%

Skipped

OWASP Dep-Check

50%

Last scanned: Jun 7, 2026, 3:07 AM10.8s33 files scanned

SKILL.md

name:: skill-creator
description:: Create, edit, audit, and evaluate golem-powers skills. Use for new skills, structural skill edits, workflow/adapter changes, pre-deploy validation, skill evals, benchmarks, live A/B tests, session JSONL mining, batch miners, and handoff digests. Triggers: create skill, edit skill, audit skill, validate skill, skill eval, live eval, mine session. NOT for invoking existing skills or convergence weaving.

Skill Creator

Take skills from "drafted" to "proven" through rigorous multi-layer evaluation.

Installation

bash ~/Gits/golems/skills/golem-powers/skill-creator/scripts/install.sh

The installer is idempotent — safe to re-run. It:

Symlinks the skill into ~/.claude/commands/skill-creator (golem-powers convention)
Ensures ~/Gits/skill-creator/.claude/agents/ and scripts/ exist as the project-scope home
Symlinks the session-miner agent definition into the project-scope agents dir (gates the sub-agent so only skillCreator-family sessions can spawn it)
Symlinks the parser into the project-scope scripts dir

Run bash scripts/install.sh --dry-run to preview without changes. bash scripts/install.sh --help for usage.

Core Loop: DRAFT → EVAL → RED → GREEN → SMOKE → SHIP

Every skill change goes through this pipeline. No exceptions.

1. DRAFT — Understand the Skill

Read the SKILL.md or MCP tool definition completely
brain_search for: prior evals, user complaints, known issues, related skills
Identify: what does this skill claim to do? What's the baseline without it?
Document intent, inputs, outputs, edge cases
Classify: capability uplift (may become obsolete) vs encoded preference (durable)

2. EVAL DESIGN — Create Structured Test Cases

Design eval cases as structured JSON in evals/evals.json:

{
  "id": 1,
  "name": "descriptive-kebab-name",
  "category": "compliance|structure|quality",
  "description": "What this eval tests",
  "prompt": "The input scenario given to the agent",
  "assertions": [
    {
      "type": "tool_usage|content|negative",
      "name": "assertion-name",
      "description": "What correct behavior looks like"
    }
  ]
}

Cover: happy path, edge cases, failure modes, interaction with other skills. See references/scoring-rubric.md for scoring methodology.

3. RED — Baseline Measurement (without skill)

Run each eval WITHOUT the skill loaded
Score baseline performance across all eval cases
Document baseline scores
Gate: If baseline >70%, the skill may not be adding value — flag it

4. GREEN — With-Skill Measurement

Run each eval WITH the skill loaded
Score delta between with_skill and without_skill
Gate: Delta <10% = marginal, consider retiring. Delta >30% = clearly valuable.
Maximum 3 iterations before flagging for human review

5. SMOKE — Live Agent Testing

Two tiers based on skill importance:

Tier A: Static Smoke (all skills)

Review evals.json assertions manually
Verify no contradictions between skill instructions and assertions
Check that strongest discriminators are tested

Tier B: Live Agent Smoke (flagship skills only)

Run live-eval-runner.sh for top 3 discriminator evals
Agent routing: Claude skills → Sonnet, code skills → Codex, audit skills → Cursor
Compare captured output against assertions
Gate: Live delta must be within 15% of static delta
Results stored in evals/results/live-{date}.json and brain_store'd

See workflows/live-eval.md for the full live eval workflow.

6. SHIP — Package and Document

Ensure every skill ships with its evals (no eval = no ship)
brain_store eval results, delta scores, issues found
Update skill metadata (version, last-eval-date, compliance-score)
Follow workflows/create-skill.md and workflows/audit-skill.md for SKILL.md structure compliance before shipping structural changes.
REGISTER a NEW skill — committed ≠ installed. A new golem-powers skill is invisible to every agent until it's symlinked into ~/.claude/skills/<name> + ~/.claude/commands/<name> (run golem-install, which auto-discovers + links every golem-powers/*/ dir; or symlink the one new skill). Then VERIFY it appears in the available-skills list — committing/merging does NOT register it. (2026-05-30: /weave was committed + merged but unusable by any agent until the symlinks existed. See workflows/create-skill.md "Final Step: REGISTER".)

Eval Methodology

with_skill vs without_skill comparison is MANDATORY. No exceptions.

Behavioral Compliance Scoring

Scoring Thresholds

Agent Routing for Live Evals

Workflows

References

Integration with Other Skills

/cmux-agents — spawning live eval agents in cmux panes
/never-fabricate — Read() every file before reporting results
/pr-loop — shipping skill changes through the full PR lifecycle

Packaged Sub-Agents (skill-creator family only)

For when to choose sub-agent vs skill: see references/subagent-vs-skill.md.

Critical Rules

Never claim "tests pass" without running them. Read actual output.
Every skill ships with evals. No eval = no ship.
brain_search BEFORE starting work — someone may have already evaluated it.
brain_store AFTER completing work — results, delta scores, decisions.
Max 3 iterations per skill before flagging for human review.
Read() every file you cite. No citing from compacted summaries.
Sequential skill compounding — build skills one at a time, each learning from the last.

Related Skills

etanhey/phoenix-human-view

tools

VerifiedTrustedCommunity

The human-eval UX contract for Phoenix views: turn-by-turn scrollable replay (not a scorecard), hide-but-copyable IDs, collapsed thinking, identity chips, tool filters, tiny frozen starter datasets, mark-wrong-in-thread, mobile-first. Use when: building or reviewing ANY Phoenix/eval view, annotation UI, session replay, or human-grading surface. Triggers: phoenix view, eval UI, annotation view, session replay, human eval UX, grading interface. NOT for: Phoenix data pipelines/ingest (capture scripts have their own specs).

3SKILL.mdUpdated Jun 7, 2026

etanhey/phoenix-human-view

etanhey/mac-systems

tools

VerifiedTrustedCommunity

macOS systems specialist — AppKit NSPanel architecture, launchd services, socket activation, MCP bridge resilience, syspolicyd, and high-frequency SwiftUI dashboards. Use when building menu-bar apps, LaunchAgents, debugging syspolicyd/Gatekeeper/TCC, resilient UDS/MCP bridges, or SwiftUI dashboards at 10Hz+.

3SKILL.mdUpdated Jun 7, 2026

etanhey/judge-fleet

development

VerifiedTrustedCommunity

Bulk LLM-judging protocol for fleet-dispatched verdict runs (KG cluster, eval harness). Use when: dispatching or running judge workers (J1/J2/RT), planning bulk-apply from verdict JSONL, or triaging evidence_degraded outputs. Triggers: judge fleet, bulk judge, R3 verdicts, kg-judge, RT gate, evidence_degraded. NOT for: single-item code review, Phoenix view UX (use phoenix-human-view), or non-judge eval pipelines.

3SKILL.mdUpdated Jun 7, 2026

etanhey/fleet-wrap

development

VerifiedTrustedCommunity

Quiet-down protocol for sprint close: when the fleet wraps, delete ALL polling crons and monitors, send ONE final dashboard + ONE message, then go SILENT. Use when: fleet wraps, all workers done, overnight queue exhausted, sprint close, Etan asleep/away with nothing approved left. Triggers: fleet wrap, wrap the fleet, stand down, going quiet, sprint close. NOT for: mid-sprint monitoring (keep your loops), spawning a successor (use /session-handoff first).

3SKILL.mdUpdated Jun 7, 2026

Download

For Claude Desktop. Download once, then upload the file in the app — no terminal needed.

Need help? View full Cowork setup guide →

Install manually

Choose your platform

# Clone the repo
git clone https://github.com/etanhey/golems.git

# Copy into Claude Code skills folder (global)
cp -r golems/skills/golem-powers/skill-creator ~/.claude/skills/

Claude Code Skills — official skills path docs.

Repository

etanhey/golems

3 stars

Compatible with

Claude Code

OpenAI Codex CLI

ChatGPT