skill-maker

Create production-quality Tessl skills from scratch and optimize them through eval-driven iteration.

Two modes of operation:

Create: Interactive interview → scaffold a complete skill directory → repo integration → lint → (when Tessl CLI exists) §2.6 pipeline through scenarios and eval.
Optimize: Refresh scenarios (§2.6 Path T or Phase 3 Path M) → run evals → analyze results → propose edits → apply → re-eval → log.

Detect which mode from the user's request. If ambiguous, ask.

Non-negotiables (do not skip)

Always use AskUserQuestion for every user-facing choice. Never assume answers.
Every AskUserQuestion must include an uncertainty option ("I'm not sure — help me decide"). When selected, either infer from prior answers or trigger a research loop.
Run all interview questions before scaffolding. Do not generate SKILL.md mid-interview.
Scaffold must lint-pass. After generating files, run tessl tile lint if available; otherwise run simulated lint checks (Phase 2.5).
Repo integration is mandatory. Every new skill gets a README row and a CI matrix entry. Check for existing entries before inserting (idempotency). Do not add or edit skills-lock.json — it pins vendored skills under .agents/skills/; first-party tiles live under skills/ and are wired via README + CI only.
Eval execution is read-only. Never modify skill files during tessl eval run or LLM-as-judge runs. All skill mutations (e.g. tessl skill review, Phase 5 apply) must occur before eval execution starts, at explicit workflow boundaries — never interleaved with a running eval.
Benchmark logging is append-only. Never overwrite or delete previous entries in benchmark-log.md.
Present optimization proposals as specific edits with before/after text, not vague suggestions. tessl skill review --optimize --yes is an exception: Tessl applies changes immediately without per-edit approval — treat it as a distinct "Tessl apply" step and tell the user when you run it.
Every SKILL.md includes metadata.version. Use a semver string (e.g. 1.0.0 for new skills; bump minor or patch when behavior or documentation meaningfully changes).

Mode: Create

Phase 1 — Interview

Run all 10 questions using AskUserQuestion before generating any files. Collect answers into a working decision map held in memory. Complete the Completeness check at the end before proceeding.

| # | Question | Key options | If unsure | |---|----------|-------------|-----------| | 1 | What does this skill do? (one sentence) | Free text | Ask "What task should the AI do better?" and "What goes wrong without it?" | | 2 | Who will use this skill? | Developers / Semi-technical / Both | Default to Both | | 3 | What type of project? | Code generation / Writing / Tool use / Interview / Other | Ask for a brief domain description | | 4 | What are the 3–5 things this skill MUST do every time? | Free text (list) | Ask "What would make you say 'it worked perfectly'?" | | 5 | What should this skill NEVER do? | Free text (list) | Generate domain-specific anti-patterns from purpose + domain answers | | 6 | What phrases or signals activate this skill? | Free text / Generate suggestions / Research similar | Produce ≥5 candidate trigger terms from purpose + domain + behaviors; present for approval | | 7 | What does the final output look like? | Files / Structured message / Interactive flow | Research similar skills | | 8 | Does this skill need companion files beyond SKILL.md? | No / Rules files / Templates | Recommend companion files if >5 core behaviors or estimated length >300 lines | | 9 | Which tools does this skill need? | AskUserQuestion only / + file tools / + WebSearch / All + Bash | Infer from domain: Code-gen → file tools + Bash; Writing → file tools; Workflow → all; Interview → AskUserQuestion + optionally WebSearch | | 10 | Describe 2–3 realistic test tasks for this skill | Free text / Generate / Skip | Generate from purpose + behaviors |

Completeness check: Before scaffolding, verify all 10 categories have resolved values. If any are missing or still "unsure," resolve them before continuing.

Phase 2 — Scaffold Generator

Turn the decision map into a complete skill directory. See scaffold-rules for full implementation details.

2.1 SKILL.md

Frontmatter:

---
name: <skill-name>
description: "<Purpose>. Triggers: <trigger terms>. Uses <tools>. Outputs: <deliverables>. Do NOT use for: <exclusions>."
metadata:
  version: "1.0.0"
  tags: <domain tags>
---

Apply activation-design heuristics: front-load trigger terms; use imperatives throughout ("Use X", "Do not Y", "Always Z") — never "consider", "may want", or "try to".

Body structure: title + one-liner → non-negotiables (numbered) → process/phases → integrated example (realistic, exercises ≥2 non-negotiables) → anti-patterns.

Length target: 150–400 lines. If content exceeds 400 lines, extract secondary rules into rules/*.md and reference with relative links.

2.2 tile.json

{
  "name": "oh-my-ai/<skill-name>",
  "version": "1.0.0",
  "private": false,
  "summary": "<one-line purpose>",
  "skills": { "<skill-name>": { "path": "SKILL.md" } }
}

2.3 Companion files (conditional)

If the interview identified companion file needs — Rules: rules/<rule-name>.md with YAML frontmatter (name, description) and structured content. Templates: at skill root, referenced from SKILL.md with relative links.

2.4 Repo integration (automated, idempotent)

Both integrations are mandatory. Check for existing entries before inserting.

README.md: Insert a new alphabetically-ordered row in the | Skill | Description | table.
CI matrix (.github/workflows/tessl-publish.yml): Append - skills/<skill-name> to the tile: array. Validate YAML after editing (python3 -c "import yaml; yaml.safe_load(open(...))") — revert and report on failure.

2.5 Lint check

With tessl CLI: cd skills/<skill-name> && tessl tile lint

Simulated (no CLI): Verify: SKILL.md has valid YAML frontmatter with name, description, and metadata.version (semver string); tile.json has name, version, summary, skills; tile.json name matches oh-my-ai/<skill-name>; no broken relative links; each rules/*.md has YAML frontmatter with name and description.

Report results. Fix failures and re-lint.

2.6 Tessl CLI pipeline (preferred when CLI exists)

Run from repository root with paths like ./skills/<skill-name>. Use which tessl (or equivalent) first; if missing, skip this subsection and use Phase 3 Path M + Phase 4 Path B as needed.

Boundary: tessl skill review writes the skill and must complete before tessl eval run starts (→ non-negotiable #6).

| Step | Command / action | |------|------------------| | 1 | tessl skill review --optimize --yes ./skills/<skill-name> — may rewrite SKILL.md (and other files per Tessl). This is Tessl auto-apply (non-negotiable #8). | | 2 | If the skill has tile.json: cd skills/<skill-name> && tessl tile lint — same as Phase 2.5. | | 3 | tessl scenario generate ./skills/<skill-name> — parse the generation id from stdout; do not guess. | | 4 | tessl scenario download <generation> — use the id from step 3. | | 5 | Place downloaded scenarios under the skill: if Tessl wrote ./evals/ at repo root, move it with mv ./evals/ ./skills/<skill-name>/ (or merge — see below). If output landed elsewhere, move that directory into skills/<skill-name>/evals/. | | 6 | Continue to Phase 4 — Path A in eval-runner. |

If skills/<skill-name>/evals/ already exists: Use AskUserQuestion before moving: replace entirely, merge (explain how), or download to a temp directory — never overwrite silently.

Local Tessl cache under .tessl/ stays out of git (typically gitignored).

Mode: Optimize

Phase 3 — Eval Scenario Generator

If the skill has no evals/ directory, or the user asks for eval scenarios, offer to create them via AskUserQuestion.

Path T — Tessl CLI (preferred): Run steps 3–5 from §2.6: tessl scenario generate → parse generation id → tessl scenario download → place under skills/<skill-name>/evals/. To tune the skill before generating scenarios, run steps 1–2 from §2.6 first. After download, verify coverage against benchmark-loop; add or adjust scenarios by hand if gaps remain.

Path M — Manual (fallback): Use when Tessl is missing, the user declines CLI generation, or download fails. Author scenarios directly.

Generate 2–3 scenarios (or validate CLI output) following benchmark-loop coverage rules (full scenario schema, scoring rules, and selection heuristics are defined there). For each scenario, ensure evals/<scenario-slug>/ contains:

task.md — A realistic problem (100–300 words) reflecting actual user prompts. Not a toy example.

criteria.json:

{
  "context": "Tests whether <specific capability>",
  "type": "weighted_checklist",
  "checklist": [
    { "name": "<criterion>", "max_score": N, "description": "<what to check>" }
  ]
}

Key constraints: all max_score values must sum to exactly 100; each criterion must be independently verifiable. Name scenarios as kebab-case slugs (e.g., core-interview-flow, noisy-context-retrieval).

Phase 4 — Eval Runner

See eval-runner for full implementation (including the full Tessl CLI pipeline and --json vs --agent). Summary:

Path A — Tessl CLI (preferred): From repo root, tessl eval run ./skills/<skill-name> --json (add --agent=... when you need a fixed judge model; see eval-runner). Parse JSON output into per-scenario, per-criterion scores.

Path B — LLM-as-Judge Fallback: For each scenario, run two subagents (Agent tool) — one with task only (baseline), one with SKILL.md prepended (with-skill). Score each criterion by launching a judge subagent with the criterion description and agent output; request a JSON response {"score": N, "reasoning": "..."}.

Assemble results into a unified schema: date, method, model, scenarios (each with baseline score, with-skill score, delta, and per-criterion breakdown).

Calibration: If both paths are available, run both on the same scenarios. Accept if within ±15%; otherwise flag to user and prefer CLI results.

Phase 5 — Optimizer

Analyze eval results, classify failures, and propose targeted edits. See activation-design and benchmark-loop for full failure pattern definitions and classification guidance.

5.1 Failure classification

| Pattern | Signal | Fix | |---------|--------|-----| | Activation gap | Skill didn't fire / agent ignored instructions | Add explicit triggers to description; front-load non-negotiables | | Ambiguous instruction | Inconsistent behavior across runs | Replace "consider"/"may want" with imperatives | | Missing example | Agent doesn't know expected output shape | Add integrated example showing input → decision points → output | | Regression | Negative delta vs. baseline | Identify which edit caused it; revert or rewrite | | Context overload | Skill too long, agent loses focus | Compress; extract rules to companion files |

5.2 Analysis workflow

Rank scenarios by delta (lowest first).
Within each low-scoring scenario, identify criteria scoring below 50% of their max_score.
Classify each into a failure pattern from the table above.
For the top 3 failures, propose specific edits: exact before/after text, failure pattern addressed, estimated score improvement.
Present proposals to the user. Do not apply without approval.

5.3 Apply and re-eval

On user approval: apply edits → re-run Phase 4 → compare new vs. previous results → log to benchmark-log.md → flag any negative deltas immediately.

Optional Tessl loop: Before Phase 4, re-run §2.6 from step 1 (tessl skill review through scenario refresh) to regenerate scenarios after major skill changes. All such mutations must finish before eval execution begins (→ non-negotiable #6).

Phase 6 — Benchmark Logging and Gate Check

6.1 Benchmark log

After every eval run, append to skills/<skill-name>/benchmark-log.md:

## Run: <ISO-8601 timestamp>

**Method:** <tessl-cli | llm-as-judge> | **Model:** <model-name>

| Scenario | Baseline | With Skill | Delta |
|----------|----------|------------|-------|
| <name>   | <score>  | <score>    | <+/-N> |

**Changes applied:** <summary of edits, or "Initial evaluation">

---

Create the file if it doesn't exist. Always append — never overwrite.

6.2 Gate check

Negative delta on any scenario: WARN — "Scenario '<name>' regressed by <N> points. Review the last set of changes."
Any criterion at 0% with skill enabled: WARN — "Criterion '<name>' in '<scenario>' scored 0 with the skill active. The skill may be actively interfering."
All deltas positive and ≥10 points: PASS — "All scenarios show meaningful improvement."

Warnings do not block. If warnings exist, offer to run another optimization cycle (return to Phase 5).

Integrated example

User says: "Create a skill for writing git commit messages"

Interview summary → decision map:

| # | Answer | |---|--------| | 1 | "Generate conventional commit messages from staged diffs" | | 2 | Developers | | 3 | Code generation | | 4 | Read staged diff; use Conventional Commits format; keep subject ≤72 chars; include body for non-trivial changes | | 5 | Never fabricate changes not in the diff; never use vague subjects like "update code" | | 6 | "commit message, write commit, git commit, conventional commit" | | 7 | Structured message | | 8 | No companion files | | 9 | Bash (for git diff --staged) | | 10 | Generate scenarios |

Scaffold produced:

skills/commit-message/
├── SKILL.md          # Frontmatter with triggers, non-negotiables, format rules, integrated example
├── tile.json         # oh-my-ai/commit-message, v1.0.0
└── evals/
    ├── simple-feature-commit/
    │   ├── task.md        # "Given this staged diff adding a login form..."
    │   └── criteria.json  # Tests: conventional format, subject length, body presence
    └── noisy-multi-file-commit/
        ├── task.md        # "Given this large diff touching 8 files..."
        └── criteria.json  # Tests: focus, not fabricating, correct scope

Repo integration: README row added (alphabetically); CI matrix updated.

Anti-patterns

Generating SKILL.md before the interview is complete.
Producing a skill with a description that lacks explicit trigger terms.
Writing eval criteria that don't sum to 100.
Proposing vague optimization suggestions instead of specific before/after edits.
Modifying skill files during eval execution (tessl eval run / judge runs) or overwriting previous benchmark-log.md entries.
Running tessl skill review in the middle of Phase 4, or guessing a scenario generation id instead of parsing CLI output.
Skipping the lint check after scaffold generation, or inserting duplicate rows in README or the CI matrix.
Adding or editing skills-lock.json when scaffolding a first-party skill under skills/.

skill-maker

Create production-quality Tessl skills from scratch and optimize them through eval-driven iteration.

Two modes of operation:

Create: Interactive interview → scaffold a complete skill directory → repo integration → lint → (when Tessl CLI exists) §2.6 pipeline through scenarios and eval.
Optimize: Refresh scenarios (§2.6 Path T or Phase 3 Path M) → run evals → analyze results → propose edits → apply → re-eval → log.

Detect which mode from the user's request. If ambiguous, ask.

Non-negotiables (do not skip)

Always use AskUserQuestion for every user-facing choice. Never assume answers.
Every AskUserQuestion must include an uncertainty option ("I'm not sure — help me decide"). When selected, either infer from prior answers or trigger a research loop.
Run all interview questions before scaffolding. Do not generate SKILL.md mid-interview.
Scaffold must lint-pass. After generating files, run tessl tile lint if available; otherwise run simulated lint checks (Phase 2.5).
Repo integration is mandatory. Every new skill gets a README row and a CI matrix entry. Check for existing entries before inserting (idempotency). Do not add or edit skills-lock.json — it pins vendored skills under .agents/skills/; first-party tiles live under skills/ and are wired via README + CI only.
Eval execution is read-only. Never modify skill files during tessl eval run or LLM-as-judge runs. All skill mutations (e.g. tessl skill review, Phase 5 apply) must occur before eval execution starts, at explicit workflow boundaries — never interleaved with a running eval.
Benchmark logging is append-only. Never overwrite or delete previous entries in benchmark-log.md.
Present optimization proposals as specific edits with before/after text, not vague suggestions. tessl skill review --optimize --yes is an exception: Tessl applies changes immediately without per-edit approval — treat it as a distinct "Tessl apply" step and tell the user when you run it.
Every SKILL.md includes metadata.version. Use a semver string (e.g. 1.0.0 for new skills; bump minor or patch when behavior or documentation meaningfully changes).

Mode: Create

Phase 1 — Interview

Run all 10 questions using AskUserQuestion before generating any files. Collect answers into a working decision map held in memory. Complete the Completeness check at the end before proceeding.

Completeness check: Before scaffolding, verify all 10 categories have resolved values. If any are missing or still "unsure," resolve them before continuing.

Phase 2 — Scaffold Generator

Turn the decision map into a complete skill directory. See scaffold-rules for full implementation details.

2.1 SKILL.md

Frontmatter:

---
name: <skill-name>
description: "<Purpose>. Triggers: <trigger terms>. Uses <tools>. Outputs: <deliverables>. Do NOT use for: <exclusions>."
metadata:
  version: "1.0.0"
  tags: <domain tags>
---

Apply activation-design heuristics: front-load trigger terms; use imperatives throughout ("Use X", "Do not Y", "Always Z") — never "consider", "may want", or "try to".

Body structure: title + one-liner → non-negotiables (numbered) → process/phases → integrated example (realistic, exercises ≥2 non-negotiables) → anti-patterns.

Length target: 150–400 lines. If content exceeds 400 lines, extract secondary rules into rules/*.md and reference with relative links.

2.2 tile.json

{
  "name": "oh-my-ai/<skill-name>",
  "version": "1.0.0",
  "private": false,
  "summary": "<one-line purpose>",
  "skills": { "<skill-name>": { "path": "SKILL.md" } }
}

2.3 Companion files (conditional)

2.4 Repo integration (automated, idempotent)

Both integrations are mandatory. Check for existing entries before inserting.

README.md: Insert a new alphabetically-ordered row in the | Skill | Description | table.
CI matrix (.github/workflows/tessl-publish.yml): Append - skills/<skill-name> to the tile: array. Validate YAML after editing (python3 -c "import yaml; yaml.safe_load(open(...))") — revert and report on failure.

2.5 Lint check

With tessl CLI: cd skills/<skill-name> && tessl tile lint

Report results. Fix failures and re-lint.

2.6 Tessl CLI pipeline (preferred when CLI exists)

Run from repository root with paths like ./skills/<skill-name>. Use which tessl (or equivalent) first; if missing, skip this subsection and use Phase 3 Path M + Phase 4 Path B as needed.

Boundary: tessl skill review writes the skill and must complete before tessl eval run starts (→ non-negotiable #6).

If skills/<skill-name>/evals/ already exists: Use AskUserQuestion before moving: replace entirely, merge (explain how), or download to a temp directory — never overwrite silently.

Local Tessl cache under .tessl/ stays out of git (typically gitignored).

Mode: Optimize

Phase 3 — Eval Scenario Generator

If the skill has no evals/ directory, or the user asks for eval scenarios, offer to create them via AskUserQuestion.

Path M — Manual (fallback): Use when Tessl is missing, the user declines CLI generation, or download fails. Author scenarios directly.

task.md — A realistic problem (100–300 words) reflecting actual user prompts. Not a toy example.

criteria.json:

{
  "context": "Tests whether <specific capability>",
  "type": "weighted_checklist",
  "checklist": [
    { "name": "<criterion>", "max_score": N, "description": "<what to check>" }
  ]
}

Phase 4 — Eval Runner

See eval-runner for full implementation (including the full Tessl CLI pipeline and --json vs --agent). Summary:

Assemble results into a unified schema: date, method, model, scenarios (each with baseline score, with-skill score, delta, and per-criterion breakdown).

Calibration: If both paths are available, run both on the same scenarios. Accept if within ±15%; otherwise flag to user and prefer CLI results.

Phase 5 — Optimizer

Analyze eval results, classify failures, and propose targeted edits. See activation-design and benchmark-loop for full failure pattern definitions and classification guidance.

5.1 Failure classification

5.2 Analysis workflow

Rank scenarios by delta (lowest first).
Within each low-scoring scenario, identify criteria scoring below 50% of their max_score.
Classify each into a failure pattern from the table above.
For the top 3 failures, propose specific edits: exact before/after text, failure pattern addressed, estimated score improvement.
Present proposals to the user. Do not apply without approval.

5.3 Apply and re-eval

On user approval: apply edits → re-run Phase 4 → compare new vs. previous results → log to benchmark-log.md → flag any negative deltas immediately.

Phase 6 — Benchmark Logging and Gate Check

6.1 Benchmark log

After every eval run, append to skills/<skill-name>/benchmark-log.md:

## Run: <ISO-8601 timestamp>

**Method:** <tessl-cli | llm-as-judge> | **Model:** <model-name>

| Scenario | Baseline | With Skill | Delta |
|----------|----------|------------|-------|
| <name>   | <score>  | <score>    | <+/-N> |

**Changes applied:** <summary of edits, or "Initial evaluation">

---

Create the file if it doesn't exist. Always append — never overwrite.

6.2 Gate check

Negative delta on any scenario: WARN — "Scenario '<name>' regressed by <N> points. Review the last set of changes."
Any criterion at 0% with skill enabled: WARN — "Criterion '<name>' in '<scenario>' scored 0 with the skill active. The skill may be actively interfering."
All deltas positive and ≥10 points: PASS — "All scenarios show meaningful improvement."

Warnings do not block. If warnings exist, offer to run another optimization cycle (return to Phase 5).

Integrated example

User says: "Create a skill for writing git commit messages"

Interview summary → decision map:

Scaffold produced:

skills/commit-message/
├── SKILL.md          # Frontmatter with triggers, non-negotiables, format rules, integrated example
├── tile.json         # oh-my-ai/commit-message, v1.0.0
└── evals/
    ├── simple-feature-commit/
    │   ├── task.md        # "Given this staged diff adding a login form..."
    │   └── criteria.json  # Tests: conventional format, subject length, body presence
    └── noisy-multi-file-commit/
        ├── task.md        # "Given this large diff touching 8 files..."
        └── criteria.json  # Tests: focus, not fabricating, correct scope

Repo integration: README row added (alphabetically); CI matrix updated.

Anti-patterns

Generating SKILL.md before the interview is complete.
Producing a skill with a description that lacks explicit trigger terms.
Writing eval criteria that don't sum to 100.
Proposing vague optimization suggestions instead of specific before/after edits.
Modifying skill files during eval execution (tessl eval run / judge runs) or overwriting previous benchmark-log.md entries.
Running tessl skill review in the middle of Phase 4, or guessing a scenario generation id instead of parsing CLI output.
Skipping the lint check after scaffold generation, or inserting duplicate rows in README or the CI matrix.
Adding or editing skills-lock.json when scaffolding a first-party skill under skills/.

Adoption

kvokov/skill-maker

$ install --global

Security Scan Results

SKILL.md

skill-maker

Non-negotiables (do not skip)

Mode: Create

Phase 1 — Interview

Phase 2 — Scaffold Generator

2.1 SKILL.md

2.2 tile.json

2.3 Companion files (conditional)

2.4 Repo integration (automated, idempotent)

2.5 Lint check

2.6 Tessl CLI pipeline (preferred when CLI exists)

Mode: Optimize

Phase 3 — Eval Scenario Generator

Phase 4 — Eval Runner

Phase 5 — Optimizer

5.1 Failure classification

5.2 Analysis workflow

5.3 Apply and re-eval

Phase 6 — Benchmark Logging and Gate Check

6.1 Benchmark log

6.2 Gate check

Integrated example

Anti-patterns

Related Skills

kvokov/nestjs

kvokov/uiux

kvokov/reading-synthesis

kvokov/llm-wiki

kvokov/skill-maker

$ install --global

Security Scan Results

SKILL.md

skill-maker

Non-negotiables (do not skip)

Mode: Create

Phase 1 — Interview

Phase 2 — Scaffold Generator

2.1 SKILL.md

2.2 tile.json

2.3 Companion files (conditional)

2.4 Repo integration (automated, idempotent)

2.5 Lint check

2.6 Tessl CLI pipeline (preferred when CLI exists)

Mode: Optimize

Phase 3 — Eval Scenario Generator

Phase 4 — Eval Runner

Phase 5 — Optimizer

5.1 Failure classification

5.2 Analysis workflow

5.3 Apply and re-eval

Phase 6 — Benchmark Logging and Gate Check

6.1 Benchmark log

6.2 Gate check

Integrated example

Anti-patterns

Related Skills

kvokov/nestjs

kvokov/uiux

kvokov/reading-synthesis

kvokov/llm-wiki