skills/skillify/SKILL.md
The meta skill. Turn any raw feature into a properly-skilled, tested, resolvable unit of agent capability. Cross-modal eval is the recommended Phase 3 quality gate: 3 frontier models from different providers critique the output, you iterate to quality, THEN write tests that lock in the proven-good behavior.
npx skillsauth add garrytan/gbrain skillifyInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Relationship to
/cross-modal-review: That skill is the manual mid-flow "second opinion" gate (one model reviews work product before commit). This skill's Phase 3 below usesgbrain eval cross-modalinstead — three different-provider frontier models score-and-iterate on a documented dimension list before tests cement behavior. Use/cross-modal-reviewfor ad-hoc second opinions; use Phase 3 here when skillifying a feature.
A feature is "properly skilled" when all 11 checklist items pass. Item 3 (cross-modal eval) is informational in v1.1.0 — it does not gate the skillpack-check audit, but a missing or stale receipt is surfaced so the user knows where the gate stands.
□ 1. SKILL.md — skill file with frontmatter + contract + phases
□ 2. Code — deterministic script if applicable
□ 3. Cross-modal eval — 3 frontier models from 3 providers; informational
□ 4. Unit tests — cover every branch of deterministic logic
□ 5. Integration tests — exercise live endpoints
□ 6. LLM evals — quality/correctness cases for LLM-involving steps
□ 7. Resolver trigger — entry in skills/RESOLVER.md with real user trigger phrases
□ 8. Resolver eval — test that triggers route to this skill
□ 9. Check-resolvable — DRY + MECE audit, no orphans
□ 10. E2E test — smoke test: trigger → side effect
□ 11. Brain filing — if it writes pages, entry in brain/RESOLVER.md
Before skillifying, check:
If no to all three, it's a script, not a skill. Move on.
Feature: [name]
Code: [path]
Missing items: [check each of the 11]
---
name: my-skill
version: 1.0.0
description: |
One paragraph. What it does, when to use it.
triggers:
- "trigger phrase users actually say"
- "another real trigger"
tools:
- exec
- read
- write
mutating: false # true if it writes to brain/disk
---
Body must include: Contract (what it guarantees), Phases (step-by-step), Output Format (what it produces).
Extract deterministic code into scripts/*.ts.
Tests lock in behavior. If the behavior is mediocre, tests lock in mediocrity. Cross-modal eval proves the quality bar FIRST, then tests cement it.
Choose the input that exercises the skill's hardest documented use case. If unsure: use the primary trigger example from SKILL.md, or the most complex real-world input from the last 7 days of memory files.
Run the skill on the representative input. The OUTPUT FILE is what gets evaluated.
gbrain eval cross-modal \
--task "What this skill is supposed to accomplish" \
--output skills/<slug>/SKILL.md
The command runs 3 frontier models from 3 different providers in parallel,
scores the OUTPUT against the TASK on 5 documented dimensions, and writes a
receipt under ~/.gbrain/.gbrain/eval-receipts/<slug>-<sha8>.json (the
sha-8 binds the receipt to the current SKILL.md content — re-running after
edits writes a new receipt).
Default models (override per slot via --slot-a-model, --slot-b-model,
--slot-c-model):
| Slot | Default | Provider |
|------|---------|----------|
| A | openai:gpt-4o | OpenAI |
| B | anthropic:claude-opus-4-7 | Anthropic |
| C | google:gemini-1.5-pro | Google |
These MUST be frontier models from DIFFERENT providers. Using a single provider's family or budget models defeats the purpose — different families have less correlated blind spots. Refresh the list when a new model generation ships.
Pass criteria (BOTH must be true):
Inconclusive: fewer than 2 of 3 models returned parseable scores. Receipt is still written (forensics) but the gate is not authoritative. Exit code 2; CI wrappers should treat this as "did not run cleanly", not "failed quality gate".
CYCLE 1:
Eval → scores + top 10 improvements
IF pass: → done, write tests
ELSE:
Apply top 10 improvements to the actual file
Log: which improvements applied, what changed
CYCLE 2:
Re-eval the FIXED output (same 3 models, same dimensions)
Compare: before/after scores per dimension (track delta)
IF pass: → done, write tests
ELSE: apply remaining improvements + new ones
CYCLE 3 (final):
Re-eval
IF pass: → ship
ELSE: → ship with KNOWN_GAPS section listing:
- Which dimensions are still below 7
- Which improvements couldn't be resolved
- Why (e.g., "would require architectural change")
--cycles 3 in TTY, --cycles 1 in non-TTY (limits scripted
bulk spend in CI loops).--max-tokens 4000.--budget-usd N hard cap is a v0.27.x follow-up TODO.Models resolve through the gbrain AI gateway. Configure once with:
gbrain providers test # see what's configured
gbrain config # set keys
Or set env vars: OPENAI_API_KEY, ANTHROPIC_API_KEY,
GOOGLE_GENERATIVE_AI_API_KEY, TOGETHER_API_KEY, etc. The gateway reads
from ~/.gbrain/config.json plus process.env.
3 cycles × 3 models = 9 frontier calls max per run. With Opus-class +
GPT-4o-class + Gemini-1.5-Pro, expect $1–3 per full run on default
--max-tokens 4000. Receipts include the per-call model identifiers so
you can audit retroactively.
NOW that eval has proven quality, write tests that lock it in:
Unit tests — every branch of deterministic logic. Mock external calls. Integration tests — hit real endpoints. Catch bugs mocks hide. LLM evals — quality/correctness for LLM steps. Lighter than cross-modal eval — test specific behaviors.
bun test test/<skill>.test.ts # unit tests
gbrain skillify check skills/<slug>/scripts/<slug>.mjs --json | \
jq '.[] | .items[] | select(.name | contains("Cross-modal"))'
ls ~/.gbrain/.gbrain/eval-receipts/ # receipt landed
gbrain check-resolvable --json | jq .ok # resolver clean
Phase 0: Yes — invoked weekly, 50+ lines, clear trigger "summarize this PR"
Phase 1: Audit → SKILL.md missing, no tests, no resolver entry. Score: 1/11
Phase 2: Write SKILL.md + extract script to scripts/summarize-pr.ts
Phase 3: Cross-modal eval cycle 1 →
GPT-4o: goal=6, depth=5, specificity=4 → "misses file-level diffs"
Opus 4.7: goal=7, depth=6, specificity=5 → "no test plan in summary"
Gemini 1.5 Pro: goal=6, depth=5, specificity=5 → "template feels generic"
Aggregate: goal=6.3 FAIL, depth=5.3 FAIL
Top improvements: add file-level changes, include test plan, use PR context
→ Apply fixes → Cycle 2: goal=8, depth=7.5, specificity=7 → PASS
Phase 4: Write 12 unit tests locking in the improved behavior
Phase 5: Add "summarize this PR" trigger to skills/RESOLVER.md
Phase 6: E2E test: feed a real PR URL → verify brain page created
Phase 7: All green. Score: 11/11
NOT properly skilled until:
Skillify produces three durable artifacts per skill:
skills/<slug>/SKILL.md, scripts/<slug>.mjs,
routing-eval.jsonl, plus a test/<slug>.test.ts skeleton. Generated by
gbrain skillify scaffold <name> and refined by the human/agent into a
real implementation.~/.gbrain/.gbrain/eval-receipts/<slug>-<sha8>.json. The sha-8 binds the
receipt to the current SKILL.md content. gbrain skillify check
surfaces the status (found / stale / missing) as informational.gbrain skillify check: properly skilled |
close — create: <missing items> | needs skillify — run /skillify on <target>. Score is <passed>/<total>. Required items gate the verdict;
item 11 (cross-modal eval) is informational and never blocks PASS.JSON output (gbrain skillify check --json) includes the same fields plus
the per-item detail string, so agents can route on the structured envelope
without parsing prose.
tools
--- name: query-helper triggers: - find a page tools: - search - query writes_pages: false --- # query-helper This skill helps you query the brain. The first prose line becomes the description when no `description:` frontmatter is present.
testing
# broken This SKILL.md has no YAML frontmatter fence. It must still be listed (with the directory name as its name and empty triggers), never throw.
documentation
Read, enrich, and write brain pages with source attribution.
testing
Migrate a brain from gbrain-base (or any pack) to gbrain-base-v2's 14-canonical-type taxonomy via gbrain onboard --check + the unify-types Minion handler. Collapses 94 noisy types to 15 canonical with subtypes, alias rows, and link rows. Triggers when an agent notices pack_upgrade_available, type_proliferation, or asks "what is the canonical taxonomy / how do I clean up my page types".