plugin/skills/eval/SKILL.md
Use this skill when running pre-release validation, detecting calibration regression, or tuning a skill — to run plugin evaluation tiers (Tier 1 linters and Tier 2 judge-calibration drift smoke) by wrapping the eval/runner.py harness. Tier 1 is free (no LLM); Tier 2 budgets tokens per `eval/config.json`. Tier 3 (full behavioral suites) is planned but not yet shipped — runner returns error code 3 if invoked.
npx skillsauth add avav25/ai-assets evalInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Validate the plugin against rubric-scored test cases. Two tiers shipped today:
<scenario>.score-X.X.md). This catches three drift classes:
runner.py --tier 3 currently exits with error code 3./eval <skill-name>) or Tier 2 to one rubric (/eval --tier 2 --rubric <name>) while iterating/eval --all to run Tier 1 + Tier 2 in a single passShipped commands:
/eval # Tier 1 linters across all skills
/eval feature-design # Tier 1 lint scoped to one skill
/eval --tier 2 # Tier 2 judge-calibration smoke (default 10 rubrics × 2 samples)
/eval --tier 2 --rubric feature-design # Tier 2 limited to one rubric
/eval --tier 2 --dry-run # plan-only; no API calls (also auto-applies if anthropic SDK unavailable)
/eval --all # Tier 1 + Tier 2 in one run
Planned (not yet shipped — runner returns error code 3):
/eval --tier 3 # full behavioral suite (planned)
/eval --skill <name> --tier 3 # Tier 3 per-skill (planned)
/eval --baseline <skill> # capture per-skill baseline scorecard (planned)
/eval --all --resume # resume after interruption (planned, Tier 3 only)
Sample plan (default seed 42): 10 rubrics × 2 samples = 20 judge calls. Override via runner flags --seed N, --sample-rubrics N, --samples-per-rubric N, --rubric NAME.
Requires ANTHROPIC_API_KEY and the anthropic Python SDK. Without either, the runner reports DRY-RUN ONLY and skips actual API calls.
Default judge model: Haiku (claude-haiku-4-5). Per-rubric override allowed when calibration Spearman drops below threshold (see plugin/eval/config.json → judge_models).
Per plugin/eval/config.json:
| Tier | Soft cap | Hard cap | Notes | |---|---|---|---| | Tier 1 | 0 | 0 | No LLM. Always free. | | Tier 2 smoke | 50K | 150K | Default plan ~10–20K with Haiku. | | Tier 3 (planned) | 30K | 100K | Per-skill (60K/150K with Sonnet judge override). Not shipped. | | Tier 3 full suite (planned) | 500K | 1.5M | Release-gate run. Not shipped. |
runner.py --tier 3 (and any --baseline/--resume/per-skill Tier 3 form) is not shipped — the runner exits with error code 3. Use Tier 1 or Tier 2 instead.ANTHROPIC_API_KEY and the anthropic Python SDK. Without either, the runner reports DRY-RUN ONLY, skips all API calls, and performs no real scoring.plugin/eval/config.json; a plan that would breach the hard cap is rejected before any API call.plugin/eval/cases/ and plugin/eval/results/ directories do not exist, and the Tier-3 behavioral suite + blind-comparator are not yet wired — relying on those today fails. (The eval-judge agent itself is shipped — plugin/agents/eval-judge.md powers RALF subjective oracles now — but the Tier-3 behavioral path that would invoke it from /eval is not yet wired.)plugin/eval/runner.py (entry) + plugin/eval/tier2.py (Tier 2 implementation)plugin/eval/config.json (token caps + judge models + calibration thresholds)plugin/eval/judge-rubrics/<rubric>.md (per-rubric scoring criteria)plugin/eval/calibration/<rubric>/{good,bad}/<scenario>.score-X.X.md (Tier 2 calibration samples — 45 rubrics × 6 samples = 270 total)userConfig.ralph_session_* caps if eval cases ever trigger embedded RALF (Tier 3, planned)/plugin-doctor --calibrate-judge (Tier 2 opt-in)These are referenced by the planned Tier 3 design but not implemented:
plugin/eval/cases/<skill>/*.json — per-skill behavioral test cases. Directory does not exist; expect to ship in Phase 4.plugin/eval/results/<run-id>/ — per-run output artefacts.plugin/agents/eval-judge.md (shipped) to plugin/eval/cases/ + plugin/eval/results/ — the judge agent exists today; the case/result harness that drives it does not.--baseline <skill> capture to .committed/eval-baselines/<release-tag>.json.Do not write code that depends on these surfaces today; they are documented for design continuity, not behavior.
development
Use this skill when running the recurring (daily) knowledge-base rescan for a repo that already has knowledge/.knowledge-sync.yml — the main-thread dispatcher that reads the config, computes the git delta since last_scanned_sha, maps changed paths to affected doc areas, early-exits cheaply when nothing changed, then fans out one Agent(content-writer) per affected area, applies the propose/direct update policy, advances the baseline only on success, and writes an L4 run log — all with the G1 untrusted-content choke-point, secret-scan, deny-list, and budget controls woven in. For first-time setup use /knowledge-sync-init.
development
Use this skill when bootstrapping scheduled knowledge-base sync for a repo that has no knowledge/.knowledge-sync.yml yet — to run one-time setup that detects the knowledge_root from CLAUDE.md/AGENTS.md, maps doc areas to source globs, records opt-in external sources (Linear/Notion/WebFetch, all disabled by default), captures a baseline last_scanned_sha, sets the per-area update policy, generates or seeds knowledge/CONVENTIONS.md, provisions the L4 memory dir, and offers to register the daily routine. Routes ongoing recurring sync operations to /knowledge-sync.
tools
Use this skill when bootstrapping a target repository to be ai-skills-aware — on the first run of any ai-skills workflow in a fresh repo, when adopting the ai-skills plugin in an existing repo, or after upgrading to a plugin version that adds new memory paths or templates, including when the user does not say "init" but asks to "set up" or "onboard" the repo — to detect codebase type, create CLAUDE.md + AGENTS.md scaffolding, initialize the .ai-skills-memory/ directory tree from L1 templates, and configure .gitignore. Idempotent — safe to re-run. Accepts `--codebase-type <type>` and `--overwrite`. Not for re-initializing only memory — use `/memory-init` instead.
tools
Use this skill when extending, repairing, or improving plugin assets, when ingesting a `/feedback` report as a fix-cycle backlog, or when you do not remember which lower-level command is right for the job — the umbrella workflow for ai-skills plugin-asset authoring and maintenance: creating, auditing, fixing, improving, refactoring, and migrating skills, agents, rules, hooks, prompts, schemas, and rubrics inside the plugin. Auto-classifies the request, loads the right knowledge skills (`@prompt-engineering`, `@context-engineering`, `@team-protocols`), and spawns the right subagents (`prompt-engineer`, `system-architect`, `python-engineer`, `software-engineer`, `qa-engineer`, `eval-judge`) via the `Agent` tool.