skills/assay/SKILL.md
Recon-informed approach evaluator. Weighs competing options against codebase constraints and returns structured recommendations with confidence scoring, kill criteria, and evidence grounding. Consumes recon briefs or caller context. Used by design, spec, migrate. Triggers on /assay, 'evaluate approaches', 'which option', 'compare alternatives'.
npx skillsauth add raddue/crucible assayInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
All subagent dispatches use disk-mediated dispatch. See shared/dispatch-convention.md for the full protocol.
Evaluate competing approaches against codebase constraints. Returns a structured Assay Report with a recommendation, alternatives with kill criteria, and confidence scoring. Evidence-grounded — recommendations cite specific file:line references, not generic best practices.
Skill type: Rigid — follow exactly, no shortcuts.
Models:
Announce at start: "I'm using the assay skill to evaluate competing approaches."
Name origin: In metallurgy, an assay tests raw material to determine its quality and composition before committing it to the forge.
/assay
question: "How should the auth middleware handle token refresh?"
context: { ... }
decision_type: "architecture"
approaches: [...]
cascading_decisions: [...]
question (required) — The decision or question to evaluate. One clear sentence.
context (required) — Evidence for the evaluator to reason against. Accepts different shapes depending on the caller:
| Caller | Context Shape | Key Fields |
|---|---|---|
| /design | Recon brief + agent findings | project_structure, existing_patterns, scope_boundaries, prior_art |
| /spec | Recon brief + agent findings (autonomous) | project_structure, existing_patterns, scope_boundaries, prior_art |
| /migrate | Recon brief + migration analysis | project_structure, migration_target, breaking_changes, blast_radius |
| Generic caller | Freeform evidence | description (string) — unstructured context, lower confidence |
When context contains unrecognized keys, the evaluator treats them as additional evidence. When context is a bare string, treat as { "description": context }.
decision_type (optional) — architecture | strategy | diagnosis | optimization. Auto-detected from the question if omitted. Defaults to architecture when ambiguous.
approaches (optional) — Array of { name, description } candidates to evaluate. When omitted, the evaluator generates 2-4 candidates from the question and context.
cascading_decisions (optional) — Array of { decision, reasoning } representing prior decisions. Treated as hard constraints — the evaluator cannot modify or challenge them. Conflicts are reported in prior_decision_conflicts.
question is present and non-emptycontext is present (object or string)decision_type is provided, validate it's one of the 4 recognized valuesapproaches is provided, verify it's an array with at least 2 entries, each having name and descriptionDispatch a single Opus agent using skills/assay/assay-evaluator-prompt.md.
Fill template placeholders before writing the dispatch file:
{{QUESTION}} — the decision question{{CONTEXT}} — the full context object/string{{DECISION_TYPE}} — the decision type (provided or "auto-detect"){{APPROACHES}} — the approaches array (or "Generate 2-4 candidates"){{CASCADING_DECISIONS}} — cascading decisions array (or "None")Parse the evaluator's response as JSON. Validate:
decision_type, confidence, missing_information, recommended, alternatives, prior_decision_conflictsrecommended has: name, rationale, evidence, risks, kill_criteria, constraint_fitname, constraint_fit, pros, cons, would_recommend_ifconstraint_fit objects have: pattern_alignment, scope_fit, reversibility, integration_riskconfidence is one of: high, medium, lowOn validation failure: Retry once with the validation errors as feedback. On second failure, return:
{ "error": "Evaluator produced invalid output after retry", "raw_output": "..." }
Return the validated Assay Report to the caller.
The evaluator adapts scoring weights based on decision type:
| Type | Primary Weight | Secondary Weight |
|---|---|---|
| architecture | Reversibility, constraint fit | Long-term cost, extensibility |
| strategy | Risk, phasing | Blast radius, team capacity |
| diagnosis | Evidence strength, testability | Explanation coverage, simplicity |
| optimization | Measurable improvement | Disruption cost, reversibility |
{
"decision_type": "architecture",
"confidence": "high",
"missing_information": [],
"recommended": {
"name": "Event-driven via message bus",
"rationale": "Aligns with existing src/events/bus.ts pattern...",
"evidence": ["src/events/bus.ts:14 — existing event dispatch"],
"risks": ["Adds async complexity to currently synchronous flow"],
"kill_criteria": "Switch away if latency requirements exceed 50ms p99",
"constraint_fit": {
"pattern_alignment": "high",
"scope_fit": "high",
"reversibility": "two-way door",
"integration_risk": "low"
}
},
"alternatives": [
{
"name": "Direct service calls",
"constraint_fit": {
"pattern_alignment": "medium",
"scope_fit": "high",
"reversibility": "one-way door",
"integration_risk": "medium"
},
"pros": ["Simpler mental model", "Synchronous"],
"cons": ["Tight coupling", "Requires shared deployment"],
"would_recommend_if": "Latency is critical or team prefers simplicity"
}
],
"prior_decision_conflicts": []
}
| Level | Criteria |
|---|---|
| high | One approach clearly dominates on all weighted dimensions |
| medium | Two viable options with trade-offs that depend on priority |
| low | Need more information — missing_information lists what would help |
Every recommendation must cite specific evidence from the context:
"This is the industry standard approach" is NOT evidence. "This aligns with how src/api/routes/users.ts already handles it" IS evidence.
Without a recon brief, evidence cites the caller's context. Confidence scores skew lower.
kill_criteria on recommended approach: condition that would flip the recommendationwould_recommend_if on each alternative: condition that would make it the recommendationThese make decisions revisitable without re-running the full analysis.
| Failure | Behavior |
|---|---|
| Missing question or context | Return error immediately — no dispatch |
| Evaluator returns invalid JSON | Retry once with validation errors. Second failure returns { "error": ... } |
| Evaluator timeout | Return { "error": "Evaluator timed out" } |
| Invalid decision_type | Warn and default to architecture |
| approaches has fewer than 2 entries | Ignore provided approaches, let evaluator generate candidates |
| Skill | Decision Type | Context Source | Approaches |
|---|---|---|---|
| /design | architecture | Recon brief + cascading decisions | Evaluator generates |
| /spec | architecture | Recon brief + cascading decisions (autonomous — confidence routing) | Evaluator generates |
| /migrate | strategy | Recon brief + migration analysis | Evaluator generates |
Not called by (investigated, not a fit): /debugging (hypothesis evaluation uses quality-gate, not assay), /prospector (competing design evaluation is more sophisticated than assay for this use case). See #147 for rationale.
From /design:
/assay
question: "How should components communicate in the new auth module?"
context: { recon brief with project_structure, existing_patterns }
decision_type: "architecture"
cascading_decisions: [{ decision: "Using Redis for session store", reasoning: "..." }]
From /spec:
/assay
question: "How should the auth middleware handle token refresh?"
context: { recon brief + investigation findings }
decision_type: "architecture"
cascading_decisions: [{ decision: "Using Redis for session store", reasoning: "..." }]
Spec consumes assay output autonomously: high confidence = accept, medium = terminal alert, low = block alert.
From /migrate:
/assay
question: "What migration strategy minimizes risk for the React 18→19 upgrade?"
context: { recon brief + migration_target: "React 19", breaking_changes: [...] }
decision_type: "strategy"
/assay question: "Should we use PostgreSQL or SQLite for this project?"
context: "Small team, <10K users, read-heavy workload, deployed on single server"
skills/assay/assay-evaluator-prompt.md/recon)/design's Challenger agent)testing
Standalone instance-bug reviewer — runs a parallel finder fan-out + verify gate over a diff or a path and prints ranked, verified findings. Use when the user says "delve", "find bugs in this diff", "review this for bugs", "scan this file/subsystem for defects", "instance-bug sweep", or wants concrete reproducible defects (not a merge verdict, not systemic health). Works on a PR id, a base..head range, or a path, on any forge (GitHub, GitLab, Bitbucket, self-hosted).
testing
Render the Crucible calibration ledger weekly report — the honest "Crucible caught N silent bugs" headline, verdict breakdown, per-skill severity rates, and the inflation detector. Triggers on "/ledger", "weekly report", "weekly ledger", "caught N", "quality ledger", "calibration report", "render the ledger".
development
The Book of Grudges — cross-session bug graveyard. Every fixed bug is recorded as a structured "grudge"; before touching code, skills query the grudgebook for the files in scope and surface past regressions as forced "DO NOT REPEAT" context. Read mode (pre-flight) and write mode (on bug resolution / fix(*) PR). Machine-local, per-repo, never committed. Triggers on /grudge, "check grudges", "record a grudge", "any past bugs here", "regression oracle", "bug graveyard".
testing
Reconcile the Crucible calibration ledger — walk merged fix/hotfix branches to falsify the originating gating-verdicts, compute per-skill Brier calibration scores, and append a falsification log. Triggers on "/calibration-reconcile", "reconcile ledger", "reconcile calibration", "falsify verdicts", "brier score", "calibration reconcile", "compute brier".