skills/failure-taxonomy/SKILL.md
Build a structured taxonomy of failure modes from open-coded trace annotations. Use this skill whenever the user has freeform annotations from reviewing LLM traces and wants to cluster them into a coherent, non-overlapping set of binary failure categories (axial coding). Also use when the user mentions "failure modes", "error taxonomy", "axial coding", "cluster annotations", "categorize errors", "failure analysis", or wants to go from raw observation notes to structured evaluation criteria. This skill covers the full pipeline: grouping open codes, defining failure modes, re-labeling traces, and quantifying error rates.
npx skillsauth add maragudk/evals-skills failure-taxonomyInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Transform raw, freeform trace annotations from open coding sessions into a structured taxonomy of binary failure modes, following the grounded theory methodology from the Analyze-Measure-Improve evaluation lifecycle.
The user has already completed open coding — they've read through LLM pipeline traces and written short, freeform notes describing what went wrong (the "point of first failure"). Now they need to move from that chaotic pile of observations into an organized, actionable taxonomy. This is the axial coding step.
Typical inputs look like a JSON array, CSV, or spreadsheet of objects with fields like:
trace_id — identifier for the traceannotation or note — the freeform open-coded observationpass_fail, trace_summary, query, or the full trace itselfGroup similar annotations into a small set of coherent, non-overlapping failure categories.
Key principles — these matter a lot:
Process:
Present the draft taxonomy to the user as a table:
| # | Failure Mode | Definition | Example Annotations |
|---|-------------|------------|---------------------|
| 1 | [Title] | [One-line] | [2-3 examples] |
After presenting the draft, prompt the user to consider:
Iterate until the user confirms the taxonomy. Typical refinement takes 1–2 rounds.
Once the taxonomy is confirmed, systematically apply it back to every trace:
1 or 0 for each failure mode.Compute error rates for each failure mode:
Present a summary table and recommend which failure modes to address first based on frequency. Note: frequency alone doesn't determine priority — the user may weight certain failures higher based on business impact. Ask them.
The skill produces up to three artifacts:
Taxonomy definition (always produced) — A clean document defining each failure mode with its title, definition, and representative examples.
Re-labeled dataset (produced when input traces are provided) — The original annotations augmented with binary columns for each failure mode, as JSON or CSV.
Summary statistics (produced when re-labeling is done) — Error rates, counts, and a prioritized ranking.
For detailed output schemas and file format guidance, read references/output-formats.md.
These are drawn directly from common pitfalls observed in practice:
When the user has 30+ annotations, it can help to use an LLM to propose initial groupings. If doing this, use the following prompt pattern:
Below is a list of open-ended annotations describing failures in [DOMAIN DESCRIPTION].
Please group them into a small set of coherent failure categories, where each category
captures similar types of mistakes. Each group should have:
- A short descriptive title (2-5 words)
- A brief one-line definition
- The annotation indices that belong to it
Do not invent new failure types; only cluster based on what is present in the notes.
Aim for 3-7 categories. If an annotation doesn't fit any group, list it separately
as "Uncategorized."
Annotations:
[PASTE ANNOTATIONS HERE]
Critical: LLM-generated groupings are a starting point, not the final answer. Always present them to the user for review and adjustment. The user's domain expertise is what makes the taxonomy meaningful.
After the taxonomy is built, the user typically moves to one of:
Mention these next steps when delivering the final taxonomy, so the user knows where to go from here.
tools
Generate a custom trace annotation web app for open coding during LLM error analysis. Use when the user wants to review LLM traces, annotate failures with freeform comments, and do first-pass qualitative labeling (open coding). Also use when the user mentions "annotate traces", "trace review tool", "open coding tool", "label traces", "build an annotation interface", "review LLM outputs", or wants to manually inspect pipeline traces before building a failure taxonomy. This skill produces a tailored Python web application using FastHTML, TailwindCSS, and HTMX.
development
Use this skill when crafting, reviewing, or improving prompts for LLM pipelines — including task prompts, system prompts, and LLM-as-Judge prompts. Triggers include: requests to write or refine a prompt, diagnose why an LLM produces inconsistent or incorrect outputs, bridge the gap between intent and model behavior, reduce ambiguity in instructions, add few-shot examples, structure complex prompts, or improve output formatting. Also use when the user needs help distinguishing specification failures (unclear instructions) from generalization failures (model limitations), or when iterating on prompts based on observed failure modes. Do NOT use for general coding tasks, document creation, or non-LLM writing.
development
Build, validate, and deploy LLM-as-Judge evaluators for automated quality assessment of LLM pipeline outputs. Use this skill whenever the user wants to: create an automated evaluator for subjective or nuanced failure modes, write a judge prompt for Pass/Fail assessment, split labeled data for judge development, measure judge alignment (TPR/TNR), estimate true success rates with bias correction, or set up CI evaluation pipelines. Also trigger when the user mentions "judge prompt", "automated eval", "LLM evaluator", "grading prompt", "alignment metrics", "true positive rate", or wants to move from manual trace review to automated evaluation. This skill covers the full lifecycle: prompt design → data splitting → iterative refinement → success rate estimation.
development
Maintainer-only workflow for handling GitHub Secret Scanning alerts on OpenClaw. Use when Codex needs to triage, redact, clean up, and resolve secret leakage found in issue comments, issue bodies, PR comments, or other GitHub content.