skills/trace-annotation-tool/SKILL.md
Generate a custom trace annotation web app for open coding during LLM error analysis. Use when the user wants to review LLM traces, annotate failures with freeform comments, and do first-pass qualitative labeling (open coding). Also use when the user mentions "annotate traces", "trace review tool", "open coding tool", "label traces", "build an annotation interface", "review LLM outputs", or wants to manually inspect pipeline traces before building a failure taxonomy. This skill produces a tailored Python web application using FastHTML, TailwindCSS, and HTMX.
npx skillsauth add maragudk/evals-skills trace-annotation-toolInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Generate a custom local web application for open coding of LLM traces — the first qualitative pass of error analysis in the Analyze phase of the evaluation lifecycle.
The tool includes these features by default:
Ask the user: "These are the default features. Do you want anything else before I generate the tool?" Then incorporate any additional requests.
Generate a single-directory Python web application with this structure:
trace-annotator/
├── app.py # FastHTML application (single file, all routes)
├── requirements.txt # Dependencies (fasthtml, python-fasthtml)
└── README.md # Brief usage instructions
<script src="https://cdn.tailwindcss.com">) for stylingapp.py — a single-file FastHTML app with these routes:
GET / — main annotation view showing the current trace, annotation form, and progressPOST /annotate — save annotation (notes + pass/fail/defer) and advance to next traceGET /trace/{n} — navigate to a specific trace (used by prev/next and keyboard nav)GET /progress — return progress stats (for HTMX partial updates)Data flow:
annotations.jsonl (if it exists) to preserve prior work.annotations.jsonl immediately.Annotations file format (annotations.jsonl):
{"trace_id": "abc-123", "status": "fail", "notes": "SQL query missed the pet-friendly constraint", "timestamp": "2025-01-15T10:32:00Z"}
{"trace_id": "abc-124", "status": "pass", "notes": "", "timestamp": "2025-01-15T10:32:45Z"}
{"trace_id": "abc-125", "status": "defer", "notes": "Not sure if tone is appropriate for investor", "timestamp": "2025-01-15T10:33:12Z"}
This is the most important part of the tool. Tailor the HTML rendering to the user's specific trace structure. Apply these principles from HCI research on LLM review interfaces:
Bind these shortcuts via a small inline <script> block. Display them in a help tooltip
or footer so the user can reference them.
| Key | Action |
|-----|--------|
| p | Mark as Pass and advance |
| f | Mark as Fail and advance |
| d | Mark as Defer and advance |
| n | Next trace (without annotating) |
| b | Previous trace (back) |
| e | Focus the notes text field |
| ? | Toggle keyboard shortcut help |
Shortcuts must be suppressed when the notes text field is focused (so the user can type normally). Re-enable them on blur.
Use a clean, minimal layout with TailwindCSS:
Use TailwindCSS utility classes. The visual design should be:
After generating the tool, tell the user how to run it:
cd trace-annotator
pip install -r requirements.txt
python app.py path/to/traces.jsonl
Then explain the workflow:
Mention that annotations are saved to annotations.jsonl in the same directory.
Open coding is the qualitative, exploratory first pass through trace data. The user reads traces and jots down raw observations about what's going wrong — without trying to categorize or structure the observations yet. The goal is to surface a broad, honest view of system behavior before imposing any taxonomy.
What to annotate: Focus on the point of first failure in each trace — the most upstream issue. In multi-step traces, a single early error often cascades into multiple downstream failures. Fixing the first error frequently resolves the entire chain.
When to stop: Continue until at least 20 failing traces are labeled and no fundamentally new failure patterns are appearing (theoretical saturation).
What comes next: Once the user has a body of freeform annotations, they move to
axial coding — clustering those observations into structured, binary failure modes.
This is covered by the failure-taxonomy skill.
references/beyond-open-coding.md for when and how to add structure.After open coding, the user's workflow typically continues with:
failure-taxonomy skill): Cluster freeform annotations into
structured, binary failure modes via axial coding.llm-as-a-judge skill): Once failure modes are
defined, build automated evaluators for each one.references/beyond-open-coding.md for guidance.Mention these next steps when the tool is delivered.
development
Use this skill when crafting, reviewing, or improving prompts for LLM pipelines — including task prompts, system prompts, and LLM-as-Judge prompts. Triggers include: requests to write or refine a prompt, diagnose why an LLM produces inconsistent or incorrect outputs, bridge the gap between intent and model behavior, reduce ambiguity in instructions, add few-shot examples, structure complex prompts, or improve output formatting. Also use when the user needs help distinguishing specification failures (unclear instructions) from generalization failures (model limitations), or when iterating on prompts based on observed failure modes. Do NOT use for general coding tasks, document creation, or non-LLM writing.
development
Build, validate, and deploy LLM-as-Judge evaluators for automated quality assessment of LLM pipeline outputs. Use this skill whenever the user wants to: create an automated evaluator for subjective or nuanced failure modes, write a judge prompt for Pass/Fail assessment, split labeled data for judge development, measure judge alignment (TPR/TNR), estimate true success rates with bias correction, or set up CI evaluation pipelines. Also trigger when the user mentions "judge prompt", "automated eval", "LLM evaluator", "grading prompt", "alignment metrics", "true positive rate", or wants to move from manual trace review to automated evaluation. This skill covers the full lifecycle: prompt design → data splitting → iterative refinement → success rate estimation.
development
Build a structured taxonomy of failure modes from open-coded trace annotations. Use this skill whenever the user has freeform annotations from reviewing LLM traces and wants to cluster them into a coherent, non-overlapping set of binary failure categories (axial coding). Also use when the user mentions "failure modes", "error taxonomy", "axial coding", "cluster annotations", "categorize errors", "failure analysis", or wants to go from raw observation notes to structured evaluation criteria. This skill covers the full pipeline: grouping open codes, defining failure modes, re-labeling traces, and quantifying error rates.
tools
Use when work should span one or more detached tasks but still behave like one job with a single owner context. TaskFlow is the durable flow substrate under authoring layers like Lobster, ACPX, plugins, or plain code. Keep conditional logic in the caller; use TaskFlow for flow identity, child-task linkage, waiting state, revision-checked mutations, and user-facing emergence.