Adoption

Agent Skills are supported by leading AI development tools.

VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory

maragudk/trace-annotation-tool

Name: trace-annotation-tool
Author: maragudk

skills/trace-annotation-tool/SKILL.md

npx skillsauth add maragudk/evals-skills trace-annotation-tool

Clean

TrivyContainer and dependency vulnerability scanner

Clean

SemgrepStatic code analysis for vulnerabilities

Clean

mcp-scan (Snyk)Model Context Protocol security validation

Skipped

Snyk (dep)Open source security scanning

Skipped

Socket.devSupply chain security analysis

Skipped

VirusTotalMulti-engine malware detection

Skipped

CrowdStrikeAdvanced threat intelligence

Skipped

OSV-ScannerOpen Source Vulnerability database check

Skipped

OWASP Dep-Check

Trace Annotation Tool Generator

Generate a custom local web application for open coding of LLM traces — the first qualitative pass of error analysis in the Analyze phase of the evaluation lifecycle.

Core Workflow

Step 1: Understand the User's Trace Data

Ask the user to point to their trace data file (CSV, JSONL, JSON, or any structured format).
Read a sample of the data to understand its structure: field names, nesting depth, which fields represent the user query, intermediate steps, tool calls, and final output.
Identify a unique trace identifier field (or generate sequential IDs if none exists).
Confirm the structure with the user: "I see fields X, Y, Z — which represent the trace steps, and which is the user query?"

Step 2: Ask About Additional Features

The tool includes these features by default:

Trace viewer: One trace at a time, with tailored visual rendering of the trace structure
Freeform notes: Text field for open coding observations
Pass / Fail / Defer: Binary judgment with a defer option for uncertain traces
Keyboard shortcuts: Navigation and annotation hotkeys
Progress indicator: "17 / 100 reviewed" with pass/fail/defer counts
Auto-save: Annotations saved to a separate JSONL file on every action

Ask the user: "These are the default features. Do you want anything else before I generate the tool?" Then incorporate any additional requests.

Step 3: Generate the Application

Generate a single-directory Python web application with this structure:

trace-annotator/
├── app.py          # FastHTML application (single file, all routes)
├── requirements.txt # Dependencies (fasthtml, python-fasthtml)
└── README.md        # Brief usage instructions

Technology Stack

FastHTML for the web framework (HTMX is built-in)
TailwindCSS via CDN (<script src="https://cdn.tailwindcss.com">) for styling
Vanilla JavaScript only for keyboard shortcut bindings

Application Architecture

app.py — a single-file FastHTML app with these routes:

GET / — main annotation view showing the current trace, annotation form, and progress
POST /annotate — save annotation (notes + pass/fail/defer) and advance to next trace
GET /trace/{n} — navigate to a specific trace (used by prev/next and keyboard nav)
GET /progress — return progress stats (for HTMX partial updates)

Data flow:

On startup, read the trace data file from a path specified via command-line argument or environment variable.
Load existing annotations from annotations.jsonl (if it exists) to preserve prior work.
On each annotation action, append/update the entry in annotations.jsonl immediately.
The annotations file is separate from the source data — the original file is never modified.

Annotations file format (annotations.jsonl):

{"trace_id": "abc-123", "status": "fail", "notes": "SQL query missed the pet-friendly constraint", "timestamp": "2025-01-15T10:32:00Z"}
{"trace_id": "abc-124", "status": "pass", "notes": "", "timestamp": "2025-01-15T10:32:45Z"}
{"trace_id": "abc-125", "status": "defer", "notes": "Not sure if tone is appropriate for investor", "timestamp": "2025-01-15T10:33:12Z"}

Trace Rendering

This is the most important part of the tool. Tailor the HTML rendering to the user's specific trace structure. Apply these principles from HCI research on LLM review interfaces:

Visual hierarchy: Emphasize the user query and final output. Use distinct visual blocks (background colors, borders, indentation) for different trace components.
Collapsible sections: For multi-step traces, make intermediate steps (tool calls, reasoning, retrieval) collapsible — expanded by default for the first trace, then respecting the user's toggle state.
Domain-appropriate rendering: If the trace contains emails, render them like emails. If it contains SQL, syntax-highlight the SQL. If it contains JSON tool calls, format them as structured blocks. Match the visual presentation to the content type.
Readable text: Use comfortable line lengths (max-w-prose or similar), adequate spacing, and readable font sizes. Traces can be long — don't cram them.

Keyboard Shortcuts

Bind these shortcuts via a small inline <script> block. Display them in a help tooltip or footer so the user can reference them.

| Key | Action | |-----|--------| | p | Mark as Pass and advance | | f | Mark as Fail and advance | | d | Mark as Defer and advance | | n | Next trace (without annotating) | | b | Previous trace (back) | | e | Focus the notes text field | | ? | Toggle keyboard shortcut help |

Shortcuts must be suppressed when the notes text field is focused (so the user can type normally). Re-enable them on blur.

UI Layout

Use a clean, minimal layout with TailwindCSS:

Top bar: Progress indicator ("17 / 100 reviewed — 12 pass, 3 fail, 2 defer"), trace navigation (prev/next buttons), and keyboard shortcut help toggle.
Main area: The rendered trace, taking up most of the viewport. Scrollable if the trace is long.
Bottom panel (sticky): Annotation controls — the notes text field, and pass/fail/defer buttons. Always visible so the user can annotate without scrolling back up.

Styling Guidelines

Use TailwindCSS utility classes. The visual design should be:

Clean and minimal — this is a productivity tool, not a marketing page
High contrast for readability during long annotation sessions
Distinct visual treatment for different trace components (user input vs. LLM output vs. tool calls vs. metadata)
Responsive but optimized for desktop — this is a sit-down-and-work tool

Step 4: Provide Usage Instructions

After generating the tool, tell the user how to run it:

cd trace-annotator
pip install -r requirements.txt
python app.py path/to/traces.jsonl

Then explain the workflow:

Open the browser (FastHTML will print the local URL)
Read each trace carefully, noting the point of first failure (the most upstream issue)
Write a short freeform note describing the observation
Mark as pass, fail, or defer
Use keyboard shortcuts to move quickly through traces
Annotations are saved automatically — you can close and resume anytime

Mention that annotations are saved to annotations.jsonl in the same directory.

What Open Coding Is (and Isn't)

Open coding is the qualitative, exploratory first pass through trace data. The user reads traces and jots down raw observations about what's going wrong — without trying to categorize or structure the observations yet. The goal is to surface a broad, honest view of system behavior before imposing any taxonomy.

What to annotate: Focus on the point of first failure in each trace — the most upstream issue. In multi-step traces, a single early error often cascades into multiple downstream failures. Fixing the first error frequently resolves the entire chain.

When to stop: Continue until at least 20 failing traces are labeled and no fundamentally new failure patterns are appearing (theoretical saturation).

What comes next: Once the user has a body of freeform annotations, they move to axial coding — clustering those observations into structured, binary failure modes. This is covered by the failure-taxonomy skill.

Anti-Patterns to Avoid

Over-engineering the tool: The annotation tool is a means to an end. Generate a working tool quickly and let the user start annotating. Don't add features they didn't ask for.
Premature structure: Don't add structured failure mode checkboxes or tag systems to the initial tool. Open coding is deliberately unstructured — the taxonomy emerges later. See references/beyond-open-coding.md for when and how to add structure.
Generic trace rendering: Don't just dump raw JSON. Take the time to understand the trace format and render it in a way that makes failures easy to spot.
Ignoring keyboard shortcuts: The textbook is emphatic that annotation speed directly correlates with engineering velocity. Hotkeys are not optional.

Connecting to Next Steps

After open coding, the user's workflow typically continues with:

Failure taxonomy (the failure-taxonomy skill): Cluster freeform annotations into structured, binary failure modes via axial coding.
LLM-as-Judge evaluators (the llm-as-a-judge skill): Once failure modes are defined, build automated evaluators for each one.
Extending the tool: The generated annotation tool can be extended to support structured failure tags after the taxonomy is built. See references/beyond-open-coding.md for guidance.

Mention these next steps when the tool is delivered.

maragudk/trace-annotation-tool

skills/trace-annotation-tool/SKILL.md

Generate a custom trace annotation web app for open coding during LLM error analysis. Use when the user wants to review LLM traces, annotate failures with freeform comments, and do first-pass qualitative labeling (open coding). Also use when the user mentions "annotate traces", "trace review tool", "open coding tool", "label traces", "build an annotation interface", "review LLM outputs", or wants to manually inspect pipeline traces before building a failure taxonomy. This skill produces a tailored Python web application using FastHTML, TailwindCSS, and HTMX.

8 stars

tools

Updated Apr 7, 2026

$ install --global

skillsauth

npx skillsauth add maragudk/evals-skills trace-annotation-tool

Install this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.

Security Scan Results

3 of 9 scanners reported clean

Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.

Scanners Passed

Scanners in report

Clean

TrivyContainer and dependency vulnerability scanner

95%

Clean

SemgrepStatic code analysis for vulnerabilities

95%

Clean

mcp-scan (Snyk)Model Context Protocol security validation

95%

Skipped

Snyk (dep)Open source security scanning

50%

Skipped

Socket.devSupply chain security analysis

50%

Skipped

VirusTotalMulti-engine malware detection

50%

Skipped

CrowdStrikeAdvanced threat intelligence

50%

Skipped

OSV-ScannerOpen Source Vulnerability database check

50%

Skipped

OWASP Dep-Check

50%

Last scanned: Apr 7, 2026, 2:16 AM130.1s2 files scanned

SKILL.md

name:: trace-annotation-tool
description:: >

Trace Annotation Tool Generator

Generate a custom local web application for open coding of LLM traces — the first qualitative pass of error analysis in the Analyze phase of the evaluation lifecycle.

Core Workflow

Step 1: Understand the User's Trace Data

Ask the user to point to their trace data file (CSV, JSONL, JSON, or any structured format).
Read a sample of the data to understand its structure: field names, nesting depth, which fields represent the user query, intermediate steps, tool calls, and final output.
Identify a unique trace identifier field (or generate sequential IDs if none exists).
Confirm the structure with the user: "I see fields X, Y, Z — which represent the trace steps, and which is the user query?"

Step 2: Ask About Additional Features

The tool includes these features by default:

Trace viewer: One trace at a time, with tailored visual rendering of the trace structure
Freeform notes: Text field for open coding observations
Pass / Fail / Defer: Binary judgment with a defer option for uncertain traces
Keyboard shortcuts: Navigation and annotation hotkeys
Progress indicator: "17 / 100 reviewed" with pass/fail/defer counts
Auto-save: Annotations saved to a separate JSONL file on every action

Ask the user: "These are the default features. Do you want anything else before I generate the tool?" Then incorporate any additional requests.

Step 3: Generate the Application

Generate a single-directory Python web application with this structure:

trace-annotator/
├── app.py          # FastHTML application (single file, all routes)
├── requirements.txt # Dependencies (fasthtml, python-fasthtml)
└── README.md        # Brief usage instructions

Technology Stack

FastHTML for the web framework (HTMX is built-in)
TailwindCSS via CDN (<script src="https://cdn.tailwindcss.com">) for styling
Vanilla JavaScript only for keyboard shortcut bindings

Application Architecture

app.py — a single-file FastHTML app with these routes:

GET / — main annotation view showing the current trace, annotation form, and progress
POST /annotate — save annotation (notes + pass/fail/defer) and advance to next trace
GET /trace/{n} — navigate to a specific trace (used by prev/next and keyboard nav)
GET /progress — return progress stats (for HTMX partial updates)

Data flow:

On startup, read the trace data file from a path specified via command-line argument or environment variable.
Load existing annotations from annotations.jsonl (if it exists) to preserve prior work.
On each annotation action, append/update the entry in annotations.jsonl immediately.
The annotations file is separate from the source data — the original file is never modified.

Annotations file format (annotations.jsonl):

{"trace_id": "abc-123", "status": "fail", "notes": "SQL query missed the pet-friendly constraint", "timestamp": "2025-01-15T10:32:00Z"}
{"trace_id": "abc-124", "status": "pass", "notes": "", "timestamp": "2025-01-15T10:32:45Z"}
{"trace_id": "abc-125", "status": "defer", "notes": "Not sure if tone is appropriate for investor", "timestamp": "2025-01-15T10:33:12Z"}

Trace Rendering

This is the most important part of the tool. Tailor the HTML rendering to the user's specific trace structure. Apply these principles from HCI research on LLM review interfaces:

Visual hierarchy: Emphasize the user query and final output. Use distinct visual blocks (background colors, borders, indentation) for different trace components.
Collapsible sections: For multi-step traces, make intermediate steps (tool calls, reasoning, retrieval) collapsible — expanded by default for the first trace, then respecting the user's toggle state.
Domain-appropriate rendering: If the trace contains emails, render them like emails. If it contains SQL, syntax-highlight the SQL. If it contains JSON tool calls, format them as structured blocks. Match the visual presentation to the content type.
Readable text: Use comfortable line lengths (max-w-prose or similar), adequate spacing, and readable font sizes. Traces can be long — don't cram them.

Keyboard Shortcuts

Bind these shortcuts via a small inline <script> block. Display them in a help tooltip or footer so the user can reference them.

Shortcuts must be suppressed when the notes text field is focused (so the user can type normally). Re-enable them on blur.

UI Layout

Use a clean, minimal layout with TailwindCSS:

Top bar: Progress indicator ("17 / 100 reviewed — 12 pass, 3 fail, 2 defer"), trace navigation (prev/next buttons), and keyboard shortcut help toggle.
Main area: The rendered trace, taking up most of the viewport. Scrollable if the trace is long.
Bottom panel (sticky): Annotation controls — the notes text field, and pass/fail/defer buttons. Always visible so the user can annotate without scrolling back up.

Styling Guidelines

Use TailwindCSS utility classes. The visual design should be:

Clean and minimal — this is a productivity tool, not a marketing page
High contrast for readability during long annotation sessions
Distinct visual treatment for different trace components (user input vs. LLM output vs. tool calls vs. metadata)
Responsive but optimized for desktop — this is a sit-down-and-work tool

Step 4: Provide Usage Instructions

After generating the tool, tell the user how to run it:

cd trace-annotator
pip install -r requirements.txt
python app.py path/to/traces.jsonl

Then explain the workflow:

Open the browser (FastHTML will print the local URL)
Read each trace carefully, noting the point of first failure (the most upstream issue)
Write a short freeform note describing the observation
Mark as pass, fail, or defer
Use keyboard shortcuts to move quickly through traces
Annotations are saved automatically — you can close and resume anytime

Mention that annotations are saved to annotations.jsonl in the same directory.

What Open Coding Is (and Isn't)

When to stop: Continue until at least 20 failing traces are labeled and no fundamentally new failure patterns are appearing (theoretical saturation).

Anti-Patterns to Avoid

Over-engineering the tool: The annotation tool is a means to an end. Generate a working tool quickly and let the user start annotating. Don't add features they didn't ask for.
Premature structure: Don't add structured failure mode checkboxes or tag systems to the initial tool. Open coding is deliberately unstructured — the taxonomy emerges later. See references/beyond-open-coding.md for when and how to add structure.
Generic trace rendering: Don't just dump raw JSON. Take the time to understand the trace format and render it in a way that makes failures easy to spot.
Ignoring keyboard shortcuts: The textbook is emphatic that annotation speed directly correlates with engineering velocity. Hotkeys are not optional.

Connecting to Next Steps

After open coding, the user's workflow typically continues with:

Failure taxonomy (the failure-taxonomy skill): Cluster freeform annotations into structured, binary failure modes via axial coding.
LLM-as-Judge evaluators (the llm-as-a-judge skill): Once failure modes are defined, build automated evaluators for each one.
Extending the tool: The generated annotation tool can be extended to support structured failure tags after the taxonomy is built. See references/beyond-open-coding.md for guidance.

Mention these next steps when the tool is delivered.

Related Skills

maragudk/prompt-engineering

development

VerifiedTrustedCommunity

Use this skill when crafting, reviewing, or improving prompts for LLM pipelines — including task prompts, system prompts, and LLM-as-Judge prompts. Triggers include: requests to write or refine a prompt, diagnose why an LLM produces inconsistent or incorrect outputs, bridge the gap between intent and model behavior, reduce ambiguity in instructions, add few-shot examples, structure complex prompts, or improve output formatting. Also use when the user needs help distinguishing specification failures (unclear instructions) from generalization failures (model limitations), or when iterating on prompts based on observed failure modes. Do NOT use for general coding tasks, document creation, or non-LLM writing.

8SKILL.mdUpdated Apr 7, 2026

maragudk/prompt-engineering

maragudk/llm-as-a-judge

development

VerifiedTrustedCommunity

Build, validate, and deploy LLM-as-Judge evaluators for automated quality assessment of LLM pipeline outputs. Use this skill whenever the user wants to: create an automated evaluator for subjective or nuanced failure modes, write a judge prompt for Pass/Fail assessment, split labeled data for judge development, measure judge alignment (TPR/TNR), estimate true success rates with bias correction, or set up CI evaluation pipelines. Also trigger when the user mentions "judge prompt", "automated eval", "LLM evaluator", "grading prompt", "alignment metrics", "true positive rate", or wants to move from manual trace review to automated evaluation. This skill covers the full lifecycle: prompt design → data splitting → iterative refinement → success rate estimation.

8SKILL.mdUpdated Apr 7, 2026

maragudk/llm-as-a-judge

maragudk/failure-taxonomy

development

VerifiedTrustedCommunity

Build a structured taxonomy of failure modes from open-coded trace annotations. Use this skill whenever the user has freeform annotations from reviewing LLM traces and wants to cluster them into a coherent, non-overlapping set of binary failure categories (axial coding). Also use when the user mentions "failure modes", "error taxonomy", "axial coding", "cluster annotations", "categorize errors", "failure analysis", or wants to go from raw observation notes to structured evaluation criteria. This skill covers the full pipeline: grouping open codes, defining failure modes, re-labeling traces, and quantifying error rates.

8SKILL.mdUpdated Apr 7, 2026

maragudk/failure-taxonomy

openclaw/taskflow

tools

VerifiedTrustedCommunity

Use when work should span one or more detached tasks but still behave like one job with a single owner context. TaskFlow is the durable flow substrate under authoring layers like Lobster, ACPX, plugins, or plain code. Keep conditional logic in the caller; use TaskFlow for flow identity, child-task linkage, waiting state, revision-checked mutations, and user-facing emergence.

357,764SKILL.mdUpdated Apr 10, 2026

Download

For Claude Desktop. Download once, then upload the file in the app — no terminal needed.

Need help? View full Cowork setup guide →

Install manually

Choose your platform

# Clone the repo
git clone https://github.com/maragudk/evals-skills.git

# Copy into Claude Code skills folder (global)
cp -r evals-skills/skills/trace-annotation-tool ~/.claude/skills/

Claude Code Skills — official skills path docs.

Repository

maragudk/evals-skills

8 stars

Compatible with

Claude Code

OpenAI Codex CLI

ChatGPT