.cursor/skills/analyze-run/SKILL.md
Analyze all failures in a convex-evals run, spawning parallel sub-agents to investigate each failure and producing a report with classifications and recommendations. Use when the user asks to analyze an entire run, review all failures in a run, or wants to understand why a model scored poorly.
npx skillsauth add get-convex/convex-evals analyze-runInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
https://convex-evals.netlify.app/experiment/.../run/$runId/...Extract the run ID from the visualizer URL. The URL pattern is:
/experiment/$experimentId/run/$runId/...
The $runId is the Convex document ID (e.g. jn7922j1w29pdxm76bj9ps0enx80mg9e).
Reports are stored in reports/{provider}/{model}/ (e.g. reports/anthropic/claude-opus-4-6/).
List the directory for the model being analyzed and read the most recent report(s). This gives you:
Reference prior findings when the same eval fails again — note whether it's a repeat and whether any prior fix should have resolved it.
Run from the evalScores/ directory:
npx convex run --prod debugQueries:getFailedEvalsForRun '{"runId": "<runId>"}'
This returns:
run -- model name, provider, experiment, statustotalEvals, passedCount, failedCount -- overall statsfailedEvals -- array of failed evals, each with _id, evalPath, category, name, failureReason, and failedStep (which step failed and its error)If there are no failures, report that all evals passed and stop.
For each failed eval, spawn a sub-agent (up to 4 in parallel) with this prompt template:
You are investigating a failing eval from the convex-evals system.
The workspace is at c:\dev\convex\convex-evals
Run this command from the evalScores/ directory:
npx convex run --prod debug:getEvalDebugInfo '{"evalId": "<EVAL_ID>"}'
Then analyze the result:
1. Which step failed and what was the exact error?
2. Look at the model's generated code in outputFiles.
3. Look at the expected answer and grader in evalSourceFiles.
4. Look at the task description in eval.task.
5. Is this a genuine model mistake, or is the test/lint/task unfair?
Classify the failure as one of:
- MODEL_FAULT: The model genuinely got it wrong
- OVERLY_STRICT: The eval/lint/test requirements are unreasonable for what was asked
- AMBIGUOUS_TASK: The task description is unclear and the model's interpretation was reasonable
- KNOWN_GAP: A known limitation of this eval that affects all models (e.g. the Convex API returns fields the model can't predict without being told)
Return a structured summary:
- Eval: <name> (<category>)
- Failed step: <step name>
- Error: <one-line error summary>
- Classification: <one of the above>
- Reasoning: <2-3 sentences explaining your classification>
- Model output snippet: <the relevant problematic code, if applicable>
- Expected code snippet: <what the answer looks like, if applicable>
Once all sub-agents return, build the analysis:
For each failure, list: eval name, failed step, classification, one-line reasoning.
Look for patterns across failures:
Group recommendations by type:
Always create a report file at:
reports/{provider}/{model}/{runIdPrefix}_{date}.md
For example: reports/anthropic/claude-opus-4-6/jn72t14a_2026-02-06.md
The runIdPrefix is the first 8 characters of the run ID.
The report should contain:
Present the full analysis to the user. End with:
"These are my findings. Would you like me to implement any of these recommendations, or would you like to discuss specific failures in more detail?"
Do NOT make any code/config changes until the user explicitly asks.
If the user asks you to implement any recommendations, update the report file's "Actions taken" section after making the changes. Record:
This ensures future analysis sessions can see which recommendations were already acted on and avoid re-recommending changes that have already been made.
testing
Empirically verify guideline changes by running before/after eval runs across multiple models and ensuring no regressions. Use when proposing or reviewing changes to runner/models/guidelines.ts, or when the user asks to validate guidelines.
testing
Investigate a single failing eval from the convex-evals system. Use when the user shares a visualizer URL pointing to a specific eval, asks about a specific failing eval, or references a specific eval ID.
documentation
Analyze guideline ablation experiment results to determine which guideline sections are essential, marginal, or dispensable. Use when the user asks to analyze ablation results, interpret guideline compaction data, or wants to know which guidelines to keep for AGENTS.md.
data-ai
Add a new AI model to the eval runner, update the manual eval workflow, push changes, and trigger baseline eval runs. Use when the user wants to add a new model, onboard a model, or mentions a new model name/link to add to the leaderboard.