Adoption

Agent Skills are supported by leading AI development tools.

VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory

get-convex/analyze-run

Name: analyze-run
Author: get-convex

.cursor/skills/analyze-run/SKILL.md

npx skillsauth add get-convex/convex-evals analyze-run

Clean

TrivyContainer and dependency vulnerability scanner

Clean

SemgrepStatic code analysis for vulnerabilities

Clean

mcp-scan (Snyk)Model Context Protocol security validation

Skipped

Snyk (dep)Open source security scanning

Skipped

Socket.devSupply chain security analysis

Skipped

VirusTotalMulti-engine malware detection

Skipped

CrowdStrikeAdvanced threat intelligence

Skipped

OSV-ScannerOpen Source Vulnerability database check

Skipped

OWASP Dep-Check

Analyze Run

When to use

User asks "analyze this run" or "why did this model score poorly?"
User shares a run URL like https://convex-evals.netlify.app/experiment/.../run/$runId/...
User wants to review all failures across an entire eval run

Step 1: Get the run ID

Extract the run ID from the visualizer URL. The URL pattern is:

/experiment/$experimentId/run/$runId/...

The $runId is the Convex document ID (e.g. jn7922j1w29pdxm76bj9ps0enx80mg9e).

Step 2: Check previous reports for this model

Reports are stored in reports/{provider}/{model}/ (e.g. reports/anthropic/claude-opus-4-6/). List the directory for the model being analyzed and read the most recent report(s). This gives you:

Known recurring failures for this model
Actions already taken (lint config changes, grader fixes, task updates)
Classifications from prior analysis that may still apply

Reference prior findings when the same eval fails again — note whether it's a repeat and whether any prior fix should have resolved it.

Step 3: Fetch the failure summary

Run from the evalScores/ directory:

npx convex run --prod debugQueries:getFailedEvalsForRun '{"runId": "<runId>"}'

This returns:

run -- model name, provider, experiment, status
totalEvals, passedCount, failedCount -- overall stats
failedEvals -- array of failed evals, each with _id, evalPath, category, name, failureReason, and failedStep (which step failed and its error)

If there are no failures, report that all evals passed and stop.

Step 4: Fan out sub-agents to analyze each failure

For each failed eval, spawn a sub-agent (up to 4 in parallel) with this prompt template:

You are investigating a failing eval from the convex-evals system.

The workspace is at c:\dev\convex\convex-evals
Run this command from the evalScores/ directory:

npx convex run --prod debug:getEvalDebugInfo '{"evalId": "<EVAL_ID>"}'

Then analyze the result:
1. Which step failed and what was the exact error?
2. Look at the model's generated code in outputFiles.
3. Look at the expected answer and grader in evalSourceFiles.
4. Look at the task description in eval.task.
5. Is this a genuine model mistake, or is the test/lint/task unfair?

Classify the failure as one of:
- MODEL_FAULT: The model genuinely got it wrong
- OVERLY_STRICT: The eval/lint/test requirements are unreasonable for what was asked
- AMBIGUOUS_TASK: The task description is unclear and the model's interpretation was reasonable
- KNOWN_GAP: A known limitation of this eval that affects all models (e.g. the Convex API returns fields the model can't predict without being told)

Return a structured summary:
- Eval: <name> (<category>)
- Failed step: <step name>
- Error: <one-line error summary>
- Classification: <one of the above>
- Reasoning: <2-3 sentences explaining your classification>
- Model output snippet: <the relevant problematic code, if applicable>
- Expected code snippet: <what the answer looks like, if applicable>

Step 5: Collate, present, and create report

Once all sub-agents return, build the analysis:

5a. Overall summary

Model, experiment, pass rate (X/Y evals passed)
Breakdown by failure type: how many eslint, tsc, deploy, test failures

5b. Failure classification table

For each failure, list: eval name, failed step, classification, one-line reasoning.

5c. Cross-cutting patterns

Look for patterns across failures:

Are multiple failures caused by the same root issue? (e.g. same lint rule, same API misunderstanding, same missing pattern)
Are there categories of evals that are systematically harder?
Do prior reports for this model already document these issues?

5d. Recommendations

Group recommendations by type:

Eval improvements: Tasks that should be clarified, tests that should be relaxed
Lint/config changes: Rules that are too strict for what we're testing
Model-specific notes: Patterns this model struggles with that other models might not
No action needed: Failures that are genuinely the model's fault

5e. Create report file

Always create a report file at:

reports/{provider}/{model}/{runIdPrefix}_{date}.md

For example: reports/anthropic/claude-opus-4-6/jn72t14a_2026-02-06.md

The runIdPrefix is the first 8 characters of the run ID.

The report should contain:

Run metadata (ID, model, experiment, date, pass rate)
Failure summary table (by step type)
Per-failure analysis with classification, reasoning, and code snippets
Cross-cutting patterns (especially recurring failures from prior reports)
Recommendations (eval improvements, lint/config changes, model-specific notes)
Net impact assessment (how many failures are actionable vs genuine model faults)
Actions taken: List any changes made as a result of this analysis (e.g. "Updated TASK.txt for 007-http_action_routing to clarify getSiteURL placement"). Default to "None" if no changes were made — this makes it explicit that recommendations were reviewed and deliberately not acted on, rather than simply forgotten.

5f. Present to user

Present the full analysis to the user. End with:

"These are my findings. Would you like me to implement any of these recommendations, or would you like to discuss specific failures in more detail?"

Do NOT make any code/config changes until the user explicitly asks.

5g. Update report after implementing changes

If the user asks you to implement any recommendations, update the report file's "Actions taken" section after making the changes. Record:

What was changed (file path + brief description)
Which failure(s) it addresses
Date of the change

This ensures future analysis sessions can see which recommendations were already acted on and avoid re-recommending changes that have already been made.

get-convex/analyze-run

.cursor/skills/analyze-run/SKILL.md

Analyze all failures in a convex-evals run, spawning parallel sub-agents to investigate each failure and producing a report with classifications and recommendations. Use when the user asks to analyze an entire run, review all failures in a run, or wants to understand why a model scored poorly.

109 stars

testing

Updated Apr 17, 2026

$ install --global

skillsauth

npx skillsauth add get-convex/convex-evals analyze-run

Install this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.

Security Scan Results

3 of 9 scanners reported clean

Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.

Scanners Passed

Scanners in report

Clean

TrivyContainer and dependency vulnerability scanner

95%

Clean

SemgrepStatic code analysis for vulnerabilities

95%

Clean

mcp-scan (Snyk)Model Context Protocol security validation

95%

Skipped

Snyk (dep)Open source security scanning

50%

Skipped

Socket.devSupply chain security analysis

50%

Skipped

VirusTotalMulti-engine malware detection

50%

Skipped

CrowdStrikeAdvanced threat intelligence

50%

Skipped

OSV-ScannerOpen Source Vulnerability database check

50%

Skipped

OWASP Dep-Check

50%

Last scanned: Apr 17, 2026, 11:21 AM4.6s1 file scanned

SKILL.md

name:: analyze-run
description:: Analyze all failures in a convex-evals run, spawning parallel sub-agents to investigate each failure and producing a report with classifications and recommendations. Use when the user asks to analyze an entire run, review all failures in a run, or wants to understand why a model scored poorly.

Analyze Run

When to use

User asks "analyze this run" or "why did this model score poorly?"
User shares a run URL like https://convex-evals.netlify.app/experiment/.../run/$runId/...
User wants to review all failures across an entire eval run

Step 1: Get the run ID

Extract the run ID from the visualizer URL. The URL pattern is:

/experiment/$experimentId/run/$runId/...

The $runId is the Convex document ID (e.g. jn7922j1w29pdxm76bj9ps0enx80mg9e).

Step 2: Check previous reports for this model

Reports are stored in reports/{provider}/{model}/ (e.g. reports/anthropic/claude-opus-4-6/). List the directory for the model being analyzed and read the most recent report(s). This gives you:

Known recurring failures for this model
Actions already taken (lint config changes, grader fixes, task updates)
Classifications from prior analysis that may still apply

Reference prior findings when the same eval fails again — note whether it's a repeat and whether any prior fix should have resolved it.

Step 3: Fetch the failure summary

Run from the evalScores/ directory:

npx convex run --prod debugQueries:getFailedEvalsForRun '{"runId": "<runId>"}'

This returns:

run -- model name, provider, experiment, status
totalEvals, passedCount, failedCount -- overall stats
failedEvals -- array of failed evals, each with _id, evalPath, category, name, failureReason, and failedStep (which step failed and its error)

If there are no failures, report that all evals passed and stop.

Step 4: Fan out sub-agents to analyze each failure

For each failed eval, spawn a sub-agent (up to 4 in parallel) with this prompt template:

You are investigating a failing eval from the convex-evals system.

The workspace is at c:\dev\convex\convex-evals
Run this command from the evalScores/ directory:

npx convex run --prod debug:getEvalDebugInfo '{"evalId": "<EVAL_ID>"}'

Then analyze the result:
1. Which step failed and what was the exact error?
2. Look at the model's generated code in outputFiles.
3. Look at the expected answer and grader in evalSourceFiles.
4. Look at the task description in eval.task.
5. Is this a genuine model mistake, or is the test/lint/task unfair?

Classify the failure as one of:
- MODEL_FAULT: The model genuinely got it wrong
- OVERLY_STRICT: The eval/lint/test requirements are unreasonable for what was asked
- AMBIGUOUS_TASK: The task description is unclear and the model's interpretation was reasonable
- KNOWN_GAP: A known limitation of this eval that affects all models (e.g. the Convex API returns fields the model can't predict without being told)

Return a structured summary:
- Eval: <name> (<category>)
- Failed step: <step name>
- Error: <one-line error summary>
- Classification: <one of the above>
- Reasoning: <2-3 sentences explaining your classification>
- Model output snippet: <the relevant problematic code, if applicable>
- Expected code snippet: <what the answer looks like, if applicable>

Step 5: Collate, present, and create report

Once all sub-agents return, build the analysis:

5a. Overall summary

Model, experiment, pass rate (X/Y evals passed)
Breakdown by failure type: how many eslint, tsc, deploy, test failures

5b. Failure classification table

For each failure, list: eval name, failed step, classification, one-line reasoning.

5c. Cross-cutting patterns

Look for patterns across failures:

Are multiple failures caused by the same root issue? (e.g. same lint rule, same API misunderstanding, same missing pattern)
Are there categories of evals that are systematically harder?
Do prior reports for this model already document these issues?

5d. Recommendations

Group recommendations by type:

Eval improvements: Tasks that should be clarified, tests that should be relaxed
Lint/config changes: Rules that are too strict for what we're testing
Model-specific notes: Patterns this model struggles with that other models might not
No action needed: Failures that are genuinely the model's fault

5e. Create report file

Always create a report file at:

reports/{provider}/{model}/{runIdPrefix}_{date}.md

For example: reports/anthropic/claude-opus-4-6/jn72t14a_2026-02-06.md

The runIdPrefix is the first 8 characters of the run ID.

The report should contain:

Run metadata (ID, model, experiment, date, pass rate)
Failure summary table (by step type)
Per-failure analysis with classification, reasoning, and code snippets
Cross-cutting patterns (especially recurring failures from prior reports)
Recommendations (eval improvements, lint/config changes, model-specific notes)
Net impact assessment (how many failures are actionable vs genuine model faults)
Actions taken: List any changes made as a result of this analysis (e.g. "Updated TASK.txt for 007-http_action_routing to clarify getSiteURL placement"). Default to "None" if no changes were made — this makes it explicit that recommendations were reviewed and deliberately not acted on, rather than simply forgotten.

5f. Present to user

Present the full analysis to the user. End with:

"These are my findings. Would you like me to implement any of these recommendations, or would you like to discuss specific failures in more detail?"

Do NOT make any code/config changes until the user explicitly asks.

5g. Update report after implementing changes

If the user asks you to implement any recommendations, update the report file's "Actions taken" section after making the changes. Record:

What was changed (file path + brief description)
Which failure(s) it addresses
Date of the change

This ensures future analysis sessions can see which recommendations were already acted on and avoid re-recommending changes that have already been made.

Related Skills

get-convex/validate-guidelines

testing

VerifiedTrustedCommunity

Empirically verify guideline changes by running before/after eval runs across multiple models and ensuring no regressions. Use when proposing or reviewing changes to runner/models/guidelines.ts, or when the user asks to validate guidelines.

109SKILL.mdUpdated Apr 17, 2026

get-convex/validate-guidelines

get-convex/analyze-eval

testing

VerifiedTrustedCommunity

Investigate a single failing eval from the convex-evals system. Use when the user shares a visualizer URL pointing to a specific eval, asks about a specific failing eval, or references a specific eval ID.

109SKILL.mdUpdated Apr 17, 2026

get-convex/analyze-eval

get-convex/analyze-ablation

documentation

VerifiedTrustedCommunity

Analyze guideline ablation experiment results to determine which guideline sections are essential, marginal, or dispensable. Use when the user asks to analyze ablation results, interpret guideline compaction data, or wants to know which guidelines to keep for AGENTS.md.

109SKILL.mdUpdated Apr 17, 2026

get-convex/analyze-ablation

get-convex/add-model

data-ai

VerifiedTrustedCommunity

Add a new AI model to the eval runner, update the manual eval workflow, push changes, and trigger baseline eval runs. Use when the user wants to add a new model, onboard a model, or mentions a new model name/link to add to the leaderboard.

109SKILL.mdUpdated Apr 17, 2026

Download

For Claude Desktop. Download once, then upload the file in the app — no terminal needed.

Need help? View full Cowork setup guide →

Install manually

Choose your platform

# Clone the repo
git clone https://github.com/get-convex/convex-evals.git

# Copy into Claude Code skills folder (global)
cp -r convex-evals/.cursor/skills/analyze-run ~/.claude/skills/

Claude Code Skills — official skills path docs.

Repository

get-convex/convex-evals

109 stars

Compatible with

Claude Code

OpenAI Codex CLI

ChatGPT