.cursor/skills/add-model/SKILL.md
Add a new AI model to the eval runner, update the manual eval workflow, push changes, and trigger baseline eval runs. Use when the user wants to add a new model, onboard a model, or mentions a new model name/link to add to the leaderboard.
npx skillsauth add get-convex/convex-evals add-modelInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Follow these steps whenever the user asks to add a new AI model to the eval suite.
Determine the following (ask the user if not provided):
anthropic/claude-opus-4.6. If the user gives a marketing name or URL, look up the OpenRouter model id.Claude 4.6 Opus.claude-opus-4.5 is the predecessor of claude-opus-4.6).apiKind - only needed for OpenAI Codex/Responses-API models; set to "responses". Omit for all other models.If you're unsure, check how the closest existing model in the same family is configured in runner/models/index.ts and match it.
runner/models/index.tsOpen runner/models/index.ts and add a new entry to the ALL_MODELS array. Place it next to its family siblings, respecting the existing grouping comments.
Template:
{
name: "<provider>/<model-id>",
formattedName: "<Human Name>",
// apiKind: "responses", // only for OpenAI Codex / Responses-API models
},
Open .github/workflows/manual_evals.yml and replace the entire matrix.model list with only the new model. This workflow exists solely to collect baseline data for newly added models, so it should only ever contain the latest addition.
matrix:
model:
- "<provider>/<model-id>"
Run bun run typecheck to verify no type errors were introduced.
Before committing, run a quick local sanity check with one or two simple evals to confirm the model ID is valid, the API key works, and results are being produced. Use the simplest fundamentals evals:
MODELS=<new-model-name> TEST_FILTER="000-fundamentals/000" bun run local:run
If that passes, optionally run one more:
MODELS=<new-model-name> TEST_FILTER="000-fundamentals/001" bun run local:run
What to look for:
.env file before proceedingOnly proceed to the next step once at least one eval completes successfully.
Create a descriptive commit message and push to main:
git add runner/models/index.ts .github/workflows/manual_evals.yml
git commit -m "add <model-name>; demote older <family> versions"
git push origin main
Use the GitHub CLI to dispatch the manual eval workflow 3 times (to get a statistically meaningful baseline):
gh workflow run manual_evals.yml --ref main
Run this command 3 times, waiting ~5 seconds between dispatches to avoid collisions.
You MUST poll until all 3 runs reach a terminal state (completed/failed/cancelled). Do not stop monitoring early or hand back to the user while runs are still in progress.
Poll every ~2 minutes using:
gh run list --workflow=manual_evals.yml --limit=6
Runs typically take 20-30 minutes. Keep checking until all show completed. If a run fails, immediately investigate:
gh run view <run-id> --log-failed
Report the final pass/fail status for each run to the user once all 3 are done.
ALL_MODELS in runner/models/index.ts.github/workflows/manual_evals.yml matrix replaced with only the new modelbun run typecheck passesmaintesting
Empirically verify guideline changes by running before/after eval runs across multiple models and ensuring no regressions. Use when proposing or reviewing changes to runner/models/guidelines.ts, or when the user asks to validate guidelines.
testing
Analyze all failures in a convex-evals run, spawning parallel sub-agents to investigate each failure and producing a report with classifications and recommendations. Use when the user asks to analyze an entire run, review all failures in a run, or wants to understand why a model scored poorly.
testing
Investigate a single failing eval from the convex-evals system. Use when the user shares a visualizer URL pointing to a specific eval, asks about a specific failing eval, or references a specific eval ID.
documentation
Analyze guideline ablation experiment results to determine which guideline sections are essential, marginal, or dispensable. Use when the user asks to analyze ablation results, interpret guideline compaction data, or wants to know which guidelines to keep for AGENTS.md.