Adoption

Agent Skills are supported by leading AI development tools.

VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory

get-convex/add-model

Name: add-model
Author: get-convex

.cursor/skills/add-model/SKILL.md

npx skillsauth add get-convex/convex-evals add-model

Clean

TrivyContainer and dependency vulnerability scanner

Clean

SemgrepStatic code analysis for vulnerabilities

Clean

mcp-scan (Snyk)Model Context Protocol security validation

Skipped

Snyk (dep)Open source security scanning

Skipped

Socket.devSupply chain security analysis

Skipped

VirusTotalMulti-engine malware detection

Skipped

CrowdStrikeAdvanced threat intelligence

Skipped

OSV-ScannerOpen Source Vulnerability database check

Skipped

OWASP Dep-Check

Add a New Model to the Eval Runner

Follow these steps whenever the user asks to add a new AI model to the eval suite.

Step 0: Gather Information

Determine the following (ask the user if not provided):

Model identifier - the OpenRouter-style name, e.g. anthropic/claude-opus-4.6. If the user gives a marketing name or URL, look up the OpenRouter model id.
Formatted name - human-readable, e.g. Claude 4.6 Opus.
Provider family & version - needed to find older siblings (e.g. claude-opus-4.5 is the predecessor of claude-opus-4.6).
apiKind - only needed for OpenAI Codex/Responses-API models; set to "responses". Omit for all other models.

If you're unsure, check how the closest existing model in the same family is configured in runner/models/index.ts and match it.

Step 1: Add the Model to `runner/models/index.ts`

Open runner/models/index.ts and add a new entry to the ALL_MODELS array. Place it next to its family siblings, respecting the existing grouping comments.

Template:

{
  name: "<provider>/<model-id>",
  formattedName: "<Human Name>",
  // apiKind: "responses",  // only for OpenAI Codex / Responses-API models
},

Step 2: Update the Manual Evals Workflow

Open .github/workflows/manual_evals.yml and replace the entire matrix.model list with only the new model. This workflow exists solely to collect baseline data for newly added models, so it should only ever contain the latest addition.

matrix:
  model:
    - "<provider>/<model-id>"

Step 3: Typecheck

Run bun run typecheck to verify no type errors were introduced.

Step 4: Smoke Test the New Model Locally

Before committing, run a quick local sanity check with one or two simple evals to confirm the model ID is valid, the API key works, and results are being produced. Use the simplest fundamentals evals:

MODELS=<new-model-name> TEST_FILTER="000-fundamentals/000" bun run local:run

If that passes, optionally run one more:

MODELS=<new-model-name> TEST_FILTER="000-fundamentals/001" bun run local:run

What to look for:

No authentication or "model not found" errors
The run completes and produces a score (even a low score is fine - we just want to confirm it runs)
If it fails with an API/auth error, stop and fix the model ID or check the .env file before proceeding

Only proceed to the next step once at least one eval completes successfully.

Step 5: Commit and Push

Create a descriptive commit message and push to main:

git add runner/models/index.ts .github/workflows/manual_evals.yml
git commit -m "add <model-name>; demote older <family> versions"
git push origin main

Step 6: Trigger Manual Eval Runs for Baseline Data

Use the GitHub CLI to dispatch the manual eval workflow 3 times (to get a statistically meaningful baseline):

gh workflow run manual_evals.yml --ref main

Run this command 3 times, waiting ~5 seconds between dispatches to avoid collisions.

Step 7: Monitor the Runs Until Completion

You MUST poll until all 3 runs reach a terminal state (completed/failed/cancelled). Do not stop monitoring early or hand back to the user while runs are still in progress.

Poll every ~2 minutes using:

gh run list --workflow=manual_evals.yml --limit=6

Runs typically take 20-30 minutes. Keep checking until all show completed. If a run fails, immediately investigate:

gh run view <run-id> --log-failed

Report the final pass/fail status for each run to the user once all 3 are done.

Summary Checklist

[ ] Model added to ALL_MODELS in runner/models/index.ts
[ ] .github/workflows/manual_evals.yml matrix replaced with only the new model
[ ] bun run typecheck passes
[ ] Smoke test: at least one eval completes successfully locally
[ ] Changes committed and pushed to main
[ ] Manual eval workflow dispatched 3 times
[ ] All 3 runs monitored to completion

get-convex/add-model

.cursor/skills/add-model/SKILL.md

Add a new AI model to the eval runner, update the manual eval workflow, push changes, and trigger baseline eval runs. Use when the user wants to add a new model, onboard a model, or mentions a new model name/link to add to the leaderboard.

109 stars

data-ai

Updated Apr 17, 2026

$ install --global

skillsauth

npx skillsauth add get-convex/convex-evals add-model

Install this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.

Security Scan Results

3 of 9 scanners reported clean

Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.

Scanners Passed

Scanners in report

Clean

TrivyContainer and dependency vulnerability scanner

95%

Clean

SemgrepStatic code analysis for vulnerabilities

95%

Clean

mcp-scan (Snyk)Model Context Protocol security validation

95%

Skipped

Snyk (dep)Open source security scanning

50%

Skipped

Socket.devSupply chain security analysis

50%

Skipped

VirusTotalMulti-engine malware detection

50%

Skipped

CrowdStrikeAdvanced threat intelligence

50%

Skipped

OSV-ScannerOpen Source Vulnerability database check

50%

Skipped

OWASP Dep-Check

50%

Last scanned: Apr 17, 2026, 11:21 AM5.6s1 file scanned

SKILL.md

name:: add-model
description:: Add a new AI model to the eval runner, update the manual eval workflow, push changes, and trigger baseline eval runs. Use when the user wants to add a new model, onboard a model, or mentions a new model name/link to add to the leaderboard.

Add a New Model to the Eval Runner

Follow these steps whenever the user asks to add a new AI model to the eval suite.

Step 0: Gather Information

Determine the following (ask the user if not provided):

Model identifier - the OpenRouter-style name, e.g. anthropic/claude-opus-4.6. If the user gives a marketing name or URL, look up the OpenRouter model id.
Formatted name - human-readable, e.g. Claude 4.6 Opus.
Provider family & version - needed to find older siblings (e.g. claude-opus-4.5 is the predecessor of claude-opus-4.6).
apiKind - only needed for OpenAI Codex/Responses-API models; set to "responses". Omit for all other models.

If you're unsure, check how the closest existing model in the same family is configured in runner/models/index.ts and match it.

Step 1: Add the Model to `runner/models/index.ts`

Open runner/models/index.ts and add a new entry to the ALL_MODELS array. Place it next to its family siblings, respecting the existing grouping comments.

Template:

{
  name: "<provider>/<model-id>",
  formattedName: "<Human Name>",
  // apiKind: "responses",  // only for OpenAI Codex / Responses-API models
},

Step 2: Update the Manual Evals Workflow

matrix:
  model:
    - "<provider>/<model-id>"

Step 3: Typecheck

Run bun run typecheck to verify no type errors were introduced.

Step 4: Smoke Test the New Model Locally

MODELS=<new-model-name> TEST_FILTER="000-fundamentals/000" bun run local:run

If that passes, optionally run one more:

MODELS=<new-model-name> TEST_FILTER="000-fundamentals/001" bun run local:run

What to look for:

No authentication or "model not found" errors
The run completes and produces a score (even a low score is fine - we just want to confirm it runs)
If it fails with an API/auth error, stop and fix the model ID or check the .env file before proceeding

Only proceed to the next step once at least one eval completes successfully.

Step 5: Commit and Push

Create a descriptive commit message and push to main:

git add runner/models/index.ts .github/workflows/manual_evals.yml
git commit -m "add <model-name>; demote older <family> versions"
git push origin main

Step 6: Trigger Manual Eval Runs for Baseline Data

Use the GitHub CLI to dispatch the manual eval workflow 3 times (to get a statistically meaningful baseline):

gh workflow run manual_evals.yml --ref main

Run this command 3 times, waiting ~5 seconds between dispatches to avoid collisions.

Step 7: Monitor the Runs Until Completion

You MUST poll until all 3 runs reach a terminal state (completed/failed/cancelled). Do not stop monitoring early or hand back to the user while runs are still in progress.

Poll every ~2 minutes using:

gh run list --workflow=manual_evals.yml --limit=6

Runs typically take 20-30 minutes. Keep checking until all show completed. If a run fails, immediately investigate:

gh run view <run-id> --log-failed

Report the final pass/fail status for each run to the user once all 3 are done.

Summary Checklist

[ ] Model added to ALL_MODELS in runner/models/index.ts
[ ] .github/workflows/manual_evals.yml matrix replaced with only the new model
[ ] bun run typecheck passes
[ ] Smoke test: at least one eval completes successfully locally
[ ] Changes committed and pushed to main
[ ] Manual eval workflow dispatched 3 times
[ ] All 3 runs monitored to completion

Related Skills

get-convex/validate-guidelines

testing

VerifiedTrustedCommunity

Empirically verify guideline changes by running before/after eval runs across multiple models and ensuring no regressions. Use when proposing or reviewing changes to runner/models/guidelines.ts, or when the user asks to validate guidelines.

109SKILL.mdUpdated Apr 17, 2026

get-convex/validate-guidelines

get-convex/analyze-run

testing

VerifiedTrustedCommunity

Analyze all failures in a convex-evals run, spawning parallel sub-agents to investigate each failure and producing a report with classifications and recommendations. Use when the user asks to analyze an entire run, review all failures in a run, or wants to understand why a model scored poorly.

109SKILL.mdUpdated Apr 17, 2026

get-convex/analyze-run

get-convex/analyze-eval

testing

VerifiedTrustedCommunity

Investigate a single failing eval from the convex-evals system. Use when the user shares a visualizer URL pointing to a specific eval, asks about a specific failing eval, or references a specific eval ID.

109SKILL.mdUpdated Apr 17, 2026

get-convex/analyze-eval

get-convex/analyze-ablation

documentation

VerifiedTrustedCommunity

Analyze guideline ablation experiment results to determine which guideline sections are essential, marginal, or dispensable. Use when the user asks to analyze ablation results, interpret guideline compaction data, or wants to know which guidelines to keep for AGENTS.md.

109SKILL.mdUpdated Apr 17, 2026

get-convex/analyze-ablation

Download

For Claude Desktop. Download once, then upload the file in the app — no terminal needed.

Need help? View full Cowork setup guide →

Install manually

Choose your platform

# Clone the repo
git clone https://github.com/get-convex/convex-evals.git

# Copy into Claude Code skills folder (global)
cp -r convex-evals/.cursor/skills/add-model ~/.claude/skills/

Claude Code Skills — official skills path docs.

Repository

get-convex/convex-evals

109 stars

Compatible with

Claude Code

OpenAI Codex CLI

ChatGPT

Adoption

get-convex/add-model

$ install --global

Security Scan Results

SKILL.md

Add a New Model to the Eval Runner

Step 0: Gather Information

Step 1: Add the Model to runner/models/index.ts

Step 2: Update the Manual Evals Workflow

Step 3: Typecheck

Step 4: Smoke Test the New Model Locally

Step 5: Commit and Push

Step 6: Trigger Manual Eval Runs for Baseline Data

Step 7: Monitor the Runs Until Completion

Summary Checklist

Related Skills

get-convex/validate-guidelines

get-convex/analyze-run

get-convex/analyze-eval

get-convex/analyze-ablation

get-convex/add-model

$ install --global

Security Scan Results

SKILL.md

Add a New Model to the Eval Runner

Step 0: Gather Information

Step 1: Add the Model to runner/models/index.ts

Step 2: Update the Manual Evals Workflow

Step 3: Typecheck

Step 4: Smoke Test the New Model Locally

Step 5: Commit and Push

Step 6: Trigger Manual Eval Runs for Baseline Data

Step 7: Monitor the Runs Until Completion

Summary Checklist

Related Skills

get-convex/validate-guidelines

get-convex/analyze-run

get-convex/analyze-eval

get-convex/analyze-ablation

Step 1: Add the Model to `runner/models/index.ts`

Step 1: Add the Model to `runner/models/index.ts`