Adoption

Agent Skills are supported by leading AI development tools.

VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory

get-convex/analyze-ablation

Name: analyze-ablation
Author: get-convex

.cursor/skills/analyze-ablation/SKILL.md

npx skillsauth add get-convex/convex-evals analyze-ablation

Clean

TrivyContainer and dependency vulnerability scanner

Clean

SemgrepStatic code analysis for vulnerabilities

Clean

mcp-scan (Snyk)Model Context Protocol security validation

Skipped

Snyk (dep)Open source security scanning

Skipped

Socket.devSupply chain security analysis

Skipped

VirusTotalMulti-engine malware detection

Skipped

CrowdStrikeAdvanced threat intelligence

Skipped

OSV-ScannerOpen Source Vulnerability database check

Skipped

OWASP Dep-Check

Analyze Ablation Results

When to use

User says "analyze the ablation results" or "analyze ablation for <model>"
User wants to know which guideline sections to keep or remove
User asks about guideline compaction or AGENTS.md token budget
After a guideline ablation experiment has been run (via GitHub Action or locally)

Step 1: Find available results

List the ablation/results/ directory to see which models have results:

ls ablation/results/

If the directory is empty or doesn't exist, the user needs to either:

Run the ablation locally: bun run scripts/runAblation.ts --model <model>

Or download results from a GitHub Actions run:

# List recent ablation workflow runs
gh run list --workflow=ablation_experiment.yml --limit=5
# Download the artifact
gh run download <run-id> -n ablation-<model>-<run-id> -D ablation/results/

Step 2: Read the summary JSON

For the requested model, read the latest JSON file from ablation/results/<model>/. The file contains:

model: The model name
timestamp: When the experiment ran
baseline: Overall pass/fail counts and per-eval results with the full guideline set
sections: Array of per-section ablation results, each with:
- name: Section name (e.g. "function_guidelines", "query_guidelines")
- tokensInSection: How many tokens this section costs
- verdict: "ESSENTIAL" (2+ regressions), "MARGINAL" (1 regression), "DISPENSABLE" (0 regressions)
- regressions: Eval names that flipped from pass to fail when this section was removed
- improvements: Eval names that flipped from fail to pass when removed (guidelines confusing the model)
- score: Pass/fail counts for this ablation variant

Step 3: Present the classification table

Present a summary table showing:

| Section | Verdict | Regressions | Improvements | Tokens | Score | |---------|---------|-------------|--------------|--------|-------|

Sort by verdict: ESSENTIAL first, then MARGINAL, then DISPENSABLE.

Step 4: Detailed analysis

For each ESSENTIAL and MARGINAL section:

List the specific evals that regressed
Check if the regressed evals are related to the section topic (e.g. query_guidelines causes query eval regressions — expected)
Flag any surprising regressions (section removal caused failures in unrelated evals)

For any section with improvements:

These are cases where the guideline was actively confusing the model
Flag these as candidates for rewording even if the section is kept

Step 5: Cross-model comparison (if multiple models have results)

If ablation/results/ has results for multiple models:

Compare verdicts across models
A section ESSENTIAL in ALL models is definitely essential
A section ESSENTIAL in some models should still be kept (conservative approach)
A section DISPENSABLE in ALL models is safe to remove
Flag any sections with conflicting verdicts across models

Step 6: Token budget summary

Calculate:

Current full guidelines: Total tokens (from baseline)
Essential sections only: Sum of tokens for ESSENTIAL sections
Essential + Marginal: Sum of tokens for ESSENTIAL and MARGINAL sections
Savings: How many tokens saved by removing DISPENSABLE (and optionally MARGINAL) sections

Step 7: Recommend next steps

Based on the results, recommend one of:

Ready to build AGENTS.md: If the classification is clear and token savings are meaningful, suggest building the compact guideline set and running a validation run.
Subsection ablation needed: If function_guidelines is ESSENTIAL (likely — it's the largest section at ~2400 tokens), suggest a follow-up ablation of its 8 subsections to find further savings.
Cross-model validation needed: If only one model has been tested, suggest running ablation on 1-2 additional models for confidence.
Results are noisy: If many sections show exactly 1 regression (MARGINAL), the run-to-run variance may be too high. Suggest re-running or using a different model.

Present findings to the user and ask which direction they want to go. Do NOT make any code changes until asked.

Reference: Guideline sections

The 10 top-level sections in runner/models/guidelines.ts are:

function_guidelines — Function syntax, HTTP endpoints, validators, registration, calling conventions, function references, API design, pagination
validator_guidelines — v.bigint deprecation, v.record usage
schema_guidelines — Schema location, system fields, index naming, index field ordering
typescript_guidelines — Id types, Record types, strict typing, as const, Array/Record patterns, @types/node
full_text_search_guidelines — Search index query syntax
query_guidelines — No filter, no .delete(), .unique(), async iteration, ordering
mutation_guidelines — ctx.db.replace vs ctx.db.patch
action_guidelines — "use node", no ctx.db, action syntax
scheduling_guidelines — Cron syntax, FunctionReference usage, crons.ts patterns
file_storage_guidelines — Storage API, getUrl, system table queries, Blob handling

get-convex/analyze-ablation

.cursor/skills/analyze-ablation/SKILL.md

Analyze guideline ablation experiment results to determine which guideline sections are essential, marginal, or dispensable. Use when the user asks to analyze ablation results, interpret guideline compaction data, or wants to know which guidelines to keep for AGENTS.md.

109 stars

documentation

Updated Apr 17, 2026

$ install --global

skillsauth

npx skillsauth add get-convex/convex-evals analyze-ablation

Install this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.

Security Scan Results

3 of 9 scanners reported clean

Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.

Scanners Passed

Scanners in report

Clean

TrivyContainer and dependency vulnerability scanner

95%

Clean

SemgrepStatic code analysis for vulnerabilities

95%

Clean

mcp-scan (Snyk)Model Context Protocol security validation

95%

Skipped

Snyk (dep)Open source security scanning

50%

Skipped

Socket.devSupply chain security analysis

50%

Skipped

VirusTotalMulti-engine malware detection

50%

Skipped

CrowdStrikeAdvanced threat intelligence

50%

Skipped

OSV-ScannerOpen Source Vulnerability database check

50%

Skipped

OWASP Dep-Check

50%

Last scanned: Apr 17, 2026, 11:21 AM4.3s1 file scanned

SKILL.md

name:: analyze-ablation
description:: Analyze guideline ablation experiment results to determine which guideline sections are essential, marginal, or dispensable. Use when the user asks to analyze ablation results, interpret guideline compaction data, or wants to know which guidelines to keep for AGENTS.md.

Analyze Ablation Results

When to use

User says "analyze the ablation results" or "analyze ablation for <model>"
User wants to know which guideline sections to keep or remove
User asks about guideline compaction or AGENTS.md token budget
After a guideline ablation experiment has been run (via GitHub Action or locally)

Step 1: Find available results

List the ablation/results/ directory to see which models have results:

ls ablation/results/

If the directory is empty or doesn't exist, the user needs to either:

Run the ablation locally: bun run scripts/runAblation.ts --model <model>

Or download results from a GitHub Actions run:

# List recent ablation workflow runs
gh run list --workflow=ablation_experiment.yml --limit=5
# Download the artifact
gh run download <run-id> -n ablation-<model>-<run-id> -D ablation/results/

Step 2: Read the summary JSON

For the requested model, read the latest JSON file from ablation/results/<model>/. The file contains:

model: The model name
timestamp: When the experiment ran
baseline: Overall pass/fail counts and per-eval results with the full guideline set
sections: Array of per-section ablation results, each with:
- name: Section name (e.g. "function_guidelines", "query_guidelines")
- tokensInSection: How many tokens this section costs
- verdict: "ESSENTIAL" (2+ regressions), "MARGINAL" (1 regression), "DISPENSABLE" (0 regressions)
- regressions: Eval names that flipped from pass to fail when this section was removed
- improvements: Eval names that flipped from fail to pass when removed (guidelines confusing the model)
- score: Pass/fail counts for this ablation variant

Step 3: Present the classification table

Present a summary table showing:

| Section | Verdict | Regressions | Improvements | Tokens | Score | |---------|---------|-------------|--------------|--------|-------|

Sort by verdict: ESSENTIAL first, then MARGINAL, then DISPENSABLE.

Step 4: Detailed analysis

For each ESSENTIAL and MARGINAL section:

List the specific evals that regressed
Check if the regressed evals are related to the section topic (e.g. query_guidelines causes query eval regressions — expected)
Flag any surprising regressions (section removal caused failures in unrelated evals)

For any section with improvements:

These are cases where the guideline was actively confusing the model
Flag these as candidates for rewording even if the section is kept

Step 5: Cross-model comparison (if multiple models have results)

If ablation/results/ has results for multiple models:

Compare verdicts across models
A section ESSENTIAL in ALL models is definitely essential
A section ESSENTIAL in some models should still be kept (conservative approach)
A section DISPENSABLE in ALL models is safe to remove
Flag any sections with conflicting verdicts across models

Step 6: Token budget summary

Calculate:

Current full guidelines: Total tokens (from baseline)
Essential sections only: Sum of tokens for ESSENTIAL sections
Essential + Marginal: Sum of tokens for ESSENTIAL and MARGINAL sections
Savings: How many tokens saved by removing DISPENSABLE (and optionally MARGINAL) sections

Step 7: Recommend next steps

Based on the results, recommend one of:

Ready to build AGENTS.md: If the classification is clear and token savings are meaningful, suggest building the compact guideline set and running a validation run.
Subsection ablation needed: If function_guidelines is ESSENTIAL (likely — it's the largest section at ~2400 tokens), suggest a follow-up ablation of its 8 subsections to find further savings.
Cross-model validation needed: If only one model has been tested, suggest running ablation on 1-2 additional models for confidence.
Results are noisy: If many sections show exactly 1 regression (MARGINAL), the run-to-run variance may be too high. Suggest re-running or using a different model.

Present findings to the user and ask which direction they want to go. Do NOT make any code changes until asked.

Reference: Guideline sections

The 10 top-level sections in runner/models/guidelines.ts are:

function_guidelines — Function syntax, HTTP endpoints, validators, registration, calling conventions, function references, API design, pagination
validator_guidelines — v.bigint deprecation, v.record usage
schema_guidelines — Schema location, system fields, index naming, index field ordering
typescript_guidelines — Id types, Record types, strict typing, as const, Array/Record patterns, @types/node
full_text_search_guidelines — Search index query syntax
query_guidelines — No filter, no .delete(), .unique(), async iteration, ordering
mutation_guidelines — ctx.db.replace vs ctx.db.patch
action_guidelines — "use node", no ctx.db, action syntax
scheduling_guidelines — Cron syntax, FunctionReference usage, crons.ts patterns
file_storage_guidelines — Storage API, getUrl, system table queries, Blob handling

Related Skills

get-convex/validate-guidelines

testing

VerifiedTrustedCommunity

Empirically verify guideline changes by running before/after eval runs across multiple models and ensuring no regressions. Use when proposing or reviewing changes to runner/models/guidelines.ts, or when the user asks to validate guidelines.

109SKILL.mdUpdated Apr 17, 2026

get-convex/validate-guidelines

get-convex/analyze-run

testing

VerifiedTrustedCommunity

Analyze all failures in a convex-evals run, spawning parallel sub-agents to investigate each failure and producing a report with classifications and recommendations. Use when the user asks to analyze an entire run, review all failures in a run, or wants to understand why a model scored poorly.

109SKILL.mdUpdated Apr 17, 2026

get-convex/analyze-run

get-convex/analyze-eval

testing

VerifiedTrustedCommunity

Investigate a single failing eval from the convex-evals system. Use when the user shares a visualizer URL pointing to a specific eval, asks about a specific failing eval, or references a specific eval ID.

109SKILL.mdUpdated Apr 17, 2026

get-convex/analyze-eval

get-convex/add-model

data-ai

VerifiedTrustedCommunity

Add a new AI model to the eval runner, update the manual eval workflow, push changes, and trigger baseline eval runs. Use when the user wants to add a new model, onboard a model, or mentions a new model name/link to add to the leaderboard.

109SKILL.mdUpdated Apr 17, 2026

Download

For Claude Desktop. Download once, then upload the file in the app — no terminal needed.

Need help? View full Cowork setup guide →

Install manually

Choose your platform

# Clone the repo
git clone https://github.com/get-convex/convex-evals.git

# Copy into Claude Code skills folder (global)
cp -r convex-evals/.cursor/skills/analyze-ablation ~/.claude/skills/

Claude Code Skills — official skills path docs.

Repository

get-convex/convex-evals

109 stars

Compatible with

Claude Code

OpenAI Codex CLI

ChatGPT