.cursor/skills/analyze-ablation/SKILL.md
Analyze guideline ablation experiment results to determine which guideline sections are essential, marginal, or dispensable. Use when the user asks to analyze ablation results, interpret guideline compaction data, or wants to know which guidelines to keep for AGENTS.md.
npx skillsauth add get-convex/convex-evals analyze-ablationInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
List the ablation/results/ directory to see which models have results:
ls ablation/results/
If the directory is empty or doesn't exist, the user needs to either:
bun run scripts/runAblation.ts --model <model># List recent ablation workflow runs
gh run list --workflow=ablation_experiment.yml --limit=5
# Download the artifact
gh run download <run-id> -n ablation-<model>-<run-id> -D ablation/results/
For the requested model, read the latest JSON file from ablation/results/<model>/.
The file contains:
model: The model nametimestamp: When the experiment ranbaseline: Overall pass/fail counts and per-eval results with the full guideline setsections: Array of per-section ablation results, each with:
name: Section name (e.g. "function_guidelines", "query_guidelines")tokensInSection: How many tokens this section costsverdict: "ESSENTIAL" (2+ regressions), "MARGINAL" (1 regression), "DISPENSABLE" (0 regressions)regressions: Eval names that flipped from pass to fail when this section was removedimprovements: Eval names that flipped from fail to pass when removed (guidelines confusing the model)score: Pass/fail counts for this ablation variantPresent a summary table showing:
| Section | Verdict | Regressions | Improvements | Tokens | Score | |---------|---------|-------------|--------------|--------|-------|
Sort by verdict: ESSENTIAL first, then MARGINAL, then DISPENSABLE.
For each ESSENTIAL and MARGINAL section:
For any section with improvements:
If ablation/results/ has results for multiple models:
Calculate:
Based on the results, recommend one of:
Ready to build AGENTS.md: If the classification is clear and token savings are meaningful, suggest building the compact guideline set and running a validation run.
Subsection ablation needed: If function_guidelines is ESSENTIAL (likely — it's the largest section at ~2400 tokens), suggest a follow-up ablation of its 8 subsections to find further savings.
Cross-model validation needed: If only one model has been tested, suggest running ablation on 1-2 additional models for confidence.
Results are noisy: If many sections show exactly 1 regression (MARGINAL), the run-to-run variance may be too high. Suggest re-running or using a different model.
Present findings to the user and ask which direction they want to go. Do NOT make any code changes until asked.
The 10 top-level sections in runner/models/guidelines.ts are:
function_guidelines — Function syntax, HTTP endpoints, validators, registration, calling conventions, function references, API design, paginationvalidator_guidelines — v.bigint deprecation, v.record usageschema_guidelines — Schema location, system fields, index naming, index field orderingtypescript_guidelines — Id types, Record types, strict typing, as const, Array/Record patterns, @types/nodefull_text_search_guidelines — Search index query syntaxquery_guidelines — No filter, no .delete(), .unique(), async iteration, orderingmutation_guidelines — ctx.db.replace vs ctx.db.patchaction_guidelines — "use node", no ctx.db, action syntaxscheduling_guidelines — Cron syntax, FunctionReference usage, crons.ts patternsfile_storage_guidelines — Storage API, getUrl, system table queries, Blob handlingtesting
Empirically verify guideline changes by running before/after eval runs across multiple models and ensuring no regressions. Use when proposing or reviewing changes to runner/models/guidelines.ts, or when the user asks to validate guidelines.
testing
Analyze all failures in a convex-evals run, spawning parallel sub-agents to investigate each failure and producing a report with classifications and recommendations. Use when the user asks to analyze an entire run, review all failures in a run, or wants to understand why a model scored poorly.
testing
Investigate a single failing eval from the convex-evals system. Use when the user shares a visualizer URL pointing to a specific eval, asks about a specific failing eval, or references a specific eval ID.
data-ai
Add a new AI model to the eval runner, update the manual eval workflow, push changes, and trigger baseline eval runs. Use when the user wants to add a new model, onboard a model, or mentions a new model name/link to add to the leaderboard.