.cursor/skills/validate-guidelines/SKILL.md
Empirically verify guideline changes by running before/after eval runs across multiple models and ensuring no regressions. Use when proposing or reviewing changes to runner/models/guidelines.ts, or when the user asks to validate guidelines.
npx skillsauth add get-convex/convex-evals validate-guidelinesInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
runner/models/guidelines.ts and wants to ensure they don't regress other modelsGuideline changes are validated by running evals twice per model: once with the current (before) guidelines and once with the proposed (after) guidelines. Results are compared; any eval that passed before and fails after is a regression. The goal is to ensure changes improve or at least do not regress scores across multiple models.
Determine which guideline sections were modified in runner/models/guidelines.ts (e.g. function_guidelines, query_guidelines, file_storage_guidelines) and the intent (new rule, clarification, token compaction).
bun run buildRelease.ts and use dist/AGENTS.md, or render compact guidelines to a temp file. If the repo is in a clean state, dist/AGENTS.md after build is the "before" snapshot.runner/models/guidelines.ts, run bun run buildRelease.ts, copy dist/AGENTS.md to a temp path (e.g. guideline-validation/after.md), then revert the file; orEnsure both paths are absolute or relative to the repo root and that the script can read them.
Use the mapping below to choose a --filter regex or omit it for the full suite.
| Guideline section | Suggested TEST_FILTER (regex) |
|-------------------|--------------------------------|
| function_guidelines (http, validators, registration, calling, pagination) | 000-fundamentals\|006-clients or full |
| validator_guidelines | 000-fundamentals/009 |
| schema_guidelines | 001-data_modeling |
| typescript_guidelines | Omit (run all) |
| full_text_search_guidelines | 002-queries/009\|002-queries/020 |
| query_guidelines | 002-queries |
| mutation_guidelines | 003-mutations |
| action_guidelines | 004-actions |
| scheduling_guidelines | 000-fundamentals/003\|000-fundamentals/004 |
| file_storage_guidelines | 000-fundamentals/007\|004-actions/004\|004-actions/005 |
--filter to run all evals.Default set (preferred for validation): claude-sonnet-4-5, claude-opus-4-6, gemini-3-pro-preview, gpt-5.2-codex.
Check which API keys are set in .env (e.g. ANTHROPIC_API_KEY, OPENAI_API_KEY, GOOGLE_API_KEY). The script skips models whose provider key is missing and prints a warning. Use a subset if some keys are unavailable; at least two models are recommended.
Do not set CONVEX_EVAL_URL or CONVEX_AUTH_TOKEN so results stay local.
bun run validate:guidelines --before <path-to-before.md> --after <path-to-after.md> --models claude-sonnet-4-5,claude-opus-4-6,gemini-3-pro-preview,gpt-5.2-codex
With an eval filter:
bun run validate:guidelines --before <before.md> --after <after.md> --models claude-sonnet-4-5,gpt-5.2-codex --filter "002-queries"
Optional: --output <path> to write the JSON summary to a specific file. By default it is written to guideline-validation/results/<timestamp>.json.
The script runs each model sequentially: first all evals with "before" guidelines, then all evals with "after" guidelines. Pass/fail is collected and deltas are computed.
IMPORTANT: You must orchestrate the entire run end-to-end. Start the command in the background (block_until_ms: 0), then poll the terminal output file periodically until the run finishes (look for the exit_code footer or the GUIDELINE VALIDATION SUMMARY banner). Use exponential backoff for polling (e.g. 30s, 60s, 120s). Do NOT return to the user until the run is fully complete and you have read and analyzed the results. The user expects a complete report, not a "check back later" handoff.
The script prints:
Read the script output and present the full summary table and verdict to the user.
bun run validate:guidelines --before <path> --after <path> --models <m1,m2,...> [--filter <regex>] [--output <path>]
--before, --after: Paths to guideline markdown files (current vs proposed).--models: Comma-separated model names from runner/models/index.ts (e.g. gpt-5.2-codex, claude-sonnet-4-5).--filter: Optional regex on eval category/name (e.g. 005-idioms or 002-queries/015).--output: Optional path for the JSON summary file.API keys are loaded from .env via dotenv (see AGENTS.md). The script does not report to Convex.
testing
Analyze all failures in a convex-evals run, spawning parallel sub-agents to investigate each failure and producing a report with classifications and recommendations. Use when the user asks to analyze an entire run, review all failures in a run, or wants to understand why a model scored poorly.
testing
Investigate a single failing eval from the convex-evals system. Use when the user shares a visualizer URL pointing to a specific eval, asks about a specific failing eval, or references a specific eval ID.
documentation
Analyze guideline ablation experiment results to determine which guideline sections are essential, marginal, or dispensable. Use when the user asks to analyze ablation results, interpret guideline compaction data, or wants to know which guidelines to keep for AGENTS.md.
data-ai
Add a new AI model to the eval runner, update the manual eval workflow, push changes, and trigger baseline eval runs. Use when the user wants to add a new model, onboard a model, or mentions a new model name/link to add to the leaderboard.