Adoption

Agent Skills are supported by leading AI development tools.

VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory

get-convex/validate-guidelines

Name: validate-guidelines
Author: get-convex

.cursor/skills/validate-guidelines/SKILL.md

npx skillsauth add get-convex/convex-evals validate-guidelines

Clean

TrivyContainer and dependency vulnerability scanner

Clean

SemgrepStatic code analysis for vulnerabilities

Clean

mcp-scan (Snyk)Model Context Protocol security validation

Skipped

Snyk (dep)Open source security scanning

Skipped

Socket.devSupply chain security analysis

Skipped

VirusTotalMulti-engine malware detection

Skipped

CrowdStrikeAdvanced threat intelligence

Skipped

OSV-ScannerOpen Source Vulnerability database check

Skipped

OWASP Dep-Check

Validate Guidelines

When to use

User proposes or has made changes to runner/models/guidelines.ts and wants to ensure they don't regress other models
User says "validate the guideline changes" or "run the guideline validation"
Before committing guideline edits, to confirm improvements or no-regression across sonnet, opus, gemini, chatgpt (or a subset)

Overview

Guideline changes are validated by running evals twice per model: once with the current (before) guidelines and once with the proposed (after) guidelines. Results are compared; any eval that passed before and fails after is a regression. The goal is to ensure changes improve or at least do not regress scores across multiple models.

Step 1: Identify the change

Determine which guideline sections were modified in runner/models/guidelines.ts (e.g. function_guidelines, query_guidelines, file_storage_guidelines) and the intent (new rule, clarification, token compaction).

Step 2: Build before and after guideline files

Before: Current committed guidelines. Generate by running bun run buildRelease.ts and use dist/AGENTS.md, or render compact guidelines to a temp file. If the repo is in a clean state, dist/AGENTS.md after build is the "before" snapshot.
After: Guidelines with the proposed changes. Either:
- Temporarily apply the proposed edits to runner/models/guidelines.ts, run bun run buildRelease.ts, copy dist/AGENTS.md to a temp path (e.g. guideline-validation/after.md), then revert the file; or
- Write the proposed full guideline markdown to a temp file (e.g. by building from a branch or a copy of the file).

Ensure both paths are absolute or relative to the repo root and that the script can read them.

Step 3: Select target evals

Use the mapping below to choose a --filter regex or omit it for the full suite.

| Guideline section | Suggested TEST_FILTER (regex) | |-------------------|--------------------------------| | function_guidelines (http, validators, registration, calling, pagination) | 000-fundamentals\|006-clients or full | | validator_guidelines | 000-fundamentals/009 | | schema_guidelines | 001-data_modeling | | typescript_guidelines | Omit (run all) | | full_text_search_guidelines | 002-queries/009\|002-queries/020 | | query_guidelines | 002-queries | | mutation_guidelines | 003-mutations | | action_guidelines | 004-actions | | scheduling_guidelines | 000-fundamentals/003\|000-fundamentals/004 | | file_storage_guidelines | 000-fundamentals/007\|004-actions/004\|004-actions/005 |

Targeted change (e.g. one section): use a filter that matches the evals most likely affected.
Broad change (e.g. wording across many sections): omit --filter to run all evals.

Step 4: Select models

Default set (preferred for validation): claude-sonnet-4-5, claude-opus-4-6, gemini-3-pro-preview, gpt-5.2-codex.

Check which API keys are set in .env (e.g. ANTHROPIC_API_KEY, OPENAI_API_KEY, GOOGLE_API_KEY). The script skips models whose provider key is missing and prints a warning. Use a subset if some keys are unavailable; at least two models are recommended.

Step 5: Run the validation script and monitor to completion

Do not set CONVEX_EVAL_URL or CONVEX_AUTH_TOKEN so results stay local.

bun run validate:guidelines --before <path-to-before.md> --after <path-to-after.md> --models claude-sonnet-4-5,claude-opus-4-6,gemini-3-pro-preview,gpt-5.2-codex

With an eval filter:

bun run validate:guidelines --before <before.md> --after <after.md> --models claude-sonnet-4-5,gpt-5.2-codex --filter "002-queries"

Optional: --output <path> to write the JSON summary to a specific file. By default it is written to guideline-validation/results/<timestamp>.json.

The script runs each model sequentially: first all evals with "before" guidelines, then all evals with "after" guidelines. Pass/fail is collected and deltas are computed.

IMPORTANT: You must orchestrate the entire run end-to-end. Start the command in the background (block_until_ms: 0), then poll the terminal output file periodically until the run finishes (look for the exit_code footer or the GUIDELINE VALIDATION SUMMARY banner). Use exponential backoff for polling (e.g. 30s, 60s, 120s). Do NOT return to the user until the run is fully complete and you have read and analyzed the results. The user expects a complete report, not a "check back later" handoff.

Step 6: Parse and report results

The script prints:

A comparison table: per-model before pass count, after pass count, delta, number of regressions, number of improvements.
Regressions: evals that passed before and failed after (by model).
Improvements: evals that failed before and passed after (by model).
A verdict line: either "REGRESSIONS DETECTED" or "Safe to commit."

Read the script output and present the full summary table and verdict to the user.

If there are regressions: list them and recommend reverting or narrowing the guideline change; optionally run analyze-eval on a regression to see why it failed.
If there are no regressions: recommend committing the guideline change; mention any improvements.

Step 7: Recommend next steps

No regressions, with or without improvements: Safe to commit the guideline changes.
Any regressions: Do not commit. Suggest reverting the change or narrowing it (e.g. only add the new rule to a subsection that doesn’t affect the regressed eval). Re-run validation after adjusting.
Unclear or noisy: If only one model regresses one eval, consider re-running that model to check for flakiness, or run the full suite once more.

Reference: Script usage

bun run validate:guidelines --before <path> --after <path> --models <m1,m2,...> [--filter <regex>] [--output <path>]

--before, --after: Paths to guideline markdown files (current vs proposed).
--models: Comma-separated model names from runner/models/index.ts (e.g. gpt-5.2-codex, claude-sonnet-4-5).
--filter: Optional regex on eval category/name (e.g. 005-idioms or 002-queries/015).
--output: Optional path for the JSON summary file.

API keys are loaded from .env via dotenv (see AGENTS.md). The script does not report to Convex.

get-convex/validate-guidelines

.cursor/skills/validate-guidelines/SKILL.md

Empirically verify guideline changes by running before/after eval runs across multiple models and ensuring no regressions. Use when proposing or reviewing changes to runner/models/guidelines.ts, or when the user asks to validate guidelines.

109 stars

testing

Updated Apr 17, 2026

$ install --global

skillsauth

npx skillsauth add get-convex/convex-evals validate-guidelines

Install this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.

Security Scan Results

3 of 9 scanners reported clean

Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.

Scanners Passed

Scanners in report

Clean

TrivyContainer and dependency vulnerability scanner

95%

Clean

SemgrepStatic code analysis for vulnerabilities

95%

Clean

mcp-scan (Snyk)Model Context Protocol security validation

95%

Skipped

Snyk (dep)Open source security scanning

50%

Skipped

Socket.devSupply chain security analysis

50%

Skipped

VirusTotalMulti-engine malware detection

50%

Skipped

CrowdStrikeAdvanced threat intelligence

50%

Skipped

OSV-ScannerOpen Source Vulnerability database check

50%

Skipped

OWASP Dep-Check

50%

Last scanned: Apr 17, 2026, 11:21 AM4.4s1 file scanned

SKILL.md

name:: validate-guidelines
description:: Empirically verify guideline changes by running before/after eval runs across multiple models and ensuring no regressions. Use when proposing or reviewing changes to runner/models/guidelines.ts, or when the user asks to validate guidelines.

Validate Guidelines

When to use

User proposes or has made changes to runner/models/guidelines.ts and wants to ensure they don't regress other models
User says "validate the guideline changes" or "run the guideline validation"
Before committing guideline edits, to confirm improvements or no-regression across sonnet, opus, gemini, chatgpt (or a subset)

Overview

Step 1: Identify the change

Step 2: Build before and after guideline files

Before: Current committed guidelines. Generate by running bun run buildRelease.ts and use dist/AGENTS.md, or render compact guidelines to a temp file. If the repo is in a clean state, dist/AGENTS.md after build is the "before" snapshot.
After: Guidelines with the proposed changes. Either:
- Temporarily apply the proposed edits to runner/models/guidelines.ts, run bun run buildRelease.ts, copy dist/AGENTS.md to a temp path (e.g. guideline-validation/after.md), then revert the file; or
- Write the proposed full guideline markdown to a temp file (e.g. by building from a branch or a copy of the file).

Ensure both paths are absolute or relative to the repo root and that the script can read them.

Step 3: Select target evals

Use the mapping below to choose a --filter regex or omit it for the full suite.

Targeted change (e.g. one section): use a filter that matches the evals most likely affected.
Broad change (e.g. wording across many sections): omit --filter to run all evals.

Step 4: Select models

Default set (preferred for validation): claude-sonnet-4-5, claude-opus-4-6, gemini-3-pro-preview, gpt-5.2-codex.

Step 5: Run the validation script and monitor to completion

Do not set CONVEX_EVAL_URL or CONVEX_AUTH_TOKEN so results stay local.

bun run validate:guidelines --before <path-to-before.md> --after <path-to-after.md> --models claude-sonnet-4-5,claude-opus-4-6,gemini-3-pro-preview,gpt-5.2-codex

With an eval filter:

bun run validate:guidelines --before <before.md> --after <after.md> --models claude-sonnet-4-5,gpt-5.2-codex --filter "002-queries"

Optional: --output <path> to write the JSON summary to a specific file. By default it is written to guideline-validation/results/<timestamp>.json.

The script runs each model sequentially: first all evals with "before" guidelines, then all evals with "after" guidelines. Pass/fail is collected and deltas are computed.

Step 6: Parse and report results

The script prints:

A comparison table: per-model before pass count, after pass count, delta, number of regressions, number of improvements.
Regressions: evals that passed before and failed after (by model).
Improvements: evals that failed before and passed after (by model).
A verdict line: either "REGRESSIONS DETECTED" or "Safe to commit."

Read the script output and present the full summary table and verdict to the user.

If there are regressions: list them and recommend reverting or narrowing the guideline change; optionally run analyze-eval on a regression to see why it failed.
If there are no regressions: recommend committing the guideline change; mention any improvements.

Step 7: Recommend next steps

No regressions, with or without improvements: Safe to commit the guideline changes.
Any regressions: Do not commit. Suggest reverting the change or narrowing it (e.g. only add the new rule to a subsection that doesn’t affect the regressed eval). Re-run validation after adjusting.
Unclear or noisy: If only one model regresses one eval, consider re-running that model to check for flakiness, or run the full suite once more.

Reference: Script usage

bun run validate:guidelines --before <path> --after <path> --models <m1,m2,...> [--filter <regex>] [--output <path>]

--before, --after: Paths to guideline markdown files (current vs proposed).
--models: Comma-separated model names from runner/models/index.ts (e.g. gpt-5.2-codex, claude-sonnet-4-5).
--filter: Optional regex on eval category/name (e.g. 005-idioms or 002-queries/015).
--output: Optional path for the JSON summary file.

API keys are loaded from .env via dotenv (see AGENTS.md). The script does not report to Convex.

Related Skills

get-convex/analyze-run

testing

VerifiedTrustedCommunity

Analyze all failures in a convex-evals run, spawning parallel sub-agents to investigate each failure and producing a report with classifications and recommendations. Use when the user asks to analyze an entire run, review all failures in a run, or wants to understand why a model scored poorly.

109SKILL.mdUpdated Apr 17, 2026

get-convex/analyze-run

get-convex/analyze-eval

testing

VerifiedTrustedCommunity

Investigate a single failing eval from the convex-evals system. Use when the user shares a visualizer URL pointing to a specific eval, asks about a specific failing eval, or references a specific eval ID.

109SKILL.mdUpdated Apr 17, 2026

get-convex/analyze-eval

get-convex/analyze-ablation

documentation

VerifiedTrustedCommunity

Analyze guideline ablation experiment results to determine which guideline sections are essential, marginal, or dispensable. Use when the user asks to analyze ablation results, interpret guideline compaction data, or wants to know which guidelines to keep for AGENTS.md.

109SKILL.mdUpdated Apr 17, 2026

get-convex/analyze-ablation

get-convex/add-model

data-ai

VerifiedTrustedCommunity

Add a new AI model to the eval runner, update the manual eval workflow, push changes, and trigger baseline eval runs. Use when the user wants to add a new model, onboard a model, or mentions a new model name/link to add to the leaderboard.

109SKILL.mdUpdated Apr 17, 2026

Download

For Claude Desktop. Download once, then upload the file in the app — no terminal needed.

Need help? View full Cowork setup guide →

Install manually

Choose your platform

# Clone the repo
git clone https://github.com/get-convex/convex-evals.git

# Copy into Claude Code skills folder (global)
cp -r convex-evals/.cursor/skills/validate-guidelines ~/.claude/skills/

Claude Code Skills — official skills path docs.

Repository

get-convex/convex-evals

109 stars

Compatible with

Claude Code

OpenAI Codex CLI

ChatGPT