Adoption

Agent Skills are supported by leading AI development tools.

VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory

raddue/skill-selection-evals

Name: skill-selection-evals
Author: raddue

skills/skill-selection-evals/SKILL.md

npx skillsauth add raddue/crucible skill-selection-evals

Clean

TrivyContainer and dependency vulnerability scanner

Clean

SemgrepStatic code analysis for vulnerabilities

Clean

mcp-scan (Snyk)Model Context Protocol security validation

Skipped

Snyk (dep)Open source security scanning

Skipped

Socket.devSupply chain security analysis

Skipped

VirusTotalMulti-engine malware detection

Skipped

CrowdStrikeAdvanced threat intelligence

Skipped

OSV-ScannerOpen Source Vulnerability database check

Skipped

OWASP Dep-Check

Skill-Selection Evals

This is not an executable skill. It contains evaluation data for measuring the accuracy of skill selection (routing) decisions.

Purpose

Crucible's 49 execution evals measure quality once a skill is invoked. Selection evals measure whether the right skill gets invoked in the first place.

Eval Types

Direct selection: Given a prompt, does the agent pick the correct skill?
Negative selection: Given a prompt that sounds like skill X but is not, does the agent avoid the false positive?
Context-dependent: Same verb, different context, different correct skill.
Cascade ordering: Multi-skill tasks requiring correct invocation order.

Boundaries Tested

test-methodology — TDD vs test-coverage vs adversarial-tester
review-direction — temper vs review-feedback
adversarial-scope — red-team vs inquisitor vs audit vs siege
completion-claims — verify vs finish
bug-handling — debugging vs verify vs audit
build-vs-raw-dispatch — build (full idea→PR pipeline) vs a single-skill dispatch (planning, test-driven-development, …)

Difficulty Ratings

Each eval is rated easy/medium/hard based on routing ambiguity. This enables stratified baseline measurement — distinguishing between improvements that lift hard cases (high value) vs confirming easy cases already work (low signal).

raddue/skill-selection-evals

skills/skill-selection-evals/SKILL.md

Eval-only skill for measuring skill routing accuracy. Not invoked directly — contains selection evals that test whether the agent picks the correct skill for a given prompt.

10 stars

testing

Updated Jun 12, 2026

$ install --global

skillsauth

npx skillsauth add raddue/crucible skill-selection-evals

Install this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.

Security Scan Results

3 of 9 scanners reported clean

Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.

Scanners Passed

Scanners in report

Clean

TrivyContainer and dependency vulnerability scanner

95%

Clean

SemgrepStatic code analysis for vulnerabilities

95%

Clean

mcp-scan (Snyk)Model Context Protocol security validation

95%

Skipped

Snyk (dep)Open source security scanning

50%

Skipped

Socket.devSupply chain security analysis

50%

Skipped

VirusTotalMulti-engine malware detection

50%

Skipped

CrowdStrikeAdvanced threat intelligence

50%

Skipped

OSV-ScannerOpen Source Vulnerability database check

50%

Skipped

OWASP Dep-Check

50%

Last scanned: Jun 12, 2026, 6:06 AM363.7s4 files scanned

SKILL.md

name:: skill-selection-evals
description:: Eval-only skill for measuring skill routing accuracy. Not invoked directly — contains selection evals that test whether the agent picks the correct skill for a given prompt.

Skill-Selection Evals

This is not an executable skill. It contains evaluation data for measuring the accuracy of skill selection (routing) decisions.

Purpose

Crucible's 49 execution evals measure quality once a skill is invoked. Selection evals measure whether the right skill gets invoked in the first place.

Eval Types

Direct selection: Given a prompt, does the agent pick the correct skill?
Negative selection: Given a prompt that sounds like skill X but is not, does the agent avoid the false positive?
Context-dependent: Same verb, different context, different correct skill.
Cascade ordering: Multi-skill tasks requiring correct invocation order.

Boundaries Tested

test-methodology — TDD vs test-coverage vs adversarial-tester
review-direction — temper vs review-feedback
adversarial-scope — red-team vs inquisitor vs audit vs siege
completion-claims — verify vs finish
bug-handling — debugging vs verify vs audit
build-vs-raw-dispatch — build (full idea→PR pipeline) vs a single-skill dispatch (planning, test-driven-development, …)

Difficulty Ratings

Related Skills

raddue/delve

testing

VerifiedTrustedCommunity

Standalone instance-bug reviewer — runs a parallel finder fan-out + verify gate over a diff or a path and prints ranked, verified findings. Use when the user says "delve", "find bugs in this diff", "review this for bugs", "scan this file/subsystem for defects", "instance-bug sweep", or wants concrete reproducible defects (not a merge verdict, not systemic health). Works on a PR id, a base..head range, or a path, on any forge (GitHub, GitLab, Bitbucket, self-hosted).

10SKILL.mdUpdated Jun 4, 2026

raddue/ledger

testing

VerifiedTrustedCommunity

Render the Crucible calibration ledger weekly report — the honest "Crucible caught N silent bugs" headline, verdict breakdown, per-skill severity rates, and the inflation detector. Triggers on "/ledger", "weekly report", "weekly ledger", "caught N", "quality ledger", "calibration report", "render the ledger".

10SKILL.mdUpdated Jun 2, 2026

raddue/grudge

development

VerifiedTrustedCommunity

The Book of Grudges — cross-session bug graveyard. Every fixed bug is recorded as a structured "grudge"; before touching code, skills query the grudgebook for the files in scope and surface past regressions as forced "DO NOT REPEAT" context. Read mode (pre-flight) and write mode (on bug resolution / fix(*) PR). Machine-local, per-repo, never committed. Triggers on /grudge, "check grudges", "record a grudge", "any past bugs here", "regression oracle", "bug graveyard".

10SKILL.mdUpdated Jun 2, 2026

raddue/calibration-reconcile

testing

VerifiedTrustedCommunity

Reconcile the Crucible calibration ledger — walk merged fix/hotfix branches to falsify the originating gating-verdicts, compute per-skill Brier calibration scores, and append a falsification log. Triggers on "/calibration-reconcile", "reconcile ledger", "reconcile calibration", "falsify verdicts", "brier score", "calibration reconcile", "compute brier".

10SKILL.mdUpdated Jun 2, 2026

raddue/calibration-reconcile

Download

For Claude Desktop. Download once, then upload the file in the app — no terminal needed.

Need help? View full Cowork setup guide →

Install manually

Choose your platform

# Clone the repo
git clone https://github.com/raddue/crucible.git

# Copy into Claude Code skills folder (global)
cp -r crucible/skills/skill-selection-evals ~/.claude/skills/

Claude Code Skills — official skills path docs.

Repository

raddue/crucible

10 stars

Compatible with

Claude Code

OpenAI Codex CLI

ChatGPT

Adoption

raddue/skill-selection-evals

$ install --global

Security Scan Results

SKILL.md

Skill-Selection Evals

Purpose

Eval Types

Boundaries Tested

Difficulty Ratings

See Also

Related Skills

raddue/delve

raddue/ledger

raddue/grudge

raddue/calibration-reconcile

raddue/skill-selection-evals

$ install --global

Security Scan Results

SKILL.md

Skill-Selection Evals

Purpose

Eval Types

Boundaries Tested

Difficulty Ratings

See Also

Related Skills

raddue/delve

raddue/ledger

raddue/grudge

raddue/calibration-reconcile