Adoption

Agent Skills are supported by leading AI development tools.

VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory

bcbeidel/pick-model

Name: pick-model
Author: bcbeidel

plugins/consider/skills/pick-model/SKILL.md

npx skillsauth add bcbeidel/wos pick-model

Clean

TrivyContainer and dependency vulnerability scanner

Clean

SemgrepStatic code analysis for vulnerabilities

Clean

mcp-scan (Snyk)Model Context Protocol security validation

Skipped

Snyk (dep)Open source security scanning

Skipped

Socket.devSupply chain security analysis

Skipped

VirusTotalMulti-engine malware detection

Skipped

CrowdStrikeAdvanced threat intelligence

Skipped

OSV-ScannerOpen Source Vulnerability database check

Skipped

OWASP Dep-Check

<objective> Recommend the right AI model for a task by routing through three decisions: what type of work, how much a mistake costs, and what the budget allows. Recommendations are grounded in external benchmarks — not vendor marketing. </objective> <process> 1. **Classify the task type.** Ask what the user is trying to accomplish. Decompose broad categories into specific sub-types — "coding" could mean function completion (saturated — any model works), bug fixing (Claude/Gemini lead), or agentic multi-file refactoring (scaffold matters more than model). Read `references/model-landscape.md` for the task-type mapping table.

Assess stakes. Ask: "What happens if the model gets this wrong?" Map to one of three levels:
- Exploratory — brainstorming, throwaway drafts, learning. Mistakes are free. Route to budget tier, low effort.
- Professional — real work that matters but is recoverable. Wrong answers cost time, not catastrophe. Route to value tier, medium effort.
- Critical — production-facing, high-stakes, expensive to fix. Wrong answers have business impact. Route to frontier tier, high effort.
Check cost sensitivity. Ask if there's a budget constraint:
- Unconstrained — recommend the leader for the task type.
- Cost-aware — recommend the value pick (best quality-per-dollar).
- Budget-limited — recommend the budget pick (cheapest viable).
Apply the crossover check. Before recommending a cheaper model, check: does it score >70% of the frontier on the relevant benchmark? If yes, recommend it with high effort. If <50%, skip to frontier. Between 50-70%, note the tradeoff explicitly.
Set the effort parameter. Based on stakes and provider, recommend the specific effort setting. Read the effort controls table in references/model-landscape.md.
Present the recommendation. Use the output format below. Include the primary pick, an alternative at a different price point, the effort setting, and a one-line rationale citing the benchmark evidence.
</process>

<output_format>

Model Recommendation: [Task Description]

Primary Pick

[Model Name] ([Provider]) — [one-line rationale with benchmark cite]

Cost: $X.XX / $X.XX per M tokens (input/output)
Effort: [parameter_name] = [value]

Alternative

[Model Name] — [why this is the tradeoff option]

Cost: $X.XX / $X.XX per M tokens

Why This Over That

[2-3 sentences explaining the key differentiator — benchmark evidence, cost math, or capability gap that drives the recommendation]

Caveats

Data as of [month year]. Benchmark rankings shift quarterly.
[Any task-specific caveats — scaffold matters, score variance, etc.] </output_format>

<example> User: "I need to pick a model for our code review pipeline. It reviews PRs automatically, leaves comments, and sometimes suggests fixes. We process about 200 PRs a day. Cost matters but quality matters more — bad reviews erode developer trust."

Model Recommendation: Automated PR Code Review

Primary Pick

Claude Sonnet 4.6 (Anthropic) — SWE-bench Verified 79.6%, Chatbot Arena Coding ELO 1523. Within 1.2 points of Opus on coding benchmarks at 60% of the cost.

Cost: $3.00 / $15.00 per M tokens (input/output)
Effort: effort = high

Alternative

DeepSeek V3.2 — 73% SWE-bench at $0.26/M input (17x cheaper). For 200 PRs/day, the cost difference is significant. Quality gap is real but may be acceptable for initial triage before human review.

Cost: $0.26 / $0.38 per M tokens

Why This Over That

At 200 PRs/day, the Sonnet-to-Opus upgrade buys 1.2 SWE-bench points for a 67% price increase — not justified when developer trust depends on consistency, not peak capability. Sonnet's coding ELO (1523) is third overall, behind only Opus variants. The DeepSeek alternative is viable for a two-tier setup: DeepSeek for initial scan, Sonnet for flagged PRs.

Caveats

Data as of April 2026. SWE-bench Verified is saturated at the top — the 1.2-point Sonnet-to-Opus gap is within noise.
For agentic PR review (multi-file reasoning, not just line comments), agent scaffold quality matters more than model choice — the ~22% scaffold effect dwarfs the ~1% model effect at the frontier. </example>

Key Instructions

Ground every recommendation in external benchmarks. Cite the benchmark name and score. Do not recommend based on vendor marketing, model descriptions, or general reputation. Read the benchmark tables in references/model-landscape.md.
Decompose broad task types before recommending. "Coding" spans function completion (saturated), bug fixing (differentiated), and agentic workflows (scaffold-dependent). Different sub-types get different models. Ask the user to be specific.
State the freshness date on every recommendation. Benchmark data is from April 2026. Rankings shift. Say so explicitly.
Does not recommend infrastructure, scaffolds, or architectures — produces model selection guidance only. For agentic tasks, note that scaffold investment may yield more than model upgrades, but don't design the scaffold.
Does not make provider-loyalty assumptions. Recommend across providers based on benchmark evidence. A user on Claude may get a Gemini recommendation if the benchmarks support it.

Anti-Pattern Guards

Recommending frontier for everything — the most expensive model is not always the best choice. On saturated benchmarks (function completion, simple math), budget models perform equivalently. Cost sensitivity matters — a 19x price difference for 1.2 benchmark points is rarely justified.
Treating benchmark scores as cardinal — the same model scores 2-17 points apart depending on evaluation protocol. Close-margin comparisons (e.g., "Opus beats Sonnet by 1.2 points") should be presented as ties, not winners. Use ordinal rankings.
Ignoring the scaffold effect — for agentic coding tasks, recommending a model upgrade when the real bottleneck is scaffold quality wastes the user's money. Flag this when the task is agentic.
Stale data presented as current — benchmark landscapes shift quarterly. If the user asks about a model or benchmark not in the reference data, say so rather than guessing.

Handoff

Chainable to: opportunity-cost (to analyze what switching models gives up), pareto (to find the 20% of workloads driving 80% of model costs)

bcbeidel/pick-model

plugins/consider/skills/pick-model/SKILL.md

Select the right AI model for a task based on external benchmarks, pricing, and effort tradeoffs. Use when the user asks "which model should I use?", "what's the best model for X?", "help me pick a model", "model comparison", "should I use Opus or Sonnet?", or needs to decide between AI models for a specific workload. Also use when someone mentions cost optimization for LLM usage or wants to know if a cheaper model would work.

1 stars

testing

Updated May 8, 2026

$ install --global

skillsauth

npx skillsauth add bcbeidel/wos pick-model

Install this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.

Security Scan Results

3 of 9 scanners reported clean

Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.

Scanners Passed

Scanners in report

Clean

TrivyContainer and dependency vulnerability scanner

95%

Clean

SemgrepStatic code analysis for vulnerabilities

95%

Clean

mcp-scan (Snyk)Model Context Protocol security validation

95%

Skipped

Snyk (dep)Open source security scanning

50%

Skipped

Socket.devSupply chain security analysis

50%

Skipped

VirusTotalMulti-engine malware detection

50%

Skipped

CrowdStrikeAdvanced threat intelligence

50%

Skipped

OSV-ScannerOpen Source Vulnerability database check

50%

Skipped

OWASP Dep-Check

50%

Last scanned: May 8, 2026, 3:29 AM129.9s2 files scanned

SKILL.md

name:: pick-model
description:: Select the right AI model for a task based on external benchmarks, pricing, and effort tradeoffs. Use when the user asks \"which model should I use?\", \"what's the best model for X?\", \"help me pick a model\", \"model comparison\", \"should I use Opus or Sonnet?\", or needs to decide between AI models for a specific workload. Also use when someone mentions cost optimization for LLM usage or wants to know if a cheaper model would work.
argument-hint:: [task or workload to select a model for]
user-invocable:: true
license:: MIT

Assess stakes. Ask: "What happens if the model gets this wrong?" Map to one of three levels:
- Exploratory — brainstorming, throwaway drafts, learning. Mistakes are free. Route to budget tier, low effort.
- Professional — real work that matters but is recoverable. Wrong answers cost time, not catastrophe. Route to value tier, medium effort.
- Critical — production-facing, high-stakes, expensive to fix. Wrong answers have business impact. Route to frontier tier, high effort.
Check cost sensitivity. Ask if there's a budget constraint:
- Unconstrained — recommend the leader for the task type.
- Cost-aware — recommend the value pick (best quality-per-dollar).
- Budget-limited — recommend the budget pick (cheapest viable).
Apply the crossover check. Before recommending a cheaper model, check: does it score >70% of the frontier on the relevant benchmark? If yes, recommend it with high effort. If <50%, skip to frontier. Between 50-70%, note the tradeoff explicitly.
Set the effort parameter. Based on stakes and provider, recommend the specific effort setting. Read the effort controls table in references/model-landscape.md.
Present the recommendation. Use the output format below. Include the primary pick, an alternative at a different price point, the effort setting, and a one-line rationale citing the benchmark evidence.
</process>

<output_format>

Model Recommendation: [Task Description]

Primary Pick

[Model Name] ([Provider]) — [one-line rationale with benchmark cite]

Cost: $X.XX / $X.XX per M tokens (input/output)
Effort: [parameter_name] = [value]

Alternative

[Model Name] — [why this is the tradeoff option]

Cost: $X.XX / $X.XX per M tokens

Why This Over That

[2-3 sentences explaining the key differentiator — benchmark evidence, cost math, or capability gap that drives the recommendation]

Caveats

Data as of [month year]. Benchmark rankings shift quarterly.
[Any task-specific caveats — scaffold matters, score variance, etc.] </output_format>

Model Recommendation: Automated PR Code Review

Primary Pick

Claude Sonnet 4.6 (Anthropic) — SWE-bench Verified 79.6%, Chatbot Arena Coding ELO 1523. Within 1.2 points of Opus on coding benchmarks at 60% of the cost.

Cost: $3.00 / $15.00 per M tokens (input/output)
Effort: effort = high

Alternative

Cost: $0.26 / $0.38 per M tokens

Why This Over That

Caveats

Data as of April 2026. SWE-bench Verified is saturated at the top — the 1.2-point Sonnet-to-Opus gap is within noise.
For agentic PR review (multi-file reasoning, not just line comments), agent scaffold quality matters more than model choice — the ~22% scaffold effect dwarfs the ~1% model effect at the frontier. </example>

Key Instructions

Ground every recommendation in external benchmarks. Cite the benchmark name and score. Do not recommend based on vendor marketing, model descriptions, or general reputation. Read the benchmark tables in references/model-landscape.md.
Decompose broad task types before recommending. "Coding" spans function completion (saturated), bug fixing (differentiated), and agentic workflows (scaffold-dependent). Different sub-types get different models. Ask the user to be specific.
State the freshness date on every recommendation. Benchmark data is from April 2026. Rankings shift. Say so explicitly.
Does not recommend infrastructure, scaffolds, or architectures — produces model selection guidance only. For agentic tasks, note that scaffold investment may yield more than model upgrades, but don't design the scaffold.
Does not make provider-loyalty assumptions. Recommend across providers based on benchmark evidence. A user on Claude may get a Gemini recommendation if the benchmarks support it.

Anti-Pattern Guards

Recommending frontier for everything — the most expensive model is not always the best choice. On saturated benchmarks (function completion, simple math), budget models perform equivalently. Cost sensitivity matters — a 19x price difference for 1.2 benchmark points is rarely justified.
Treating benchmark scores as cardinal — the same model scores 2-17 points apart depending on evaluation protocol. Close-margin comparisons (e.g., "Opus beats Sonnet by 1.2 points") should be presented as ties, not winners. Use ordinal rankings.
Ignoring the scaffold effect — for agentic coding tasks, recommending a model upgrade when the real bottleneck is scaffold quality wastes the user's money. Flag this when the task is agentic.
Stale data presented as current — benchmark landscapes shift quarterly. If the user asks about a model or benchmark not in the reference data, say so rather than guessing.

Handoff

Chainable to: opportunity-cost (to analyze what switching models gives up), pareto (to find the 20% of workloads driving 80% of model costs)

Related Skills

bcbeidel/check-help-skill

tools

VerifiedTrustedCommunity

Use when the user wants to "audit a help skill", "review my plugin index", or "verify my help-skill is up to date". Audits a plugins/<plugin>/skills/help/SKILL.md against the help-skill rubric — coverage, freshness, frontmatter fidelity, plus five judgment dimensions and a trigger-collision check.

1SKILL.mdUpdated May 3, 2026

bcbeidel/check-help-skill

bcbeidel/build-help-skill

tools

VerifiedTrustedCommunity

Use when the user wants to "scaffold a help skill", "add a /<plugin>:help command", or "build a plugin index skill", or wants to give a plugin an orientation surface that lists its skills and common workflows. Produces a SKILL.md at plugins/<plugin>/skills/help/SKILL.md.

1SKILL.mdUpdated May 3, 2026

bcbeidel/build-help-skill

bcbeidel/check-skill-pair

tools

VerifiedTrustedCommunity

Audits pair-level integrity of a primitive-pair (the artifact `/build:build-skill-pair` produces) by walking the four required artifact slots — principles doc, `build-<primitive>/SKILL.md`, `check-<primitive>/SKILL.md`, and the `primitive-routing.md` registration — and reports cross-artifact issues a per-SKILL.md checker cannot see: missing principles doc, divergent principles paths between halves, absent routing registration, missing build→check handoff. Per-half structural compliance with the unified pattern (`check-skill-pattern.md`) is delegated to `plugins/build/_shared/scripts/check_skill_pattern.py`. Use when the user wants to "audit a skill pair", "review a primitive pair", or "validate the skill pair for X". Not for auditing a single SKILL.md — route to `/build:check-skill`. Not for re-distilling a stale principles doc — route to `/build:build-skill-pair`.

1SKILL.mdUpdated Apr 24, 2026

bcbeidel/check-skill-pair

bcbeidel/check-resolver

testing

VerifiedTrustedCommunity

Audit a root-level resolver — verify AGENTS.md pointer, managed-region integrity, filing-table coverage against disk, context-table actionability, and trigger-eval pass rate. Use when the user wants to "audit a resolver", "validate routing table", or "find dark capabilities".

1SKILL.mdUpdated Apr 24, 2026

bcbeidel/check-resolver

Download

For Claude Desktop. Download once, then upload the file in the app — no terminal needed.

Need help? View full Cowork setup guide →

Install manually

Choose your platform

# Clone the repo
git clone https://github.com/bcbeidel/wos.git

# Copy into Claude Code skills folder (global)
cp -r wos/plugins/consider/skills/pick-model ~/.claude/skills/

Claude Code Skills — official skills path docs.

Repository

bcbeidel/wos

1 stars

Compatible with

Claude Code

OpenAI Codex CLI

ChatGPT