skills/smart-model-routing/SKILL.md
Dynamic model selection based on task complexity scoring. Replaces static model mappings with a weighted signal system that picks Opus, Sonnet, or Haiku-class speed per task. Works with agent-assignment-matrix.md.
npx skillsauth add rubicanjr/FinCognis smart-model-routingInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Dynamic model selection based on task complexity. Never use a sledgehammer where a scalpel will do — and never use a scalpel where you need a sledgehammer.
Static model assignments ("always use Sonnet for code") waste money on trivial tasks and produce poor results on complex ones. This skill scores every task across three signal categories and routes it to the appropriate model tier automatically.
Default rule: inherit parent model. Only override when the score clearly warrants a different tier. Never specify a model without a scoring reason.
Three weighted categories are evaluated. Their scores combine into a single complexity score (0-20 typical range, can exceed).
Analyze the raw text of the request.
Trigger words that indicate high-stakes design work:
Example: "Redesign the authentication system" → +3
Domains where mistakes are expensive or irreversible:
Example: "Update the payment processing logic" → +2
Signals the task is mechanical and bounded:
Example: "Fix the typo in the header" → -2
Count nested conditions, multiple constraints, OR clauses:
Analyze the task structure, not just the words.
How many discrete steps or deliverables?
Example: "Build an endpoint, write tests, add validation, update docs, add migration, hook up auth, notify the frontend team" → +3
How many files will be touched?
Use tldr impact <function> or tldr calls . to estimate before scoring.
Where in the system does this change land?
Examples:
Can this be easily undone?
What level of test coverage is needed?
The situation around the request, not just the request itself.
Has this task or related area already failed once?
Rationale: repeated failures indicate the problem is genuinely hard. Escalate.
Is this a top-level user request or the 3rd handoff in a chain?
Deep chains accumulate complexity. Each layer adds ambiguity risk.
Is this a well-trodden path or new territory?
After computing the weighted sum:
raw_score = (lexical_raw * 0.3) + (structural_raw * 0.4) + (context_raw * 0.3)
| Score | Tier | Model | Use Case | |-------|------|-------|----------| | >= 8 | HIGH | Opus | System design, security, complex debugging, migration planning | | >= 4 | MEDIUM | Sonnet | Standard development, feature work, code review | | < 4 | LOW | Haiku-class speed | Typo fixes, label changes, simple renames, formatting |
Score proximity to a threshold boundary determines confidence.
| Distance from nearest threshold | Confidence | Action | |----------------------------------|-----------|--------| | Score within 1 point of boundary | LOW | Escalate to higher tier | | Score 2+ points from boundary | HIGH | Use computed tier |
When confidence is LOW, always round up to the higher tier. A false negative (using a weak model on a hard task) is far more costly than a false positive (using a strong model on an easy task).
Example:
These override the scoring system completely. No exceptions.
When spawning sub-agents, include model selection in the spawn parameters only when the score warrants it. Omit model entirely when the task should inherit.
// CORRECT: score = 9 (HIGH) → specify Opus
Task({
prompt: "Design the new event sourcing architecture...",
model: "claude-opus-4-5"
})
// CORRECT: score = 5 (MEDIUM) → inherit, omit model
Task({
prompt: "Add input validation to the create endpoint..."
// no model param
})
// WRONG: always specifying model
Task({
prompt: "Fix the typo in the README",
model: "claude-opus-4-5" // wasteful
})
When deciding which model to use for your own reasoning:
Request: "Change the button text from 'Submit' to 'Save'"
| Signal | Value | Score | |--------|-------|-------| | Lexical: simple keyword ("change") | -2 | | | Lexical: word count (7 words) | -1 | | | Structural: 1 file, 1 subtask, function-level, reversible, no tests | 0 | | | Context: no errors, direct request, familiar codebase | 0 | | | Weighted total | | 0.9 (LOW) |
Tier: LOW. Model: Haiku-class. Confidence: HIGH.
Request: "Add a POST /api/orders endpoint with Zod validation, rate limiting, auth middleware, and write integration tests"
| Signal | Value | Score | |--------|-------|-------| | Lexical: no arch keywords, no risk keywords | 0 | | | Lexical: word count ~20 | 0 | | | Structural: 4 subtasks (+2), 3-4 files (+1), module-level (+1), reversible (0), new tests (+2) | 6 | | | Context: no prior errors, direct, familiar | 0 | | | Weighted total | | (0 * 0.3) + (6 * 0.4) + (0 * 0.3) = 2.4 → LOW |
Wait — re-check: "auth middleware" is a risk keyword (+2 lexical). Update:
| Signal | Value | Score | |--------|-------|-------| | Lexical: risk keyword "auth" (+2) | 2 | | | Structural: 6 signals total | 6 | | | Context: 0 | 0 | | | Weighted total | | (2 * 0.3) + (6 * 0.4) + (0 * 0.3) = 3.0 → LOW, but hard rule: auth → override to MEDIUM |
Tier: MEDIUM. Model: Sonnet. Hard rule override applied.
Request: "Audit the authentication flow. We've had two incidents this week where tokens weren't being validated correctly in production. Find root cause and fix."
| Signal | Value | Score | |--------|-------|-------| | Lexical: risk keywords "authentication", "tokens", "production" (+2 +2 +2) | 6 | | | Lexical: question depth (find root cause AND fix, two constraints) | +2 | | | Lexical: word count 36 | 0 | | | Structural: unknown subtask count (investigation = open-ended) → treat as 4+ (+2), 3+ files (+2), system-wide (+3), hard to reverse (+2) | 9 | | | Context: 2 prior production incidents = treat as 2 errors (+1), direct request (0), known codebase (0) | 1 | | | Weighted total | | (8 * 0.3) + (9 * 0.4) + (1 * 0.3) = 2.4 + 3.6 + 0.3 = 6.3 → MEDIUM |
But: hard rule "security review of any kind" AND "2+ prior failures" → override to HIGH.
Tier: HIGH. Model: Opus. Hard rule override applied (security + repeated failures).
Request: "Migrate the users table to add a composite index on (email, tenant_id), backfill existing rows, and update all queries that touch this table"
| Signal | Value | Score | |--------|-------|-------| | Lexical: risk keywords "database migration" (+2) | 2 | | | Lexical: word count 30 | 0 | | | Structural: 3 subtasks (+1), 4+ files (+2), system-wide (+3), irreversible (+2), new tests for migration (+2) | 10 | | | Context: no errors yet, direct, familiar | 0 | | | Weighted total | | (2 * 0.3) + (10 * 0.4) + (0 * 0.3) = 0.6 + 4.0 + 0 = 4.6 → MEDIUM |
But: hard rule "database migration" → override to HIGH.
Tier: HIGH. Model: Opus. Score near boundary + hard rule.
Every routing decision must be logged:
ROUTING DECISION
TASK: [brief description]
SCORE: [weighted total] (lexical=[X], structural=[Y], context=[Z])
TIER: [LOW|MEDIUM|HIGH]
CONFIDENCE: [HIGH|LOW]
HARD_RULE: [rule name if override applied, else "none"]
MODEL: [selected model]
REASON: [1-sentence justification]
Example:
ROUTING DECISION
TASK: Security audit of auth token validation
SCORE: 6.3 (lexical=8, structural=9, context=1)
TIER: HIGH (override from MEDIUM)
CONFIDENCE: HIGH
HARD_RULE: security-review + repeated-failures
MODEL: claude-opus-4-5
REASON: Active production security incident with two prior failures; hard rule override applied
This skill overrides the default model assignments in agent-assignment-matrix.md when a scoring reason exists. The matrix defines WHICH agent handles a task; this skill defines WHICH MODEL that agent uses.
Priority order:
When a task enters its second or third retry in the qa-loop.md Dev-QA cycle:
This prevents the failure loop of using the same weak model repeatedly on a hard task.
Before scoring, check if a similar task has been attempted before:
cd ~/.claude && PYTHONPATH=scripts python3 scripts/core/recall_learnings.py \
--query "model routing [task description]" --text-only
If a prior attempt used a different tier and succeeded or failed, factor that into the context signal score.
| Anti-pattern | Why It Fails | Correct Behavior | |---|---|---| | Always specifying Opus "to be safe" | Slow, expensive, no learning signal | Score the task, use the right tier | | Always inheriting parent model | Complex sub-tasks get wrong model | Score independently, override when needed | | Using model as a retry strategy | Same model will fail the same way | If score warrants higher tier, upgrade | | Ignoring hard rules | Security bugs, data loss | Hard rules exist because the cost of failure is asymmetric | | Scoring once per session | Context changes, errors accumulate | Re-score on retry, re-score when context shifts |
Before routing any agent spawn or model selection decision:
model param if inheriting from parentRemember: The goal is the right model for the job. Not the most expensive one. Not the default one. The right one.
development
Goal-based workflow orchestration - routes tasks to specialist agents based on user goals
tools
Wiring Verification
development
Connection management, room patterns, reconnection strategies, message buffering, and binary protocol design.
development
Screenshot comparison QA for frontend development. Takes a screenshot of the current implementation, scores it across multiple visual dimensions, and returns a structured PASS/REVISE/FAIL verdict with concrete fixes. Use when implementing UI from a design reference or verifying visual correctness.