skills/evals/cost-quality-tradeoff/SKILL.md
Measure and optimize the cost/quality curve — which model, prompt, and settings give the best quality per dollar. Covers Pareto analysis, break-even thresholds, and when to spend more vs less. Use this skill when optimizing LLM spend, picking a default model for a feature, or deciding whether a premium model is worth it. Activate when: cost vs quality, model selection, eval cost, Pareto frontier, cheaper model, premium model tradeoff.
npx skillsauth add latestaiagents/agent-skills cost-quality-tradeoffInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Quality without cost context is half a decision. You need the Pareto frontier — for each quality bar, what's the cheapest config that hits it?
Plot each candidate config (model × prompt × settings) on quality (y-axis) vs cost per request (x-axis). The frontier is the set of configs where no other config is both cheaper AND better.
Any config NOT on the frontier is dominated — always strictly worse than another option. Drop it.
quality
↑
1 | *A (opus + thinking)
| *B (opus)
|*G *D (sonnet + few-shot)
|*F *C (sonnet)
0 |*E (haiku)
+---------------→ cost
Pareto: A, B, D, C, E. Dominated: F (worse than E at same cost), G (worse than D at same cost).
For each candidate, measure:
| Metric | Example | |---|---| | Input tokens / request | 2,500 | | Output tokens / request | 400 | | $ / request | $0.012 | | Quality score | 0.87 | | p95 latency | 1.8s |
const costPerRequest = (usage.input_tokens / 1e6) * inputRate +
(usage.output_tokens / 1e6) * outputRate +
(usage.cache_creation_input_tokens / 1e6) * cacheWriteRate +
(usage.cache_read_input_tokens / 1e6) * cacheReadRate;
Always include cache costs — they dominate on cached workloads.
For any feature, try at least:
One of these usually sits on the frontier for your workload. Don't assume — measure.
Before jumping to a bigger model, try prompt levers:
A better prompt on Haiku can beat a mediocre prompt on Sonnet — and cost 10× less.
When considering an upgrade, compute when it pays off:
Cost increase per request: Δcost = new - old
Quality increase: Δquality = new - old
Value per quality point: V (estimated from business metrics)
Worth it if: Δquality × V > Δcost
Example: If every 1% quality gain increases user retention revenue by $0.003/request, and upgrading Haiku→Sonnet costs +$0.002/request for +5% quality:
You don't have to pick one. Route by difficulty:
const difficulty = await classifyDifficulty(query);
const model = difficulty === "simple" ? "claude-haiku-4-5"
: difficulty === "medium" ? "claude-sonnet-4-6"
: "claude-opus-4-6";
Classification is a cheap Haiku call. Most queries are simple; you save money. Hard queries get the premium treatment.
Measure: does tiered routing actually improve your cost/quality position? Sometimes classification errors wipe out the gains.
Cost-quality isn't enough; latency matters too. Examples where it dominates:
Report 3-tuples: (quality, cost, p95 latency). The frontier in 3D is smaller; choose by which axis has a constraint.
If you can cache 90% of your input:
Decisions made without caching factored in are usually wrong. Re-measure with cache.
Don't eval each config on 10,000 items. Start small:
Saves 10-100× on eval cost.
development
Test skills for correct activation, content quality, and regression — both automated checks (frontmatter validity, lint) and manual verification (query-suite activation testing). Covers CI integration and how to catch skill regressions before users do. Use this skill when adding skills to a repo, setting up CI for a skill library, or debugging "the skill exists but doesn't work". Activate when: test skills, validate skills, skill CI, skill linting, skill activation test, skill regression.
documentation
Write the YAML frontmatter for a SKILL.md file so it activates reliably — name, description, and activation keywords that the model matches against. Covers length, tone, and the most common frontmatter mistakes. Use this skill when authoring a new skill, fixing a skill that isn't auto-activating, or reviewing skills for publication. Activate when: SKILL.md frontmatter, skill description, skill activation, skill YAML, write a skill, author a skill.
development
Design skills that fire at the right moment — neither over-eager (noise) nor under-eager (silent). Covers activation specificity, trigger phrases, disambiguation between overlapping skills, and debugging activation. Use this skill when multiple skills could fire on the same query, a skill never fires, or a skill fires too often. Activate when: skill won't activate, skill over-activates, overlapping skills, skill triggers, skill selection, skill disambiguation.
development
Structure SKILL.md content so the model reads just enough — concise summary up front, progressively deeper detail, examples on demand. Covers section ordering, length budgets, when to split into multiple skills. Use this skill when writing or refactoring a skill body, one skill has grown too long, or a skill is wordy but not useful. Activate when: SKILL.md structure, skill content, skill too long, split skill, progressive disclosure, skill body.