Adoption

Agent Skills are supported by leading AI development tools.

VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory

latestaiagents/cost-quality-tradeoff

Name: cost-quality-tradeoff
Author: latestaiagents

skills/evals/cost-quality-tradeoff/SKILL.md

npx skillsauth add latestaiagents/agent-skills cost-quality-tradeoff

Clean

TrivyContainer and dependency vulnerability scanner

Clean

SemgrepStatic code analysis for vulnerabilities

Clean

mcp-scan (Snyk)Model Context Protocol security validation

Skipped

Snyk (dep)Open source security scanning

Skipped

Socket.devSupply chain security analysis

Skipped

VirusTotalMulti-engine malware detection

Skipped

CrowdStrikeAdvanced threat intelligence

Skipped

OSV-ScannerOpen Source Vulnerability database check

Skipped

OWASP Dep-Check

Cost vs Quality Tradeoff

Quality without cost context is half a decision. You need the Pareto frontier — for each quality bar, what's the cheapest config that hits it?

When to Use

Choosing a default model for a new feature
Reducing LLM spend on an existing feature
Justifying (or not) an upgrade to a premium model
Trading off prompt complexity, model size, and thinking budget

The Pareto Frontier

Plot each candidate config (model × prompt × settings) on quality (y-axis) vs cost per request (x-axis). The frontier is the set of configs where no other config is both cheaper AND better.

Any config NOT on the frontier is dominated — always strictly worse than another option. Drop it.

  quality
   ↑
 1 |    *A (opus + thinking)
   |    *B (opus)
   |*G *D (sonnet + few-shot)
   |*F *C (sonnet)
 0 |*E (haiku)
   +---------------→ cost

Pareto: A, B, D, C, E. Dominated: F (worse than E at same cost), G (worse than D at same cost).

Measurement

For each candidate, measure:

| Metric | Example | |---|---| | Input tokens / request | 2,500 | | Output tokens / request | 400 | | $ / request | $0.012 | | Quality score | 0.87 | | p95 latency | 1.8s |

const costPerRequest = (usage.input_tokens / 1e6) * inputRate +
                       (usage.output_tokens / 1e6) * outputRate +
                       (usage.cache_creation_input_tokens / 1e6) * cacheWriteRate +
                       (usage.cache_read_input_tokens / 1e6) * cacheReadRate;

Always include cache costs — they dominate on cached workloads.

Common Configs to Compare

For any feature, try at least:

Haiku with concise prompt
Haiku with longer / few-shot prompt
Sonnet with concise prompt
Sonnet with few-shot + structured output
Sonnet with extended thinking
Opus with concise prompt
Opus with extended thinking

One of these usually sits on the frontier for your workload. Don't assume — measure.

Prompt as a Lever

Before jumping to a bigger model, try prompt levers:

Few-shot examples (2-5) often bump quality 5-15% at small cost
Structured output (JSON schema) reduces parsing errors
Chain-of-thought prompting helps reasoning without thinking tokens
Better system prompt scoping (what's in vs out) improves accuracy

A better prompt on Haiku can beat a mediocre prompt on Sonnet — and cost 10× less.

Break-Even Analysis

When considering an upgrade, compute when it pays off:

Cost increase per request: Δcost = new - old
Quality increase: Δquality = new - old
Value per quality point: V (estimated from business metrics)

Worth it if: Δquality × V > Δcost

Example: If every 1% quality gain increases user retention revenue by $0.003/request, and upgrading Haiku→Sonnet costs +$0.002/request for +5% quality:

Value gain: 5 × $0.003 = $0.015
Cost: $0.002
Net: +$0.013 per request → upgrade.

Tiered Routing

You don't have to pick one. Route by difficulty:

const difficulty = await classifyDifficulty(query);
const model = difficulty === "simple" ? "claude-haiku-4-5"
            : difficulty === "medium" ? "claude-sonnet-4-6"
            : "claude-opus-4-6";

Classification is a cheap Haiku call. Most queries are simple; you save money. Hard queries get the premium treatment.

Measure: does tiered routing actually improve your cost/quality position? Sometimes classification errors wipe out the gains.

Latency as a Third Axis

Cost-quality isn't enough; latency matters too. Examples where it dominates:

Chat UX needs first token < 2s → rules out unthinking Opus
Voice agent needs full response < 500ms → forces Haiku
Background summarization: latency doesn't matter, optimize cost/quality only

Report 3-tuples: (quality, cost, p95 latency). The frontier in 3D is smaller; choose by which axis has a constraint.

Cache-Aware Selection

If you can cache 90% of your input:

Effective input cost drops ~10×
Long-context + cache often beats short-context + no cache at the same quality
Larger models become more affordable per request

Decisions made without caching factored in are usually wrong. Re-measure with cache.

Sample Budget

Don't eval each config on 10,000 items. Start small:

Shortlist with 50 items → narrow to 3 configs
Confirm with 200 items → pick winner
Validate in production with canary on 5% traffic → full rollout

Saves 10-100× on eval cost.

Anti-Patterns

Choosing the biggest model by default — often dominated
Ignoring cache costs — skews the whole picture
One-dimensional optimization — quality-only misses cost blowouts
Tiered routing without measuring — classification errors can negate gains
Per-product model choices — you can't reuse infrastructure investment
Skipping prompt levers — jumping to bigger model before trying few-shot

Best Practices

Measure cost AND quality AND latency; plot the Pareto frontier
Try prompt levers (few-shot, CoT, structure) before upsizing the model
Compute break-even: is the quality gain worth the cost delta?
Consider tiered routing; measure that it actually helps
Always factor in cache reads/writes — they can 10× swing the decision
Eval with small samples first (50-200); canary in prod before full rollout
Revisit choices quarterly — pricing and model quality move

latestaiagents/cost-quality-tradeoff

skills/evals/cost-quality-tradeoff/SKILL.md

Measure and optimize the cost/quality curve — which model, prompt, and settings give the best quality per dollar. Covers Pareto analysis, break-even thresholds, and when to spend more vs less. Use this skill when optimizing LLM spend, picking a default model for a feature, or deciding whether a premium model is worth it. Activate when: cost vs quality, model selection, eval cost, Pareto frontier, cheaper model, premium model tradeoff.

2 stars

testing

Updated Apr 23, 2026

$ install --global

skillsauth

npx skillsauth add latestaiagents/agent-skills cost-quality-tradeoff

Install this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.

Security Scan Results

3 of 9 scanners reported clean

Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.

Scanners Passed

Scanners in report

Clean

TrivyContainer and dependency vulnerability scanner

95%

Clean

SemgrepStatic code analysis for vulnerabilities

95%

Clean

mcp-scan (Snyk)Model Context Protocol security validation

95%

Skipped

Snyk (dep)Open source security scanning

50%

Skipped

Socket.devSupply chain security analysis

50%

Skipped

VirusTotalMulti-engine malware detection

50%

Skipped

CrowdStrikeAdvanced threat intelligence

50%

Skipped

OSV-ScannerOpen Source Vulnerability database check

50%

Skipped

OWASP Dep-Check

50%

Last scanned: Apr 24, 2026, 2:55 AM8.3s1 file scanned

SKILL.md

name:: cost-quality-tradeoff
description:: |
Activate when:: cost vs quality, model selection, eval cost, Pareto frontier, cheaper model, premium model tradeoff.

Cost vs Quality Tradeoff

Quality without cost context is half a decision. You need the Pareto frontier — for each quality bar, what's the cheapest config that hits it?

When to Use

Choosing a default model for a new feature
Reducing LLM spend on an existing feature
Justifying (or not) an upgrade to a premium model
Trading off prompt complexity, model size, and thinking budget

The Pareto Frontier

Plot each candidate config (model × prompt × settings) on quality (y-axis) vs cost per request (x-axis). The frontier is the set of configs where no other config is both cheaper AND better.

Any config NOT on the frontier is dominated — always strictly worse than another option. Drop it.

  quality
   ↑
 1 |    *A (opus + thinking)
   |    *B (opus)
   |*G *D (sonnet + few-shot)
   |*F *C (sonnet)
 0 |*E (haiku)
   +---------------→ cost

Pareto: A, B, D, C, E. Dominated: F (worse than E at same cost), G (worse than D at same cost).

Measurement

For each candidate, measure:

| Metric | Example | |---|---| | Input tokens / request | 2,500 | | Output tokens / request | 400 | | $ / request | $0.012 | | Quality score | 0.87 | | p95 latency | 1.8s |

const costPerRequest = (usage.input_tokens / 1e6) * inputRate +
                       (usage.output_tokens / 1e6) * outputRate +
                       (usage.cache_creation_input_tokens / 1e6) * cacheWriteRate +
                       (usage.cache_read_input_tokens / 1e6) * cacheReadRate;

Always include cache costs — they dominate on cached workloads.

Common Configs to Compare

For any feature, try at least:

Haiku with concise prompt
Haiku with longer / few-shot prompt
Sonnet with concise prompt
Sonnet with few-shot + structured output
Sonnet with extended thinking
Opus with concise prompt
Opus with extended thinking

One of these usually sits on the frontier for your workload. Don't assume — measure.

Prompt as a Lever

Before jumping to a bigger model, try prompt levers:

Few-shot examples (2-5) often bump quality 5-15% at small cost
Structured output (JSON schema) reduces parsing errors
Chain-of-thought prompting helps reasoning without thinking tokens
Better system prompt scoping (what's in vs out) improves accuracy

A better prompt on Haiku can beat a mediocre prompt on Sonnet — and cost 10× less.

Break-Even Analysis

When considering an upgrade, compute when it pays off:

Cost increase per request: Δcost = new - old
Quality increase: Δquality = new - old
Value per quality point: V (estimated from business metrics)

Worth it if: Δquality × V > Δcost

Example: If every 1% quality gain increases user retention revenue by $0.003/request, and upgrading Haiku→Sonnet costs +$0.002/request for +5% quality:

Value gain: 5 × $0.003 = $0.015
Cost: $0.002
Net: +$0.013 per request → upgrade.

Tiered Routing

You don't have to pick one. Route by difficulty:

const difficulty = await classifyDifficulty(query);
const model = difficulty === "simple" ? "claude-haiku-4-5"
            : difficulty === "medium" ? "claude-sonnet-4-6"
            : "claude-opus-4-6";

Classification is a cheap Haiku call. Most queries are simple; you save money. Hard queries get the premium treatment.

Measure: does tiered routing actually improve your cost/quality position? Sometimes classification errors wipe out the gains.

Latency as a Third Axis

Cost-quality isn't enough; latency matters too. Examples where it dominates:

Chat UX needs first token < 2s → rules out unthinking Opus
Voice agent needs full response < 500ms → forces Haiku
Background summarization: latency doesn't matter, optimize cost/quality only

Report 3-tuples: (quality, cost, p95 latency). The frontier in 3D is smaller; choose by which axis has a constraint.

Cache-Aware Selection

If you can cache 90% of your input:

Effective input cost drops ~10×
Long-context + cache often beats short-context + no cache at the same quality
Larger models become more affordable per request

Decisions made without caching factored in are usually wrong. Re-measure with cache.

Sample Budget

Don't eval each config on 10,000 items. Start small:

Shortlist with 50 items → narrow to 3 configs
Confirm with 200 items → pick winner
Validate in production with canary on 5% traffic → full rollout

Saves 10-100× on eval cost.

Anti-Patterns

Choosing the biggest model by default — often dominated
Ignoring cache costs — skews the whole picture
One-dimensional optimization — quality-only misses cost blowouts
Tiered routing without measuring — classification errors can negate gains
Per-product model choices — you can't reuse infrastructure investment
Skipping prompt levers — jumping to bigger model before trying few-shot

Best Practices

Measure cost AND quality AND latency; plot the Pareto frontier
Try prompt levers (few-shot, CoT, structure) before upsizing the model
Compute break-even: is the quality gain worth the cost delta?
Consider tiered routing; measure that it actually helps
Always factor in cache reads/writes — they can 10× swing the decision
Eval with small samples first (50-200); canary in prod before full rollout
Revisit choices quarterly — pricing and model quality move

Related Skills

latestaiagents/skill-testing

development

VerifiedTrustedCommunity

Test skills for correct activation, content quality, and regression — both automated checks (frontmatter validity, lint) and manual verification (query-suite activation testing). Covers CI integration and how to catch skill regressions before users do. Use this skill when adding skills to a repo, setting up CI for a skill library, or debugging "the skill exists but doesn't work". Activate when: test skills, validate skills, skill CI, skill linting, skill activation test, skill regression.

2SKILL.mdUpdated Apr 23, 2026

latestaiagents/skill-testing

latestaiagents/skill-frontmatter

documentation

VerifiedTrustedCommunity

Write the YAML frontmatter for a SKILL.md file so it activates reliably — name, description, and activation keywords that the model matches against. Covers length, tone, and the most common frontmatter mistakes. Use this skill when authoring a new skill, fixing a skill that isn't auto-activating, or reviewing skills for publication. Activate when: SKILL.md frontmatter, skill description, skill activation, skill YAML, write a skill, author a skill.

2SKILL.mdUpdated Apr 23, 2026

latestaiagents/skill-frontmatter

latestaiagents/skill-activation-patterns

development

VerifiedTrustedCommunity

Design skills that fire at the right moment — neither over-eager (noise) nor under-eager (silent). Covers activation specificity, trigger phrases, disambiguation between overlapping skills, and debugging activation. Use this skill when multiple skills could fire on the same query, a skill never fires, or a skill fires too often. Activate when: skill won't activate, skill over-activates, overlapping skills, skill triggers, skill selection, skill disambiguation.

2SKILL.mdUpdated Apr 23, 2026

latestaiagents/skill-activation-patterns

latestaiagents/progressive-disclosure

development

VerifiedTrustedCommunity

Structure SKILL.md content so the model reads just enough — concise summary up front, progressively deeper detail, examples on demand. Covers section ordering, length budgets, when to split into multiple skills. Use this skill when writing or refactoring a skill body, one skill has grown too long, or a skill is wordy but not useful. Activate when: SKILL.md structure, skill content, skill too long, split skill, progressive disclosure, skill body.

2SKILL.mdUpdated Apr 23, 2026

latestaiagents/progressive-disclosure

Download

For Claude Desktop. Download once, then upload the file in the app — no terminal needed.

Need help? View full Cowork setup guide →

Install manually

Choose your platform

# Clone the repo
git clone https://github.com/latestaiagents/agent-skills.git

# Copy into Claude Code skills folder (global)
cp -r agent-skills/skills/evals/cost-quality-tradeoff ~/.claude/skills/

Claude Code Skills — official skills path docs.

Repository

latestaiagents/agent-skills

2 stars

Compatible with

Claude Code

OpenAI Codex CLI

ChatGPT