skills/llm-cost-optimizer/SKILL.md
Track and reduce LLM API costs with per-request token tracking, model routing, budget alerts, and prompt compression. Activate on: reduce AI costs, token tracking, model routing, LLM budget, prompt compression. NOT for: general cloud cost optimization (cost-accrual-tracker), model training costs (ai-engineer).
npx skillsauth add curiositech/windags-skills llm-cost-optimizerInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Track per-request token usage, implement intelligent model routing, set budget alerts, and compress prompts to reduce LLM API costs by 40-70%.
High API costs (>$500/month) AND unknown spend breakdown?
├─ YES → Start with token tracking middleware
│ ├─ Instrument all LLM calls for 1 week
│ └─ Generate cost breakdown report
└─ NO → Skip to routing or compression
Cost breakdown shows 80% from 20% of endpoints?
├─ YES → Implement model routing for top cost drivers
│ ├─ Simple tasks (classify/extract) → Haiku/GPT-4o-mini
│ ├─ Medium tasks (summarize/explain) → Sonnet/GPT-4o
│ └─ Complex tasks (reason/create) → Opus/GPT-4
└─ NO → Focus on prompt compression
System prompts >1000 tokens AND high request volume?
├─ YES → Compress prompts first (highest ROI)
│ ├─ Enable prompt caching (Anthropic/OpenAI)
│ ├─ Reduce few-shot examples to 2-3 best
│ └─ LLMLingua compress system instructions
└─ NO → Set budget alerts and monitoring
Budget overruns happening frequently?
├─ YES → Implement automated throttling
│ ├─ Daily caps with 80% soft limit warnings
│ ├─ Auto-downgrade expensive → cheap models
│ └─ Emergency circuit breaker at 95%
└─ NO → Set up monitoring dashboards
Request Analysis:
├─ Input tokens <200 AND structured output needed?
│ └─ Route to: Haiku ($0.80/M) or GPT-4o-mini ($0.15/M)
├─ Single-step reasoning OR summarization <2000 tokens?
│ └─ Route to: Sonnet ($3/M) or GPT-4o ($2.50/M)
├─ Multi-step reasoning OR creative writing OR high-stakes?
│ └─ Route to: Opus ($15/M) or GPT-4 ($30/M)
└─ Latency <100ms required?
└─ Route to: Fastest model regardless of cost
Symptom: Model performance drops after cost optimization Detection: Eval metrics decline >5% after routing/compression changes Diagnosis: Over-aggressive optimization sacrificing capability for cost Fix:
Symptom: Simple tasks routed to expensive models, complex to cheap Detection: Haiku/mini models showing high retry rates or error responses Diagnosis: Complexity classifier is miscalibrated or missing edge cases Fix:
Symptom: Team ignores budget alerts, overruns become normal Detection: >3 budget alerts per week with no corrective action Diagnosis: Alerts are noise without automated enforcement Fix:
Symptom: Cost calculations wrong, optimization decisions based on old pricing Detection: Calculated costs don't match actual API bills (>10% variance) Diagnosis: Hardcoded pricing table outdated, new models not included Fix:
Symptom: Prompt caching provides no savings despite implementation Detection: Cache hit rate <20% despite repeated system prompts Diagnosis: Prompts have subtle variations breaking exact-match caching Fix:
Scenario: API costs jumped from $800/month to $3,200/month after launching new chat feature.
# After 1 week of tracking
breakdown = {
"/chat/respond": {"requests": 45000, "cost": 2100, "avg_tokens": 850},
"/chat/summarize": {"requests": 12000, "cost": 400, "avg_tokens": 600},
"/admin/classify": {"requests": 8000, "cost": 120, "avg_tokens": 200}
}
# Insight: Chat responses drive 77% of cost but use Opus for everything
Novice approach: "Let's switch everything to Haiku to save money" Expert decision: "Chat classification can use Haiku, but creative responses need Sonnet/Opus routing"
def route_chat_request(prompt, conversation_length, task_type):
if task_type == "classify_intent" or len(prompt) < 200:
return "claude-haiku-3-5" # $0.80/M input
elif task_type in ["summarize", "explain"] and conversation_length < 5:
return "claude-sonnet-3-5" # $3/M input
else: # creative, long conversations, complex reasoning
return "claude-opus-3" # $15/M input
Results after 1 week:
/chat/respond: 60% Haiku, 35% Sonnet, 5% Opus# Original system prompt: 1,800 tokens
original_prompt = """You are a helpful AI assistant. Your role is to provide accurate, helpful, and engaging responses to user questions. You should always be polite and professional. Here are some examples of good responses:
Example 1: [300 tokens of example]
Example 2: [300 tokens of example]
Example 3: [300 tokens of example]
Example 4: [300 tokens of example]
Remember to always follow these guidelines: [400 tokens of detailed rules]
"""
# Compressed version: 720 tokens (60% reduction)
compressed_prompt = """You are a helpful AI assistant providing accurate, engaging responses.
Best examples:
- [100 token example 1]
- [100 token example 2]
Guidelines: [320 tokens of essential rules only]
"""
A/B Test Results:
Trade-off analysis shown to stakeholders:
This skill should NOT be used for:
cost-accrual-tracker for compute, storage, networking costsai-engineer for training optimization and compute allocationrag-document-ingestion-pipeline for embedding and retrieval optimizationmodel-serving-api-builder for inference server tuningdata-pipeline-builder for ETL cost optimizationDelegate to other skills when:
ai-engineerrag-document-ingestion-pipelinellm-response-caching-layercost-accrual-trackertools
Building resilient distributed systems with circuit breakers, retries with full-jitter exponential backoff, retry budgets (per-request 3-attempt + per-client 10% ratio per Google SRE), deadline propagation, and the cascading-failure math (4 layers × 3 retries = 64x amplification). Grounded in Resilience4j, Microsoft Cloud Patterns, AWS Architecture Blog (Marc Brooker), and Google SRE Book.
testing
Designing HTTP cache headers that work correctly across browsers, CDNs, and shared proxies — `Cache-Control` directives per RFC 9111, `stale-while-revalidate` and `stale-if-error` per RFC 5861, the Vary header for varying responses, and surrogate keys for tag-based purging. Grounded in IETF RFCs and Cloudflare/Fastly docs.
development
Use when designing or fixing a Content Security Policy on a real site, choosing between nonce-based and hash-based CSP, adding strict-dynamic, debugging "Refused to execute inline script" errors, deploying CSP in report-only mode first, configuring report-to / report-uri, or auditing an existing policy for unsafe-inline / unsafe-eval / wildcards. Triggers: "CSP blocks legitimate inline script", strict-dynamic, nonce-{RANDOM}, sha256-{HASH}, object-src none, base-uri none, frame-ancestors, Trusted Types, X-Content-Security-Policy obsolete, report-only vs enforced. NOT for general HTTP security headers (HSTS, COOP/COEP), Trusted Types deep dive, CORS configuration, or building a WAF.
tools
Choosing and operating an HTTP API versioning strategy that doesn't break clients — Stripe's date-based pinned versions, the Deprecation/Sunset header pair (RFC 9745 + RFC 8594), URI vs header vs media-type approaches, and the version-transformer pattern. Grounded in Stripe's published architecture and IETF RFCs.