cost-aware-llm-pipeline/SKILL.md
Cost optimization patterns for LLM API usage — model routing by task complexity, budget tracking, retry logic, and prompt caching.
npx skillsauth add lidge-jun/cli-jaw-skills cost-aware-llm-pipelineInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Patterns for controlling LLM API costs while maintaining quality. Combines model routing, budget tracking, retry logic, and prompt caching into a composable pipeline.
Automatically select cheaper models for simple tasks, reserving expensive models for complex ones.
MODEL_SONNET = "claude-sonnet-4-6"
MODEL_HAIKU = "claude-haiku-4-5-20251001"
_SONNET_TEXT_THRESHOLD = 10_000 # chars
_SONNET_ITEM_THRESHOLD = 30 # items
def select_model(
text_length: int,
item_count: int,
force_model: str | None = None,
) -> str:
"""Select model based on task complexity."""
if force_model is not None:
return force_model
if text_length >= _SONNET_TEXT_THRESHOLD or item_count >= _SONNET_ITEM_THRESHOLD:
return MODEL_SONNET # Complex task
return MODEL_HAIKU # Simple task (3-4x cheaper)
Track cumulative spend with frozen dataclasses. Each API call returns a new tracker — never mutates state.
from dataclasses import dataclass
@dataclass(frozen=True, slots=True)
class CostRecord:
model: str
input_tokens: int
output_tokens: int
cost_usd: float
@dataclass(frozen=True, slots=True)
class CostTracker:
budget_limit: float = 1.00
records: tuple[CostRecord, ...] = ()
def add(self, record: CostRecord) -> "CostTracker":
"""Return new tracker with added record (never mutates self)."""
return CostTracker(
budget_limit=self.budget_limit,
records=(*self.records, record),
)
@property
def total_cost(self) -> float:
return sum(r.cost_usd for r in self.records)
@property
def over_budget(self) -> bool:
return self.total_cost > self.budget_limit
Retry only on transient errors. Fail fast on authentication or bad request errors.
from anthropic import (
APIConnectionError,
InternalServerError,
RateLimitError,
)
_RETRYABLE_ERRORS = (APIConnectionError, RateLimitError, InternalServerError)
_MAX_RETRIES = 3
def call_with_retry(func, *, max_retries: int = _MAX_RETRIES):
"""Retry only on transient errors, fail fast on others."""
for attempt in range(max_retries):
try:
return func()
except _RETRYABLE_ERRORS:
if attempt == max_retries - 1:
raise
time.sleep(2 ** attempt) # Exponential backoff
# AuthenticationError, BadRequestError etc. → raise immediately
Cache long system prompts to avoid resending them on every request.
messages = [
{
"role": "user",
"content": [
{
"type": "text",
"text": system_prompt,
"cache_control": {"type": "ephemeral"}, # Cache this
},
{
"type": "text",
"text": user_input, # Variable part
},
],
}
]
Combine all four techniques in a single pipeline function:
def process(text: str, config: Config, tracker: CostTracker) -> tuple[Result, CostTracker]:
# 1. Route model
model = select_model(len(text), estimated_items, config.force_model)
# 2. Check budget
if tracker.over_budget:
raise BudgetExceededError(tracker.total_cost, tracker.budget_limit)
# 3. Call with retry + caching
response = call_with_retry(lambda: client.messages.create(
model=model,
messages=build_cached_messages(system_prompt, text),
))
# 4. Track cost (immutable)
record = CostRecord(model=model, input_tokens=..., output_tokens=..., cost_usd=...)
tracker = tracker.add(record)
return parse_result(response), tracker
| Model | Input ($/1M tokens) | Output ($/1M tokens) | Relative Cost | |-------|---------------------|----------------------|---------------| | Haiku 4.5 | $1.00 | $5.00 | 1x | | Sonnet 4.6 | $3.00 | $15.00 | ~3x | | Opus 4.8 | $5.00 | $25.00 | ~5x |
Batch API: 50% discount on all models for async batch processing. Prompt caching: Up to 90% savings on cached input tokens (24-hour retention).
Start with the cheapest model; escalate only when quality is insufficient:
def cascade_call(prompt: str, quality_threshold: float = 0.8) -> tuple[str, str]:
"""Try cheap model first, escalate if quality is low."""
result = call_model(MODEL_HAIKU, prompt)
if evaluate_quality(result) >= quality_threshold:
return result, MODEL_HAIKU
return call_model(MODEL_SONNET, prompt), MODEL_SONNET
Hard-stop when budget is exhausted — never silently overspend:
def check_circuit(tracker: CostTracker) -> None:
"""Raise immediately if budget exceeded. Check BEFORE every call."""
if tracker.total_cost >= tracker.budget_limit:
raise BudgetExceededError(
f"Budget exhausted: ${tracker.total_cost:.2f} / ${tracker.budget_limit:.2f}"
)
Cache semantically similar requests to avoid duplicate API calls:
def get_or_call(prompt: str, cache: dict, similarity_threshold: float = 0.95) -> str:
"""Return cached result for semantically similar prompts."""
for cached_prompt, cached_result in cache.items():
if semantic_similarity(prompt, cached_prompt) >= similarity_threshold:
return cached_result
result = call_model(MODEL_SONNET, prompt)
cache[prompt] = result
return result
development
Native Web UI structured renderer schemas for compose-block drafts, search-results cards, dataframe tables, chart-json charts, and diff output
tools
Unified search hub. Route any web/real-time/X lookup through a 4-tier escalation: built-in web search → cli-jaw browser CDP → progrok Grok OAuth → web-ai (Grok Expert / GPT Pro). Use for: search, 검색, web search, latest news, real-time info, X/Twitter, fact lookup, deep research.
development
UI/UX intent discovery, design vocabulary, product personalities, UX state patterns, typography line break judgment, favicon/product logo design, and logo trust section design. Use when user design direction is vague, when building onboarding/empty/error states, when setting up favicons or product logos, or when referencing a product aesthetic.
development
Canonical owner of module boundary rules, circular dependency detection/prevention, implicit coupling taxonomy, barrel/re-export discipline, and boundary-only defensive programming. Referenced by dev, dev-code-reviewer, dev-backend, dev-frontend stubs.