skills/practices/llm-cost-audit/SKILL.md
# LLM Cost Audit Audit checklist and patterns for optimizing LLM costs in LangChain and LangGraph pipelines. Every LLM call is a billing event — this skill ensures no money is wasted on avoidable tokens, wrong models, or missed caching opportunities. --- ## Audit Checklist Run through each section against the target pipeline. Flag any finding as a cost issue with estimated savings. --- ## 1. Prompt Caching Anthropic prompt caching gives a **90% discount** on cached token reads. Missing it
npx skillsauth add 33prime/rtg-forge skills/practices/llm-cost-auditInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Audit checklist and patterns for optimizing LLM costs in LangChain and LangGraph pipelines. Every LLM call is a billing event — this skill ensures no money is wasted on avoidable tokens, wrong models, or missed caching opportunities.
Run through each section against the target pipeline. Flag any finding as a cost issue with estimated savings.
Anthropic prompt caching gives a 90% discount on cached token reads. Missing it is the single most expensive mistake.
cache_control: Every ChatAnthropic call with a static system prompt longer than 1,024 tokens (Sonnet/Opus) or 2,048 tokens (Haiku 3.5) MUST use cache_control: {"type": "ephemeral"}.tools → system → messages order. Changing anything earlier invalidates everything later.tool_choice, web search, citations, speed mode, or image presence between calls breaks the cache.cache_control on the last user message so the full conversation prefix gets cached.json.dumps with sort_keys=True)."ttl": "1h" for infrequent access patterns (costs 2x instead of 1.25x on write).| Model | Min tokens | |-------|-----------| | Opus 4.5/4.6 | 4,096 | | Opus 4/4.1, Sonnet 3.7/4/4.5/4.6 | 1,024 | | Haiku 4.5 | 4,096 | | Haiku 3/3.5 | 2,048 |
Cache read = 0.1x base input price (90% off). Cache write = 1.25x (5m TTL) or 2x (1h TTL). Break-even: 2 cache hits within TTL window.
httpx/requests calls to Anthropic API without cache_control headersChatAnthropic calls that re-send large system prompts without cachingLong-running conversations and agentic workflows burn tokens as context grows. Compaction is how you manage this — and doing it wrong can double the cost.
compact_20260112) handles edge cases automatically. Use it instead of rolling your own unless you need custom behavior.pause_after_compaction: If your agent needs to re-attach files, recent messages, or specific instructions after compaction, use pause_after_compaction: true and re-inject before continuing.cache_control: Without a separate cache breakpoint on the system prompt, compaction invalidates the system prompt cache too. Always cache the system prompt separately.response = client.beta.messages.create(
betas=["compact-2026-01-12"],
model="claude-opus-4-6",
max_tokens=4096,
system=[{"type": "text", "text": system_prompt,
"cache_control": {"type": "ephemeral"}}], # Survives compaction
messages=messages,
context_management={
"edits": [{
"type": "compact_20260112",
"trigger": {"type": "input_tokens", "value": 150000},
"pause_after_compaction": True,
}]
},
)
TRIGGER = 100_000
BUDGET = 3_000_000
n_compactions = 0
if response.stop_reason == "compaction":
n_compactions += 1
if n_compactions * TRIGGER >= BUDGET:
# Inject wrap-up instruction
messages.append({"role": "user",
"content": "Please wrap up and summarize the final state."})
| Approach | Cost for 150K token conversation | |----------|------| | No compaction (re-send everything) | Full price every turn | | Compaction with different prefix (cache miss) | ~$4.50 input on Opus | | Cache-safe compaction (same prefix) | ~$0.45 input (90% cached) | | Background compaction + cache sharing | ~$0.09 input (cached + pre-built) |
usage.iterations not being aggregated for billing (top-level usage excludes compaction iterations)cache_control breakpoint (re-cached on every compaction)Opus is 18.75x more expensive than Haiku per token. Sending a classification task to Opus is burning money.
| Model | Input $/1M | Output $/1M | Best for | |-------|-----------|------------|----------| | Haiku 3.5 | $0.80 | $4.00 | Classification, extraction, formatting, routing | | Sonnet 4 | $3.00 | $15.00 | Code gen, analysis, summarization, conversation | | Opus 4 | $15.00 | $75.00 | Complex reasoning, architecture, multi-step agents |
classify → Haiku, analyze → Sonnet, architect → Opus.model= parameter across all LLM calls in the pipelineThinking tokens are billed at output token rates. On Opus, 10K thinking tokens = $0.75.
thinking={"type": "enabled"} is set globally or on classification/extraction nodes, it's wasting tokens.budget_tokens cap: Without a cap, the model can generate 30K+ thinking tokens. Always set budget_tokens (start with 3,000, increase only if quality demands it).thinking={"type": "enabled"} without budget_tokensLangGraph checkpoints the entire state at every node. Bloated state = bloated checkpoints + wasted input tokens when state is passed to LLM calls.
messages: Annotated[list, add_messages] grows forever in conversational graphs. Must have trimming or summarization.RemoveMessage usage: In long-running conversations, old messages should be removed or summarized to bound token usage.trim_messages: Before every LLM call, messages should be trimmed to a token budget.from langchain_core.messages import trim_messages
def call_model(state: MessagesState):
trimmed = trim_messages(
state["messages"],
max_tokens=4000,
strategy="last",
token_counter=llm,
include_system=True,
)
return {"messages": [await llm.ainvoke(trimmed)]}
raw_transcript: str or full_response: dict fieldsresults: list that appends every intermediate output)Every token costs money. Reduce input tokens, bound output tokens, and use structured output.
"n" vs "full_name") save tokens in large payloads.max_tokens set: Always set max_tokens appropriate to the task. Classification: 50-100. Extraction: 200-500. Generation: 500-2000.stop_sequences: Use stop_sequences to halt generation at known endpoints (e.g., "}", "```", "</result>").tool_use to force schema-compliant JSON output. Typically 50-70% fewer output tokens than free-form.{"role": "assistant", "content": "{"}) to skip preamble.max_tokens parameter on LLM callstool_use would workAnthropic's Batch API gives a 50% discount on both input and output tokens. 24-hour SLA, up to 100K requests per batch.
client.messages.batches.create().Batch API discount (50%) stacks with prompt caching read discount (90%). A cached batch read costs 0.05x the standard input price.
You can't optimize what you don't measure.
input_tokens, output_tokens, cache_read_input_tokens, cache_creation_input_tokens from response.usage or AIMessage.usage_metadata.from langchain_core.callbacks import BaseCallbackHandler
class CostTracker(BaseCallbackHandler):
def __init__(self):
self.total_cost = 0.0
def on_llm_end(self, response, **kwargs):
for gen_list in response.generations:
for gen in gen_list:
if hasattr(gen, "message"):
usage = gen.message.usage_metadata
# Calculate cost from usage + model pricing
async def budget_aware_node(state):
if state.get("cumulative_cost_usd", 0) >= state.get("budget_usd", 1.0):
return {"errors": ["Cost budget exceeded"], "should_terminate": True}
# ... proceed with LLM call, update cumulative_cost_usd ...
LangGraph subgraphs isolate state, preventing token bleed across concerns. Conditional execution skips expensive nodes when they're not needed.
max_retries guardsBeyond Anthropic's prompt caching, LangChain offers response-level caching that avoids API calls entirely for repeated inputs.
InMemoryCache for dev/single-process, SQLiteCache for single-server production, RedisCache for multi-server, RedisSemanticCache for fuzzy matching.cache=False).from langchain_core.globals import set_llm_cache
from langchain_core.caches import InMemoryCache
set_llm_cache(InMemoryCache())
# Per-LLM override
deterministic_llm = ChatAnthropic(model="claude-haiku-35-20241022", cache=True)
creative_llm = ChatAnthropic(model="claude-sonnet-4-20250514", cache=False)
set_llm_cache() call anywhere in the codebase| Model | Input | Output | Cache write (5m) | Cache read | Batch input | Batch output | |-------|-------|--------|-----------------|------------|-------------|-------------| | Opus 4 | $15.00 | $75.00 | $18.75 | $1.50 | $7.50 | $37.50 | | Sonnet 4 | $3.00 | $15.00 | $3.75 | $0.30 | $1.50 | $7.50 | | Haiku 3.5 | $0.80 | $4.00 | $1.00 | $0.08 | $0.40 | $2.00 |
| Configuration | Cost | |---------------|------| | Opus, no optimization | $52.50 | | Sonnet, no optimization | $10.50 | | Haiku, no optimization | $3.60 | | Sonnet + prompt caching (1.5K cached) | $5.55 | | Sonnet + batch | $5.25 | | Haiku + batch + caching | $0.99 |
development
# Parallel Execution > This skill is under development. Workflow patterns for running independent tasks in parallel to improve performance and throughput. ## Topics to Cover - Identifying independent tasks suitable for parallel execution - `asyncio.gather()` with `return_exceptions=True` - `asyncio.TaskGroup` for structured concurrency (Python 3.11+) - Semaphores for bounded concurrency - `Promise.all()` and `Promise.allSettled()` in TypeScript - Handling partial failures (some tasks succeed
development
# Module Extraction > This skill is under development. Workflow for identifying and extracting reusable modules from existing codebases. Extract when a pattern is used in 3+ places and has stabilized. ## Topics to Cover - Identifying extraction candidates (rule of three) - Defining module boundaries and public interface - Dependency analysis: what does the module need? - Interface design: protocols, abstract base classes - Step-by-step extraction process - Testing strategy: tests before, dur
development
# Forge Orchestrate — Intelligent Build Orchestration You are a build planner, not a build executor. Your job is to look at a project, figure out what's left to build, decompose the work into parallel streams, assign the right intelligence level to each stream, estimate cost, and hand the user a set of terminal commands they can run. You plan. They execute. --- ## Stream Decomposition The unit of parallelism is a **stream** — a self-contained bundle of tasks that one Claude session handles e
development
# Code Review > This skill is under development. Workflow for conducting effective code reviews that catch real issues and improve code quality. ## Topics to Cover - Review priorities: correctness > design > performance > style - What to check in every review (checklist) - How to give constructive feedback - Automated checks that should run before human review - Review scope: how big is too big? - Patterns for reviewing database migrations - Patterns for reviewing API changes - When to reque