LLM Cost Audit

Audit checklist and patterns for optimizing LLM costs in LangChain and LangGraph pipelines. Every LLM call is a billing event — this skill ensures no money is wasted on avoidable tokens, wrong models, or missed caching opportunities.

Audit Checklist

Run through each section against the target pipeline. Flag any finding as a cost issue with estimated savings.

1. Prompt Caching

Anthropic prompt caching gives a 90% discount on cached token reads. Missing it is the single most expensive mistake.

What to check

System prompts with cache_control: Every ChatAnthropic call with a static system prompt longer than 1,024 tokens (Sonnet/Opus) or 2,048 tokens (Haiku 3.5) MUST use cache_control: {"type": "ephemeral"}.
Cache hierarchy preserved: Content must follow tools → system → messages order. Changing anything earlier invalidates everything later.
No cache-breaking mutations: Toggling tool_choice, web search, citations, speed mode, or image presence between calls breaks the cache.
Multi-turn caching: Place cache_control on the last user message so the full conversation prefix gets cached.
JSON key ordering stability: Tool use content blocks must serialize with stable key order (Python json.dumps with sort_keys=True).
Max 4 breakpoints per request: More won't error but won't help either.
20-block lookback window: If you have more than 20 content blocks before the last breakpoint, add an earlier breakpoint.
TTL awareness: Default is 5 minutes (refreshed on each hit). Use "ttl": "1h" for infrequent access patterns (costs 2x instead of 1.25x on write).

Minimum cacheable tokens by model

| Model | Min tokens | |-------|-----------| | Opus 4.5/4.6 | 4,096 | | Opus 4/4.1, Sonnet 3.7/4/4.5/4.6 | 1,024 | | Haiku 4.5 | 4,096 | | Haiku 3/3.5 | 2,048 |

Pricing impact

Cache read = 0.1x base input price (90% off). Cache write = 1.25x (5m TTL) or 2x (1h TTL). Break-even: 2 cache hits within TTL window.

Red flags

Raw httpx/requests calls to Anthropic API without cache_control headers
ChatAnthropic calls that re-send large system prompts without caching
System prompts below minimum threshold that could be padded with useful few-shot examples to cross the line

2. Compaction & Context Window Management

Long-running conversations and agentic workflows burn tokens as context grows. Compaction is how you manage this — and doing it wrong can double the cost.

What to check

No compaction strategy at all: Any agent or chat that can exceed 50% of the context window MUST have a compaction plan. Without one, you either hit the limit and fail, or re-send the entire growing conversation on every turn.
Compaction calls with different prefix: When summarizing a conversation, the compaction call MUST use the exact same system prompt, tools, and message prefix as the parent conversation. This ensures a cache hit (1/10 price). A separate "summarizer" prompt pays full price for all input tokens.
Not using the Compaction API: Anthropic's server-side compaction (compact_20260112) handles edge cases automatically. Use it instead of rolling your own unless you need custom behavior.
Reactive-only compaction: Blocking the user for 30-60s when context fills is a bad experience. Use background compaction: trigger summary creation at a soft threshold (e.g., 75% full), swap instantly at the hard threshold.
No total token budget enforcement: Long agentic tasks with many tool iterations can consume millions of tokens across compaction cycles. Track compaction count × trigger threshold to estimate cumulative spend.
Missing pause_after_compaction: If your agent needs to re-attach files, recent messages, or specific instructions after compaction, use pause_after_compaction: true and re-inject before continuing.
System prompt without its own cache_control: Without a separate cache breakpoint on the system prompt, compaction invalidates the system prompt cache too. Always cache the system prompt separately.

Compaction API usage

response = client.beta.messages.create(
    betas=["compact-2026-01-12"],
    model="claude-opus-4-6",
    max_tokens=4096,
    system=[{"type": "text", "text": system_prompt,
             "cache_control": {"type": "ephemeral"}}],  # Survives compaction
    messages=messages,
    context_management={
        "edits": [{
            "type": "compact_20260112",
            "trigger": {"type": "input_tokens", "value": 150000},
            "pause_after_compaction": True,
        }]
    },
)

Budget enforcement with compaction counting

TRIGGER = 100_000
BUDGET = 3_000_000
n_compactions = 0

if response.stop_reason == "compaction":
    n_compactions += 1
    if n_compactions * TRIGGER >= BUDGET:
        # Inject wrap-up instruction
        messages.append({"role": "user",
            "content": "Please wrap up and summarize the final state."})

Cost comparison: compaction approaches

| Approach | Cost for 150K token conversation | |----------|------| | No compaction (re-send everything) | Full price every turn | | Compaction with different prefix (cache miss) | ~$4.50 input on Opus | | Cache-safe compaction (same prefix) | ~$0.45 input (90% cached) | | Background compaction + cache sharing | ~$0.09 input (cached + pre-built) |

Red flags

Agent or chat with no context window management
Compaction using a separate "summarizer" system prompt
Blocking user during compaction (no background strategy)
No compaction count tracking for budget enforcement
usage.iterations not being aggregated for billing (top-level usage excludes compaction iterations)
System prompt missing cache_control breakpoint (re-cached on every compaction)

3. Model Selection & Routing

Opus is 18.75x more expensive than Haiku per token. Sending a classification task to Opus is burning money.

Model tiers

| Model | Input $/1M | Output $/1M | Best for | |-------|-----------|------------|----------| | Haiku 3.5 | $0.80 | $4.00 | Classification, extraction, formatting, routing | | Sonnet 4 | $3.00 | $15.00 | Code gen, analysis, summarization, conversation | | Opus 4 | $15.00 | $75.00 | Complex reasoning, architecture, multi-step agents |

What to check

Single model for all tasks: If every node in the graph uses the same model, there's a routing opportunity. Classification/extraction nodes should use Haiku.
Opus for simple tasks: Any Opus call that does classification, extraction, formatting, or routing is overspending.
No cascading/fallback: For quality-sensitive tasks, try Haiku → Sonnet → Opus escalation rather than always hitting Opus.
Tool use on Haiku: Haiku's tool use is less reliable. If a Haiku node uses tools and fails often, upgrade to Sonnet for that node only.

Routing patterns

Rule-based (zero overhead): Route by task type — classify → Haiku, analyze → Sonnet, architect → Opus.
Classifier-based: Use Haiku to classify complexity, then route. Adds one cheap call but adapts to input.
Cascading: Start cheap, escalate on low confidence. Best for variable-complexity inputs.
Parallel racing: Send to multiple models, take the first/best. Only for latency-critical paths.

Red flags

Same model= parameter across all LLM calls in the pipeline
Opus used for extraction, tagging, or formatting
No quality evaluation to know whether a cheaper model would suffice

4. Extended Thinking

Thinking tokens are billed at output token rates. On Opus, 10K thinking tokens = $0.75.

What to check

Extended thinking on simple tasks: If thinking={"type": "enabled"} is set globally or on classification/extraction nodes, it's wasting tokens.
No budget_tokens cap: Without a cap, the model can generate 30K+ thinking tokens. Always set budget_tokens (start with 3,000, increase only if quality demands it).
Thinking on Haiku: Haiku 3.5 doesn't support extended thinking. If the code tries to enable it, it's a no-op or error.

When to use extended thinking

Math, logic, or formal reasoning
Debugging that requires step-by-step tracing
Multi-variable decision-making
Tasks where standard mode demonstrably produces worse results

Red flags

thinking={"type": "enabled"} without budget_tokens
Extended thinking on every node in the graph
No evidence that thinking improves quality for the specific task

5. State Management (LangGraph)

LangGraph checkpoints the entire state at every node. Bloated state = bloated checkpoints + wasted input tokens when state is passed to LLM calls.

What to check

Unbounded message lists: messages: Annotated[list, add_messages] grows forever in conversational graphs. Must have trimming or summarization.
Large objects in state: Full transcripts, raw API responses, embedding vectors, or binary data stored in state fields. These should be in a database with only references in state.
No RemoveMessage usage: In long-running conversations, old messages should be removed or summarized to bound token usage.
Missing trim_messages: Before every LLM call, messages should be trimmed to a token budget.
State fields that only one node reads: These add checkpoint overhead without benefiting the graph.

Trimming pattern

from langchain_core.messages import trim_messages

def call_model(state: MessagesState):
    trimmed = trim_messages(
        state["messages"],
        max_tokens=4000,
        strategy="last",
        token_counter=llm,
        include_system=True,
    )
    return {"messages": [await llm.ainvoke(trimmed)]}

Red flags

State TypedDict with raw_transcript: str or full_response: dict fields
No message trimming before LLM calls
Checkpoint storage growing unboundedly over time
Large list fields without reducers (e.g., results: list that appends every intermediate output)

6. Token Optimization

Every token costs money. Reduce input tokens, bound output tokens, and use structured output.

Input token reduction

Verbose system prompts: Rewrite "Please ensure that you carefully..." → direct imperatives. Saves 20-40% of prompt tokens.
Too many few-shot examples: Use 1-2 high-quality examples instead of 5-10. Dynamic selection by similarity is even better.
Full documents where excerpts suffice: Only include relevant chunks, not entire documents.
JSON keys in prompts: Short keys with a legend ("n" vs "full_name") save tokens in large payloads.
XML over JSON for Claude prompt structure: Claude was trained on XML; XML tags are more token-efficient for structured instructions.

Output token reduction

No max_tokens set: Always set max_tokens appropriate to the task. Classification: 50-100. Extraction: 200-500. Generation: 500-2000.
No stop_sequences: Use stop_sequences to halt generation at known endpoints (e.g., "}", "```", "</result>").
Free-form output for structured data: Use tool_use to force schema-compliant JSON output. Typically 50-70% fewer output tokens than free-form.
No assistant prefilling: Prefill the assistant response (e.g., {"role": "assistant", "content": "{"}) to skip preamble.

Red flags

No max_tokens parameter on LLM calls
Free-form text output parsed with regex when tool_use would work
System prompts with verbose explanatory language
Five or more few-shot examples included in every call
Full documents in context when only a summary or specific section is needed

7. Batch API Opportunities

Anthropic's Batch API gives a 50% discount on both input and output tokens. 24-hour SLA, up to 100K requests per batch.

What to check

Offline processing not using batch: Any pipeline step that processes historical data, runs evaluations, generates bulk content, or validates in batch should use client.messages.batches.create().
Real-time API for batch-eligible work: If the results aren't needed within seconds, they're batch-eligible.

Batch-eligible workloads

Bulk document processing / summarization
Dataset annotation and labeling
Evaluation and testing runs
Content generation pipelines
Module validation sweeps
Correction aggregation
Historical transcript analysis

Batch + caching stacking

Batch API discount (50%) stacks with prompt caching read discount (90%). A cached batch read costs 0.05x the standard input price.

Red flags

Historical data processed one-by-one through the real-time API
Evaluation suites that call the standard API instead of batch
Any loop that processes 100+ items sequentially through the API

8. Cost Tracking & Observability

You can't optimize what you don't measure.

What to check

No token tracking: Every LLM call should track input_tokens, output_tokens, cache_read_input_tokens, cache_creation_input_tokens from response.usage or AIMessage.usage_metadata.
No cost attribution: Costs should be tagged by pipeline, customer, and step so you can identify the most expensive operations.
No budget enforcement: Production pipelines should have per-run cost budgets that halt execution if exceeded.
No observability platform: LangSmith or Langfuse should be integrated for trace-level cost visibility.

Callback pattern for LangChain

from langchain_core.callbacks import BaseCallbackHandler

class CostTracker(BaseCallbackHandler):
    def __init__(self):
        self.total_cost = 0.0

    def on_llm_end(self, response, **kwargs):
        for gen_list in response.generations:
            for gen in gen_list:
                if hasattr(gen, "message"):
                    usage = gen.message.usage_metadata
                    # Calculate cost from usage + model pricing

Budget enforcement pattern

async def budget_aware_node(state):
    if state.get("cumulative_cost_usd", 0) >= state.get("budget_usd", 1.0):
        return {"errors": ["Cost budget exceeded"], "should_terminate": True}
    # ... proceed with LLM call, update cumulative_cost_usd ...

Red flags

No callbacks or usage tracking on any LLM instance
No per-run or per-pipeline cost logging
No alerts for cost anomalies
Running in production without LangSmith/Langfuse tracing

9. Subgraph & Conditional Execution

LangGraph subgraphs isolate state, preventing token bleed across concerns. Conditional execution skips expensive nodes when they're not needed.

What to check

Monolithic graphs: A single graph with 10+ nodes sharing one state accumulates tokens across all nodes. Break into subgraphs that pass only necessary data.
No conditional skipping: Expensive nodes (Sonnet/Opus calls, large context retrieval) should be gated behind conditions. Low-priority inputs can take a cheaper path.
No retry limits: Conditional edges that retry on failure without a cap can loop indefinitely, burning tokens.

Red flags

All nodes share the same message history (full accumulation)
No conditional edges to skip expensive processing for simple inputs
Retry loops without max_retries guards
Linear pipelines where intermediate outputs are carried through all downstream nodes

10. Caching at the Application Layer

Beyond Anthropic's prompt caching, LangChain offers response-level caching that avoids API calls entirely for repeated inputs.

What to check

No LangChain cache configured: For deterministic tasks (classification, extraction), identical inputs should return cached results.
Cache type appropriate to deployment: InMemoryCache for dev/single-process, SQLiteCache for single-server production, RedisCache for multi-server, RedisSemanticCache for fuzzy matching.
Caching enabled on non-deterministic nodes: Creative/generative nodes should have caching disabled (cache=False).

Cache configuration

from langchain_core.globals import set_llm_cache
from langchain_core.caches import InMemoryCache

set_llm_cache(InMemoryCache())

# Per-LLM override
deterministic_llm = ChatAnthropic(model="claude-haiku-35-20241022", cache=True)
creative_llm = ChatAnthropic(model="claude-sonnet-4-20250514", cache=False)

Red flags

No set_llm_cache() call anywhere in the codebase
Same prompt sent repeatedly to the API (visible in traces as duplicate calls)
Cache enabled on synthesis/creative nodes (stale outputs)

Pricing Quick Reference

Standard pricing (per 1M tokens)

| Model | Input | Output | Cache write (5m) | Cache read | Batch input | Batch output | |-------|-------|--------|-----------------|------------|-------------|-------------| | Opus 4 | $15.00 | $75.00 | $18.75 | $1.50 | $7.50 | $37.50 | | Sonnet 4 | $3.00 | $15.00 | $3.75 | $0.30 | $1.50 | $7.50 | | Haiku 3.5 | $0.80 | $4.00 | $1.00 | $0.08 | $0.40 | $2.00 |

Cost multipliers

Cache read: 0.1x input price (90% savings)
Batch API: 0.5x all prices (50% savings)
Batch + cache read: 0.05x input price (95% savings)
Extended thinking: billed at output token rate

Example: 1,000 requests, 2K input + 500 output tokens each

| Configuration | Cost | |---------------|------| | Opus, no optimization | $52.50 | | Sonnet, no optimization | $10.50 | | Haiku, no optimization | $3.60 | | Sonnet + prompt caching (1.5K cached) | $5.55 | | Sonnet + batch | $5.25 | | Haiku + batch + caching | $0.99 |

LLM Cost Audit

Audit Checklist

Run through each section against the target pipeline. Flag any finding as a cost issue with estimated savings.

1. Prompt Caching

Anthropic prompt caching gives a 90% discount on cached token reads. Missing it is the single most expensive mistake.

What to check

System prompts with cache_control: Every ChatAnthropic call with a static system prompt longer than 1,024 tokens (Sonnet/Opus) or 2,048 tokens (Haiku 3.5) MUST use cache_control: {"type": "ephemeral"}.
Cache hierarchy preserved: Content must follow tools → system → messages order. Changing anything earlier invalidates everything later.
No cache-breaking mutations: Toggling tool_choice, web search, citations, speed mode, or image presence between calls breaks the cache.
Multi-turn caching: Place cache_control on the last user message so the full conversation prefix gets cached.
JSON key ordering stability: Tool use content blocks must serialize with stable key order (Python json.dumps with sort_keys=True).
Max 4 breakpoints per request: More won't error but won't help either.
20-block lookback window: If you have more than 20 content blocks before the last breakpoint, add an earlier breakpoint.
TTL awareness: Default is 5 minutes (refreshed on each hit). Use "ttl": "1h" for infrequent access patterns (costs 2x instead of 1.25x on write).

Minimum cacheable tokens by model

| Model | Min tokens | |-------|-----------| | Opus 4.5/4.6 | 4,096 | | Opus 4/4.1, Sonnet 3.7/4/4.5/4.6 | 1,024 | | Haiku 4.5 | 4,096 | | Haiku 3/3.5 | 2,048 |

Pricing impact

Cache read = 0.1x base input price (90% off). Cache write = 1.25x (5m TTL) or 2x (1h TTL). Break-even: 2 cache hits within TTL window.

Red flags

Raw httpx/requests calls to Anthropic API without cache_control headers
ChatAnthropic calls that re-send large system prompts without caching
System prompts below minimum threshold that could be padded with useful few-shot examples to cross the line

2. Compaction & Context Window Management

Long-running conversations and agentic workflows burn tokens as context grows. Compaction is how you manage this — and doing it wrong can double the cost.

What to check

No compaction strategy at all: Any agent or chat that can exceed 50% of the context window MUST have a compaction plan. Without one, you either hit the limit and fail, or re-send the entire growing conversation on every turn.
Compaction calls with different prefix: When summarizing a conversation, the compaction call MUST use the exact same system prompt, tools, and message prefix as the parent conversation. This ensures a cache hit (1/10 price). A separate "summarizer" prompt pays full price for all input tokens.
Not using the Compaction API: Anthropic's server-side compaction (compact_20260112) handles edge cases automatically. Use it instead of rolling your own unless you need custom behavior.
Reactive-only compaction: Blocking the user for 30-60s when context fills is a bad experience. Use background compaction: trigger summary creation at a soft threshold (e.g., 75% full), swap instantly at the hard threshold.
No total token budget enforcement: Long agentic tasks with many tool iterations can consume millions of tokens across compaction cycles. Track compaction count × trigger threshold to estimate cumulative spend.
Missing pause_after_compaction: If your agent needs to re-attach files, recent messages, or specific instructions after compaction, use pause_after_compaction: true and re-inject before continuing.
System prompt without its own cache_control: Without a separate cache breakpoint on the system prompt, compaction invalidates the system prompt cache too. Always cache the system prompt separately.

Compaction API usage

response = client.beta.messages.create(
    betas=["compact-2026-01-12"],
    model="claude-opus-4-6",
    max_tokens=4096,
    system=[{"type": "text", "text": system_prompt,
             "cache_control": {"type": "ephemeral"}}],  # Survives compaction
    messages=messages,
    context_management={
        "edits": [{
            "type": "compact_20260112",
            "trigger": {"type": "input_tokens", "value": 150000},
            "pause_after_compaction": True,
        }]
    },
)

Budget enforcement with compaction counting

TRIGGER = 100_000
BUDGET = 3_000_000
n_compactions = 0

if response.stop_reason == "compaction":
    n_compactions += 1
    if n_compactions * TRIGGER >= BUDGET:
        # Inject wrap-up instruction
        messages.append({"role": "user",
            "content": "Please wrap up and summarize the final state."})

Cost comparison: compaction approaches

Red flags

Agent or chat with no context window management
Compaction using a separate "summarizer" system prompt
Blocking user during compaction (no background strategy)
No compaction count tracking for budget enforcement
usage.iterations not being aggregated for billing (top-level usage excludes compaction iterations)
System prompt missing cache_control breakpoint (re-cached on every compaction)

3. Model Selection & Routing

Opus is 18.75x more expensive than Haiku per token. Sending a classification task to Opus is burning money.

Model tiers

What to check

Single model for all tasks: If every node in the graph uses the same model, there's a routing opportunity. Classification/extraction nodes should use Haiku.
Opus for simple tasks: Any Opus call that does classification, extraction, formatting, or routing is overspending.
No cascading/fallback: For quality-sensitive tasks, try Haiku → Sonnet → Opus escalation rather than always hitting Opus.
Tool use on Haiku: Haiku's tool use is less reliable. If a Haiku node uses tools and fails often, upgrade to Sonnet for that node only.

Routing patterns

Rule-based (zero overhead): Route by task type — classify → Haiku, analyze → Sonnet, architect → Opus.
Classifier-based: Use Haiku to classify complexity, then route. Adds one cheap call but adapts to input.
Cascading: Start cheap, escalate on low confidence. Best for variable-complexity inputs.
Parallel racing: Send to multiple models, take the first/best. Only for latency-critical paths.

Red flags

Same model= parameter across all LLM calls in the pipeline
Opus used for extraction, tagging, or formatting
No quality evaluation to know whether a cheaper model would suffice

4. Extended Thinking

Thinking tokens are billed at output token rates. On Opus, 10K thinking tokens = $0.75.

What to check

Extended thinking on simple tasks: If thinking={"type": "enabled"} is set globally or on classification/extraction nodes, it's wasting tokens.
No budget_tokens cap: Without a cap, the model can generate 30K+ thinking tokens. Always set budget_tokens (start with 3,000, increase only if quality demands it).
Thinking on Haiku: Haiku 3.5 doesn't support extended thinking. If the code tries to enable it, it's a no-op or error.

When to use extended thinking

Math, logic, or formal reasoning
Debugging that requires step-by-step tracing
Multi-variable decision-making
Tasks where standard mode demonstrably produces worse results

Red flags

thinking={"type": "enabled"} without budget_tokens
Extended thinking on every node in the graph
No evidence that thinking improves quality for the specific task

5. State Management (LangGraph)

LangGraph checkpoints the entire state at every node. Bloated state = bloated checkpoints + wasted input tokens when state is passed to LLM calls.

What to check

Unbounded message lists: messages: Annotated[list, add_messages] grows forever in conversational graphs. Must have trimming or summarization.
Large objects in state: Full transcripts, raw API responses, embedding vectors, or binary data stored in state fields. These should be in a database with only references in state.
No RemoveMessage usage: In long-running conversations, old messages should be removed or summarized to bound token usage.
Missing trim_messages: Before every LLM call, messages should be trimmed to a token budget.
State fields that only one node reads: These add checkpoint overhead without benefiting the graph.

Trimming pattern

from langchain_core.messages import trim_messages

def call_model(state: MessagesState):
    trimmed = trim_messages(
        state["messages"],
        max_tokens=4000,
        strategy="last",
        token_counter=llm,
        include_system=True,
    )
    return {"messages": [await llm.ainvoke(trimmed)]}

Red flags

State TypedDict with raw_transcript: str or full_response: dict fields
No message trimming before LLM calls
Checkpoint storage growing unboundedly over time
Large list fields without reducers (e.g., results: list that appends every intermediate output)

6. Token Optimization

Every token costs money. Reduce input tokens, bound output tokens, and use structured output.

Input token reduction

Verbose system prompts: Rewrite "Please ensure that you carefully..." → direct imperatives. Saves 20-40% of prompt tokens.
Too many few-shot examples: Use 1-2 high-quality examples instead of 5-10. Dynamic selection by similarity is even better.
Full documents where excerpts suffice: Only include relevant chunks, not entire documents.
JSON keys in prompts: Short keys with a legend ("n" vs "full_name") save tokens in large payloads.
XML over JSON for Claude prompt structure: Claude was trained on XML; XML tags are more token-efficient for structured instructions.

Output token reduction

No max_tokens set: Always set max_tokens appropriate to the task. Classification: 50-100. Extraction: 200-500. Generation: 500-2000.
No stop_sequences: Use stop_sequences to halt generation at known endpoints (e.g., "}", "```", "</result>").
Free-form output for structured data: Use tool_use to force schema-compliant JSON output. Typically 50-70% fewer output tokens than free-form.
No assistant prefilling: Prefill the assistant response (e.g., {"role": "assistant", "content": "{"}) to skip preamble.

Red flags

No max_tokens parameter on LLM calls
Free-form text output parsed with regex when tool_use would work
System prompts with verbose explanatory language
Five or more few-shot examples included in every call
Full documents in context when only a summary or specific section is needed

7. Batch API Opportunities

Anthropic's Batch API gives a 50% discount on both input and output tokens. 24-hour SLA, up to 100K requests per batch.

What to check

Offline processing not using batch: Any pipeline step that processes historical data, runs evaluations, generates bulk content, or validates in batch should use client.messages.batches.create().
Real-time API for batch-eligible work: If the results aren't needed within seconds, they're batch-eligible.

Batch-eligible workloads

Bulk document processing / summarization
Dataset annotation and labeling
Evaluation and testing runs
Content generation pipelines
Module validation sweeps
Correction aggregation
Historical transcript analysis

Batch + caching stacking

Batch API discount (50%) stacks with prompt caching read discount (90%). A cached batch read costs 0.05x the standard input price.

Red flags

Historical data processed one-by-one through the real-time API
Evaluation suites that call the standard API instead of batch
Any loop that processes 100+ items sequentially through the API

8. Cost Tracking & Observability

You can't optimize what you don't measure.

What to check

No token tracking: Every LLM call should track input_tokens, output_tokens, cache_read_input_tokens, cache_creation_input_tokens from response.usage or AIMessage.usage_metadata.
No cost attribution: Costs should be tagged by pipeline, customer, and step so you can identify the most expensive operations.
No budget enforcement: Production pipelines should have per-run cost budgets that halt execution if exceeded.
No observability platform: LangSmith or Langfuse should be integrated for trace-level cost visibility.

Callback pattern for LangChain

from langchain_core.callbacks import BaseCallbackHandler

class CostTracker(BaseCallbackHandler):
    def __init__(self):
        self.total_cost = 0.0

    def on_llm_end(self, response, **kwargs):
        for gen_list in response.generations:
            for gen in gen_list:
                if hasattr(gen, "message"):
                    usage = gen.message.usage_metadata
                    # Calculate cost from usage + model pricing

Budget enforcement pattern

async def budget_aware_node(state):
    if state.get("cumulative_cost_usd", 0) >= state.get("budget_usd", 1.0):
        return {"errors": ["Cost budget exceeded"], "should_terminate": True}
    # ... proceed with LLM call, update cumulative_cost_usd ...

Red flags

No callbacks or usage tracking on any LLM instance
No per-run or per-pipeline cost logging
No alerts for cost anomalies
Running in production without LangSmith/Langfuse tracing

9. Subgraph & Conditional Execution

LangGraph subgraphs isolate state, preventing token bleed across concerns. Conditional execution skips expensive nodes when they're not needed.

What to check

Monolithic graphs: A single graph with 10+ nodes sharing one state accumulates tokens across all nodes. Break into subgraphs that pass only necessary data.
No conditional skipping: Expensive nodes (Sonnet/Opus calls, large context retrieval) should be gated behind conditions. Low-priority inputs can take a cheaper path.
No retry limits: Conditional edges that retry on failure without a cap can loop indefinitely, burning tokens.

Red flags

All nodes share the same message history (full accumulation)
No conditional edges to skip expensive processing for simple inputs
Retry loops without max_retries guards
Linear pipelines where intermediate outputs are carried through all downstream nodes

10. Caching at the Application Layer

Beyond Anthropic's prompt caching, LangChain offers response-level caching that avoids API calls entirely for repeated inputs.

What to check

No LangChain cache configured: For deterministic tasks (classification, extraction), identical inputs should return cached results.
Cache type appropriate to deployment: InMemoryCache for dev/single-process, SQLiteCache for single-server production, RedisCache for multi-server, RedisSemanticCache for fuzzy matching.
Caching enabled on non-deterministic nodes: Creative/generative nodes should have caching disabled (cache=False).

Cache configuration

from langchain_core.globals import set_llm_cache
from langchain_core.caches import InMemoryCache

set_llm_cache(InMemoryCache())

# Per-LLM override
deterministic_llm = ChatAnthropic(model="claude-haiku-35-20241022", cache=True)
creative_llm = ChatAnthropic(model="claude-sonnet-4-20250514", cache=False)

Adoption

33prime/skills/practices/llm-cost-audit

$ install --global

Security Scan Results

SKILL.md

LLM Cost Audit

Audit Checklist

1. Prompt Caching

What to check

Minimum cacheable tokens by model

Pricing impact

Red flags

2. Compaction & Context Window Management

What to check

Compaction API usage

Budget enforcement with compaction counting

Cost comparison: compaction approaches

Red flags

3. Model Selection & Routing

Model tiers

What to check

Routing patterns

Red flags

4. Extended Thinking

What to check

When to use extended thinking

Red flags

5. State Management (LangGraph)

What to check

Trimming pattern

Red flags

6. Token Optimization

Input token reduction

Output token reduction

Red flags

7. Batch API Opportunities

What to check

Batch-eligible workloads

Batch + caching stacking

Red flags

8. Cost Tracking & Observability

What to check

Callback pattern for LangChain

Budget enforcement pattern

Red flags

9. Subgraph & Conditional Execution

What to check

Red flags

10. Caching at the Application Layer

What to check

Cache configuration

Red flags

Pricing Quick Reference

Standard pricing (per 1M tokens)

Cost multipliers

Example: 1,000 requests, 2K input + 500 output tokens each

Related Skills

33prime/skills/workflows/parallel-execution

33prime/skills/workflows/module-extraction

33prime/skills/workflows/forge-orchestrate

33prime/skills/workflows/code-review

33prime/skills/practices/llm-cost-audit

$ install --global

Security Scan Results

SKILL.md

LLM Cost Audit

Audit Checklist

1. Prompt Caching

What to check

Minimum cacheable tokens by model

Pricing impact

Red flags

2. Compaction & Context Window Management

What to check

Compaction API usage

Budget enforcement with compaction counting

Cost comparison: compaction approaches

Red flags

3. Model Selection & Routing

Model tiers