Input tokens
What the AI reads
Your prompt + system instructions + conversation history + uploaded documents + tool definitions. Claude carries ~5,000 tokens of system overhead before you type a word.
Every AI interaction starts with tokens — the atomic unit of cost. Understanding them is understanding your bill.
A token is roughly ¾ of a word. The sentence "What's the weather today?" is about 6 tokens. A 1,000-word blog post runs approximately 1,333 tokens.
Input tokens
What the AI reads
Your prompt + system instructions + conversation history + uploaded documents + tool definitions. Claude carries ~5,000 tokens of system overhead before you type a word.
Output tokens
What the AI writes
The response, code, analysis — generated sequentially, one token at a time. Always 3–5x more expensive than input because sequential generation is memory-bandwidth limited on GPUs.
Key insight: Output tokens dominate your bill not because there are more of them, but because they cost 3–5x more per token. A model that produces the same answer in fewer output tokens can be cheaper even at a higher per-token price.
NVIDIA now calls data centers "AI token factories." Here's the invisible chain that determines your price.
NVIDIA Hopper → Blackwell
35x cost reduction
From $4.20 to $0.12 per million tokens at the GPU level. The GB300 NVL72 generates 6,000 tokens/GPU — an order of magnitude more than the H200.
Energy efficiency gain
10x tokens per watt
A 1-megawatt Blackwell facility produces 10x the inference output of the same power envelope on Hopper. Power cost is the #1 TCO driver for inference.
Source:NVIDIA InferenceMAX v1 benchmarks (Oct 2025), NVIDIA blog "Lowest Token Cost AI Factories" (Apr 16, 2026), SemiAnalysis benchmark reports. A $5M GB200 NVL72 investment can generate $75M in token revenue — a 15x ROI.
Only the top-tier models — no mini, nano, or budget variants. Three flagships, head to head.
| Claude Opus 4.7 | Claude Opus 4.6 | GPT-5.4 | |
|---|---|---|---|
| Release | Apr 16, 2026 | Feb 4, 2026 | Mar 5, 2026 |
| Input price | $5.00 / MTok | $5.00 / MTok | $2.50 / MTok |
| Output price | $25.00 / MTok | $25.00 / MTok | $15.00 / MTok |
| Context window | 1M tokens | 1M tokens | 1.05M tokens |
| Max output | 128K tokens | 64K tokens | 128K tokens |
| Long-context surcharge | None | None | 2x input above 272K |
| Cached input cost | $0.50 / MTok (90% off) | $0.50 / MTok (90% off) | $1.25 / MTok (50% off) |
| Cache write fee | 1.25x base (5-min TTL) | 1.25x base (5-min TTL) | Free (automatic) |
| Batch discount | 50% | 50% | 50% |
| Tokenizer note | New tokenizer — up to 35% more tokens for same text | Standard | Standard |
Hidden cost traps:Opus 4.7's new tokenizer may inflate effective cost per request even at the same per-token rate. GPT-5.4 doubles input price above 272K tokens. Always test with real prompts — sticker price ≠ task price.
(800 input tokens + 300 output tokens per conversation)
Opus 4.6 (no optimization)
$1,150/mo
GPT-5.4 (no optimization)
$650/mo
Claude stack (optimized)
$314/mo
Haiku 85% + Opus 15% with caching. 73% savings.
Platform-specific strategies ranked by impact. Stack them for 50-90% total savings.
Cache reads cost just $0.50/MTok on Opus (10% of base). Add cache_control to your request. Static content first, dynamic content last. Minimum 1,024 tokens. 5-min TTL writes cost 1.25x; 1-hour writes cost 2x.
Async processing within 24 hours at half price. Opus 4.6/4.7 supports 300K output tokens per batch request. Ideal for content generation, data processing, and evaluation runs.
Haiku 4.5 $1/$5 → Sonnet 4.6 $3/$15 → Opus 4.7 $5/$25. Use intent classification to auto-route. The opusplan alias uses Opus for planning, Sonnet for implementation.
Opus 4.7 adds xhigh effort level. Use low for simple tasks (fewer thinking tokens), high minimum for quality-critical work. Cap thinking tokens at 10,000 for most tasks.
Caching is free and automatic on GPT-5.4+. No code changes needed. Cached input costs $1.25/MTok (50% of base). Extend to 24h with prompt_cache_retention: "24h".
Batch API: 50% off, 24-hour turnaround. Flex tier combines batch pricing with extended caching. Flex showed 23% input cost reduction vs standard Batch in testing.
Nano $0.20/$1.25 → Mini $0.40/$1.60 → Standard $2.50/$15. Mini scores 54.38% on SWE-bench Pro vs Standard's 57.7% at 6x lower cost.
GPT-5.4 input cost doubles above 272K tokens ($2.50 → $5.00). Output jumps to $22.50. Monitor prompt size. Use conversation windowing or summarization for long sessions.
Use Headroom as a local proxy to compress tool outputs, JSON, logs, and code. Works with both Claude and OpenAI. CCR architecture means nothing is lost permanently.
One task per session. Use /compact while cache is warm (within 5 min) or /clear after breaks. Keep CLAUDE.md under 500 tokens. Strip unnecessary fields from tool responses.
Open-source tools developers use to slash AI bills — ranked by community traction and real-world impact.
Makes AI agents talk like a caveman — cutting output verbosity dramatically while preserving technical accuracy. Created by a 19-year-old CS student, went #1 trending on GitHub and #1 on Hacker News.
Works with: Claude Code, Codex, Gemini CLI, Cursor, Windsurf, Copilot, Cline
Context optimization layer that compresses tool outputs, JSON, code, and logs before they hit the LLM. Runs as a local proxy — your data never leaves your machine. Features lossless CCR (Compress-Cache-Retrieve) architecture.
Integrations: Claude, OpenAI, LiteLLM, Bedrock, LangChain, Agno, Strands, MCP
CLI proxy that compresses command outputs (git, cargo, npm, etc.) before they reach the LLM context. Single Rust binary, 100+ supported commands, under 10ms overhead.
Example: git push output drops from ~200 tokens to ~10 tokens
Entropy-based prompt compression that minimizes token complexity without accessing model weights. Supports LangChain and OpenAI JSON format. Protected tags keep critical sections intact.
Academic-grade curated collection of 200+ papers on token pruning, merging, clustering, and compression across vision, language, and multimodal models. Updated through CVPR/ICLR/AAAI 2026.
A single CLAUDE.md file that keeps responses terse. Benchmarked 63% output word reduction. Drop-in for any Claude Code project. Includes profiles for coding, documentation, and review workflows.
Reality check on "caveman" claims: The viral caveman repo claims 65-75% output token savings. Independent benchmarking by dev.to found 14-21% actual savings across real coding tasks, and a simpler 6-line prompt (85 tokens) outperformed the full 552-token skill. Output compression helps, but total session savings are modest because input tokens dominate most sessions. Real savings come from stacking strategies — caching + routing + compression together.
Security-verified skills help agents do the right work with less back-and-forth. Browse the SkillsAuth marketplace for skills you can trust in production.
Browse skill marketplace