skills/claude-4-6-features/prompt-caching-ttl/SKILL.md
Use Claude's prompt caching with 5-minute and 1-hour TTLs to slash costs on repeated context — codebases, system prompts, long documents. Covers cache breakpoints, hit-rate optimization, and the common mistakes that silently disable caching. Use this skill when building apps with repeated large context, optimizing LLM spend, or debugging "why are my cache reads zero?" Activate when: prompt caching, cache_control, cache hit rate, 5 minute cache, 1 hour TTL cache, ephemeral cache, reduce Claude cost.
npx skillsauth add latestaiagents/agent-skills prompt-caching-ttlInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Prompt caching reuses already-processed prefixes. Cache reads cost ~10% of fresh input. For apps with large repeated context, this is the single biggest lever on your bill.
| TTL | Use case | Cost of cache write | |---|---|---| | 5 min (ephemeral) | Conversation, active session, interactive tools | ~1.25× input | | 1 hour | System prompts, knowledge bases, codebases | ~2× input |
Cache reads are ~0.1× input cost regardless of TTL. The only difference is how long the cache persists and what the write costs.
Rule: use 1h when the cache lives across sessions or independent users; use 5min for in-session reuse.
const response = await client.messages.create({
model: "claude-sonnet-4-6",
max_tokens: 4096,
system: [
{ type: "text", text: "You are a helpful assistant." },
{
type: "text",
text: giantCodebase,
cache_control: { type: "ephemeral", ttl: "1h" },
},
],
messages: [{ role: "user", content: "Where is auth handled?" }],
});
console.log(response.usage);
// { input_tokens: 120, cache_creation_input_tokens: 450000, cache_read_input_tokens: 0, output_tokens: 200 }
Next call within 1h:
{ input_tokens: 120, cache_creation_input_tokens: 0, cache_read_input_tokens: 450000, output_tokens: 180 }
You can place up to 4 cache breakpoints per request. Everything up to a breakpoint is cached as a prefix. Typical pattern:
const response = await client.messages.create({
model: "claude-sonnet-4-6",
max_tokens: 4096,
system: [
{ type: "text", text: systemPrompt, cache_control: { type: "ephemeral", ttl: "1h" } },
],
tools: [
// all tool definitions
{ ...lastTool, cache_control: { type: "ephemeral", ttl: "1h" } }, // breakpoint at end of tools
],
messages: [
{ role: "user", content: "Long context document..." },
{
role: "assistant",
content: [{ type: "text", text: "Understood.", cache_control: { type: "ephemeral", ttl: "5m" } }],
},
{ role: "user", content: "Current question." }, // NOT cached — this changes every call
],
});
Four breakpoints: system, tools, document, conversation history. The current turn stays uncached.
Cache hits require byte-exact prefix match up to the breakpoint. Common breakers:
model — caches are per-modelmax_tokens — doesn't break cache, but other param changes mightKeep everything before the breakpoint stable and deterministic.
Below the minimum, cache_control is ignored silently.
const usage = response.usage;
const hitRate = usage.cache_read_input_tokens /
(usage.cache_read_input_tokens + usage.cache_creation_input_tokens + usage.input_tokens);
console.log(`Cache hit rate: ${(hitRate * 100).toFixed(1)}%`);
Target: > 80% hit rate for production workloads on stable prefixes.
For conversations, put a 5m breakpoint at the last assistant turn. Each new user turn extends the cache:
function addCacheBreakpoint(messages: any[]) {
const copy = [...messages];
const lastAssistant = [...copy].reverse().find((m) => m.role === "assistant");
if (lastAssistant && Array.isArray(lastAssistant.content)) {
const lastBlock = lastAssistant.content[lastAssistant.content.length - 1];
if (lastBlock.type === "text") lastBlock.cache_control = { type: "ephemeral", ttl: "5m" };
}
return copy;
}
As the conversation grows, each call reuses the accumulated cache.
1M context is expensive. Cache it:
const response = await client.messages.create(
{
model: "claude-sonnet-4-6",
max_tokens: 4096,
system: [
{ type: "text", text: giantCodebase, cache_control: { type: "ephemeral", ttl: "1h" } },
],
messages: [{ role: "user", content: userQuestion }],
},
{ headers: { "anthropic-beta": "context-1m-2025-08-07" } },
);
First call: pay ~2× input for cache write. Every subsequent question in the hour: ~10% input. Breaks even after ~2 questions.
cache_control missing, nothing cachedcache_creation_input_tokens and cache_read_input_tokens every callcreation > 0 on every call: the prefix is changing. Diff consecutive requests byte by bytecache_control isn't being sentcreation on first call but read is 0 on second within TTL: different model, different region, or parameter change broke the cachedevelopment
Test skills for correct activation, content quality, and regression — both automated checks (frontmatter validity, lint) and manual verification (query-suite activation testing). Covers CI integration and how to catch skill regressions before users do. Use this skill when adding skills to a repo, setting up CI for a skill library, or debugging "the skill exists but doesn't work". Activate when: test skills, validate skills, skill CI, skill linting, skill activation test, skill regression.
documentation
Write the YAML frontmatter for a SKILL.md file so it activates reliably — name, description, and activation keywords that the model matches against. Covers length, tone, and the most common frontmatter mistakes. Use this skill when authoring a new skill, fixing a skill that isn't auto-activating, or reviewing skills for publication. Activate when: SKILL.md frontmatter, skill description, skill activation, skill YAML, write a skill, author a skill.
development
Design skills that fire at the right moment — neither over-eager (noise) nor under-eager (silent). Covers activation specificity, trigger phrases, disambiguation between overlapping skills, and debugging activation. Use this skill when multiple skills could fire on the same query, a skill never fires, or a skill fires too often. Activate when: skill won't activate, skill over-activates, overlapping skills, skill triggers, skill selection, skill disambiguation.
development
Structure SKILL.md content so the model reads just enough — concise summary up front, progressively deeper detail, examples on demand. Covers section ordering, length budgets, when to split into multiple skills. Use this skill when writing or refactoring a skill body, one skill has grown too long, or a skill is wordy but not useful. Activate when: SKILL.md structure, skill content, skill too long, split skill, progressive disclosure, skill body.