Adoption

Agent Skills are supported by leading AI development tools.

VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory

latestaiagents/prompt-caching-ttl

Name: prompt-caching-ttl
Author: latestaiagents

skills/claude-4-6-features/prompt-caching-ttl/SKILL.md

npx skillsauth add latestaiagents/agent-skills prompt-caching-ttl

Clean

TrivyContainer and dependency vulnerability scanner

Clean

SemgrepStatic code analysis for vulnerabilities

Clean

mcp-scan (Snyk)Model Context Protocol security validation

Skipped

Snyk (dep)Open source security scanning

Skipped

Socket.devSupply chain security analysis

Skipped

VirusTotalMulti-engine malware detection

Skipped

CrowdStrikeAdvanced threat intelligence

Skipped

OSV-ScannerOpen Source Vulnerability database check

Skipped

OWASP Dep-Check

Prompt Caching — 5min & 1h TTL

Prompt caching reuses already-processed prefixes. Cache reads cost ~10% of fresh input. For apps with large repeated context, this is the single biggest lever on your bill.

When to Use

Large system prompts reused across many requests
Long documents/codebases with many follow-up questions
Multi-turn conversations with growing history
Tool/function definitions shared across sessions
Any call where > 1024 tokens would be repeated (2048 for Haiku)

Two TTLs

| TTL | Use case | Cost of cache write | |---|---|---| | 5 min (ephemeral) | Conversation, active session, interactive tools | ~1.25× input | | 1 hour | System prompts, knowledge bases, codebases | ~2× input |

Cache reads are ~0.1× input cost regardless of TTL. The only difference is how long the cache persists and what the write costs.

Rule: use 1h when the cache lives across sessions or independent users; use 5min for in-session reuse.

Basic Usage

const response = await client.messages.create({
  model: "claude-sonnet-4-6",
  max_tokens: 4096,
  system: [
    { type: "text", text: "You are a helpful assistant." },
    {
      type: "text",
      text: giantCodebase,
      cache_control: { type: "ephemeral", ttl: "1h" },
    },
  ],
  messages: [{ role: "user", content: "Where is auth handled?" }],
});

console.log(response.usage);
// { input_tokens: 120, cache_creation_input_tokens: 450000, cache_read_input_tokens: 0, output_tokens: 200 }

Next call within 1h:

{ input_tokens: 120, cache_creation_input_tokens: 0, cache_read_input_tokens: 450000, output_tokens: 180 }

Cache Breakpoints

You can place up to 4 cache breakpoints per request. Everything up to a breakpoint is cached as a prefix. Typical pattern:

const response = await client.messages.create({
  model: "claude-sonnet-4-6",
  max_tokens: 4096,
  system: [
    { type: "text", text: systemPrompt, cache_control: { type: "ephemeral", ttl: "1h" } },
  ],
  tools: [
    // all tool definitions
    { ...lastTool, cache_control: { type: "ephemeral", ttl: "1h" } }, // breakpoint at end of tools
  ],
  messages: [
    { role: "user", content: "Long context document..." },
    {
      role: "assistant",
      content: [{ type: "text", text: "Understood.", cache_control: { type: "ephemeral", ttl: "5m" } }],
    },
    { role: "user", content: "Current question." }, // NOT cached — this changes every call
  ],
});

Four breakpoints: system, tools, document, conversation history. The current turn stays uncached.

Prefix Matching Rules

Cache hits require byte-exact prefix match up to the breakpoint. Common breakers:

Timestamps / UUIDs in system prompt — moves every call; kills the cache
Reordering tools — same tools in different order = cache miss
Changing any earlier message — even whitespace; cache invalidates
Different model — caches are per-model
Different max_tokens — doesn't break cache, but other param changes might

Keep everything before the breakpoint stable and deterministic.

Minimum Cacheable Size

Most models: 1024 tokens
Haiku: 2048 tokens

Below the minimum, cache_control is ignored silently.

Inspecting Hit Rate

const usage = response.usage;
const hitRate = usage.cache_read_input_tokens /
                (usage.cache_read_input_tokens + usage.cache_creation_input_tokens + usage.input_tokens);
console.log(`Cache hit rate: ${(hitRate * 100).toFixed(1)}%`);

Target: > 80% hit rate for production workloads on stable prefixes.

Multi-Turn Pattern

For conversations, put a 5m breakpoint at the last assistant turn. Each new user turn extends the cache:

function addCacheBreakpoint(messages: any[]) {
  const copy = [...messages];
  const lastAssistant = [...copy].reverse().find((m) => m.role === "assistant");
  if (lastAssistant && Array.isArray(lastAssistant.content)) {
    const lastBlock = lastAssistant.content[lastAssistant.content.length - 1];
    if (lastBlock.type === "text") lastBlock.cache_control = { type: "ephemeral", ttl: "5m" };
  }
  return copy;
}

As the conversation grows, each call reuses the accumulated cache.

Combining with 1M Context

1M context is expensive. Cache it:

const response = await client.messages.create(
  {
    model: "claude-sonnet-4-6",
    max_tokens: 4096,
    system: [
      { type: "text", text: giantCodebase, cache_control: { type: "ephemeral", ttl: "1h" } },
    ],
    messages: [{ role: "user", content: userQuestion }],
  },
  { headers: { "anthropic-beta": "context-1m-2025-08-07" } },
);

First call: pay ~2× input for cache write. Every subsequent question in the hour: ~10% input. Breaks even after ~2 questions.

Anti-Patterns

Dynamic content before breakpoint — timestamp, request ID, rand in system prompt
No breakpoint placed — cache_control missing, nothing cached
Cache smaller than minimum — silently ignored
Ignoring hit rate metrics — you think caching is working but it's not
Using 5m for cross-session knowledge bases — re-paying cache write constantly

Debugging Zero Hit Rate

Log cache_creation_input_tokens and cache_read_input_tokens every call
If creation > 0 on every call: the prefix is changing. Diff consecutive requests byte by byte
If both are 0: prefix is below minimum tokens, or cache_control isn't being sent
If creation on first call but read is 0 on second within TTL: different model, different region, or parameter change broke the cache

Best Practices

Place breakpoints at stable boundaries: end of system prompt, end of tools, end of large document
Use 1h TTL for knowledge bases reused across requests; 5m for conversations
Keep everything before breakpoints byte-stable — no timestamps, no UUIDs
Monitor hit rate as a first-class metric
For multi-turn, move a 5m breakpoint to the last assistant message each call
Combine with 1M context for big corpora — caching makes long-context affordable

latestaiagents/prompt-caching-ttl

skills/claude-4-6-features/prompt-caching-ttl/SKILL.md

Use Claude's prompt caching with 5-minute and 1-hour TTLs to slash costs on repeated context — codebases, system prompts, long documents. Covers cache breakpoints, hit-rate optimization, and the common mistakes that silently disable caching. Use this skill when building apps with repeated large context, optimizing LLM spend, or debugging "why are my cache reads zero?" Activate when: prompt caching, cache_control, cache hit rate, 5 minute cache, 1 hour TTL cache, ephemeral cache, reduce Claude cost.

2 stars

development

Updated Apr 23, 2026

$ install --global

skillsauth

npx skillsauth add latestaiagents/agent-skills prompt-caching-ttl

Install this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.

Security Scan Results

3 of 9 scanners reported clean

Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.

Scanners Passed

Scanners in report

Clean

TrivyContainer and dependency vulnerability scanner

95%

Clean

SemgrepStatic code analysis for vulnerabilities

95%

Clean

mcp-scan (Snyk)Model Context Protocol security validation

95%

Skipped

Snyk (dep)Open source security scanning

50%

Skipped

Socket.devSupply chain security analysis

50%

Skipped

VirusTotalMulti-engine malware detection

50%

Skipped

CrowdStrikeAdvanced threat intelligence

50%

Skipped

OSV-ScannerOpen Source Vulnerability database check

50%

Skipped

OWASP Dep-Check

50%

Last scanned: Apr 24, 2026, 2:55 AM14.0s1 file scanned

SKILL.md

name:: prompt-caching-ttl
description:: |
Activate when:: prompt caching, cache_control, cache hit rate, 5 minute cache, 1 hour TTL cache, ephemeral cache, reduce Claude cost.

Prompt Caching — 5min & 1h TTL

Prompt caching reuses already-processed prefixes. Cache reads cost ~10% of fresh input. For apps with large repeated context, this is the single biggest lever on your bill.

When to Use

Large system prompts reused across many requests
Long documents/codebases with many follow-up questions
Multi-turn conversations with growing history
Tool/function definitions shared across sessions
Any call where > 1024 tokens would be repeated (2048 for Haiku)

Two TTLs

Cache reads are ~0.1× input cost regardless of TTL. The only difference is how long the cache persists and what the write costs.

Rule: use 1h when the cache lives across sessions or independent users; use 5min for in-session reuse.

Basic Usage

const response = await client.messages.create({
  model: "claude-sonnet-4-6",
  max_tokens: 4096,
  system: [
    { type: "text", text: "You are a helpful assistant." },
    {
      type: "text",
      text: giantCodebase,
      cache_control: { type: "ephemeral", ttl: "1h" },
    },
  ],
  messages: [{ role: "user", content: "Where is auth handled?" }],
});

console.log(response.usage);
// { input_tokens: 120, cache_creation_input_tokens: 450000, cache_read_input_tokens: 0, output_tokens: 200 }

Next call within 1h:

{ input_tokens: 120, cache_creation_input_tokens: 0, cache_read_input_tokens: 450000, output_tokens: 180 }

Cache Breakpoints

You can place up to 4 cache breakpoints per request. Everything up to a breakpoint is cached as a prefix. Typical pattern:

const response = await client.messages.create({
  model: "claude-sonnet-4-6",
  max_tokens: 4096,
  system: [
    { type: "text", text: systemPrompt, cache_control: { type: "ephemeral", ttl: "1h" } },
  ],
  tools: [
    // all tool definitions
    { ...lastTool, cache_control: { type: "ephemeral", ttl: "1h" } }, // breakpoint at end of tools
  ],
  messages: [
    { role: "user", content: "Long context document..." },
    {
      role: "assistant",
      content: [{ type: "text", text: "Understood.", cache_control: { type: "ephemeral", ttl: "5m" } }],
    },
    { role: "user", content: "Current question." }, // NOT cached — this changes every call
  ],
});

Four breakpoints: system, tools, document, conversation history. The current turn stays uncached.

Prefix Matching Rules

Cache hits require byte-exact prefix match up to the breakpoint. Common breakers:

Timestamps / UUIDs in system prompt — moves every call; kills the cache
Reordering tools — same tools in different order = cache miss
Changing any earlier message — even whitespace; cache invalidates
Different model — caches are per-model
Different max_tokens — doesn't break cache, but other param changes might

Keep everything before the breakpoint stable and deterministic.

Minimum Cacheable Size

Most models: 1024 tokens
Haiku: 2048 tokens

Below the minimum, cache_control is ignored silently.

Inspecting Hit Rate

const usage = response.usage;
const hitRate = usage.cache_read_input_tokens /
                (usage.cache_read_input_tokens + usage.cache_creation_input_tokens + usage.input_tokens);
console.log(`Cache hit rate: ${(hitRate * 100).toFixed(1)}%`);

Target: > 80% hit rate for production workloads on stable prefixes.

Multi-Turn Pattern

For conversations, put a 5m breakpoint at the last assistant turn. Each new user turn extends the cache:

function addCacheBreakpoint(messages: any[]) {
  const copy = [...messages];
  const lastAssistant = [...copy].reverse().find((m) => m.role === "assistant");
  if (lastAssistant && Array.isArray(lastAssistant.content)) {
    const lastBlock = lastAssistant.content[lastAssistant.content.length - 1];
    if (lastBlock.type === "text") lastBlock.cache_control = { type: "ephemeral", ttl: "5m" };
  }
  return copy;
}

As the conversation grows, each call reuses the accumulated cache.

Combining with 1M Context

1M context is expensive. Cache it:

const response = await client.messages.create(
  {
    model: "claude-sonnet-4-6",
    max_tokens: 4096,
    system: [
      { type: "text", text: giantCodebase, cache_control: { type: "ephemeral", ttl: "1h" } },
    ],
    messages: [{ role: "user", content: userQuestion }],
  },
  { headers: { "anthropic-beta": "context-1m-2025-08-07" } },
);

First call: pay ~2× input for cache write. Every subsequent question in the hour: ~10% input. Breaks even after ~2 questions.

Anti-Patterns

Dynamic content before breakpoint — timestamp, request ID, rand in system prompt
No breakpoint placed — cache_control missing, nothing cached
Cache smaller than minimum — silently ignored
Ignoring hit rate metrics — you think caching is working but it's not
Using 5m for cross-session knowledge bases — re-paying cache write constantly

Debugging Zero Hit Rate

Log cache_creation_input_tokens and cache_read_input_tokens every call
If creation > 0 on every call: the prefix is changing. Diff consecutive requests byte by byte
If both are 0: prefix is below minimum tokens, or cache_control isn't being sent
If creation on first call but read is 0 on second within TTL: different model, different region, or parameter change broke the cache

Best Practices

Place breakpoints at stable boundaries: end of system prompt, end of tools, end of large document
Use 1h TTL for knowledge bases reused across requests; 5m for conversations
Keep everything before breakpoints byte-stable — no timestamps, no UUIDs
Monitor hit rate as a first-class metric
For multi-turn, move a 5m breakpoint to the last assistant message each call
Combine with 1M context for big corpora — caching makes long-context affordable

Related Skills

latestaiagents/skill-testing

development

VerifiedTrustedCommunity

Test skills for correct activation, content quality, and regression — both automated checks (frontmatter validity, lint) and manual verification (query-suite activation testing). Covers CI integration and how to catch skill regressions before users do. Use this skill when adding skills to a repo, setting up CI for a skill library, or debugging "the skill exists but doesn't work". Activate when: test skills, validate skills, skill CI, skill linting, skill activation test, skill regression.

2SKILL.mdUpdated Apr 23, 2026

latestaiagents/skill-testing

latestaiagents/skill-frontmatter

documentation

VerifiedTrustedCommunity

Write the YAML frontmatter for a SKILL.md file so it activates reliably — name, description, and activation keywords that the model matches against. Covers length, tone, and the most common frontmatter mistakes. Use this skill when authoring a new skill, fixing a skill that isn't auto-activating, or reviewing skills for publication. Activate when: SKILL.md frontmatter, skill description, skill activation, skill YAML, write a skill, author a skill.

2SKILL.mdUpdated Apr 23, 2026

latestaiagents/skill-frontmatter

latestaiagents/skill-activation-patterns

development

VerifiedTrustedCommunity

Design skills that fire at the right moment — neither over-eager (noise) nor under-eager (silent). Covers activation specificity, trigger phrases, disambiguation between overlapping skills, and debugging activation. Use this skill when multiple skills could fire on the same query, a skill never fires, or a skill fires too often. Activate when: skill won't activate, skill over-activates, overlapping skills, skill triggers, skill selection, skill disambiguation.

2SKILL.mdUpdated Apr 23, 2026

latestaiagents/skill-activation-patterns

latestaiagents/progressive-disclosure

development

VerifiedTrustedCommunity

Structure SKILL.md content so the model reads just enough — concise summary up front, progressively deeper detail, examples on demand. Covers section ordering, length budgets, when to split into multiple skills. Use this skill when writing or refactoring a skill body, one skill has grown too long, or a skill is wordy but not useful. Activate when: SKILL.md structure, skill content, skill too long, split skill, progressive disclosure, skill body.

2SKILL.mdUpdated Apr 23, 2026

latestaiagents/progressive-disclosure

Download

For Claude Desktop. Download once, then upload the file in the app — no terminal needed.

Need help? View full Cowork setup guide →

Install manually

Choose your platform

# Clone the repo
git clone https://github.com/latestaiagents/agent-skills.git

# Copy into Claude Code skills folder (global)
cp -r agent-skills/skills/claude-4-6-features/prompt-caching-ttl ~/.claude/skills/

Claude Code Skills — official skills path docs.

Repository

latestaiagents/agent-skills

2 stars

Compatible with

Claude Code

OpenAI Codex CLI

ChatGPT