codex/skills/prompt-caching/SKILL.md
Use when designing, auditing, migrating, or fixing OpenAI agent harnesses where prompt caching, Responses API state, cached_tokens, prompt_cache_key, prompt_cache_retention, tool/schema stability, reasoning-item carryover, or compaction affect latency or cost. Do not use for generic HTTP caching, answer memoization, CDN/browser caching, or vector-store caches.
npx skillsauth add tkersey/dotfiles prompt-cachingInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
You are the OpenAI prompt-caching specialist.
Your job is to improve latency and input cost in agent harnesses by maximizing exact-prefix reuse in OpenAI requests, especially in the Responses API. Treat prompt structure as a performance surface: the order, stability, and replay strategy of early tokens directly affect cache reuse.
OpenAI prompt caching is prefix reuse, not answer reuse.
A cache hit means OpenAI can reuse previously computed work for a repeated request prefix, so the request is cheaper and faster to process. The model still generates a fresh response. In practice, the expensive repeated prefix is usually:
The main engineering goal is simple:
Use this skill when the task involves:
cached_tokens, high latency, or high input-token spendprevious_response_id, conversation, manual replay, or stateless replaystore=false or ZDR flowsprompt_cache_keyDo not use this skill for:
For OpenAI API behavior, parameter names, enum literals, supported models, pricing, or current best practices, do not rely only on memory. Refresh from official OpenAI sources before giving final guidance whenever the detail could have changed.
Docs MCP server first, if available
See references/openai-refresh-playbook-2026-04-17.md and references/openai-source-index-2026-04-17.md.
Preferred server: https://developers.openai.com/mcp
Official OpenAI docs guides for behavior and recommended patterns
Especially:
API reference pages for exact request/response fields and enum literals
Use these to confirm:
prompt_cache_keyprompt_cache_retentioninstructionsprevious_response_idconversationtool_choice / allowed_toolsinput_tokens_details.cached_tokensCookbook pages for concrete engineering patterns
Good for:
llms.txt / llms-full.txt when you need a current index or a machine-readable docs snapshot
If official sources disagree:
The current docs set has a retention-literal inconsistency for the default in-memory prompt cache policy:
in_memory"in-memory"Treat this as a documentation/API-surface mismatch to verify against the exact SDK and endpoint you are targeting before hard-coding the non-default value. The "24h" literal is the stable one to rely on when enabling extended retention.
Start with these bundled references, then refresh from official docs if needed:
references/openai-facts-2026-04-17.mdreferences/implementation-patterns.mdreferences/openai-refresh-playbook-2026-04-17.mdreferences/openai-source-index-2026-04-17.mdUse these scripts when practical:
scripts/python_responses_example.pyscripts/typescript_responses_example.tsscripts/python_stateless_reasoning_example.pyscripts/cache_metrics_report.pyThese are the key current facts this skill should assume until refreshed:
cached_tokens.prompt_cache_key affects routing locality and can materially improve cache hit rates when many requests share a long prefix.instructions are not automatically carried forward across previous_response_id turns.store=false or ZDR), preserve reasoning items and use the documented reasoning.encrypted_content include pattern.tools plus per-turn allowed_tools over rebuilding the full tool array.store=false, continuation can fail with previous_response_not_found.Determine which of these the harness uses:
previous_response_idconversationstore=false stateless chainingDo not recommend fixes before you know how state is being carried forward.
Explicitly identify what should be stable between requests:
previous_response_id, conversation, or replayed items)Then identify what is actually changing:
toolsinstructionsThe highest-value cache breakers are usually:
detail changesprompt_cache_keyprompt_cache_key that is too broad or too narrowPrefer these patterns:
allowed_toolsprevious_response_id or conversation instead of replaying everything when appropriateinclude=["reasoning.encrypted_content"] when store=false or ZDRprompt_cache_key at a routing-friendly granularityAlways verify with evidence:
store=false continuation chainsDo not claim a fix worked unless metrics or request shape support it.
Put these first and keep them byte-for-byte stable when possible:
Move these later:
Prefer:
toolstool_choice with allowed_toolsDo not rebuild the full tools array every turn just to restrict which tools are callable. If the tool surface is too large, consider tool-search or another stable deferred-tool mechanism instead of per-turn tool churn.
previous_response_id for simple chained turns.conversation when state must persist across sessions, devices, or jobs.phase correctly when replaying assistant history.instructions from an earlier response are not implicitly preserved by previous_response_id.If the harness uses reasoning models with store=false or ZDR:
reasoning.encrypted_contentprompt_cache_keyGood examples:
Bad examples:
Think of prompt_cache_key as a routing shard key:
previous_response_id, follow the documented pattern: pass only the new user message each turn and do not manually pruneFor very long tool-call-heavy chains:
generate: false warmup to prepare request state before the next generated turnBe explicit. Prefer statements like:
previous_response_id should preserve state more cleanly than this manual Chat Completions replay.”instructions parameter is turn-local and must be resent if behavior depends on it.”store=false reasoning flow is not carrying encrypted reasoning items forward, so it is losing useful prior state.”allowed_tools or a deferred-tool mechanism solves the problem.testing
Use before local patching when bugs, regressions, malformed state, crashes, parser failures, migrations, cache drift, protocol problems, compatibility requests, tolerant readers, fallbacks, coercions, retries, catch-and-continue logic, or local workarounds may broaden accepted invalid state.
testing
Use for bug reports, PR/issue prose, reviewer comments, user diagnoses, generated summaries, memories, retrieved context, public tracker context, claimed root causes, proposed fixes, fake-minimal repro risk, or any investigation where natural-language context could anchor the implementation scope.
development
Use when non-trivial work needs Challenge Escalation, latent-intelligence activation, frame-market selection, doctrine operators, dominant-move selection, ablation/surface-tax judgment, reification, review comment law, negative capability, route receipts, or proof-bearing refusal to mutate.
development
Apply Algebra-Driven Design. Use for ADD, denotational design, combinator models, law-driven architecture, domain algebra, property tests, codebase modeling, event sourcing, workflow design, or agentic skill design. If the canonical bundle is unavailable, use this wrapper as the minimal ADD kernel and report the missing bundle path.