skills/llm-response-caching-layer/SKILL.md
Implement semantic and exact-match caching for LLM responses to reduce cost 40-60% and latency. Activate on: LLM caching, semantic cache, reduce API costs, cache AI responses. NOT for: general web caching (caching-strategies), CDN config (cloudflare-worker-dev).
npx skillsauth add curiositech/windags-skills llm-response-caching-layerInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Implement semantic and exact-match caching for LLM API responses to reduce costs 40-60% and cut P50 latency from seconds to milliseconds.
Activate on: "cache LLM responses", "semantic cache", "reduce OpenAI costs", "LLM API caching", "cache embeddings", "deduplicate LLM calls", "response memoization for AI"
NOT for: General HTTP/CDN caching (caching-strategies), browser cache headers (caching-strategies), or database query caching (ORM-specific)
| Domain | Technologies | Notes | |--------|-------------|-------| | Exact-Match Cache | Redis, DynamoDB, Memcached | Hash(prompt + model + temperature + params) as key | | Semantic Cache | GPTCache, Qdrant, Pinecone, pgvector | Embed query, find similar cached responses | | Similarity Threshold | Cosine similarity >= 0.95 typical | Tune per use case; lower = more hits, more risk | | Cache Invalidation | TTL-based, version-tagged, manual purge | LLM responses rarely need real-time freshness | | Observability | Cache hit/miss rates, cost savings, latency delta | Essential for ROI justification |
LLM Request
│
▼
[Exact Match Cache (Redis)] ──hit──→ Return cached response (< 5ms)
│ miss
▼
[Semantic Cache (Vector DB)] ──hit (similarity > 0.95)──→ Return cached response (< 50ms)
│ miss
▼
[LLM API Call] ──→ response ──→ Store in both caches ──→ Return response
# Two-tier caching middleware
import hashlib, json, numpy as np
class LLMCacheMiddleware:
def __init__(self, redis_client, vector_db, embedder, threshold=0.95):
self.redis = redis_client
self.vdb = vector_db
self.embedder = embedder
self.threshold = threshold
def cache_key(self, prompt: str, model: str, **params) -> str:
blob = json.dumps({"prompt": prompt, "model": model, **params}, sort_keys=True)
return f"llm:{hashlib.sha256(blob.encode()).hexdigest()}"
async def query(self, prompt: str, model: str, **params) -> str:
# Tier 1: Exact match
key = self.cache_key(prompt, model, **params)
cached = await self.redis.get(key)
if cached:
return json.loads(cached)["response"] # < 5ms
# Tier 2: Semantic match
query_emb = self.embedder.embed(prompt)
results = self.vdb.search(query_emb, top_k=1)
if results and results[0].score >= self.threshold:
return results[0].payload["response"] # < 50ms
# Cache miss: call LLM
response = await llm_call(prompt, model, **params)
# Store in both tiers
await self.redis.setex(key, 86400, json.dumps({"response": response}))
self.vdb.upsert(query_emb, {"prompt": prompt, "response": response})
return response
Incoming Request
│
▼
[Classify Cacheability]
├── temperature == 0 AND structured output → HIGHLY CACHEABLE (TTL: 7 days)
├── temperature == 0 AND free-form → CACHEABLE (TTL: 24 hours)
├── temperature > 0 AND repeated pattern → SEMANTIC CACHE ONLY (TTL: 1 hour)
└── temperature > 0 AND unique/creative → DO NOT CACHE
Cache Key = hash(prompt + model_version + system_prompt_version + params)
Model upgrade (gpt-4o-2026-01 → gpt-4o-2026-03):
→ All cache keys change automatically (model_version in hash)
→ Old cache entries expire via TTL
→ No manual invalidation needed
tools
Building resilient distributed systems with circuit breakers, retries with full-jitter exponential backoff, retry budgets (per-request 3-attempt + per-client 10% ratio per Google SRE), deadline propagation, and the cascading-failure math (4 layers × 3 retries = 64x amplification). Grounded in Resilience4j, Microsoft Cloud Patterns, AWS Architecture Blog (Marc Brooker), and Google SRE Book.
testing
Designing HTTP cache headers that work correctly across browsers, CDNs, and shared proxies — `Cache-Control` directives per RFC 9111, `stale-while-revalidate` and `stale-if-error` per RFC 5861, the Vary header for varying responses, and surrogate keys for tag-based purging. Grounded in IETF RFCs and Cloudflare/Fastly docs.
development
Use when designing or fixing a Content Security Policy on a real site, choosing between nonce-based and hash-based CSP, adding strict-dynamic, debugging "Refused to execute inline script" errors, deploying CSP in report-only mode first, configuring report-to / report-uri, or auditing an existing policy for unsafe-inline / unsafe-eval / wildcards. Triggers: "CSP blocks legitimate inline script", strict-dynamic, nonce-{RANDOM}, sha256-{HASH}, object-src none, base-uri none, frame-ancestors, Trusted Types, X-Content-Security-Policy obsolete, report-only vs enforced. NOT for general HTTP security headers (HSTS, COOP/COEP), Trusted Types deep dive, CORS configuration, or building a WAF.
tools
Choosing and operating an HTTP API versioning strategy that doesn't break clients — Stripe's date-based pinned versions, the Deprecation/Sunset header pair (RFC 9745 + RFC 8594), URI vs header vs media-type approaches, and the version-transformer pattern. Grounded in Stripe's published architecture and IETF RFCs.