Adoption

Agent Skills are supported by leading AI development tools.

VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory

curiositech/llm-response-caching-layer

Name: llm-response-caching-layer
Author: curiositech

skills/llm-response-caching-layer/SKILL.md

npx skillsauth add curiositech/windags-skills llm-response-caching-layer

Clean

TrivyContainer and dependency vulnerability scanner

Clean

SemgrepStatic code analysis for vulnerabilities

Clean

mcp-scan (Snyk)Model Context Protocol security validation

Skipped

Snyk (dep)Open source security scanning

Skipped

Socket.devSupply chain security analysis

Skipped

VirusTotalMulti-engine malware detection

Skipped

CrowdStrikeAdvanced threat intelligence

Skipped

OSV-ScannerOpen Source Vulnerability database check

Skipped

OWASP Dep-Check

LLM Response Caching Layer

Implement semantic and exact-match caching for LLM API responses to reduce costs 40-60% and cut P50 latency from seconds to milliseconds.

Activation Triggers

Activate on: "cache LLM responses", "semantic cache", "reduce OpenAI costs", "LLM API caching", "cache embeddings", "deduplicate LLM calls", "response memoization for AI"

NOT for: General HTTP/CDN caching (caching-strategies), browser cache headers (caching-strategies), or database query caching (ORM-specific)

Quick Start

Classify request types — Deterministic (structured extraction, classification) vs. creative (open-ended generation). Deterministic requests are highly cacheable.
Choose cache strategy — Exact-match (hash of prompt + params) for deterministic requests, semantic similarity for natural language queries.
Set up cache store — Redis for exact-match (fast, TTL-native), vector DB for semantic cache (similarity search).
Integrate as middleware — Wrap your LLM client with a cache-check layer that intercepts before API calls.
Monitor hit rates — Target 30-50% hit rate for mixed workloads, 70%+ for classification/extraction.

Core Capabilities

| Domain | Technologies | Notes | |--------|-------------|-------| | Exact-Match Cache | Redis, DynamoDB, Memcached | Hash(prompt + model + temperature + params) as key | | Semantic Cache | GPTCache, Qdrant, Pinecone, pgvector | Embed query, find similar cached responses | | Similarity Threshold | Cosine similarity >= 0.95 typical | Tune per use case; lower = more hits, more risk | | Cache Invalidation | TTL-based, version-tagged, manual purge | LLM responses rarely need real-time freshness | | Observability | Cache hit/miss rates, cost savings, latency delta | Essential for ROI justification |

Architecture Patterns

Pattern 1: Two-Tier LLM Cache

LLM Request
    │
    ▼
[Exact Match Cache (Redis)] ──hit──→ Return cached response (< 5ms)
    │ miss
    ▼
[Semantic Cache (Vector DB)] ──hit (similarity > 0.95)──→ Return cached response (< 50ms)
    │ miss
    ▼
[LLM API Call] ──→ response ──→ Store in both caches ──→ Return response

# Two-tier caching middleware
import hashlib, json, numpy as np

class LLMCacheMiddleware:
    def __init__(self, redis_client, vector_db, embedder, threshold=0.95):
        self.redis = redis_client
        self.vdb = vector_db
        self.embedder = embedder
        self.threshold = threshold

    def cache_key(self, prompt: str, model: str, **params) -> str:
        blob = json.dumps({"prompt": prompt, "model": model, **params}, sort_keys=True)
        return f"llm:{hashlib.sha256(blob.encode()).hexdigest()}"

    async def query(self, prompt: str, model: str, **params) -> str:
        # Tier 1: Exact match
        key = self.cache_key(prompt, model, **params)
        cached = await self.redis.get(key)
        if cached:
            return json.loads(cached)["response"]  # < 5ms

        # Tier 2: Semantic match
        query_emb = self.embedder.embed(prompt)
        results = self.vdb.search(query_emb, top_k=1)
        if results and results[0].score >= self.threshold:
            return results[0].payload["response"]  # < 50ms

        # Cache miss: call LLM
        response = await llm_call(prompt, model, **params)

        # Store in both tiers
        await self.redis.setex(key, 86400, json.dumps({"response": response}))
        self.vdb.upsert(query_emb, {"prompt": prompt, "response": response})
        return response

Pattern 2: Cache-Aware Request Classification

Incoming Request
    │
    ▼
[Classify Cacheability]
    ├── temperature == 0 AND structured output → HIGHLY CACHEABLE (TTL: 7 days)
    ├── temperature == 0 AND free-form         → CACHEABLE (TTL: 24 hours)
    ├── temperature > 0 AND repeated pattern   → SEMANTIC CACHE ONLY (TTL: 1 hour)
    └── temperature > 0 AND unique/creative    → DO NOT CACHE

Pattern 3: Versioned Cache with Model Upgrades

Cache Key = hash(prompt + model_version + system_prompt_version + params)

Model upgrade (gpt-4o-2026-01 → gpt-4o-2026-03):
  → All cache keys change automatically (model_version in hash)
  → Old cache entries expire via TTL
  → No manual invalidation needed

Anti-Patterns

Caching creative/high-temperature responses — Temperature > 0.7 means the user expects variety. Caching returns identical responses and feels broken.
Global similarity threshold — A 0.95 threshold works for factual Q&A but is too loose for code generation and too tight for casual chat. Tune per endpoint.
No cache versioning on model upgrades — Switching from GPT-4o to Claude returns stale GPT-4o responses. Include model version in cache keys.
Infinite TTL — Even deterministic responses can become stale when underlying data changes. Set TTL based on data freshness requirements (hours to days, not forever).
Caching without cost tracking — You cannot prove ROI without measuring: cache hit rate, tokens saved, dollars saved, latency improvement.

Quality Checklist

[ ] Request types classified by cacheability (deterministic vs. creative)
[ ] Exact-match cache uses hash of full request (prompt + model + params)
[ ] Semantic cache threshold tuned per use case (tested on sample queries)
[ ] Cache key includes model version for automatic invalidation on upgrades
[ ] TTL set based on data freshness requirements (not infinite)
[ ] High-temperature (> 0.7) requests excluded from caching
[ ] Cache hit/miss rates monitored with alerting on rate drops
[ ] Cost savings calculated and reported: tokens saved, dollars saved per day
[ ] Cache store sized for expected working set (Redis memory, vector DB storage)
[ ] Graceful degradation: cache failure falls through to LLM API (never blocks)

curiositech/llm-response-caching-layer

skills/llm-response-caching-layer/SKILL.md

Implement semantic and exact-match caching for LLM responses to reduce cost 40-60% and latency. Activate on: LLM caching, semantic cache, reduce API costs, cache AI responses. NOT for: general web caching (caching-strategies), CDN config (cloudflare-worker-dev).

development

Updated Apr 4, 2026

$ install --global

skillsauth

npx skillsauth add curiositech/windags-skills llm-response-caching-layer

Install this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.

Security Scan Results

3 of 9 scanners reported clean

Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.

Scanners Passed

Scanners in report

Clean

TrivyContainer and dependency vulnerability scanner

95%

Clean

SemgrepStatic code analysis for vulnerabilities

95%

Clean

mcp-scan (Snyk)Model Context Protocol security validation

95%

Skipped

Snyk (dep)Open source security scanning

50%

Skipped

Socket.devSupply chain security analysis

50%

Skipped

VirusTotalMulti-engine malware detection

50%

Skipped

CrowdStrikeAdvanced threat intelligence

50%

Skipped

OSV-ScannerOpen Source Vulnerability database check

50%

Skipped

OWASP Dep-Check

50%

Last scanned: Apr 4, 2026, 2:15 PM6.0s1 file scanned

SKILL.md

license:: Apache-2.0
name:: llm-response-caching-layer
description:: Implement semantic and exact-match caching for LLM responses to reduce cost 40-60% and latency. Activate on: LLM caching, semantic cache, reduce API costs, cache AI responses. NOT for: general web caching (caching-strategies), CDN config (cloudflare-worker-dev).
allowed-tools:: Read,Write,Edit,Bash(python:*,pip:*,npm:*,npx:*)
category:: Backend & Infrastructure
- skill:: ai-engineer
reason:: Cache layer integrates into LLM application pipelines

LLM Response Caching Layer

Implement semantic and exact-match caching for LLM API responses to reduce costs 40-60% and cut P50 latency from seconds to milliseconds.

Activation Triggers

Activate on: "cache LLM responses", "semantic cache", "reduce OpenAI costs", "LLM API caching", "cache embeddings", "deduplicate LLM calls", "response memoization for AI"

NOT for: General HTTP/CDN caching (caching-strategies), browser cache headers (caching-strategies), or database query caching (ORM-specific)

Quick Start

Classify request types — Deterministic (structured extraction, classification) vs. creative (open-ended generation). Deterministic requests are highly cacheable.
Choose cache strategy — Exact-match (hash of prompt + params) for deterministic requests, semantic similarity for natural language queries.
Set up cache store — Redis for exact-match (fast, TTL-native), vector DB for semantic cache (similarity search).
Integrate as middleware — Wrap your LLM client with a cache-check layer that intercepts before API calls.
Monitor hit rates — Target 30-50% hit rate for mixed workloads, 70%+ for classification/extraction.

Core Capabilities

Architecture Patterns

Pattern 1: Two-Tier LLM Cache

LLM Request
    │
    ▼
[Exact Match Cache (Redis)] ──hit──→ Return cached response (< 5ms)
    │ miss
    ▼
[Semantic Cache (Vector DB)] ──hit (similarity > 0.95)──→ Return cached response (< 50ms)
    │ miss
    ▼
[LLM API Call] ──→ response ──→ Store in both caches ──→ Return response

# Two-tier caching middleware
import hashlib, json, numpy as np

class LLMCacheMiddleware:
    def __init__(self, redis_client, vector_db, embedder, threshold=0.95):
        self.redis = redis_client
        self.vdb = vector_db
        self.embedder = embedder
        self.threshold = threshold

    def cache_key(self, prompt: str, model: str, **params) -> str:
        blob = json.dumps({"prompt": prompt, "model": model, **params}, sort_keys=True)
        return f"llm:{hashlib.sha256(blob.encode()).hexdigest()}"

    async def query(self, prompt: str, model: str, **params) -> str:
        # Tier 1: Exact match
        key = self.cache_key(prompt, model, **params)
        cached = await self.redis.get(key)
        if cached:
            return json.loads(cached)["response"]  # < 5ms

        # Tier 2: Semantic match
        query_emb = self.embedder.embed(prompt)
        results = self.vdb.search(query_emb, top_k=1)
        if results and results[0].score >= self.threshold:
            return results[0].payload["response"]  # < 50ms

        # Cache miss: call LLM
        response = await llm_call(prompt, model, **params)

        # Store in both tiers
        await self.redis.setex(key, 86400, json.dumps({"response": response}))
        self.vdb.upsert(query_emb, {"prompt": prompt, "response": response})
        return response

Pattern 2: Cache-Aware Request Classification

Incoming Request
    │
    ▼
[Classify Cacheability]
    ├── temperature == 0 AND structured output → HIGHLY CACHEABLE (TTL: 7 days)
    ├── temperature == 0 AND free-form         → CACHEABLE (TTL: 24 hours)
    ├── temperature > 0 AND repeated pattern   → SEMANTIC CACHE ONLY (TTL: 1 hour)
    └── temperature > 0 AND unique/creative    → DO NOT CACHE

Pattern 3: Versioned Cache with Model Upgrades

Cache Key = hash(prompt + model_version + system_prompt_version + params)

Model upgrade (gpt-4o-2026-01 → gpt-4o-2026-03):
  → All cache keys change automatically (model_version in hash)
  → Old cache entries expire via TTL
  → No manual invalidation needed

Anti-Patterns

Caching creative/high-temperature responses — Temperature > 0.7 means the user expects variety. Caching returns identical responses and feels broken.
Global similarity threshold — A 0.95 threshold works for factual Q&A but is too loose for code generation and too tight for casual chat. Tune per endpoint.
No cache versioning on model upgrades — Switching from GPT-4o to Claude returns stale GPT-4o responses. Include model version in cache keys.
Infinite TTL — Even deterministic responses can become stale when underlying data changes. Set TTL based on data freshness requirements (hours to days, not forever).
Caching without cost tracking — You cannot prove ROI without measuring: cache hit rate, tokens saved, dollars saved, latency improvement.

Quality Checklist

[ ] Request types classified by cacheability (deterministic vs. creative)
[ ] Exact-match cache uses hash of full request (prompt + model + params)
[ ] Semantic cache threshold tuned per use case (tested on sample queries)
[ ] Cache key includes model version for automatic invalidation on upgrades
[ ] TTL set based on data freshness requirements (not infinite)
[ ] High-temperature (> 0.7) requests excluded from caching
[ ] Cache hit/miss rates monitored with alerting on rate drops
[ ] Cost savings calculated and reported: tokens saved, dollars saved per day
[ ] Cache store sized for expected working set (Redis memory, vector DB storage)
[ ] Graceful degradation: cache failure falls through to LLM API (never blocks)

Related Skills

curiositech/revisiting-interview-data-analysing-turn

data-ai

VerifiedTrustedCommunity

license: Apache-2.0 NOT for unrelated tasks outside this domain.

8SKILL.mdUpdated Jul 19, 2026

curiositech/revisiting-interview-data-analysing-turn

curiositech/redis-patterns-expert

development

VerifiedTrustedCommunity

Use when designing caching strategies (cache-aside, write-through, write-behind), implementing distributed locks, building rate limiters, leaderboards, real-time streams (XADD/consumer groups), pub/sub, or tuning eviction policies. Triggers: thundering-herd on cache miss, dogpile on key expiry, Redlock vs SET-NX-PX choice, sliding-window rate limiter, hot-key on a single cluster slot, big-key blowup, MULTI/EXEC across slots, KEYS in production. NOT for Redis Cluster operations/admin (different domain), embedded KV (SQLite, leveldb), in-process LRU caches, or Memcached.

8SKILL.mdUpdated Jul 19, 2026

curiositech/redis-patterns-expert

curiositech/react-server-components-boundary

tools

VerifiedTrustedCommunity

Drawing the `'use client'` boundary correctly in React Server Components apps (Next.js App Router, RSC frameworks) — leaf-pushing, slot composition, serialization rules, and environment poisoning prevention. Grounded in react.dev and Next.js 16 docs.

8SKILL.mdUpdated Jul 19, 2026

curiositech/react-server-components-boundary

curiositech/rate-limiting-strategy

development

VerifiedTrustedCommunity

Use when designing rate limiting for an API, choosing between token bucket / sliding window / leaky bucket / fixed window, implementing it in Redis, deciding edge (Cloudflare/Upstash) vs origin enforcement, sizing per-user vs per-IP vs per-endpoint quotas, returning the right 429 response with Retry-After, or fixing the boundary-burst bug in fixed-window limiters. Triggers: 429 too many requests, INCR + EXPIRE, ZADD + ZREMRANGEBYSCORE + ZCARD, X-RateLimit-Remaining header, Cloudflare WAF rate limiting rules, Upstash @upstash/ratelimit, leaky bucket shaping vs policing, distributed rate limiter consistency. NOT for DDoS mitigation specifically (different scale), CAPTCHA / bot management, full WAF design, or per-user quota billing.

8SKILL.mdUpdated Jul 19, 2026

curiositech/rate-limiting-strategy

Download

For Claude Desktop. Download once, then upload the file in the app — no terminal needed.

Need help? View full Cowork setup guide →

Install manually

Choose your platform

# Clone the repo
git clone https://github.com/curiositech/windags-skills.git

# Copy into Claude Code skills folder (global)
cp -r windags-skills/skills/llm-response-caching-layer ~/.claude/skills/

Claude Code Skills — official skills path docs.

Repository

curiositech/windags-skills

Compatible with

Claude Code

OpenAI Codex CLI

ChatGPT