Venice AI Skill

Expert assistance with Venice AI platform development, featuring advanced prompt caching optimization, API integration patterns, and cost-effective AI model access across multiple providers (Anthropic Claude, OpenAI GPT, Google Gemini, xAI Grok, DeepSeek, and more).

When to Use This Skill

Core Use Cases

Prompt Caching Optimization: Reduce API costs by 36-90% and latency by up to 80%
Multi-Provider AI Integration: Access Claude, GPT, Gemini, Grok, DeepSeek via unified API
Cost Optimization: Implement caching strategies for high-volume AI applications
Performance Tuning: Optimize prompt structure for maximum cache hit rates
Conversation Management: Build efficient multi-turn chatbots with cached context

Trigger Phrases

"Optimize Venice AI costs with prompt caching"
"Implement prompt caching for [Claude/GPT/Gemini]"
"Reduce AI API latency using Venice"
"Build cost-efficient chatbot with Venice AI"
"Integrate multiple AI providers through Venice"
"Debug Venice AI cache hit rates"
"Structure prompts for maximum caching efficiency"

Quick Reference

Prompt Caching Fundamentals

How It Works: Caching uses prefix matching—identical prompt beginnings across requests retrieve pre-processed tokens instead of recomputing them.

Key Benefit: Reduce costs by 36-90% and latency by up to 80% for longer prompts.

Minimum Requirements:

Claude Opus 4.5: ~4,000 tokens
GPT, Gemini, Grok, DeepSeek: ~1,024 tokens

Common Patterns

1. Basic Chatbot with Cached System Prompt

import requests

# 2,000-token system prompt (cached after first request)
system_prompt = """You are a helpful AI assistant specialized in..."""

# First request - writes to cache
response = requests.post(
    "https://api.venice.ai/api/v1/chat/completions",
    headers={"Authorization": f"Bearer {API_KEY}"},
    json={
        "model": "claude-opus-4-5",
        "messages": [
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": "What's the weather?"}
        ],
        "prompt_cache_key": "user-session-123"  # Route to same server
    }
)

# Result: 2,050 tokens processed, 0 cached, full cost

# Subsequent requests - reads from cache
response2 = requests.post(
    "https://api.venice.ai/api/v1/chat/completions",
    headers={"Authorization": f"Bearer {API_KEY}"},
    json={
        "model": "claude-opus-4-5",
        "messages": [
            {"role": "system", "content": system_prompt},  # MUST be byte-identical
            {"role": "user", "content": "Tell me a joke"}
        ],
        "prompt_cache_key": "user-session-123"  # Same key = better hit rate
    }
)

# Result: 2,000 cached (90% discount), 80 new tokens
# Cost savings: ~64% compared to no caching

2. Claude-Specific Cache Control (Advanced)

# For Claude: Explicitly mark cache breakpoints
response = requests.post(
    "https://api.venice.ai/api/v1/chat/completions",
    headers={"Authorization": f"Bearer {API_KEY}"},
    json={
        "model": "claude-opus-4-5",
        "messages": [
            {
                "role": "system",
                "content": [
                    {
                        "type": "text",
                        "text": "You are a helpful AI assistant...",
                        "cache_control": {"type": "ephemeral"}  # Cache this block
                    }
                ]
            },
            {
                "role": "user",
                "content": "Summarize this document:\n\n{long_document}",
                "cache_control": {"type": "ephemeral"}  # Cache document too
            }
        ]
    }
)

# Venice auto-applies cache_control to system prompts for Claude
# Manual markers needed for caching beyond system messages

3. Monitoring Cache Performance

response = requests.post(...)
usage = response.json()["usage"]

cached_tokens = usage["prompt_tokens_details"]["cached_tokens"]
total_prompt_tokens = usage["prompt_tokens"]
cache_hit_rate = (cached_tokens / total_prompt_tokens) * 100

print(f"Cache Hit Rate: {cache_hit_rate:.1f}%")
print(f"Cached Tokens: {cached_tokens} (90% discount)")
print(f"New Tokens: {total_prompt_tokens - cached_tokens} (full price)")

# Troubleshooting
if cache_hit_rate == 0:
    print("⚠️ Zero cache hits - check:")
    print("  • Prompt below minimum threshold (~1,024 or ~4,000 for Claude)?")
    print("  • Prompt prefix changed (whitespace, timestamps)?")
    print("  • First request (writes cache, doesn't read)?")
    print("  • Cache expired (>5-10 min since last request)?")

4. Cost-Optimized Prompt Structure

# ❌ BAD: Dynamic content first (breaks cache)
prompt = f"""
Current time: {datetime.now()}
User ID: {user_id}
System: You are a helpful assistant...
"""

# ✅ GOOD: Static content first (maximizes cache)
prompt = """
System: You are a helpful assistant with these capabilities:
- Answer questions
- Provide code examples
- Debug issues

Reference documentation:
{static_docs}  # 5,000 tokens - cached

Examples:
{static_examples}  # 2,000 tokens - cached
"""

# Add dynamic content AFTER static prefix
messages = [
    {"role": "system", "content": prompt},  # 7,000 tokens cached
    {"role": "user", "content": f"User {user_id} asks: {question}"}  # New tokens only
]

5. Multi-Turn Conversation with Caching

// TypeScript example with conversation history
const conversationHistory: Message[] = [];
const SYSTEM_PROMPT = "...";  // 2,000 tokens

async function chat(userMessage: string, sessionId: string) {
  conversationHistory.push({
    role: "user",
    content: userMessage
  });

  const response = await fetch("https://api.venice.ai/api/v1/chat/completions", {
    method: "POST",
    headers: {
      "Authorization": `Bearer ${process.env.VENICE_API_KEY}`,
      "Content-Type": "application/json"
    },
    body: JSON.stringify({
      model: "gpt-5.2",  // Auto-caching, no cache_control needed
      messages: [
        { role: "system", content: SYSTEM_PROMPT },  // Cached
        ...conversationHistory  // Growing conversation - partially cached
      ],
      prompt_cache_key: sessionId  // Route to same server for warm cache
    })
  });

  const data = await response.json();
  conversationHistory.push({
    role: "assistant",
    content: data.choices[0].message.content
  });

  return {
    message: data.choices[0].message.content,
    cached_tokens: data.usage.prompt_tokens_details.cached_tokens,
    cache_savings_pct: (data.usage.prompt_tokens_details.cached_tokens / data.usage.prompt_tokens) * 100
  };
}

Prompt Caching Best Practices

1. Optimal Prompt Structure

Place static content before dynamic content to maximize cached prefix:

✅ GOOD ORDER:
- System prompt (always static)
- Reference documents (static)
- Examples (static)
- Tools/function definitions (static)
- Conversation history (grows but prefix is stable)
- Current user message (dynamic)

❌ BAD ORDER:
- Timestamp or request ID
- User message
- System prompt

2. Maintain Byte-Identical Prefixes

Cache keys derive from exact byte sequences. Avoid:

Different whitespace or line breaks
Timestamps within prompts
Random UUIDs before static content
Inconsistent formatting

3. Meet Minimum Token Thresholds

Claude Opus 4.5: ~4,000 tokens minimum
GPT, Gemini, Grok, DeepSeek: ~1,024 tokens minimum
Bundle context if below threshold

4. Use prompt_cache_key for Multi-Turn

Consistent routing hint across conversation improves hit rates:

prompt_cache_key = f"session-{user_id}-{conversation_id}"

5. Monitor Cache Economics

Cost Comparison:

Without Caching:
  Request 1: 2,050 tokens × $6.00/1M = $0.0123
  Request 2: 2,080 tokens × $6.00/1M = $0.0125
  Request 3: 2,120 tokens × $6.00/1M = $0.0127
  Total: $0.0375

With Caching (Claude):
  Request 1: 2,050 × $7.50/1M = $0.0154 (write premium)
  Request 2: 2,000 cached × $0.60/1M + 80 new × $6.00/1M = $0.0017
  Request 3: 2,000 cached × $0.60/1M + 120 new × $6.00/1M = $0.0019
  Total: $0.0190

Savings: 49% (break-even after 2nd request)

Provider-Specific Details

Claude (Anthropic)

Write Premium: +25% ($7.50/1M vs $6.00/1M)
Read Discount: 90% ($0.60/1M)
Requires: Explicit cache_control: {"type": "ephemeral"} markers
Minimum Tokens: ~4,000
Cache Lifetime: 5 minutes
Breakpoints: Up to 4 per request
Break-Even: 2nd request using same prefix

GPT (OpenAI)

Write Premium: None
Read Discount: 90%
Auto-Caching: No configuration needed
Minimum Tokens: ~1,024
Cache Lifetime: 5-10 minutes

Gemini (Google)

Write Premium: None
Read Discount: 75-90%
Auto-Caching: Yes
Minimum Tokens: ~1,024
Cache Lifetime: 1 hour (longest)

Grok (xAI)

Write Premium: None
Read Discount: 75-88%
Auto-Caching: Yes
Minimum Tokens: ~1,024
Cache Lifetime: 5 minutes

DeepSeek

Write Premium: None
Read Discount: 50%
Auto-Caching: Yes
Minimum Tokens: ~1,024
Cache Lifetime: 5 minutes

Troubleshooting Guide

Zero Cached Tokens

Symptoms: cached_tokens: 0 in every response

Causes:

Below Threshold: Prompt < 1,024 tokens (or < 4,000 for Claude)
- Fix: Add more context, examples, or documentation
Dynamic Prefix: Changing content at start of prompt
- Fix: Move timestamps/IDs to end, maintain static prefix
First Request: Initial request writes cache, doesn't read
- Fix: Normal behavior, subsequent requests will show caching
Cache Expired: > 5-10 min since last request
- Fix: Increase request frequency or accept cold starts
Server Inconsistency: Requests hitting different servers
- Fix: Use prompt_cache_key for routing affinity

Continuous Cache Writes

Symptoms: cache_creation_input_tokens > 0 on every request

Causes:

Prompt Mutations: Small changes breaking byte identity
- Fix: Remove timestamps, use consistent formatting
Missing cache_control (Claude only)
- Fix: Add explicit markers or rely on Venice auto-application
Insufficient Tokens: Below minimum threshold
- Fix: Enrich prompt with more context

Higher Than Expected Costs

Causes:

Claude Write Premium: Not enough reuse to amortize +25% cost
- Fix: Ensure at least 3-4 requests reuse same prefix
Changing Prompts: Every request = new cache write
- Fix: Stabilize prompt structure
Poor Structure: Dynamic before static
- Fix: Reorganize (see Best Practices #1)

Low Cache Hit Rate

Target: > 70% for chatbots, > 50% for variable workloads

Diagnostic:

hit_rate = (cached_tokens / prompt_tokens) * 100
if hit_rate < 50:
    print("Investigate: Prompt structure, token threshold, or expiration")

Reference Files

This skill includes comprehensive documentation in references/:

llms-txt.md - Complete Venice AI documentation
llms-full.md - Extended documentation with examples
llms.md - Quick reference summary

Use read to view specific reference files when detailed information is needed.

Advanced Use Cases

1. Caching with Tools and Functions

Function definitions integrate into cached prefix:

# Function definitions are part of static prefix
tools = [
    {
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": "Get weather for a location",
            "parameters": {
                "type": "object",
                "properties": {
                    "location": {"type": "string"}
                }
            }
        }
    }
]

# Tools definition cached along with system prompt
response = requests.post(
    "https://api.venice.ai/api/v1/chat/completions",
    json={
        "model": "gpt-5.2",
        "messages": [...],
        "tools": tools  # Included in cached prefix
    }
)

2. Caching with Images (Vision Models)

Cache images and accompanying text for repeated questions:

# First request - cache image + context
messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image_url",
                "image_url": {"url": "data:image/jpeg;base64,..."}
            },
            {
                "type": "text",
                "text": "This is a diagram of our system architecture."
            }
        ]
    },
    {
        "role": "user",
        "content": "What are the main components?"
    }
]

# Subsequent requests - image and context cached
messages.append({
    "role": "user",
    "content": "How does data flow between components?"
})
# Only new question processed, image + context from cache

3. Document Analysis with Caching

Load large documents once, ask multiple questions:

# 10,000-token document cached after first request
document = """
[Long legal contract, technical manual, or research paper]
"""

def analyze_document(question: str):
    return requests.post(
        "https://api.venice.ai/api/v1/chat/completions",
        json={
            "model": "claude-opus-4-5",
            "messages": [
                {
                    "role": "system",
                    "content": f"Analyze this document:\n\n{document}"
                },
                {"role": "user", "content": question}
            ],
            "prompt_cache_key": "doc-analysis-session-xyz"
        }
    )

# Each question only processes new query, document stays cached
analyze_document("What are the key terms?")
analyze_document("Summarize section 3")
analyze_document("What are the risks?")
# 10,000 tokens cached at 90% discount × 3 requests = massive savings

4. RAG (Retrieval Augmented Generation) with Caching

Cache retrieved context for follow-up questions:

# Retrieve relevant chunks from vector DB
relevant_chunks = vector_db.search(user_query)
context = "\n\n".join([chunk.text for chunk in relevant_chunks])

# Cache the retrieved context (5,000 tokens)
messages = [
    {
        "role": "system",
        "content": f"""Use this context to answer questions:

{context}

Context above is authoritative. If answer not in context, say so."""
    },
    {"role": "user", "content": user_query}
]

# Follow-up questions reuse cached context
# Only new query processed, context cached

5. Batch Processing with Shared Context

Process multiple items with same instructions:

# Shared instructions cached
instructions = """
You are a code reviewer. For each code sample, check:
- Style compliance
- Security vulnerabilities
- Performance issues
- Best practices
"""

def review_code(code_sample: str, batch_id: str):
    return requests.post(
        "https://api.venice.ai/api/v1/chat/completions",
        json={
            "model": "gpt-5.2",
            "messages": [
                {"role": "system", "content": instructions},
                {"role": "user", "content": f"Review:\n\n```\n{code_sample}\n```"}
            ],
            "prompt_cache_key": f"batch-{batch_id}"
        }
    )

# Review 100 files - instructions cached for all
for file in code_files:
    review_code(file.content, "batch-001")
# 100 requests, instructions cached 99 times

Working with This Skill

For Prompt Caching Optimization

Assess Current Usage: Calculate cache hit rate and cost per 1K tokens
Restructure Prompts: Move static content before dynamic
Implement Monitoring: Track cached_tokens in responses
Tune for Economics: Balance write premium (Claude) vs read savings

For API Integration

Choose Provider: Claude (quality), GPT (balanced), Gemini (cost), Grok (speed)
Set Up Authentication: Get API key from Venice.ai dashboard
Implement Base Client: Use examples from Quick Reference
Add Error Handling: Handle rate limits, timeouts, cache misses

For Cost Reduction

Identify High-Volume Endpoints: Where caching has biggest impact
Measure Baseline: Cost and latency without caching
Apply Caching Patterns: Use examples from this skill
Validate Savings: Compare cached vs uncached costs

For Code Examples

The Quick Reference section contains 5 production-ready patterns. The Advanced Use Cases section covers 5 specialized scenarios.

Performance Benchmarks

Latency Improvements

Short prompts (< 1,000 tokens): Minimal impact (~5-10% faster)
Medium prompts (1,000-5,000 tokens): 30-50% faster
Long prompts (> 5,000 tokens): Up to 80% faster

Cost Reductions

Single request: No savings (cache write cost)
2-5 requests: 30-50% savings
10+ requests: 60-80% savings
100+ requests: Up to 90% savings

Cache Hit Rates

Chatbots (stable system prompt): 70-90%
Document analysis: 80-95%
RAG systems: 60-80%
Batch processing: 85-95%

API Reference Summary

Base URL

https://api.venice.ai/api/v1

Authentication

Authorization: Bearer YOUR_API_KEY

Key Parameters

| Parameter | Type | Required | Description | |-----------|------|----------|-------------| | model | string | Yes | Model ID (e.g., claude-opus-4-5, gpt-5.2) | | messages | array | Yes | Conversation messages | | prompt_cache_key | string | No | Routing hint for cache affinity | | cache_control | object | No | Cache markers (Claude only) | | temperature | float | No | Sampling temperature (0-2) | | max_tokens | int | No | Maximum completion length |

Response Fields

{
  "usage": {
    "prompt_tokens": 5500,
    "completion_tokens": 200,
    "prompt_tokens_details": {
      "cached_tokens": 5000,          // Tokens from cache (90% discount)
      "cache_creation_input_tokens": 0 // Tokens written to cache (may have premium)
    }
  }
}

Resources

references/

Organized documentation extracted from official Venice AI sources:

llms-txt.md: Complete API documentation and guides
llms-full.md: Extended documentation with additional examples
llms.md: Quick reference for common operations

Code Examples

This skill includes 10+ production-ready code examples:

Python, TypeScript, JavaScript
Error handling and monitoring
Cost optimization patterns
Multi-provider integration

Notes

Version 2.0: Enhanced with comprehensive prompt caching optimization guide
Focus: Cost reduction (36-90%), latency improvement (up to 80%)
Coverage: All major providers (Claude, GPT, Gemini, Grok, DeepSeek, MiniMax, Kimi)
Production-Ready: Battle-tested patterns from real-world deployments
Updated: January 2026 with latest Venice AI capabilities

Success Metrics

Track these KPIs to measure caching effectiveness:

Cost Metrics

cost_per_1k_tokens = total_cost / (total_tokens / 1000)
savings_percentage = (uncached_cost - cached_cost) / uncached_cost * 100
roi_requests = write_premium_cost / read_savings_per_request  # Break-even point

Performance Metrics

cache_hit_rate = cached_tokens / total_prompt_tokens * 100
avg_latency_improvement = (uncached_latency - cached_latency) / uncached_latency * 100
tokens_saved = cached_tokens * (1 - discount_rate)

Quality Metrics

consistency_score = requests_with_cache_hits / total_requests * 100
cache_uptime = requests_within_ttl / total_requests * 100

When NOT to Use Caching

Caching may not be cost-effective when:

Single-use prompts: No repeat requests to amortize write premium
Highly dynamic: Every request has unique prefix
Below threshold: Prompts consistently < 1,024 tokens (or < 4,000 for Claude)
Infrequent requests: > 10 min between requests (cache expires)
Rapid iteration: Constantly changing prompt structure during development

Updating

To refresh this skill with updated Venice AI documentation:

Check for new features at https://docs.venice.ai
Re-run skill-seekers scrape with updated config
Merge new examples into Quick Reference section
Update provider-specific details if pricing changes
Increment version number and document changes

Venice AI Skill

When to Use This Skill

Core Use Cases

Prompt Caching Optimization: Reduce API costs by 36-90% and latency by up to 80%
Multi-Provider AI Integration: Access Claude, GPT, Gemini, Grok, DeepSeek via unified API
Cost Optimization: Implement caching strategies for high-volume AI applications
Performance Tuning: Optimize prompt structure for maximum cache hit rates
Conversation Management: Build efficient multi-turn chatbots with cached context

Trigger Phrases

"Optimize Venice AI costs with prompt caching"
"Implement prompt caching for [Claude/GPT/Gemini]"
"Reduce AI API latency using Venice"
"Build cost-efficient chatbot with Venice AI"
"Integrate multiple AI providers through Venice"
"Debug Venice AI cache hit rates"
"Structure prompts for maximum caching efficiency"

Quick Reference

Prompt Caching Fundamentals

How It Works: Caching uses prefix matching—identical prompt beginnings across requests retrieve pre-processed tokens instead of recomputing them.

Key Benefit: Reduce costs by 36-90% and latency by up to 80% for longer prompts.

Minimum Requirements:

Claude Opus 4.5: ~4,000 tokens
GPT, Gemini, Grok, DeepSeek: ~1,024 tokens

Common Patterns

1. Basic Chatbot with Cached System Prompt

import requests

# 2,000-token system prompt (cached after first request)
system_prompt = """You are a helpful AI assistant specialized in..."""

# First request - writes to cache
response = requests.post(
    "https://api.venice.ai/api/v1/chat/completions",
    headers={"Authorization": f"Bearer {API_KEY}"},
    json={
        "model": "claude-opus-4-5",
        "messages": [
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": "What's the weather?"}
        ],
        "prompt_cache_key": "user-session-123"  # Route to same server
    }
)

# Result: 2,050 tokens processed, 0 cached, full cost

# Subsequent requests - reads from cache
response2 = requests.post(
    "https://api.venice.ai/api/v1/chat/completions",
    headers={"Authorization": f"Bearer {API_KEY}"},
    json={
        "model": "claude-opus-4-5",
        "messages": [
            {"role": "system", "content": system_prompt},  # MUST be byte-identical
            {"role": "user", "content": "Tell me a joke"}
        ],
        "prompt_cache_key": "user-session-123"  # Same key = better hit rate
    }
)

# Result: 2,000 cached (90% discount), 80 new tokens
# Cost savings: ~64% compared to no caching

2. Claude-Specific Cache Control (Advanced)

# For Claude: Explicitly mark cache breakpoints
response = requests.post(
    "https://api.venice.ai/api/v1/chat/completions",
    headers={"Authorization": f"Bearer {API_KEY}"},
    json={
        "model": "claude-opus-4-5",
        "messages": [
            {
                "role": "system",
                "content": [
                    {
                        "type": "text",
                        "text": "You are a helpful AI assistant...",
                        "cache_control": {"type": "ephemeral"}  # Cache this block
                    }
                ]
            },
            {
                "role": "user",
                "content": "Summarize this document:\n\n{long_document}",
                "cache_control": {"type": "ephemeral"}  # Cache document too
            }
        ]
    }
)

# Venice auto-applies cache_control to system prompts for Claude
# Manual markers needed for caching beyond system messages

3. Monitoring Cache Performance

response = requests.post(...)
usage = response.json()["usage"]

cached_tokens = usage["prompt_tokens_details"]["cached_tokens"]
total_prompt_tokens = usage["prompt_tokens"]
cache_hit_rate = (cached_tokens / total_prompt_tokens) * 100

print(f"Cache Hit Rate: {cache_hit_rate:.1f}%")
print(f"Cached Tokens: {cached_tokens} (90% discount)")
print(f"New Tokens: {total_prompt_tokens - cached_tokens} (full price)")

# Troubleshooting
if cache_hit_rate == 0:
    print("⚠️ Zero cache hits - check:")
    print("  • Prompt below minimum threshold (~1,024 or ~4,000 for Claude)?")
    print("  • Prompt prefix changed (whitespace, timestamps)?")
    print("  • First request (writes cache, doesn't read)?")
    print("  • Cache expired (>5-10 min since last request)?")

4. Cost-Optimized Prompt Structure

# ❌ BAD: Dynamic content first (breaks cache)
prompt = f"""
Current time: {datetime.now()}
User ID: {user_id}
System: You are a helpful assistant...
"""

# ✅ GOOD: Static content first (maximizes cache)
prompt = """
System: You are a helpful assistant with these capabilities:
- Answer questions
- Provide code examples
- Debug issues

Reference documentation:
{static_docs}  # 5,000 tokens - cached

Examples:
{static_examples}  # 2,000 tokens - cached
"""

# Add dynamic content AFTER static prefix
messages = [
    {"role": "system", "content": prompt},  # 7,000 tokens cached
    {"role": "user", "content": f"User {user_id} asks: {question}"}  # New tokens only
]

5. Multi-Turn Conversation with Caching

// TypeScript example with conversation history
const conversationHistory: Message[] = [];
const SYSTEM_PROMPT = "...";  // 2,000 tokens

async function chat(userMessage: string, sessionId: string) {
  conversationHistory.push({
    role: "user",
    content: userMessage
  });

  const response = await fetch("https://api.venice.ai/api/v1/chat/completions", {
    method: "POST",
    headers: {
      "Authorization": `Bearer ${process.env.VENICE_API_KEY}`,
      "Content-Type": "application/json"
    },
    body: JSON.stringify({
      model: "gpt-5.2",  // Auto-caching, no cache_control needed
      messages: [
        { role: "system", content: SYSTEM_PROMPT },  // Cached
        ...conversationHistory  // Growing conversation - partially cached
      ],
      prompt_cache_key: sessionId  // Route to same server for warm cache
    })
  });

  const data = await response.json();
  conversationHistory.push({
    role: "assistant",
    content: data.choices[0].message.content
  });

  return {
    message: data.choices[0].message.content,
    cached_tokens: data.usage.prompt_tokens_details.cached_tokens,
    cache_savings_pct: (data.usage.prompt_tokens_details.cached_tokens / data.usage.prompt_tokens) * 100
  };
}

Prompt Caching Best Practices

1. Optimal Prompt Structure

Place static content before dynamic content to maximize cached prefix:

✅ GOOD ORDER:
- System prompt (always static)
- Reference documents (static)
- Examples (static)
- Tools/function definitions (static)
- Conversation history (grows but prefix is stable)
- Current user message (dynamic)

❌ BAD ORDER:
- Timestamp or request ID
- User message
- System prompt

2. Maintain Byte-Identical Prefixes

Cache keys derive from exact byte sequences. Avoid:

Different whitespace or line breaks
Timestamps within prompts
Random UUIDs before static content
Inconsistent formatting

3. Meet Minimum Token Thresholds

Claude Opus 4.5: ~4,000 tokens minimum
GPT, Gemini, Grok, DeepSeek: ~1,024 tokens minimum
Bundle context if below threshold

4. Use prompt_cache_key for Multi-Turn

Consistent routing hint across conversation improves hit rates:

prompt_cache_key = f"session-{user_id}-{conversation_id}"

5. Monitor Cache Economics

Cost Comparison:

Without Caching:
  Request 1: 2,050 tokens × $6.00/1M = $0.0123
  Request 2: 2,080 tokens × $6.00/1M = $0.0125
  Request 3: 2,120 tokens × $6.00/1M = $0.0127
  Total: $0.0375

With Caching (Claude):
  Request 1: 2,050 × $7.50/1M = $0.0154 (write premium)
  Request 2: 2,000 cached × $0.60/1M + 80 new × $6.00/1M = $0.0017
  Request 3: 2,000 cached × $0.60/1M + 120 new × $6.00/1M = $0.0019
  Total: $0.0190

Savings: 49% (break-even after 2nd request)

Provider-Specific Details

Claude (Anthropic)

Write Premium: +25% ($7.50/1M vs $6.00/1M)
Read Discount: 90% ($0.60/1M)
Requires: Explicit cache_control: {"type": "ephemeral"} markers
Minimum Tokens: ~4,000
Cache Lifetime: 5 minutes
Breakpoints: Up to 4 per request
Break-Even: 2nd request using same prefix

GPT (OpenAI)

Write Premium: None
Read Discount: 90%
Auto-Caching: No configuration needed
Minimum Tokens: ~1,024
Cache Lifetime: 5-10 minutes

Gemini (Google)

Write Premium: None
Read Discount: 75-90%
Auto-Caching: Yes
Minimum Tokens: ~1,024
Cache Lifetime: 1 hour (longest)

Grok (xAI)

Write Premium: None
Read Discount: 75-88%
Auto-Caching: Yes
Minimum Tokens: ~1,024
Cache Lifetime: 5 minutes

DeepSeek

Write Premium: None
Read Discount: 50%
Auto-Caching: Yes
Minimum Tokens: ~1,024
Cache Lifetime: 5 minutes

Troubleshooting Guide

Zero Cached Tokens

Symptoms: cached_tokens: 0 in every response

Causes:

Below Threshold: Prompt < 1,024 tokens (or < 4,000 for Claude)
- Fix: Add more context, examples, or documentation
Dynamic Prefix: Changing content at start of prompt
- Fix: Move timestamps/IDs to end, maintain static prefix
First Request: Initial request writes cache, doesn't read
- Fix: Normal behavior, subsequent requests will show caching
Cache Expired: > 5-10 min since last request
- Fix: Increase request frequency or accept cold starts
Server Inconsistency: Requests hitting different servers
- Fix: Use prompt_cache_key for routing affinity

Continuous Cache Writes

Symptoms: cache_creation_input_tokens > 0 on every request

Causes:

Prompt Mutations: Small changes breaking byte identity
- Fix: Remove timestamps, use consistent formatting
Missing cache_control (Claude only)
- Fix: Add explicit markers or rely on Venice auto-application
Insufficient Tokens: Below minimum threshold
- Fix: Enrich prompt with more context

Higher Than Expected Costs

Causes:

Claude Write Premium: Not enough reuse to amortize +25% cost
- Fix: Ensure at least 3-4 requests reuse same prefix
Changing Prompts: Every request = new cache write
- Fix: Stabilize prompt structure
Poor Structure: Dynamic before static
- Fix: Reorganize (see Best Practices #1)

Low Cache Hit Rate

Target: > 70% for chatbots, > 50% for variable workloads

Diagnostic:

hit_rate = (cached_tokens / prompt_tokens) * 100
if hit_rate < 50:
    print("Investigate: Prompt structure, token threshold, or expiration")

Reference Files

This skill includes comprehensive documentation in references/:

llms-txt.md - Complete Venice AI documentation
llms-full.md - Extended documentation with examples
llms.md - Quick reference summary

Use read to view specific reference files when detailed information is needed.

Advanced Use Cases

1. Caching with Tools and Functions

Function definitions integrate into cached prefix:

# Function definitions are part of static prefix
tools = [
    {
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": "Get weather for a location",
            "parameters": {
                "type": "object",
                "properties": {
                    "location": {"type": "string"}
                }
            }
        }
    }
]

# Tools definition cached along with system prompt
response = requests.post(
    "https://api.venice.ai/api/v1/chat/completions",
    json={
        "model": "gpt-5.2",
        "messages": [...],
        "tools": tools  # Included in cached prefix
    }
)

2. Caching with Images (Vision Models)

Cache images and accompanying text for repeated questions:

# First request - cache image + context
messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image_url",
                "image_url": {"url": "data:image/jpeg;base64,..."}
            },
            {
                "type": "text",
                "text": "This is a diagram of our system architecture."
            }
        ]
    },
    {
        "role": "user",
        "content": "What are the main components?"
    }
]

# Subsequent requests - image and context cached
messages.append({
    "role": "user",
    "content": "How does data flow between components?"
})
# Only new question processed, image + context from cache

3. Document Analysis with Caching

Load large documents once, ask multiple questions:

# 10,000-token document cached after first request
document = """
[Long legal contract, technical manual, or research paper]
"""

def analyze_document(question: str):
    return requests.post(
        "https://api.venice.ai/api/v1/chat/completions",
        json={
            "model": "claude-opus-4-5",
            "messages": [
                {
                    "role": "system",
                    "content": f"Analyze this document:\n\n{document}"
                },
                {"role": "user", "content": question}
            ],
            "prompt_cache_key": "doc-analysis-session-xyz"
        }
    )

# Each question only processes new query, document stays cached
analyze_document("What are the key terms?")
analyze_document("Summarize section 3")
analyze_document("What are the risks?")
# 10,000 tokens cached at 90% discount × 3 requests = massive savings

4. RAG (Retrieval Augmented Generation) with Caching

Cache retrieved context for follow-up questions:

# Retrieve relevant chunks from vector DB
relevant_chunks = vector_db.search(user_query)
context = "\n\n".join([chunk.text for chunk in relevant_chunks])

# Cache the retrieved context (5,000 tokens)
messages = [
    {
        "role": "system",
        "content": f"""Use this context to answer questions:

{context}

Context above is authoritative. If answer not in context, say so."""
    },
    {"role": "user", "content": user_query}
]

# Follow-up questions reuse cached context
# Only new query processed, context cached

5. Batch Processing with Shared Context

Process multiple items with same instructions:

# Shared instructions cached
instructions = """
You are a code reviewer. For each code sample, check:
- Style compliance
- Security vulnerabilities
- Performance issues
- Best practices
"""

def review_code(code_sample: str, batch_id: str):
    return requests.post(
        "https://api.venice.ai/api/v1/chat/completions",
        json={
            "model": "gpt-5.2",
            "messages": [
                {"role": "system", "content": instructions},
                {"role": "user", "content": f"Review:\n\n```\n{code_sample}\n```"}
            ],
            "prompt_cache_key": f"batch-{batch_id}"
        }
    )

# Review 100 files - instructions cached for all
for file in code_files:
    review_code(file.content, "batch-001")
# 100 requests, instructions cached 99 times

Working with This Skill

For Prompt Caching Optimization

Assess Current Usage: Calculate cache hit rate and cost per 1K tokens
Restructure Prompts: Move static content before dynamic
Implement Monitoring: Track cached_tokens in responses
Tune for Economics: Balance write premium (Claude) vs read savings

For API Integration

Choose Provider: Claude (quality), GPT (balanced), Gemini (cost), Grok (speed)
Set Up Authentication: Get API key from Venice.ai dashboard
Implement Base Client: Use examples from Quick Reference
Add Error Handling: Handle rate limits, timeouts, cache misses

For Cost Reduction

Identify High-Volume Endpoints: Where caching has biggest impact
Measure Baseline: Cost and latency without caching
Apply Caching Patterns: Use examples from this skill
Validate Savings: Compare cached vs uncached costs

For Code Examples

The Quick Reference section contains 5 production-ready patterns. The Advanced Use Cases section covers 5 specialized scenarios.

Performance Benchmarks

Latency Improvements

Short prompts (< 1,000 tokens): Minimal impact (~5-10% faster)
Medium prompts (1,000-5,000 tokens): 30-50% faster
Long prompts (> 5,000 tokens): Up to 80% faster

Cost Reductions

Single request: No savings (cache write cost)
2-5 requests: 30-50% savings
10+ requests: 60-80% savings
100+ requests: Up to 90% savings

Cache Hit Rates

Chatbots (stable system prompt): 70-90%
Document analysis: 80-95%
RAG systems: 60-80%
Batch processing: 85-95%

API Reference Summary

Base URL

https://api.venice.ai/api/v1

Authentication

Authorization: Bearer YOUR_API_KEY

Key Parameters

Response Fields

{
  "usage": {
    "prompt_tokens": 5500,
    "completion_tokens": 200,
    "prompt_tokens_details": {
      "cached_tokens": 5000,          // Tokens from cache (90% discount)
      "cache_creation_input_tokens": 0 // Tokens written to cache (may have premium)
    }
  }
}

Resources

references/

Organized documentation extracted from official Venice AI sources:

llms-txt.md: Complete API documentation and guides
llms-full.md: Extended documentation with additional examples
llms.md: Quick reference for common operations

Code Examples

This skill includes 10+ production-ready code examples:

Python, TypeScript, JavaScript
Error handling and monitoring
Cost optimization patterns
Multi-provider integration

Notes

Version 2.0: Enhanced with comprehensive prompt caching optimization guide
Focus: Cost reduction (36-90%), latency improvement (up to 80%)
Coverage: All major providers (Claude, GPT, Gemini, Grok, DeepSeek, MiniMax, Kimi)
Production-Ready: Battle-tested patterns from real-world deployments
Updated: January 2026 with latest Venice AI capabilities

Success Metrics

Track these KPIs to measure caching effectiveness:

Cost Metrics

cost_per_1k_tokens = total_cost / (total_tokens / 1000)
savings_percentage = (uncached_cost - cached_cost) / uncached_cost * 100
roi_requests = write_premium_cost / read_savings_per_request  # Break-even point

Performance Metrics

cache_hit_rate = cached_tokens / total_prompt_tokens * 100
avg_latency_improvement = (uncached_latency - cached_latency) / uncached_latency * 100
tokens_saved = cached_tokens * (1 - discount_rate)

Quality Metrics

consistency_score = requests_with_cache_hits / total_requests * 100
cache_uptime = requests_within_ttl / total_requests * 100

When NOT to Use Caching

Caching may not be cost-effective when:

Single-use prompts: No repeat requests to amortize write premium
Highly dynamic: Every request has unique prefix
Below threshold: Prompts consistently < 1,024 tokens (or < 4,000 for Claude)
Infrequent requests: > 10 min between requests (cache expires)
Rapid iteration: Constantly changing prompt structure during development

Updating

To refresh this skill with updated Venice AI documentation:

Check for new features at https://docs.venice.ai
Re-run skill-seekers scrape with updated config
Merge new examples into Quick Reference section
Update provider-specific details if pricing changes
Increment version number and document changes

Adoption

enuno/venice-ai

$ install --global

Security Scan Results

SKILL.md

Venice AI Skill

When to Use This Skill

Core Use Cases

Trigger Phrases

Quick Reference

Prompt Caching Fundamentals

Common Patterns

1. Basic Chatbot with Cached System Prompt

2. Claude-Specific Cache Control (Advanced)

3. Monitoring Cache Performance

4. Cost-Optimized Prompt Structure

5. Multi-Turn Conversation with Caching

Prompt Caching Best Practices

1. Optimal Prompt Structure

2. Maintain Byte-Identical Prefixes

3. Meet Minimum Token Thresholds

4. Use prompt_cache_key for Multi-Turn

5. Monitor Cache Economics

Provider-Specific Details

Claude (Anthropic)

GPT (OpenAI)

Gemini (Google)

Grok (xAI)

DeepSeek

Troubleshooting Guide

Zero Cached Tokens

Continuous Cache Writes

Higher Than Expected Costs

Low Cache Hit Rate

Reference Files

Advanced Use Cases

1. Caching with Tools and Functions

2. Caching with Images (Vision Models)

3. Document Analysis with Caching

4. RAG (Retrieval Augmented Generation) with Caching

5. Batch Processing with Shared Context

Working with This Skill

For Prompt Caching Optimization

For API Integration

For Cost Reduction

For Code Examples

Performance Benchmarks

Latency Improvements

Cost Reductions

Cache Hit Rates

API Reference Summary

Base URL

Authentication

Key Parameters

Response Fields

Resources

references/

Code Examples

Notes

Success Metrics

Cost Metrics

Performance Metrics

Quality Metrics

When NOT to Use Caching

Updating

Related Skills

enuno/mempalace

enuno/langsmith

enuno/langgraph

enuno/langgraph-graph-api

enuno/venice-ai

$ install --global

Security Scan Results

SKILL.md

Venice AI Skill

When to Use This Skill

Core Use Cases

Trigger Phrases

Quick Reference

Prompt Caching Fundamentals