skills-templates/venice-ai/SKILL.md
Comprehensive Venice AI platform expertise including prompt caching optimization, API integration, cost reduction strategies, and multi-provider AI model access
npx skillsauth add enuno/claude-command-and-control venice-aiInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Expert assistance with Venice AI platform development, featuring advanced prompt caching optimization, API integration patterns, and cost-effective AI model access across multiple providers (Anthropic Claude, OpenAI GPT, Google Gemini, xAI Grok, DeepSeek, and more).
How It Works: Caching uses prefix matching—identical prompt beginnings across requests retrieve pre-processed tokens instead of recomputing them.
Key Benefit: Reduce costs by 36-90% and latency by up to 80% for longer prompts.
Minimum Requirements:
import requests
# 2,000-token system prompt (cached after first request)
system_prompt = """You are a helpful AI assistant specialized in..."""
# First request - writes to cache
response = requests.post(
"https://api.venice.ai/api/v1/chat/completions",
headers={"Authorization": f"Bearer {API_KEY}"},
json={
"model": "claude-opus-4-5",
"messages": [
{"role": "system", "content": system_prompt},
{"role": "user", "content": "What's the weather?"}
],
"prompt_cache_key": "user-session-123" # Route to same server
}
)
# Result: 2,050 tokens processed, 0 cached, full cost
# Subsequent requests - reads from cache
response2 = requests.post(
"https://api.venice.ai/api/v1/chat/completions",
headers={"Authorization": f"Bearer {API_KEY}"},
json={
"model": "claude-opus-4-5",
"messages": [
{"role": "system", "content": system_prompt}, # MUST be byte-identical
{"role": "user", "content": "Tell me a joke"}
],
"prompt_cache_key": "user-session-123" # Same key = better hit rate
}
)
# Result: 2,000 cached (90% discount), 80 new tokens
# Cost savings: ~64% compared to no caching
# For Claude: Explicitly mark cache breakpoints
response = requests.post(
"https://api.venice.ai/api/v1/chat/completions",
headers={"Authorization": f"Bearer {API_KEY}"},
json={
"model": "claude-opus-4-5",
"messages": [
{
"role": "system",
"content": [
{
"type": "text",
"text": "You are a helpful AI assistant...",
"cache_control": {"type": "ephemeral"} # Cache this block
}
]
},
{
"role": "user",
"content": "Summarize this document:\n\n{long_document}",
"cache_control": {"type": "ephemeral"} # Cache document too
}
]
}
)
# Venice auto-applies cache_control to system prompts for Claude
# Manual markers needed for caching beyond system messages
response = requests.post(...)
usage = response.json()["usage"]
cached_tokens = usage["prompt_tokens_details"]["cached_tokens"]
total_prompt_tokens = usage["prompt_tokens"]
cache_hit_rate = (cached_tokens / total_prompt_tokens) * 100
print(f"Cache Hit Rate: {cache_hit_rate:.1f}%")
print(f"Cached Tokens: {cached_tokens} (90% discount)")
print(f"New Tokens: {total_prompt_tokens - cached_tokens} (full price)")
# Troubleshooting
if cache_hit_rate == 0:
print("⚠️ Zero cache hits - check:")
print(" • Prompt below minimum threshold (~1,024 or ~4,000 for Claude)?")
print(" • Prompt prefix changed (whitespace, timestamps)?")
print(" • First request (writes cache, doesn't read)?")
print(" • Cache expired (>5-10 min since last request)?")
# ❌ BAD: Dynamic content first (breaks cache)
prompt = f"""
Current time: {datetime.now()}
User ID: {user_id}
System: You are a helpful assistant...
"""
# ✅ GOOD: Static content first (maximizes cache)
prompt = """
System: You are a helpful assistant with these capabilities:
- Answer questions
- Provide code examples
- Debug issues
Reference documentation:
{static_docs} # 5,000 tokens - cached
Examples:
{static_examples} # 2,000 tokens - cached
"""
# Add dynamic content AFTER static prefix
messages = [
{"role": "system", "content": prompt}, # 7,000 tokens cached
{"role": "user", "content": f"User {user_id} asks: {question}"} # New tokens only
]
// TypeScript example with conversation history
const conversationHistory: Message[] = [];
const SYSTEM_PROMPT = "..."; // 2,000 tokens
async function chat(userMessage: string, sessionId: string) {
conversationHistory.push({
role: "user",
content: userMessage
});
const response = await fetch("https://api.venice.ai/api/v1/chat/completions", {
method: "POST",
headers: {
"Authorization": `Bearer ${process.env.VENICE_API_KEY}`,
"Content-Type": "application/json"
},
body: JSON.stringify({
model: "gpt-5.2", // Auto-caching, no cache_control needed
messages: [
{ role: "system", content: SYSTEM_PROMPT }, // Cached
...conversationHistory // Growing conversation - partially cached
],
prompt_cache_key: sessionId // Route to same server for warm cache
})
});
const data = await response.json();
conversationHistory.push({
role: "assistant",
content: data.choices[0].message.content
});
return {
message: data.choices[0].message.content,
cached_tokens: data.usage.prompt_tokens_details.cached_tokens,
cache_savings_pct: (data.usage.prompt_tokens_details.cached_tokens / data.usage.prompt_tokens) * 100
};
}
Place static content before dynamic content to maximize cached prefix:
✅ GOOD ORDER:
- System prompt (always static)
- Reference documents (static)
- Examples (static)
- Tools/function definitions (static)
- Conversation history (grows but prefix is stable)
- Current user message (dynamic)
❌ BAD ORDER:
- Timestamp or request ID
- User message
- System prompt
Cache keys derive from exact byte sequences. Avoid:
Consistent routing hint across conversation improves hit rates:
prompt_cache_key = f"session-{user_id}-{conversation_id}"
Cost Comparison:
Without Caching:
Request 1: 2,050 tokens × $6.00/1M = $0.0123
Request 2: 2,080 tokens × $6.00/1M = $0.0125
Request 3: 2,120 tokens × $6.00/1M = $0.0127
Total: $0.0375
With Caching (Claude):
Request 1: 2,050 × $7.50/1M = $0.0154 (write premium)
Request 2: 2,000 cached × $0.60/1M + 80 new × $6.00/1M = $0.0017
Request 3: 2,000 cached × $0.60/1M + 120 new × $6.00/1M = $0.0019
Total: $0.0190
Savings: 49% (break-even after 2nd request)
cache_control: {"type": "ephemeral"} markersSymptoms: cached_tokens: 0 in every response
Causes:
prompt_cache_key for routing affinitySymptoms: cache_creation_input_tokens > 0 on every request
Causes:
Causes:
Target: > 70% for chatbots, > 50% for variable workloads
Diagnostic:
hit_rate = (cached_tokens / prompt_tokens) * 100
if hit_rate < 50:
print("Investigate: Prompt structure, token threshold, or expiration")
This skill includes comprehensive documentation in references/:
Use read to view specific reference files when detailed information is needed.
Function definitions integrate into cached prefix:
# Function definitions are part of static prefix
tools = [
{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get weather for a location",
"parameters": {
"type": "object",
"properties": {
"location": {"type": "string"}
}
}
}
}
]
# Tools definition cached along with system prompt
response = requests.post(
"https://api.venice.ai/api/v1/chat/completions",
json={
"model": "gpt-5.2",
"messages": [...],
"tools": tools # Included in cached prefix
}
)
Cache images and accompanying text for repeated questions:
# First request - cache image + context
messages = [
{
"role": "user",
"content": [
{
"type": "image_url",
"image_url": {"url": "data:image/jpeg;base64,..."}
},
{
"type": "text",
"text": "This is a diagram of our system architecture."
}
]
},
{
"role": "user",
"content": "What are the main components?"
}
]
# Subsequent requests - image and context cached
messages.append({
"role": "user",
"content": "How does data flow between components?"
})
# Only new question processed, image + context from cache
Load large documents once, ask multiple questions:
# 10,000-token document cached after first request
document = """
[Long legal contract, technical manual, or research paper]
"""
def analyze_document(question: str):
return requests.post(
"https://api.venice.ai/api/v1/chat/completions",
json={
"model": "claude-opus-4-5",
"messages": [
{
"role": "system",
"content": f"Analyze this document:\n\n{document}"
},
{"role": "user", "content": question}
],
"prompt_cache_key": "doc-analysis-session-xyz"
}
)
# Each question only processes new query, document stays cached
analyze_document("What are the key terms?")
analyze_document("Summarize section 3")
analyze_document("What are the risks?")
# 10,000 tokens cached at 90% discount × 3 requests = massive savings
Cache retrieved context for follow-up questions:
# Retrieve relevant chunks from vector DB
relevant_chunks = vector_db.search(user_query)
context = "\n\n".join([chunk.text for chunk in relevant_chunks])
# Cache the retrieved context (5,000 tokens)
messages = [
{
"role": "system",
"content": f"""Use this context to answer questions:
{context}
Context above is authoritative. If answer not in context, say so."""
},
{"role": "user", "content": user_query}
]
# Follow-up questions reuse cached context
# Only new query processed, context cached
Process multiple items with same instructions:
# Shared instructions cached
instructions = """
You are a code reviewer. For each code sample, check:
- Style compliance
- Security vulnerabilities
- Performance issues
- Best practices
"""
def review_code(code_sample: str, batch_id: str):
return requests.post(
"https://api.venice.ai/api/v1/chat/completions",
json={
"model": "gpt-5.2",
"messages": [
{"role": "system", "content": instructions},
{"role": "user", "content": f"Review:\n\n```\n{code_sample}\n```"}
],
"prompt_cache_key": f"batch-{batch_id}"
}
)
# Review 100 files - instructions cached for all
for file in code_files:
review_code(file.content, "batch-001")
# 100 requests, instructions cached 99 times
cached_tokens in responsesThe Quick Reference section contains 5 production-ready patterns. The Advanced Use Cases section covers 5 specialized scenarios.
https://api.venice.ai/api/v1
Authorization: Bearer YOUR_API_KEY
| Parameter | Type | Required | Description |
|-----------|------|----------|-------------|
| model | string | Yes | Model ID (e.g., claude-opus-4-5, gpt-5.2) |
| messages | array | Yes | Conversation messages |
| prompt_cache_key | string | No | Routing hint for cache affinity |
| cache_control | object | No | Cache markers (Claude only) |
| temperature | float | No | Sampling temperature (0-2) |
| max_tokens | int | No | Maximum completion length |
{
"usage": {
"prompt_tokens": 5500,
"completion_tokens": 200,
"prompt_tokens_details": {
"cached_tokens": 5000, // Tokens from cache (90% discount)
"cache_creation_input_tokens": 0 // Tokens written to cache (may have premium)
}
}
}
Organized documentation extracted from official Venice AI sources:
This skill includes 10+ production-ready code examples:
Track these KPIs to measure caching effectiveness:
cost_per_1k_tokens = total_cost / (total_tokens / 1000)
savings_percentage = (uncached_cost - cached_cost) / uncached_cost * 100
roi_requests = write_premium_cost / read_savings_per_request # Break-even point
cache_hit_rate = cached_tokens / total_prompt_tokens * 100
avg_latency_improvement = (uncached_latency - cached_latency) / uncached_latency * 100
tokens_saved = cached_tokens * (1 - discount_rate)
consistency_score = requests_with_cache_hits / total_requests * 100
cache_uptime = requests_within_ttl / total_requests * 100
Caching may not be cost-effective when:
To refresh this skill with updated Venice AI documentation:
tools
MemPalace local-first AI memory system. Use when setting up persistent memory for Claude Code sessions, mining project files or conversation transcripts, querying past context, configuring MCP tools, managing the knowledge graph, or troubleshooting palace operations.
tools
LangSmith Python SDK — trace, evaluate, and monitor LLM applications. Covers @traceable decorator, trace context manager, Client API, evaluate() / aevaluate(), comparative evaluation, custom evaluators, dataset management, prompt caching, ASGI middleware, and pytest plugin.
development
LangGraph (Python) — build stateful, controllable agent graphs with checkpointing, streaming, persistence, interrupts, fault tolerance, and durable execution. Covers both Graph API (StateGraph) and Functional API (@entrypoint/@task).
development
LangGraph Graph API (Python) — build explicit DAG agent workflows with StateGraph, typed state, nodes, edges, Command routing, Send fan-out, checkpointers, interrupts, and streaming. Use when you need explicit control flow and graph topology.