Adoption

Agent Skills are supported by leading AI development tools.

VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory

latestaiagents/production-rag-checklist

Name: production-rag-checklist
Author: latestaiagents

plugins/rag-architect/skills/production-rag-checklist/SKILL.md

npx skillsauth add latestaiagents/agent-skills production-rag-checklist

Clean

TrivyContainer and dependency vulnerability scanner

Clean

SemgrepStatic code analysis for vulnerabilities

Clean

mcp-scan (Snyk)Model Context Protocol security validation

Skipped

Snyk (dep)Open source security scanning

Skipped

Socket.devSupply chain security analysis

Skipped

VirusTotalMulti-engine malware detection

Skipped

CrowdStrikeAdvanced threat intelligence

Skipped

OSV-ScannerOpen Source Vulnerability database check

Skipped

OWASP Dep-Check

Production RAG Checklist

Everything you need to deploy RAG systems with confidence.

Pre-Production Checklist

Data Pipeline

[ ] Document ingestion automated
- Scheduled updates for dynamic sources
- Change detection for modified documents
- Deletion handling for removed documents
[ ] Chunking strategy validated
- Chunk sizes tested with retrieval quality
- Overlap tuned for context preservation
- Document-specific splitters for code/tables
[ ] Metadata enriched
- Source tracking (URL, file path, version)
- Timestamps (created, updated, indexed)
- Document type classification
- Access control tags (if needed)
[ ] Embedding pipeline robust
- Batch processing for efficiency
- Rate limiting for API-based embeddings
- Fallback for embedding failures
- Version tracking for re-embedding

Vector Store

[ ] Index configured properly
- Appropriate index type (HNSW, IVF, etc.)
- Parameters tuned (ef_construction, m, nlist)
- Distance metric matches embedding model
[ ] Scaling planned
- Estimated vector count and growth rate
- Sharding strategy if needed
- Backup and recovery procedures
[ ] High availability
- Replicas configured
- Failover tested
- Connection pooling enabled

Retrieval Quality

[ ] Evaluation dataset created
- Minimum 100 query-answer pairs
- Edge cases covered
- Regular updates with new patterns
[ ] Baseline metrics established
- Recall@5 > 0.8
- MRR > 0.7
- Latency p99 < 500ms
[ ] Hybrid search configured (if applicable)
- BM25/keyword weight tuned
- Reranker added and tested

Generation Quality

[ ] Prompt engineering complete
- System prompt tested across scenarios
- Few-shot examples if needed
- Output format specified
[ ] Guardrails in place
- Hallucination detection
- Toxicity filtering
- PII redaction (if needed)
[ ] Fallback responses defined
- "I don't know" for low confidence
- Error messages user-friendly

Infrastructure

API Layer

from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
import asyncio

app = FastAPI()

class QueryRequest(BaseModel):
    query: str
    top_k: int = 5
    filters: dict = None

class QueryResponse(BaseModel):
    answer: str
    sources: list
    confidence: float
    latency_ms: float

@app.post("/query", response_model=QueryResponse)
async def query_rag(request: QueryRequest):
    start = time.time()

    try:
        # Timeout for long queries
        result = await asyncio.wait_for(
            rag_pipeline.ainvoke(request.query, request.top_k, request.filters),
            timeout=30.0
        )

        return QueryResponse(
            answer=result["answer"],
            sources=[doc.metadata["source"] for doc in result["documents"]],
            confidence=result.get("confidence", 0.0),
            latency_ms=(time.time() - start) * 1000
        )

    except asyncio.TimeoutError:
        raise HTTPException(status_code=504, detail="Query timeout")
    except Exception as e:
        logger.error(f"RAG error: {e}")
        raise HTTPException(status_code=500, detail="Internal error")

Caching Layer

import hashlib
from redis import Redis

class RAGCache:
    def __init__(self, redis_url: str, ttl_seconds: int = 3600):
        self.redis = Redis.from_url(redis_url)
        self.ttl = ttl_seconds

    def _hash_query(self, query: str, filters: dict) -> str:
        key = f"{query}:{json.dumps(filters, sort_keys=True)}"
        return hashlib.sha256(key.encode()).hexdigest()

    def get(self, query: str, filters: dict = None) -> dict | None:
        key = self._hash_query(query, filters)
        cached = self.redis.get(key)
        return json.loads(cached) if cached else None

    def set(self, query: str, filters: dict, result: dict):
        key = self._hash_query(query, filters)
        self.redis.setex(key, self.ttl, json.dumps(result))

    def invalidate_by_source(self, source: str):
        """Invalidate cache when source document changes."""
        # Store source->keys mapping for targeted invalidation
        pattern = f"source:{source}:*"
        for key in self.redis.scan_iter(pattern):
            self.redis.delete(key)

Rate Limiting

from slowapi import Limiter
from slowapi.util import get_remote_address

limiter = Limiter(key_func=get_remote_address)

@app.post("/query")
@limiter.limit("100/minute")  # Per IP
async def query_rag(request: QueryRequest):
    ...

Monitoring

Metrics to Track

from prometheus_client import Counter, Histogram, Gauge

# Request metrics
rag_requests = Counter('rag_requests_total', 'Total RAG requests', ['status'])
rag_latency = Histogram('rag_latency_seconds', 'RAG latency', buckets=[0.1, 0.5, 1, 2, 5, 10])

# Quality metrics
retrieval_count = Histogram('rag_retrieval_count', 'Documents retrieved', buckets=[0, 1, 3, 5, 10])
confidence_score = Histogram('rag_confidence', 'Answer confidence', buckets=[0.1, 0.3, 0.5, 0.7, 0.9])

# System metrics
vector_store_latency = Histogram('vectorstore_latency_seconds', 'Vector store query time')
llm_latency = Histogram('llm_latency_seconds', 'LLM generation time')
cache_hits = Counter('rag_cache_hits_total', 'Cache hit count')

def track_request(func):
    async def wrapper(*args, **kwargs):
        with rag_latency.time():
            try:
                result = await func(*args, **kwargs)
                rag_requests.labels(status='success').inc()
                confidence_score.observe(result.get('confidence', 0))
                return result
            except Exception as e:
                rag_requests.labels(status='error').inc()
                raise
    return wrapper

Alerting Rules

# Prometheus alerting rules
groups:
  - name: rag_alerts
    rules:
      - alert: RAGHighLatency
        expr: histogram_quantile(0.99, rag_latency_seconds) > 5
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "RAG p99 latency above 5s"

      - alert: RAGHighErrorRate
        expr: rate(rag_requests_total{status="error"}[5m]) / rate(rag_requests_total[5m]) > 0.05
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "RAG error rate above 5%"

      - alert: RAGLowConfidence
        expr: histogram_quantile(0.5, rag_confidence) < 0.5
        for: 15m
        labels:
          severity: warning
        annotations:
          summary: "RAG median confidence below 0.5"

Logging

import structlog

logger = structlog.get_logger()

async def log_rag_request(query: str, result: dict, latency_ms: float):
    logger.info(
        "rag_request",
        query=query[:100],  # Truncate for privacy
        query_hash=hashlib.sha256(query.encode()).hexdigest()[:8],
        num_sources=len(result.get("sources", [])),
        confidence=result.get("confidence"),
        latency_ms=latency_ms,
        cache_hit=result.get("cache_hit", False),
        model=result.get("model_used")
    )

Security

[ ] Input validation
- Query length limits
- Injection prevention
- Rate limiting per user
[ ] Access control
- Document-level permissions
- User authentication
- API key management
[ ] Data privacy
- PII handling defined
- Data retention policy
- Audit logging enabled

Cost Management

def estimate_monthly_cost(
    queries_per_day: int,
    avg_tokens_per_query: int = 2000,
    embedding_calls_per_day: int = 1000
) -> dict:
    """Estimate monthly RAG costs."""

    # LLM costs (GPT-4)
    llm_input_cost = 0.03 / 1000  # per token
    llm_output_cost = 0.06 / 1000

    # Embedding costs (text-embedding-3-small)
    embedding_cost = 0.00002 / 1000  # per token

    # Vector DB (Pinecone Starter)
    vector_db_monthly = 70  # USD

    monthly_queries = queries_per_day * 30
    monthly_embeddings = embedding_calls_per_day * 30

    return {
        "llm_cost": monthly_queries * avg_tokens_per_query * (llm_input_cost + llm_output_cost * 0.3),
        "embedding_cost": monthly_embeddings * 500 * embedding_cost,
        "vector_db_cost": vector_db_monthly,
        "estimated_total": "Calculate based on above"
    }

Go-Live Checklist

[ ] Load testing passed (target QPS achieved)
[ ] Failover tested
[ ] Rollback procedure documented
[ ] On-call rotation set up
[ ] Runbook created
[ ] User documentation ready
[ ] Feedback collection mechanism in place

Best Practices

Start with caching - reduces cost and latency significantly
Monitor quality metrics - not just uptime
Version everything - embeddings, prompts, indexes
Plan for reindexing - you will need to rebuild
Set cost alerts - LLM costs can spike unexpectedly
Collect user feedback - thumbs up/down on answers

latestaiagents/production-rag-checklist

plugins/rag-architect/skills/production-rag-checklist/SKILL.md

Comprehensive checklist for deploying RAG systems to production with reliability and scale. Use this skill when preparing RAG for production deployment. Activate when: production RAG, RAG deployment, RAG checklist, RAG scaling, RAG monitoring, production-ready RAG.

2 stars

testing

Updated Apr 23, 2026

$ install --global

skillsauth

npx skillsauth add latestaiagents/agent-skills production-rag-checklist

Install this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.

Security Scan Results

3 of 9 scanners reported clean

Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.

Scanners Passed

Scanners in report

Clean

TrivyContainer and dependency vulnerability scanner

95%

Clean

SemgrepStatic code analysis for vulnerabilities

95%

Clean

mcp-scan (Snyk)Model Context Protocol security validation

95%

Skipped

Snyk (dep)Open source security scanning

50%

Skipped

Socket.devSupply chain security analysis

50%

Skipped

VirusTotalMulti-engine malware detection

50%

Skipped

CrowdStrikeAdvanced threat intelligence

50%

Skipped

OSV-ScannerOpen Source Vulnerability database check

50%

Skipped

OWASP Dep-Check

50%

Last scanned: Apr 24, 2026, 2:55 AM9.8s1 file scanned

SKILL.md

name:: production-rag-checklist
description:: |
Activate when:: production RAG, RAG deployment, RAG checklist, RAG scaling, RAG monitoring, production-ready RAG.

Production RAG Checklist

Everything you need to deploy RAG systems with confidence.

Pre-Production Checklist

Data Pipeline

[ ] Document ingestion automated
- Scheduled updates for dynamic sources
- Change detection for modified documents
- Deletion handling for removed documents
[ ] Chunking strategy validated
- Chunk sizes tested with retrieval quality
- Overlap tuned for context preservation
- Document-specific splitters for code/tables
[ ] Metadata enriched
- Source tracking (URL, file path, version)
- Timestamps (created, updated, indexed)
- Document type classification
- Access control tags (if needed)
[ ] Embedding pipeline robust
- Batch processing for efficiency
- Rate limiting for API-based embeddings
- Fallback for embedding failures
- Version tracking for re-embedding

Vector Store

[ ] Index configured properly
- Appropriate index type (HNSW, IVF, etc.)
- Parameters tuned (ef_construction, m, nlist)
- Distance metric matches embedding model
[ ] Scaling planned
- Estimated vector count and growth rate
- Sharding strategy if needed
- Backup and recovery procedures
[ ] High availability
- Replicas configured
- Failover tested
- Connection pooling enabled

Retrieval Quality

[ ] Evaluation dataset created
- Minimum 100 query-answer pairs
- Edge cases covered
- Regular updates with new patterns
[ ] Baseline metrics established
- Recall@5 > 0.8
- MRR > 0.7
- Latency p99 < 500ms
[ ] Hybrid search configured (if applicable)
- BM25/keyword weight tuned
- Reranker added and tested

Generation Quality

[ ] Prompt engineering complete
- System prompt tested across scenarios
- Few-shot examples if needed
- Output format specified
[ ] Guardrails in place
- Hallucination detection
- Toxicity filtering
- PII redaction (if needed)
[ ] Fallback responses defined
- "I don't know" for low confidence
- Error messages user-friendly

Infrastructure

API Layer

from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
import asyncio

app = FastAPI()

class QueryRequest(BaseModel):
    query: str
    top_k: int = 5
    filters: dict = None

class QueryResponse(BaseModel):
    answer: str
    sources: list
    confidence: float
    latency_ms: float

@app.post("/query", response_model=QueryResponse)
async def query_rag(request: QueryRequest):
    start = time.time()

    try:
        # Timeout for long queries
        result = await asyncio.wait_for(
            rag_pipeline.ainvoke(request.query, request.top_k, request.filters),
            timeout=30.0
        )

        return QueryResponse(
            answer=result["answer"],
            sources=[doc.metadata["source"] for doc in result["documents"]],
            confidence=result.get("confidence", 0.0),
            latency_ms=(time.time() - start) * 1000
        )

    except asyncio.TimeoutError:
        raise HTTPException(status_code=504, detail="Query timeout")
    except Exception as e:
        logger.error(f"RAG error: {e}")
        raise HTTPException(status_code=500, detail="Internal error")

Caching Layer

import hashlib
from redis import Redis

class RAGCache:
    def __init__(self, redis_url: str, ttl_seconds: int = 3600):
        self.redis = Redis.from_url(redis_url)
        self.ttl = ttl_seconds

    def _hash_query(self, query: str, filters: dict) -> str:
        key = f"{query}:{json.dumps(filters, sort_keys=True)}"
        return hashlib.sha256(key.encode()).hexdigest()

    def get(self, query: str, filters: dict = None) -> dict | None:
        key = self._hash_query(query, filters)
        cached = self.redis.get(key)
        return json.loads(cached) if cached else None

    def set(self, query: str, filters: dict, result: dict):
        key = self._hash_query(query, filters)
        self.redis.setex(key, self.ttl, json.dumps(result))

    def invalidate_by_source(self, source: str):
        """Invalidate cache when source document changes."""
        # Store source->keys mapping for targeted invalidation
        pattern = f"source:{source}:*"
        for key in self.redis.scan_iter(pattern):
            self.redis.delete(key)

Rate Limiting

from slowapi import Limiter
from slowapi.util import get_remote_address

limiter = Limiter(key_func=get_remote_address)

@app.post("/query")
@limiter.limit("100/minute")  # Per IP
async def query_rag(request: QueryRequest):
    ...

Monitoring

Metrics to Track

from prometheus_client import Counter, Histogram, Gauge

# Request metrics
rag_requests = Counter('rag_requests_total', 'Total RAG requests', ['status'])
rag_latency = Histogram('rag_latency_seconds', 'RAG latency', buckets=[0.1, 0.5, 1, 2, 5, 10])

# Quality metrics
retrieval_count = Histogram('rag_retrieval_count', 'Documents retrieved', buckets=[0, 1, 3, 5, 10])
confidence_score = Histogram('rag_confidence', 'Answer confidence', buckets=[0.1, 0.3, 0.5, 0.7, 0.9])

# System metrics
vector_store_latency = Histogram('vectorstore_latency_seconds', 'Vector store query time')
llm_latency = Histogram('llm_latency_seconds', 'LLM generation time')
cache_hits = Counter('rag_cache_hits_total', 'Cache hit count')

def track_request(func):
    async def wrapper(*args, **kwargs):
        with rag_latency.time():
            try:
                result = await func(*args, **kwargs)
                rag_requests.labels(status='success').inc()
                confidence_score.observe(result.get('confidence', 0))
                return result
            except Exception as e:
                rag_requests.labels(status='error').inc()
                raise
    return wrapper

Alerting Rules

# Prometheus alerting rules
groups:
  - name: rag_alerts
    rules:
      - alert: RAGHighLatency
        expr: histogram_quantile(0.99, rag_latency_seconds) > 5
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "RAG p99 latency above 5s"

      - alert: RAGHighErrorRate
        expr: rate(rag_requests_total{status="error"}[5m]) / rate(rag_requests_total[5m]) > 0.05
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "RAG error rate above 5%"

      - alert: RAGLowConfidence
        expr: histogram_quantile(0.5, rag_confidence) < 0.5
        for: 15m
        labels:
          severity: warning
        annotations:
          summary: "RAG median confidence below 0.5"

Logging

import structlog

logger = structlog.get_logger()

async def log_rag_request(query: str, result: dict, latency_ms: float):
    logger.info(
        "rag_request",
        query=query[:100],  # Truncate for privacy
        query_hash=hashlib.sha256(query.encode()).hexdigest()[:8],
        num_sources=len(result.get("sources", [])),
        confidence=result.get("confidence"),
        latency_ms=latency_ms,
        cache_hit=result.get("cache_hit", False),
        model=result.get("model_used")
    )

Security

[ ] Input validation
- Query length limits
- Injection prevention
- Rate limiting per user
[ ] Access control
- Document-level permissions
- User authentication
- API key management
[ ] Data privacy
- PII handling defined
- Data retention policy
- Audit logging enabled

Cost Management

def estimate_monthly_cost(
    queries_per_day: int,
    avg_tokens_per_query: int = 2000,
    embedding_calls_per_day: int = 1000
) -> dict:
    """Estimate monthly RAG costs."""

    # LLM costs (GPT-4)
    llm_input_cost = 0.03 / 1000  # per token
    llm_output_cost = 0.06 / 1000

    # Embedding costs (text-embedding-3-small)
    embedding_cost = 0.00002 / 1000  # per token

    # Vector DB (Pinecone Starter)
    vector_db_monthly = 70  # USD

    monthly_queries = queries_per_day * 30
    monthly_embeddings = embedding_calls_per_day * 30

    return {
        "llm_cost": monthly_queries * avg_tokens_per_query * (llm_input_cost + llm_output_cost * 0.3),
        "embedding_cost": monthly_embeddings * 500 * embedding_cost,
        "vector_db_cost": vector_db_monthly,
        "estimated_total": "Calculate based on above"
    }

Go-Live Checklist

[ ] Load testing passed (target QPS achieved)
[ ] Failover tested
[ ] Rollback procedure documented
[ ] On-call rotation set up
[ ] Runbook created
[ ] User documentation ready
[ ] Feedback collection mechanism in place

Best Practices

Start with caching - reduces cost and latency significantly
Monitor quality metrics - not just uptime
Version everything - embeddings, prompts, indexes
Plan for reindexing - you will need to rebuild
Set cost alerts - LLM costs can spike unexpectedly
Collect user feedback - thumbs up/down on answers

Related Skills

latestaiagents/skill-testing

development

VerifiedTrustedCommunity

Test skills for correct activation, content quality, and regression — both automated checks (frontmatter validity, lint) and manual verification (query-suite activation testing). Covers CI integration and how to catch skill regressions before users do. Use this skill when adding skills to a repo, setting up CI for a skill library, or debugging "the skill exists but doesn't work". Activate when: test skills, validate skills, skill CI, skill linting, skill activation test, skill regression.

2SKILL.mdUpdated Apr 23, 2026

latestaiagents/skill-testing

latestaiagents/skill-frontmatter

documentation

VerifiedTrustedCommunity

Write the YAML frontmatter for a SKILL.md file so it activates reliably — name, description, and activation keywords that the model matches against. Covers length, tone, and the most common frontmatter mistakes. Use this skill when authoring a new skill, fixing a skill that isn't auto-activating, or reviewing skills for publication. Activate when: SKILL.md frontmatter, skill description, skill activation, skill YAML, write a skill, author a skill.

2SKILL.mdUpdated Apr 23, 2026

latestaiagents/skill-frontmatter

latestaiagents/skill-activation-patterns

development

VerifiedTrustedCommunity

Design skills that fire at the right moment — neither over-eager (noise) nor under-eager (silent). Covers activation specificity, trigger phrases, disambiguation between overlapping skills, and debugging activation. Use this skill when multiple skills could fire on the same query, a skill never fires, or a skill fires too often. Activate when: skill won't activate, skill over-activates, overlapping skills, skill triggers, skill selection, skill disambiguation.

2SKILL.mdUpdated Apr 23, 2026

latestaiagents/skill-activation-patterns

latestaiagents/progressive-disclosure

development

VerifiedTrustedCommunity

Structure SKILL.md content so the model reads just enough — concise summary up front, progressively deeper detail, examples on demand. Covers section ordering, length budgets, when to split into multiple skills. Use this skill when writing or refactoring a skill body, one skill has grown too long, or a skill is wordy but not useful. Activate when: SKILL.md structure, skill content, skill too long, split skill, progressive disclosure, skill body.

2SKILL.mdUpdated Apr 23, 2026

latestaiagents/progressive-disclosure

Download

For Claude Desktop. Download once, then upload the file in the app — no terminal needed.

Need help? View full Cowork setup guide →

Install manually

Choose your platform

# Clone the repo
git clone https://github.com/latestaiagents/agent-skills.git

# Copy into Claude Code skills folder (global)
cp -r agent-skills/plugins/rag-architect/skills/production-rag-checklist ~/.claude/skills/

Claude Code Skills — official skills path docs.

Repository

latestaiagents/agent-skills

2 stars

Compatible with

Claude Code

OpenAI Codex CLI

ChatGPT