infrastructure/local-ai/rag-infrastructure/SKILL.md
Build and operate Retrieval-Augmented Generation (RAG) infrastructure with vector stores, embedding pipelines, and hybrid search. Covers ingestion, chunking strategies, reranking, and production deployment patterns.
npx skillsauth add bagelhole/devops-security-agent-skills rag-infrastructureInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Production infrastructure for Retrieval-Augmented Generation: ingest documents, generate embeddings, store in vector databases, and serve grounded LLM responses.
Use this skill when:
pipsentence-transformers)Documents → Chunker → Embedder → Vector Store
↓
User Query → Embedder → Vector Store (search) → Reranker → LLM → Answer
from sentence_transformers import SentenceTransformer
from qdrant_client import QdrantClient
from qdrant_client.models import Distance, VectorParams, PointStruct
import uuid
# Local embedding model (no API cost)
model = SentenceTransformer("BAAI/bge-large-en-v1.5")
# Connect to Qdrant
client = QdrantClient("http://localhost:6333")
# Create collection
client.create_collection(
collection_name="knowledge-base",
vectors_config=VectorParams(size=1024, distance=Distance.COSINE),
)
def ingest_documents(docs: list[dict]):
"""Chunk, embed, and upsert documents."""
points = []
for doc in docs:
chunks = chunk_text(doc["text"], chunk_size=512, overlap=50)
embeddings = model.encode(chunks, batch_size=32, show_progress_bar=True)
for chunk, embedding in zip(chunks, embeddings):
points.append(PointStruct(
id=str(uuid.uuid4()),
vector=embedding.tolist(),
payload={"text": chunk, "source": doc["source"], "title": doc["title"]},
))
client.upsert(collection_name="knowledge-base", points=points)
print(f"Ingested {len(points)} chunks")
from langchain.text_splitter import RecursiveCharacterTextSplitter
def chunk_text(text: str, chunk_size: int = 512, overlap: int = 50) -> list[str]:
"""Recursive character splitter — best general-purpose strategy."""
splitter = RecursiveCharacterTextSplitter(
chunk_size=chunk_size,
chunk_overlap=overlap,
separators=["\n\n", "\n", ". ", " ", ""],
)
return splitter.split_text(text)
# For code/markdown — use language-aware splitter
from langchain.text_splitter import MarkdownHeaderTextSplitter
headers = [("#", "H1"), ("##", "H2"), ("###", "H3")]
md_splitter = MarkdownHeaderTextSplitter(headers_to_split_on=headers)
from qdrant_client.models import SparseVector, SparseVectorParams, NamedSparseVector
from fastembed import SparseTextEmbedding
# Qdrant hybrid collection (dense + BM25 sparse)
client.create_collection(
collection_name="hybrid-kb",
vectors_config={"dense": VectorParams(size=1024, distance=Distance.COSINE)},
sparse_vectors_config={"sparse": SparseVectorParams()},
)
sparse_model = SparseTextEmbedding("prithivida/Splade_PP_en_v1")
def hybrid_search(query: str, top_k: int = 10) -> list[dict]:
dense_vec = model.encode(query).tolist()
sparse_vec = list(sparse_model.embed(query))[0]
results = client.query_points(
collection_name="hybrid-kb",
prefetch=[
{"query": dense_vec, "using": "dense", "limit": 20},
{"query": SparseVector(indices=sparse_vec.indices.tolist(),
values=sparse_vec.values.tolist()),
"using": "sparse", "limit": 20},
],
query={"fusion": "rrf"}, # Reciprocal Rank Fusion
limit=top_k,
)
return [{"text": p.payload["text"], "score": p.score} for p in results.points]
import cohere
co = cohere.Client("your-api-key")
def rerank(query: str, candidates: list[str], top_n: int = 5) -> list[str]:
"""Rerank retrieved chunks for relevance (improves RAG quality ~20-30%)."""
response = co.rerank(
model="rerank-english-v3.0",
query=query,
documents=candidates,
top_n=top_n,
)
return [candidates[r.index] for r in response.results]
# Alternative: local reranker (no API cost)
from sentence_transformers import CrossEncoder
reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")
def local_rerank(query: str, candidates: list[str], top_n: int = 5) -> list[str]:
pairs = [[query, c] for c in candidates]
scores = reranker.predict(pairs)
ranked = sorted(zip(candidates, scores), key=lambda x: x[1], reverse=True)
return [text for text, _ in ranked[:top_n]]
from openai import OpenAI
llm = OpenAI(base_url="http://localhost:8000/v1", api_key="your-key")
def rag_query(user_question: str) -> str:
# 1. Retrieve
candidates = hybrid_search(user_question, top_k=20)
texts = [c["text"] for c in candidates]
# 2. Rerank
top_chunks = local_rerank(user_question, texts, top_n=5)
# 3. Generate
context = "\n\n---\n\n".join(top_chunks)
response = llm.chat.completions.create(
model="meta-llama/Llama-3.1-8B-Instruct",
messages=[
{"role": "system", "content": (
"Answer the question using only the provided context. "
"If the answer isn't in the context, say so.\n\nContext:\n" + context
)},
{"role": "user", "content": user_question},
],
temperature=0.1,
max_tokens=1024,
)
return response.choices[0].message.content
services:
qdrant:
image: qdrant/qdrant:latest
volumes:
- qdrant-data:/qdrant/storage
ports:
- "6333:6333"
restart: unless-stopped
redis:
image: redis:7-alpine
volumes:
- redis-data:/data
restart: unless-stopped
ingestion-worker:
build: ./ingestion
environment:
- QDRANT_URL=http://qdrant:6333
- REDIS_URL=redis://redis:6379
depends_on: [qdrant, redis]
restart: unless-stopped
rag-api:
build: ./api
ports:
- "8080:8080"
environment:
- QDRANT_URL=http://qdrant:6333
- LLM_BASE_URL=http://vllm:8000/v1
depends_on: [qdrant]
restart: unless-stopped
volumes:
qdrant-data:
redis-data:
| Issue | Cause | Fix |
|-------|-------|-----|
| Poor retrieval quality | Chunk size too large | Try 256–512 tokens; overlap 10–15% |
| LLM ignores retrieved context | Context too long | Rerank and keep top 3–5 chunks |
| Slow ingestion | Sequential embedding | Use batch_size=64 and async upserts |
| Stale documents | No re-ingestion pipeline | Track doc_hash; re-embed on change |
| High embedding costs | All chunks re-embedded | Cache embeddings with hash-based dedup |
BAAI/bge-large-en-v1.5 or nomic-embed-text for strong free embeddings.development
Design and operationalize SRE dashboards that surface reliability, latency, error, saturation, and capacity signals across services. Use when building observability views for SLOs, incident response, and executive reliability reporting.
testing
Harden OpenClaw self-hosted environments with baseline host controls, auth tightening, secret handling, network segmentation, and safe update/rollback workflows. Use when deploying OpenClaw in home labs, startups, or production-like local AI infrastructure.
devops
Deploy, manage, and optimize vector databases for AI applications. Covers Qdrant, Weaviate, pgvector, and Pinecone — collection management, indexing strategies, backup, and performance tuning for production RAG and semantic search workloads.
testing
Deploy ML models on Kubernetes with KServe (formerly KFServing) and NVIDIA Triton Inference Server. Includes canary deployments, autoscaling, model versioning, A/B testing, and GPU resource management for production model serving.