Adoption

Agent Skills are supported by leading AI development tools.

VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory

bagelhole/rag-infrastructure

Name: rag-infrastructure
Author: bagelhole

infrastructure/local-ai/rag-infrastructure/SKILL.md

npx skillsauth add bagelhole/devops-security-agent-skills rag-infrastructure

Clean

TrivyContainer and dependency vulnerability scanner

Clean

SemgrepStatic code analysis for vulnerabilities

Clean

mcp-scan (Snyk)Model Context Protocol security validation

Skipped

Snyk (dep)Open source security scanning

Skipped

Socket.devSupply chain security analysis

Skipped

VirusTotalMulti-engine malware detection

Skipped

CrowdStrikeAdvanced threat intelligence

Skipped

OSV-ScannerOpen Source Vulnerability database check

Skipped

OWASP Dep-Check

RAG Infrastructure

Production infrastructure for Retrieval-Augmented Generation: ingest documents, generate embeddings, store in vector databases, and serve grounded LLM responses.

When to Use This Skill

Use this skill when:

Building a knowledge base Q&A system over internal documents
Implementing semantic search over large document collections
Reducing LLM hallucinations with retrieved context
Setting up embedding pipelines and vector store infrastructure
Deploying hybrid search (dense + sparse/BM25)

Prerequisites

Python 3.10+ with pip
A vector database (Qdrant, Weaviate, Pinecone, or pgvector)
An embedding model (OpenAI, Cohere, or local via sentence-transformers)
An LLM endpoint (OpenAI API or self-hosted vLLM)
Docker for local vector DB deployment

Architecture Overview

Documents → Chunker → Embedder → Vector Store
                                      ↓
User Query → Embedder → Vector Store (search) → Reranker → LLM → Answer

Embedding Pipeline

from sentence_transformers import SentenceTransformer
from qdrant_client import QdrantClient
from qdrant_client.models import Distance, VectorParams, PointStruct
import uuid

# Local embedding model (no API cost)
model = SentenceTransformer("BAAI/bge-large-en-v1.5")

# Connect to Qdrant
client = QdrantClient("http://localhost:6333")

# Create collection
client.create_collection(
    collection_name="knowledge-base",
    vectors_config=VectorParams(size=1024, distance=Distance.COSINE),
)

def ingest_documents(docs: list[dict]):
    """Chunk, embed, and upsert documents."""
    points = []
    for doc in docs:
        chunks = chunk_text(doc["text"], chunk_size=512, overlap=50)
        embeddings = model.encode(chunks, batch_size=32, show_progress_bar=True)
        for chunk, embedding in zip(chunks, embeddings):
            points.append(PointStruct(
                id=str(uuid.uuid4()),
                vector=embedding.tolist(),
                payload={"text": chunk, "source": doc["source"], "title": doc["title"]},
            ))
    client.upsert(collection_name="knowledge-base", points=points)
    print(f"Ingested {len(points)} chunks")

Chunking Strategies

from langchain.text_splitter import RecursiveCharacterTextSplitter

def chunk_text(text: str, chunk_size: int = 512, overlap: int = 50) -> list[str]:
    """Recursive character splitter — best general-purpose strategy."""
    splitter = RecursiveCharacterTextSplitter(
        chunk_size=chunk_size,
        chunk_overlap=overlap,
        separators=["\n\n", "\n", ". ", " ", ""],
    )
    return splitter.split_text(text)

# For code/markdown — use language-aware splitter
from langchain.text_splitter import MarkdownHeaderTextSplitter

headers = [("#", "H1"), ("##", "H2"), ("###", "H3")]
md_splitter = MarkdownHeaderTextSplitter(headers_to_split_on=headers)

Hybrid Search (Dense + Sparse)

from qdrant_client.models import SparseVector, SparseVectorParams, NamedSparseVector
from fastembed import SparseTextEmbedding

# Qdrant hybrid collection (dense + BM25 sparse)
client.create_collection(
    collection_name="hybrid-kb",
    vectors_config={"dense": VectorParams(size=1024, distance=Distance.COSINE)},
    sparse_vectors_config={"sparse": SparseVectorParams()},
)

sparse_model = SparseTextEmbedding("prithivida/Splade_PP_en_v1")

def hybrid_search(query: str, top_k: int = 10) -> list[dict]:
    dense_vec = model.encode(query).tolist()
    sparse_vec = list(sparse_model.embed(query))[0]

    results = client.query_points(
        collection_name="hybrid-kb",
        prefetch=[
            {"query": dense_vec, "using": "dense", "limit": 20},
            {"query": SparseVector(indices=sparse_vec.indices.tolist(),
                                   values=sparse_vec.values.tolist()),
             "using": "sparse", "limit": 20},
        ],
        query={"fusion": "rrf"},   # Reciprocal Rank Fusion
        limit=top_k,
    )
    return [{"text": p.payload["text"], "score": p.score} for p in results.points]

Reranking

import cohere

co = cohere.Client("your-api-key")

def rerank(query: str, candidates: list[str], top_n: int = 5) -> list[str]:
    """Rerank retrieved chunks for relevance (improves RAG quality ~20-30%)."""
    response = co.rerank(
        model="rerank-english-v3.0",
        query=query,
        documents=candidates,
        top_n=top_n,
    )
    return [candidates[r.index] for r in response.results]

# Alternative: local reranker (no API cost)
from sentence_transformers import CrossEncoder
reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")

def local_rerank(query: str, candidates: list[str], top_n: int = 5) -> list[str]:
    pairs = [[query, c] for c in candidates]
    scores = reranker.predict(pairs)
    ranked = sorted(zip(candidates, scores), key=lambda x: x[1], reverse=True)
    return [text for text, _ in ranked[:top_n]]

RAG Query Pipeline

from openai import OpenAI

llm = OpenAI(base_url="http://localhost:8000/v1", api_key="your-key")

def rag_query(user_question: str) -> str:
    # 1. Retrieve
    candidates = hybrid_search(user_question, top_k=20)
    texts = [c["text"] for c in candidates]

    # 2. Rerank
    top_chunks = local_rerank(user_question, texts, top_n=5)

    # 3. Generate
    context = "\n\n---\n\n".join(top_chunks)
    response = llm.chat.completions.create(
        model="meta-llama/Llama-3.1-8B-Instruct",
        messages=[
            {"role": "system", "content": (
                "Answer the question using only the provided context. "
                "If the answer isn't in the context, say so.\n\nContext:\n" + context
            )},
            {"role": "user", "content": user_question},
        ],
        temperature=0.1,
        max_tokens=1024,
    )
    return response.choices[0].message.content

Docker Compose: Full RAG Stack

services:
  qdrant:
    image: qdrant/qdrant:latest
    volumes:
      - qdrant-data:/qdrant/storage
    ports:
      - "6333:6333"
    restart: unless-stopped

  redis:
    image: redis:7-alpine
    volumes:
      - redis-data:/data
    restart: unless-stopped

  ingestion-worker:
    build: ./ingestion
    environment:
      - QDRANT_URL=http://qdrant:6333
      - REDIS_URL=redis://redis:6379
    depends_on: [qdrant, redis]
    restart: unless-stopped

  rag-api:
    build: ./api
    ports:
      - "8080:8080"
    environment:
      - QDRANT_URL=http://qdrant:6333
      - LLM_BASE_URL=http://vllm:8000/v1
    depends_on: [qdrant]
    restart: unless-stopped

volumes:
  qdrant-data:
  redis-data:

Common Issues

| Issue | Cause | Fix | |-------|-------|-----| | Poor retrieval quality | Chunk size too large | Try 256–512 tokens; overlap 10–15% | | LLM ignores retrieved context | Context too long | Rerank and keep top 3–5 chunks | | Slow ingestion | Sequential embedding | Use batch_size=64 and async upserts | | Stale documents | No re-ingestion pipeline | Track doc_hash; re-embed on change | | High embedding costs | All chunks re-embedded | Cache embeddings with hash-based dedup |

Best Practices

Use BAAI/bge-large-en-v1.5 or nomic-embed-text for strong free embeddings.
Always rerank before passing to LLM — 5 precise chunks beat 20 noisy ones.
Store source metadata (URL, page, section) in vector payloads for citations.
Use namespace/tenant isolation in the vector store for multi-tenant RAG.
Evaluate with RAGAS metrics: faithfulness, answer relevancy, context precision.

Related Skills

vector-database-ops - Qdrant/Weaviate management
vllm-server - Self-hosted LLM endpoint
ollama-stack - Local LLM for development
ai-pipeline-orchestration - Ingestion pipelines

bagelhole/rag-infrastructure

infrastructure/local-ai/rag-infrastructure/SKILL.md

Build and operate Retrieval-Augmented Generation (RAG) infrastructure with vector stores, embedding pipelines, and hybrid search. Covers ingestion, chunking strategies, reranking, and production deployment patterns.

18 stars

development

Updated Apr 3, 2026

$ install --global

skillsauth

npx skillsauth add bagelhole/devops-security-agent-skills rag-infrastructure

Install this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.

Security Scan Results

3 of 9 scanners reported clean

Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.

Scanners Passed

Scanners in report

Clean

TrivyContainer and dependency vulnerability scanner

95%

Clean

SemgrepStatic code analysis for vulnerabilities

95%

Clean

mcp-scan (Snyk)Model Context Protocol security validation

95%

Skipped

Snyk (dep)Open source security scanning

50%

Skipped

Socket.devSupply chain security analysis

50%

Skipped

VirusTotalMulti-engine malware detection

50%

Skipped

CrowdStrikeAdvanced threat intelligence

50%

Skipped

OSV-ScannerOpen Source Vulnerability database check

50%

Skipped

OWASP Dep-Check

50%

Last scanned: Apr 3, 2026, 5:06 PM8.6s1 file scanned

SKILL.md

name:: rag-infrastructure
description:: Build and operate Retrieval-Augmented Generation (RAG) infrastructure with vector stores, embedding pipelines, and hybrid search. Covers ingestion, chunking strategies, reranking, and production deployment patterns.
license:: MIT
author:: devops-skills
version:: 1.0

RAG Infrastructure

Production infrastructure for Retrieval-Augmented Generation: ingest documents, generate embeddings, store in vector databases, and serve grounded LLM responses.

When to Use This Skill

Use this skill when:

Building a knowledge base Q&A system over internal documents
Implementing semantic search over large document collections
Reducing LLM hallucinations with retrieved context
Setting up embedding pipelines and vector store infrastructure
Deploying hybrid search (dense + sparse/BM25)

Prerequisites

Python 3.10+ with pip
A vector database (Qdrant, Weaviate, Pinecone, or pgvector)
An embedding model (OpenAI, Cohere, or local via sentence-transformers)
An LLM endpoint (OpenAI API or self-hosted vLLM)
Docker for local vector DB deployment

Architecture Overview

Documents → Chunker → Embedder → Vector Store
                                      ↓
User Query → Embedder → Vector Store (search) → Reranker → LLM → Answer

Embedding Pipeline

from sentence_transformers import SentenceTransformer
from qdrant_client import QdrantClient
from qdrant_client.models import Distance, VectorParams, PointStruct
import uuid

# Local embedding model (no API cost)
model = SentenceTransformer("BAAI/bge-large-en-v1.5")

# Connect to Qdrant
client = QdrantClient("http://localhost:6333")

# Create collection
client.create_collection(
    collection_name="knowledge-base",
    vectors_config=VectorParams(size=1024, distance=Distance.COSINE),
)

def ingest_documents(docs: list[dict]):
    """Chunk, embed, and upsert documents."""
    points = []
    for doc in docs:
        chunks = chunk_text(doc["text"], chunk_size=512, overlap=50)
        embeddings = model.encode(chunks, batch_size=32, show_progress_bar=True)
        for chunk, embedding in zip(chunks, embeddings):
            points.append(PointStruct(
                id=str(uuid.uuid4()),
                vector=embedding.tolist(),
                payload={"text": chunk, "source": doc["source"], "title": doc["title"]},
            ))
    client.upsert(collection_name="knowledge-base", points=points)
    print(f"Ingested {len(points)} chunks")

Chunking Strategies

from langchain.text_splitter import RecursiveCharacterTextSplitter

def chunk_text(text: str, chunk_size: int = 512, overlap: int = 50) -> list[str]:
    """Recursive character splitter — best general-purpose strategy."""
    splitter = RecursiveCharacterTextSplitter(
        chunk_size=chunk_size,
        chunk_overlap=overlap,
        separators=["\n\n", "\n", ". ", " ", ""],
    )
    return splitter.split_text(text)

# For code/markdown — use language-aware splitter
from langchain.text_splitter import MarkdownHeaderTextSplitter

headers = [("#", "H1"), ("##", "H2"), ("###", "H3")]
md_splitter = MarkdownHeaderTextSplitter(headers_to_split_on=headers)

Hybrid Search (Dense + Sparse)

from qdrant_client.models import SparseVector, SparseVectorParams, NamedSparseVector
from fastembed import SparseTextEmbedding

# Qdrant hybrid collection (dense + BM25 sparse)
client.create_collection(
    collection_name="hybrid-kb",
    vectors_config={"dense": VectorParams(size=1024, distance=Distance.COSINE)},
    sparse_vectors_config={"sparse": SparseVectorParams()},
)

sparse_model = SparseTextEmbedding("prithivida/Splade_PP_en_v1")

def hybrid_search(query: str, top_k: int = 10) -> list[dict]:
    dense_vec = model.encode(query).tolist()
    sparse_vec = list(sparse_model.embed(query))[0]

    results = client.query_points(
        collection_name="hybrid-kb",
        prefetch=[
            {"query": dense_vec, "using": "dense", "limit": 20},
            {"query": SparseVector(indices=sparse_vec.indices.tolist(),
                                   values=sparse_vec.values.tolist()),
             "using": "sparse", "limit": 20},
        ],
        query={"fusion": "rrf"},   # Reciprocal Rank Fusion
        limit=top_k,
    )
    return [{"text": p.payload["text"], "score": p.score} for p in results.points]

Reranking

import cohere

co = cohere.Client("your-api-key")

def rerank(query: str, candidates: list[str], top_n: int = 5) -> list[str]:
    """Rerank retrieved chunks for relevance (improves RAG quality ~20-30%)."""
    response = co.rerank(
        model="rerank-english-v3.0",
        query=query,
        documents=candidates,
        top_n=top_n,
    )
    return [candidates[r.index] for r in response.results]

# Alternative: local reranker (no API cost)
from sentence_transformers import CrossEncoder
reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")

def local_rerank(query: str, candidates: list[str], top_n: int = 5) -> list[str]:
    pairs = [[query, c] for c in candidates]
    scores = reranker.predict(pairs)
    ranked = sorted(zip(candidates, scores), key=lambda x: x[1], reverse=True)
    return [text for text, _ in ranked[:top_n]]

RAG Query Pipeline

from openai import OpenAI

llm = OpenAI(base_url="http://localhost:8000/v1", api_key="your-key")

def rag_query(user_question: str) -> str:
    # 1. Retrieve
    candidates = hybrid_search(user_question, top_k=20)
    texts = [c["text"] for c in candidates]

    # 2. Rerank
    top_chunks = local_rerank(user_question, texts, top_n=5)

    # 3. Generate
    context = "\n\n---\n\n".join(top_chunks)
    response = llm.chat.completions.create(
        model="meta-llama/Llama-3.1-8B-Instruct",
        messages=[
            {"role": "system", "content": (
                "Answer the question using only the provided context. "
                "If the answer isn't in the context, say so.\n\nContext:\n" + context
            )},
            {"role": "user", "content": user_question},
        ],
        temperature=0.1,
        max_tokens=1024,
    )
    return response.choices[0].message.content

Docker Compose: Full RAG Stack

services:
  qdrant:
    image: qdrant/qdrant:latest
    volumes:
      - qdrant-data:/qdrant/storage
    ports:
      - "6333:6333"
    restart: unless-stopped

  redis:
    image: redis:7-alpine
    volumes:
      - redis-data:/data
    restart: unless-stopped

  ingestion-worker:
    build: ./ingestion
    environment:
      - QDRANT_URL=http://qdrant:6333
      - REDIS_URL=redis://redis:6379
    depends_on: [qdrant, redis]
    restart: unless-stopped

  rag-api:
    build: ./api
    ports:
      - "8080:8080"
    environment:
      - QDRANT_URL=http://qdrant:6333
      - LLM_BASE_URL=http://vllm:8000/v1
    depends_on: [qdrant]
    restart: unless-stopped

volumes:
  qdrant-data:
  redis-data:

Common Issues

Best Practices

Use BAAI/bge-large-en-v1.5 or nomic-embed-text for strong free embeddings.
Always rerank before passing to LLM — 5 precise chunks beat 20 noisy ones.
Store source metadata (URL, page, section) in vector payloads for citations.
Use namespace/tenant isolation in the vector store for multi-tenant RAG.
Evaluate with RAGAS metrics: faithfulness, answer relevancy, context precision.

Related Skills

vector-database-ops - Qdrant/Weaviate management
vllm-server - Self-hosted LLM endpoint
ollama-stack - Local LLM for development
ai-pipeline-orchestration - Ingestion pipelines

Related Skills

bagelhole/sre-dashboards

development

VerifiedTrustedCommunity

Design and operationalize SRE dashboards that surface reliability, latency, error, saturation, and capacity signals across services. Use when building observability views for SLOs, incident response, and executive reliability reporting.

28SKILL.mdUpdated May 23, 2026

bagelhole/sre-dashboards

bagelhole/openclaw-security-hardening

testing

VerifiedTrustedCommunity

Harden OpenClaw self-hosted environments with baseline host controls, auth tightening, secret handling, network segmentation, and safe update/rollback workflows. Use when deploying OpenClaw in home labs, startups, or production-like local AI infrastructure.

28SKILL.mdUpdated Apr 3, 2026

bagelhole/openclaw-security-hardening

bagelhole/vector-database-ops

devops

VerifiedTrustedCommunity

Deploy, manage, and optimize vector databases for AI applications. Covers Qdrant, Weaviate, pgvector, and Pinecone — collection management, indexing strategies, backup, and performance tuning for production RAG and semantic search workloads.

28SKILL.mdUpdated Apr 3, 2026

bagelhole/vector-database-ops

bagelhole/model-serving-kubernetes

testing

VerifiedTrustedCommunity

Deploy ML models on Kubernetes with KServe (formerly KFServing) and NVIDIA Triton Inference Server. Includes canary deployments, autoscaling, model versioning, A/B testing, and GPU resource management for production model serving.

28SKILL.mdUpdated Apr 3, 2026

bagelhole/model-serving-kubernetes

Download

For Claude Desktop. Download once, then upload the file in the app — no terminal needed.

Need help? View full Cowork setup guide →

Install manually

Choose your platform

# Clone the repo
git clone https://github.com/bagelhole/devops-security-agent-skills.git

# Copy into Claude Code skills folder (global)
cp -r devops-security-agent-skills/infrastructure/local-ai/rag-infrastructure ~/.claude/skills/

Claude Code Skills — official skills path docs.

Repository

bagelhole/devops-security-agent-skills

18 stars

Compatible with

Claude Code

OpenAI Codex CLI

ChatGPT