skills/rag-document-ingestion-pipeline/SKILL.md
Build production document ingestion pipelines with chunking, embedding, and vector DB storage. Activate on: document ingestion, chunking strategy, embedding pipeline, vector DB ingestion, RAG indexing. NOT for: LLM prompt design (prompt-engineer), retrieval query logic (ai-engineer), or vector DB ops/migration (vector-database-migration-tool).
npx skillsauth add curiositech/windags-skills rag-document-ingestion-pipelineInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Build production-grade document ingestion pipelines that chunk, embed, and store documents in vector databases for retrieval-augmented generation.
Activate on: "document ingestion", "chunking strategy", "embedding pipeline", "vector DB ingestion", "RAG indexing", "ingest PDFs", "build knowledge base", "semantic chunking", "recursive chunking"
NOT for: LLM prompt design or retrieval query tuning (prompt-engineer, ai-engineer), vector DB operational migration (vector-database-migration-tool), or fine-tuning data preparation (fine-tuning-dataset-curator)
unstructured or docling for parsing.text-embedding-3-large (OpenAI), embed-v4 (Cohere), or BAAI/bge-m3 (local). Match dimensionality to your vector DB plan.| Domain | Technologies | Notes | |--------|-------------|-------| | Document Parsing | unstructured, docling, PyMuPDF, markitdown | Handles PDF, DOCX, HTML, Markdown, images with OCR | | Chunking | LangChain splitters, semantic-chunkers, chonkie | Recursive, semantic, markdown-header, code-aware | | Embedding Models | OpenAI text-embedding-3, Cohere embed-v4, BGE-M3, Nomic Embed | Local or API; 256-3072 dimensions | | Vector Databases | Pinecone, Qdrant, Weaviate, pgvector, Milvus | Managed or self-hosted; HNSW or IVF indexing | | Orchestration | LangChain, LlamaIndex, Haystack, custom Python | Pipeline DAGs with retry and checkpointing |
Document Type?
├── Structured (Markdown, HTML, code)
│ └── Structure-aware chunking (headers, functions)
│ └── Preserve hierarchy as metadata
├── Semi-structured (PDF with tables)
│ └── docling/unstructured → table extraction + text chunking
│ └── Embed tables as markdown, text as paragraphs
└── Unstructured (plain text, transcripts)
└── Semantic chunking (embedding similarity breakpoints)
└── Fallback: recursive character split (512-1024 tokens, 10% overlap)
Sources ──→ [Parser] ──→ [Chunker] ──→ [Enricher] ──→ [Embedder] ──→ [Vector DB]
│ │ │ │ │ │
│ unstructured recursive/ add metadata: batch embed upsert with
│ docling semantic source, date, (batch=256) namespace
│ section, hash partitioning
│
└── Dedup by content hash before embedding (saves 30-50% cost)
# Production ingestion skeleton
from langchain_text_splitters import RecursiveCharacterTextSplitter
from hashlib import sha256
def ingest_documents(docs: list[str], collection: str):
splitter = RecursiveCharacterTextSplitter(
chunk_size=512, chunk_overlap=64,
separators=["\n\n", "\n", ". ", " "]
)
seen_hashes = set()
chunks = []
for doc in docs:
for chunk in splitter.split_text(doc):
h = sha256(chunk.encode()).hexdigest()[:16]
if h not in seen_hashes:
seen_hashes.add(h)
chunks.append({"text": chunk, "hash": h})
# Batch embed and upsert
embeddings = embed_batch([c["text"] for c in chunks], batch_size=256)
vector_db.upsert(collection, chunks, embeddings)
Always store metadata alongside vectors for filtered retrieval:
metadata = {
"source": "docs/api-reference.md",
"section": "Authentication",
"chunk_index": 3,
"total_chunks": 12,
"ingested_at": "2026-03-20T00:00:00Z",
"content_hash": "a1b2c3d4",
"token_count": 487,
}
tools
Building resilient distributed systems with circuit breakers, retries with full-jitter exponential backoff, retry budgets (per-request 3-attempt + per-client 10% ratio per Google SRE), deadline propagation, and the cascading-failure math (4 layers × 3 retries = 64x amplification). Grounded in Resilience4j, Microsoft Cloud Patterns, AWS Architecture Blog (Marc Brooker), and Google SRE Book.
testing
Designing HTTP cache headers that work correctly across browsers, CDNs, and shared proxies — `Cache-Control` directives per RFC 9111, `stale-while-revalidate` and `stale-if-error` per RFC 5861, the Vary header for varying responses, and surrogate keys for tag-based purging. Grounded in IETF RFCs and Cloudflare/Fastly docs.
development
Use when designing or fixing a Content Security Policy on a real site, choosing between nonce-based and hash-based CSP, adding strict-dynamic, debugging "Refused to execute inline script" errors, deploying CSP in report-only mode first, configuring report-to / report-uri, or auditing an existing policy for unsafe-inline / unsafe-eval / wildcards. Triggers: "CSP blocks legitimate inline script", strict-dynamic, nonce-{RANDOM}, sha256-{HASH}, object-src none, base-uri none, frame-ancestors, Trusted Types, X-Content-Security-Policy obsolete, report-only vs enforced. NOT for general HTTP security headers (HSTS, COOP/COEP), Trusted Types deep dive, CORS configuration, or building a WAF.
tools
Choosing and operating an HTTP API versioning strategy that doesn't break clients — Stripe's date-based pinned versions, the Deprecation/Sunset header pair (RFC 9745 + RFC 8594), URI vs header vs media-type approaches, and the version-transformer pattern. Grounded in Stripe's published architecture and IETF RFCs.