Adoption

Agent Skills are supported by leading AI development tools.

VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory

curiositech/rag-document-ingestion-pipeline

Name: rag-document-ingestion-pipeline
Author: curiositech

skills/rag-document-ingestion-pipeline/SKILL.md

npx skillsauth add curiositech/windags-skills rag-document-ingestion-pipeline

Clean

TrivyContainer and dependency vulnerability scanner

Clean

SemgrepStatic code analysis for vulnerabilities

Clean

mcp-scan (Snyk)Model Context Protocol security validation

Skipped

Snyk (dep)Open source security scanning

Skipped

Socket.devSupply chain security analysis

Skipped

VirusTotalMulti-engine malware detection

Skipped

CrowdStrikeAdvanced threat intelligence

Skipped

OSV-ScannerOpen Source Vulnerability database check

Skipped

OWASP Dep-Check

RAG Document Ingestion Pipeline

Build production-grade document ingestion pipelines that chunk, embed, and store documents in vector databases for retrieval-augmented generation.

Activation Triggers

Activate on: "document ingestion", "chunking strategy", "embedding pipeline", "vector DB ingestion", "RAG indexing", "ingest PDFs", "build knowledge base", "semantic chunking", "recursive chunking"

NOT for: LLM prompt design or retrieval query tuning (prompt-engineer, ai-engineer), vector DB operational migration (vector-database-migration-tool), or fine-tuning data preparation (fine-tuning-dataset-curator)

Quick Start

Identify sources — PDFs, HTML, Markdown, databases, APIs. Use unstructured or docling for parsing.
Choose chunking strategy — Recursive character splitting for general text, semantic chunking for domain-specific content, or document-structure-aware chunking for technical docs.
Select embedding model — text-embedding-3-large (OpenAI), embed-v4 (Cohere), or BAAI/bge-m3 (local). Match dimensionality to your vector DB plan.
Configure vector DB — Pinecone (managed), Qdrant (self-hosted or cloud), Weaviate (multi-tenant), or pgvector (Postgres-native).
Run ingestion with observability — Batch embed, upsert with metadata, validate retrieval quality on a test set.

Core Capabilities

| Domain | Technologies | Notes | |--------|-------------|-------| | Document Parsing | unstructured, docling, PyMuPDF, markitdown | Handles PDF, DOCX, HTML, Markdown, images with OCR | | Chunking | LangChain splitters, semantic-chunkers, chonkie | Recursive, semantic, markdown-header, code-aware | | Embedding Models | OpenAI text-embedding-3, Cohere embed-v4, BGE-M3, Nomic Embed | Local or API; 256-3072 dimensions | | Vector Databases | Pinecone, Qdrant, Weaviate, pgvector, Milvus | Managed or self-hosted; HNSW or IVF indexing | | Orchestration | LangChain, LlamaIndex, Haystack, custom Python | Pipeline DAGs with retry and checkpointing |

Architecture Patterns

Pattern 1: Chunking Decision Tree

Document Type?
├── Structured (Markdown, HTML, code)
│   └── Structure-aware chunking (headers, functions)
│       └── Preserve hierarchy as metadata
├── Semi-structured (PDF with tables)
│   └── docling/unstructured → table extraction + text chunking
│       └── Embed tables as markdown, text as paragraphs
└── Unstructured (plain text, transcripts)
    └── Semantic chunking (embedding similarity breakpoints)
        └── Fallback: recursive character split (512-1024 tokens, 10% overlap)

Pattern 2: Production Ingestion Pipeline

Sources ──→ [Parser] ──→ [Chunker] ──→ [Enricher] ──→ [Embedder] ──→ [Vector DB]
  │            │            │              │              │              │
  │         unstructured  recursive/    add metadata:   batch embed   upsert with
  │         docling       semantic      source, date,   (batch=256)   namespace
  │                                     section, hash                 partitioning
  │
  └── Dedup by content hash before embedding (saves 30-50% cost)

# Production ingestion skeleton
from langchain_text_splitters import RecursiveCharacterTextSplitter
from hashlib import sha256

def ingest_documents(docs: list[str], collection: str):
    splitter = RecursiveCharacterTextSplitter(
        chunk_size=512, chunk_overlap=64,
        separators=["\n\n", "\n", ". ", " "]
    )
    seen_hashes = set()
    chunks = []
    for doc in docs:
        for chunk in splitter.split_text(doc):
            h = sha256(chunk.encode()).hexdigest()[:16]
            if h not in seen_hashes:
                seen_hashes.add(h)
                chunks.append({"text": chunk, "hash": h})
    # Batch embed and upsert
    embeddings = embed_batch([c["text"] for c in chunks], batch_size=256)
    vector_db.upsert(collection, chunks, embeddings)

Pattern 3: Metadata-Enriched Chunks

Always store metadata alongside vectors for filtered retrieval:

metadata = {
    "source": "docs/api-reference.md",
    "section": "Authentication",
    "chunk_index": 3,
    "total_chunks": 12,
    "ingested_at": "2026-03-20T00:00:00Z",
    "content_hash": "a1b2c3d4",
    "token_count": 487,
}

Anti-Patterns

Fixed-size chunking without overlap — Splits mid-sentence, destroys context. Always use overlap (10-15%) or semantic boundaries.
Embedding everything without deduplication — Near-duplicate chunks waste storage and degrade retrieval. Hash-dedup or use MinHash for fuzzy dedup.
Ignoring chunk metadata — Without source, section, and date metadata, you cannot filter, cite, or refresh stale content.
One embedding model for all domains — Legal text and code have different semantic spaces. Benchmark retrieval quality per domain before committing.
No ingestion idempotency — Re-running ingestion creates duplicates. Use content hashes as vector IDs for upsert-based idempotency.

Quality Checklist

[ ] Chunking strategy matches document structure (not just fixed-size)
[ ] Chunk sizes benchmarked for retrieval quality (typically 256-1024 tokens)
[ ] Overlap configured to preserve cross-boundary context
[ ] Deduplication prevents redundant embeddings
[ ] Metadata attached to every chunk (source, section, date, hash)
[ ] Embedding model dimensionality matches vector DB index config
[ ] Batch embedding with retry logic for API failures
[ ] Ingestion is idempotent (re-run safe via content-hash IDs)
[ ] Retrieval quality validated on test queries before production
[ ] Monitoring: ingestion throughput, embedding latency, DB storage growth

curiositech/rag-document-ingestion-pipeline

skills/rag-document-ingestion-pipeline/SKILL.md

Build production document ingestion pipelines with chunking, embedding, and vector DB storage. Activate on: document ingestion, chunking strategy, embedding pipeline, vector DB ingestion, RAG indexing. NOT for: LLM prompt design (prompt-engineer), retrieval query logic (ai-engineer), or vector DB ops/migration (vector-database-migration-tool).

tools

Updated Apr 4, 2026

$ install --global

skillsauth

npx skillsauth add curiositech/windags-skills rag-document-ingestion-pipeline

Install this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.

Security Scan Results

3 of 9 scanners reported clean

Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.

Scanners Passed

Scanners in report

Clean

TrivyContainer and dependency vulnerability scanner

95%

Clean

SemgrepStatic code analysis for vulnerabilities

95%

Clean

mcp-scan (Snyk)Model Context Protocol security validation

95%

Skipped

Snyk (dep)Open source security scanning

50%

Skipped

Socket.devSupply chain security analysis

50%

Skipped

VirusTotalMulti-engine malware detection

50%

Skipped

CrowdStrikeAdvanced threat intelligence

50%

Skipped

OSV-ScannerOpen Source Vulnerability database check

50%

Skipped

OWASP Dep-Check

50%

Last scanned: Apr 4, 2026, 2:28 PM284.7s1 file scanned

SKILL.md

license:: Apache-2.0
name:: rag-document-ingestion-pipeline
description:: Build production document ingestion pipelines with chunking, embedding, and vector DB storage. Activate on: document ingestion, chunking strategy, embedding pipeline, vector DB ingestion, RAG indexing. NOT for: LLM prompt design (prompt-engineer), retrieval query logic (ai-engineer), or vector DB ops/migration (vector-database-migration-tool).
allowed-tools:: Read,Write,Edit,Bash(python:*,pip:*,npm:*,npx:*)
category:: AI & Machine Learning
- skill:: vector-database-migration-tool
reason:: Schema and index design for the target vector store

RAG Document Ingestion Pipeline

Build production-grade document ingestion pipelines that chunk, embed, and store documents in vector databases for retrieval-augmented generation.

Activation Triggers

Quick Start

Identify sources — PDFs, HTML, Markdown, databases, APIs. Use unstructured or docling for parsing.
Choose chunking strategy — Recursive character splitting for general text, semantic chunking for domain-specific content, or document-structure-aware chunking for technical docs.
Select embedding model — text-embedding-3-large (OpenAI), embed-v4 (Cohere), or BAAI/bge-m3 (local). Match dimensionality to your vector DB plan.
Configure vector DB — Pinecone (managed), Qdrant (self-hosted or cloud), Weaviate (multi-tenant), or pgvector (Postgres-native).
Run ingestion with observability — Batch embed, upsert with metadata, validate retrieval quality on a test set.

Core Capabilities

Architecture Patterns

Pattern 1: Chunking Decision Tree

Document Type?
├── Structured (Markdown, HTML, code)
│   └── Structure-aware chunking (headers, functions)
│       └── Preserve hierarchy as metadata
├── Semi-structured (PDF with tables)
│   └── docling/unstructured → table extraction + text chunking
│       └── Embed tables as markdown, text as paragraphs
└── Unstructured (plain text, transcripts)
    └── Semantic chunking (embedding similarity breakpoints)
        └── Fallback: recursive character split (512-1024 tokens, 10% overlap)

Pattern 2: Production Ingestion Pipeline

Sources ──→ [Parser] ──→ [Chunker] ──→ [Enricher] ──→ [Embedder] ──→ [Vector DB]
  │            │            │              │              │              │
  │         unstructured  recursive/    add metadata:   batch embed   upsert with
  │         docling       semantic      source, date,   (batch=256)   namespace
  │                                     section, hash                 partitioning
  │
  └── Dedup by content hash before embedding (saves 30-50% cost)

# Production ingestion skeleton
from langchain_text_splitters import RecursiveCharacterTextSplitter
from hashlib import sha256

def ingest_documents(docs: list[str], collection: str):
    splitter = RecursiveCharacterTextSplitter(
        chunk_size=512, chunk_overlap=64,
        separators=["\n\n", "\n", ". ", " "]
    )
    seen_hashes = set()
    chunks = []
    for doc in docs:
        for chunk in splitter.split_text(doc):
            h = sha256(chunk.encode()).hexdigest()[:16]
            if h not in seen_hashes:
                seen_hashes.add(h)
                chunks.append({"text": chunk, "hash": h})
    # Batch embed and upsert
    embeddings = embed_batch([c["text"] for c in chunks], batch_size=256)
    vector_db.upsert(collection, chunks, embeddings)

Pattern 3: Metadata-Enriched Chunks

Always store metadata alongside vectors for filtered retrieval:

metadata = {
    "source": "docs/api-reference.md",
    "section": "Authentication",
    "chunk_index": 3,
    "total_chunks": 12,
    "ingested_at": "2026-03-20T00:00:00Z",
    "content_hash": "a1b2c3d4",
    "token_count": 487,
}

Anti-Patterns

Fixed-size chunking without overlap — Splits mid-sentence, destroys context. Always use overlap (10-15%) or semantic boundaries.
Embedding everything without deduplication — Near-duplicate chunks waste storage and degrade retrieval. Hash-dedup or use MinHash for fuzzy dedup.
Ignoring chunk metadata — Without source, section, and date metadata, you cannot filter, cite, or refresh stale content.
One embedding model for all domains — Legal text and code have different semantic spaces. Benchmark retrieval quality per domain before committing.
No ingestion idempotency — Re-running ingestion creates duplicates. Use content hashes as vector IDs for upsert-based idempotency.

Quality Checklist

[ ] Chunking strategy matches document structure (not just fixed-size)
[ ] Chunk sizes benchmarked for retrieval quality (typically 256-1024 tokens)
[ ] Overlap configured to preserve cross-boundary context
[ ] Deduplication prevents redundant embeddings
[ ] Metadata attached to every chunk (source, section, date, hash)
[ ] Embedding model dimensionality matches vector DB index config
[ ] Batch embedding with retry logic for API failures
[ ] Ingestion is idempotent (re-run safe via content-hash IDs)
[ ] Retrieval quality validated on test queries before production
[ ] Monitoring: ingestion throughput, embedding latency, DB storage growth

Related Skills

curiositech/revisiting-interview-data-analysing-turn

data-ai

VerifiedTrustedCommunity

license: Apache-2.0 NOT for unrelated tasks outside this domain.

8SKILL.mdUpdated Jul 19, 2026

curiositech/revisiting-interview-data-analysing-turn

curiositech/redis-patterns-expert

development

VerifiedTrustedCommunity

Use when designing caching strategies (cache-aside, write-through, write-behind), implementing distributed locks, building rate limiters, leaderboards, real-time streams (XADD/consumer groups), pub/sub, or tuning eviction policies. Triggers: thundering-herd on cache miss, dogpile on key expiry, Redlock vs SET-NX-PX choice, sliding-window rate limiter, hot-key on a single cluster slot, big-key blowup, MULTI/EXEC across slots, KEYS in production. NOT for Redis Cluster operations/admin (different domain), embedded KV (SQLite, leveldb), in-process LRU caches, or Memcached.

8SKILL.mdUpdated Jul 19, 2026

curiositech/redis-patterns-expert

curiositech/react-server-components-boundary

tools

VerifiedTrustedCommunity

Drawing the `'use client'` boundary correctly in React Server Components apps (Next.js App Router, RSC frameworks) — leaf-pushing, slot composition, serialization rules, and environment poisoning prevention. Grounded in react.dev and Next.js 16 docs.

8SKILL.mdUpdated Jul 19, 2026

curiositech/react-server-components-boundary

curiositech/rate-limiting-strategy

development

VerifiedTrustedCommunity

Use when designing rate limiting for an API, choosing between token bucket / sliding window / leaky bucket / fixed window, implementing it in Redis, deciding edge (Cloudflare/Upstash) vs origin enforcement, sizing per-user vs per-IP vs per-endpoint quotas, returning the right 429 response with Retry-After, or fixing the boundary-burst bug in fixed-window limiters. Triggers: 429 too many requests, INCR + EXPIRE, ZADD + ZREMRANGEBYSCORE + ZCARD, X-RateLimit-Remaining header, Cloudflare WAF rate limiting rules, Upstash @upstash/ratelimit, leaky bucket shaping vs policing, distributed rate limiter consistency. NOT for DDoS mitigation specifically (different scale), CAPTCHA / bot management, full WAF design, or per-user quota billing.

8SKILL.mdUpdated Jul 19, 2026

curiositech/rate-limiting-strategy

Download

For Claude Desktop. Download once, then upload the file in the app — no terminal needed.

Need help? View full Cowork setup guide →

Install manually

Choose your platform

# Clone the repo
git clone https://github.com/curiositech/windags-skills.git

# Copy into Claude Code skills folder (global)
cp -r windags-skills/skills/rag-document-ingestion-pipeline ~/.claude/skills/

Claude Code Skills — official skills path docs.

Repository

curiositech/windags-skills

Compatible with

Claude Code

OpenAI Codex CLI

ChatGPT