.ai-rulez/skills/chunking-embeddings/SKILL.md
chunking emueddings
npx skillsauth add kreuzberg-dev/kreuzberg chunking-embeddingsInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Text splitting strategies, embedding generation with FastEmbed, RAG pipeline integration
Location: crates/kreuzberg/src/chunking/, crates/kreuzberg/src/embeddings.rs
Extracted Text
|
[1. Normalization] -> Clean whitespace, remove control chars
|
[2. Chunk Strategy Selection] -> Fixed-size, semantic, syntax-aware, recursive
|
[3. Overlap Management] -> Control context window overlap
|
[4. Optional Embedding] -> Generate vectors with FastEmbed
|
Output: Vec<Chunk> with text, vectors, metadata
Location: crates/kreuzberg/src/chunking/mod.rs
| Strategy | Pattern | Best For |
|----------|---------|----------|
| Fixed-Size | Sliding window with configurable overlap | Uniform chunks for embedding models with fixed token limits |
| Semantic | Split by sentences, merge/split by similarity threshold | Smart context preservation for LLM consumption and semantic search |
| Syntax-Aware | Split by paragraph/section/heading/code-block structure | Preserving document structure (sections, code blocks) in RAG |
| Recursive (LangChain pattern) | Try separators in order: \n\n, \n, , | Best general-purpose chunking; auto-finds optimal split points |
Key config fields per strategy (see struct definitions in chunking/mod.rs):
chunk_size, overlap, trim_whitespacetarget_chunk_size, min/max_chunk_size, semantic_threshold, use_sentence_boundarieschunk_by (Paragraph/Section/Heading/Sentence/CodeBlock), max_chunk_size, respect_code_blocksseparators[], chunk_size, overlapLocation: crates/kreuzberg/src/chunking/mod.rs
| Preset | Chunk Size | Overlap | Strategy | Use Case | |--------|-----------|---------|----------|----------| | Balanced | 512 tokens | 50 | Semantic | RAG sweet spot | | Compact | 256 tokens | 32 | Fixed-Size | Dense vectors | | Extended | 1024 tokens | 100 | Recursive | Full context | | Minimal | 128 tokens | 16 | (default) | Lightweight embeddings |
Usage: set config.chunking.preset = Some("balanced") in ExtractionConfig.
Location: crates/kreuzberg/src/embeddings.rs
| Model | Dimensions | Notes |
|-------|-----------|-------|
| BAAI/bge-small-en-v1.5 (default) | 384 | Fast, excellent for RAG |
| BAAI/bge-small-zh-v1.5 | 384 | Chinese optimized |
| BAAI/bge-base-en-v1.5 | 768 | Better quality, slower |
| jinaai/jina-embeddings-v2-base-en | 768 | Long context (up to 8192 tokens) |
| Custom(path) | varies | Custom ONNX model path |
TextEmbeddingManager provides singleton-cached models per config. Pattern:
get_or_init_model() -- lazy-loads ONNX model (downloads if needed), caches in Arc<RwLock<HashMap>>embed_chunks() -- collects chunk texts, calls model.embed(texts, batch_size), zips results back to ChunkWithEmbeddingDefault config: batch_size=256, device=CPU, parallel_requests=4.
Embeddings require ONNX Runtime. Feature-gated via:
[features]
embeddings = ["dep:fastembed", "dep:ort"]
Install: brew install onnxruntime (macOS) / apt install libonnxruntime libonnxruntime-dev (Linux). Verify: echo $ORT_DYLIB_PATH.
The full extraction-to-RAG pipeline:
extract_file(path, config) -> ExtractionResultresult.content -> Vec<Chunk>TextEmbeddingManager::embed_chunks() -> Vec<ChunkWithEmbedding>RagDocument { file_path, metadata, chunks } ready for vector DB ingestionSee ChunkWithEmbedding struct in types.rs: contains text, embedding: Vec<f32>, dimensions, norm, metadata.
tools
Extract text, tables, metadata, and images from 91+ document formats (PDF, Office, images, HTML, email, archives, academic) using Kreuzberg. Use when writing code that calls Kreuzberg APIs in Python, Node.js/TypeScript, Rust, or CLI. Covers installation, extraction (sync/async), configuration (OCR, chunking, output format), batch processing, error handling, and plugins.
testing
test execution patterns
development
ocr uackend management
data-ai
mime detection routing