.ai-rulez/skills/format-specific-extraction/SKILL.md
format specific extraction
npx skillsauth add kreuzberg-dev/kreuzberg .ai-rulez/skills/format-specific-extractionInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
ZIP archive → Security validation → XML parsing → Text + tables + metadata
ZipBombValidator::new(limits).validate(&mut archive)?word/document.xml, ppt/slides/*.xml, content.xml)quick-xml::Reader (streaming) + DepthValidator + StringGrowthValidatorcrate::extraction::office_metadata::extract_metadata()extractors/docx.rs, extractors/pptx.rs, extractors/odt.rsBytes → pdfium-render → Per-page text + OCR fallback → Tables → Metadata
pdfium.create_document_from_bytes(content, None)?config.force_ocr || !has_searchable_text()config.pages enabled#[cfg(feature = "pdf")]extractors/pdf/mod.rsValidate → Extract metadata → Extract plaintext files only
ZipBombValidator BEFORE any extractionbuild_archive_result() helperextractors/archive.rs, extraction/archive/*.rsDetect format from MIME → Parse → Pretty-print → Metadata
Single StructuredExtractor handles multiple MIME types. Parse with format-specific library, pretty-print to text.
See: extractors/structured.rs
Parse headers → Extract body (text/html) → Process attachments
See: extraction/email.rs, extractors/email.rs
| Helper | Location | Purpose |
|--------|----------|---------|
| office_metadata::extract_metadata() | extraction/office.rs | Office XML metadata |
| cells_to_markdown() | extraction/mod.rs | Convert cell grid to GFM table |
| build_archive_result() | extraction/archive/mod.rs | Standard archive result |
EXT_TO_MIME in core/mime.rsDocumentExtractor traitsupported_mime_types() and priority() (default: 50)extractors/mod.rs → register_default_extractors()#[cfg(feature = "my-format")]tools
Extract text, tables, metadata, and images from 91+ document formats (PDF, Office, images, HTML, email, archives, academic) using Kreuzberg. Use when writing code that calls Kreuzberg APIs in Python, Node.js/TypeScript, Rust, or CLI. Covers installation, extraction (sync/async), configuration (OCR, chunking, output format), batch processing, error handling, and plugins.
testing
test execution patterns
development
ocr uackend management
data-ai
mime detection routing