.ai-rulez/skills/ocr-backend-management/SKILL.md
ocr uackend management
npx skillsauth add kreuzberg-dev/kreuzberg ocr-backend-managementInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Kreuzberg's pluggable OCR engine selection, image preprocessing, and table detection
Location: crates/kreuzberg/src/ocr/, crates/kreuzberg-tesseract/, crates/kreuzberg/src/extraction/image.rs
Kreuzberg provides a pluggable OCR system with multiple backend implementations:
| Backend | Language | Module | Features |
|---------|----------|--------|----------|
| Tesseract (Default) | Rust FFI binding | kreuzberg-tesseract crate | Multi-language, HOCR output, Tesseract 5.x |
| PaddleOCR | Rust (ONNX Runtime) | kreuzberg-paddle-ocr crate | 80+ languages, 11 script families, auto model download |
| EasyOCR | Python-only | kreuzberg-py embedding | 80+ languages, GPU support |
crates/kreuzberg/src/ocr/
├── mod.rs # OCR system orchestration
├── processor.rs # Main OCR processing pipeline
├── tesseract_backend.rs # Native Tesseract integration (30KB)
├── language_registry.rs # Language pack management (17KB)
├── cache.rs # OCR result caching (15KB)
├── hocr.rs # HOCR parsing (6KB)
├── types.rs # OCR configuration types (13KB)
└── table/
├── detection.rs # Table boundary detection
└── reconstruction.rs # Structured table output
Location: crates/kreuzberg/src/ocr/tesseract_backend.rs
pub struct TesseractConfig {
// Engine configuration
pub tessdata_prefix: Option<String>, // Custom Tesseract data dir
pub language_pack: Option<String>, // Language: "eng", "fra+deu", etc.
pub oem: Option<OcrEngineMode>, // 0=Legacy, 1=Neural, 2=Both, 3=Default
// Processing options
pub psm: Option<PageSegmentationMode>, // PSM 0-13 (see Tesseract docs)
pub dpi: Option<u32>, // Image DPI for scaling
pub quality_threshold: f32, // Min confidence 0.0-1.0
// Output
pub output_hocr: bool, // Enable hOCR bounding boxes
pub extract_tables: bool, // Auto table detection
}
Key Pattern: Lazy-initialize Tesseract instance per language pack to avoid repeated startup overhead.
// Location: ocr/processor.rs
lazy_static! {
static ref TESSERACT_INSTANCES: RwLock<HashMap<String, TesseractInstance>> = RwLock::new(HashMap::new());
}
async fn get_or_init_tesseract(language: &str) -> Result<TesseractInstance> {
let mut instances = TESSERACT_INSTANCES.write().map_err(|e| OCRError::LockPoisoned)?;
if let Some(instance) = instances.get(language) {
Ok(instance.clone())
} else {
let instance = TesseractInstance::new(language)?;
instances.insert(language.to_string(), instance.clone());
Ok(instance)
}
}
Location: crates/kreuzberg/src/ocr/language_registry.rs (17KB)
pub struct LanguageRegistry {
available_languages: HashMap<String, LanguageMetadata>,
cache: RwLock<HashMap<String, Vec<u8>>>, // Cached language packs
}
impl LanguageRegistry {
pub fn validate_language(&self, lang_code: &str) -> Result<LanguageMetadata> {
// ISO 639-1/3 validation
// Check if tessdata available for language
// Return metadata: script, rtl, confidence_threshold
}
pub fn list_available(&self) -> Vec<AvailableLanguage> {
// Return all installed language packs via tessdata check
}
pub fn download_language(&self, lang: &str) -> Result<()> {
// Download missing language pack if available
// Cache in TESSDATA_PREFIX
}
}
Supported Languages: Tesseract ships with 100+ languages; Kreuzberg validates on startup.
Location: crates/kreuzberg/src/extraction/image.rs (16KB)
Input Image
↓
[1. Load & Validate] → Check format, dimensions, color space
↓
[2. Preprocessing] → Denoise, deskew, binarize (optional, config-driven)
↓
[3. Scale/DPI Adjustment] → Resample to optimal DPI for OCR
↓
[4. Table Detection] → Identify table regions for separate OCR
↓
[5. OCR Processing] → Tesseract → hOCR output with bounding boxes
↓
[6. Table Reconstruction] → Structured table output (rows/cols)
↓
Output: ExtractionResult with text, tables, metadata (EXIF, dimensions)
pub struct ImagePreprocessingConfig {
// Denoising
pub denoise_enabled: bool, // Bilateral filter or morphological
pub denoise_strength: f32, // 0.0-1.0 aggressiveness
// Geometric
pub deskew_enabled: bool, // Auto-rotate skewed pages
pub rotation_threshold_deg: f32, // Min rotation angle to correct
// Binarization
pub binarize_enabled: bool, // Convert to B&W (improves OCR)
pub threshold_auto: bool, // Otsu vs fixed threshold
pub threshold_value: u8, // If threshold_auto=false
// Scaling
pub target_dpi: u32, // Resample to DPI (e.g., 300 DPI)
pub preserve_aspect: bool, // Don't distort
}
Implementation Pattern: Use image crate + SIMD operations for preprocessing.
// Location: extraction/image.rs
pub async fn preprocess_image(
image_data: &[u8],
config: &ImagePreprocessingConfig,
) -> Result<ImageBuffer<Rgba<u8>, Vec<u8>>> {
let mut img = image::load_from_memory(image_data)?;
// Deskew if needed
if config.deskew_enabled {
img = deskew_image(img, config.rotation_threshold_deg)?;
}
// Denoise if needed
if config.denoise_enabled {
img = denoise_image(img, config.denoise_strength)?;
}
// Binarize if needed
if config.binarize_enabled {
img = binarize_image(img, config.threshold_auto, config.threshold_value)?;
}
// DPI adjustment
if config.target_dpi > 0 {
img = resample_to_dpi(img, config.target_dpi)?;
}
Ok(img)
}
Location: crates/kreuzberg/src/ocr/table/
pub struct TableDetectionConfig {
pub enabled: bool, // Enable table detection
pub min_cells: u32, // Minimum cell count to consider table
pub confidence_threshold: f32, // Min confidence in table structure
pub line_thickness_threshold: u32, // Min pixels for table borders
pub merge_adjacent_cells: bool, // Merge cells with same span
}
Pattern: Detect table regions via:
TableData { headers: Vec<String>, rows: Vec<Vec<String>> }Location: crates/kreuzberg/src/ocr/hocr.rs
Tesseract's hOCR output contains XML with bounding boxes:
<span class="ocr_word" id="word_1_1" title="bbox 100 200 250 230">word</span>
Kreuzberg parses hOCR to:
TextWithBoundingBoxes { text: String, boxes: Vec<BBox> }Location: crates/kreuzberg/src/plugins/ocr.rs
pub trait OcrBackend: Send + Sync + 'static {
/// Process image and return text with optional HOCR/tables
async fn process(
&self,
image: &[u8],
config: &OcrConfig,
) -> Result<OcrResult>;
/// List supported languages
fn supported_languages(&self) -> Vec<String>;
/// Health check (verify engine is available)
async fn health_check(&self) -> Result<()>;
}
pub struct OcrPluginRegistry {
backends: HashMap<String, Arc<dyn OcrBackend>>,
}
impl OcrPluginRegistry {
pub fn register(&mut self, name: &str, backend: Arc<dyn OcrBackend>) {
self.backends.insert(name.to_string(), backend);
}
pub fn get_backend(&self, name: &str) -> Result<Arc<dyn OcrBackend>> {
self.backends.get(name).cloned()
.ok_or_else(|| OCRError::BackendNotFound(name.to_string()))
}
}
Usage Pattern: Register custom backend (e.g., cloud-based OCR, proprietary engine) at startup.
// Example: Custom AWS Textract backend
let textract_backend = Arc::new(AwsTextractBackend::new(credentials)?);
plugins::ocr::register_backend("aws-textract", textract_backend)?;
// Use via config
let config = ExtractionConfig {
image: Some(ImageConfig {
ocr_backend: Some("aws-textract".to_string()),
..Default::default()
}),
..Default::default()
};
Location: crates/kreuzberg/src/language_detection/
Before OCR, auto-detect document language:
pub struct LanguageDetectionConfig {
pub enabled: bool, // Auto-detect language
pub confidence_threshold: f32, // Min confidence 0.0-1.0
pub fallback_languages: Vec<String>, // If detection fails
}
// Usage in image extraction
let detected_lang = detect_language(&image_text)?;
let ocr_config = ImageConfig {
ocr_language: Some(detected_lang),
..Default::default()
};
Location: crates/kreuzberg/src/ocr/cache.rs
pub struct OcrResultCache {
// LRU cache: (image_hash, language, config_hash) → OcrResult
cache: Arc<RwLock<LruCache<CacheKey, OcrResult>>>,
}
impl OcrResultCache {
pub async fn get_or_process(
&self,
image: &[u8],
language: &str,
config: &OcrConfig,
) -> Result<OcrResult> {
let key = CacheKey::new(image, language, config);
// Check cache
{
let cache = self.cache.read()?;
if let Some(result) = cache.get(&key) {
return Ok(result.clone());
}
}
// Cache miss: process
let result = process_with_tesseract(image, language, config).await?;
self.cache.write()?.put(key, result.clone());
Ok(result)
}
}
Cache Key: Hash of image + language + config to avoid false hits.
// Process multiple images in parallel
pub async fn batch_ocr(
images: Vec<(String, Vec<u8>)>, // (file_id, image_data)
config: &OcrConfig,
) -> Result<Vec<(String, OcrResult)>> {
let tasks: Vec<_> = images
.into_iter()
.map(|(id, data)| tokio::spawn(process_image(id, data, config)))
.collect();
futures::future::try_join_all(tasks).await?
}
Location: crates/kreuzberg/src/core/config.rs
OCR backend selection precedence:
config.image.ocr_backend = Some("aws-textract")ocr_timeout_secs)tools
Extract text, tables, metadata, and images from 91+ document formats (PDF, Office, images, HTML, email, archives, academic) using Kreuzberg. Use when writing code that calls Kreuzberg APIs in Python, Node.js/TypeScript, Rust, or CLI. Covers installation, extraction (sync/async), configuration (OCR, chunking, output format), batch processing, error handling, and plugins.
testing
test execution patterns
data-ai
mime detection routing
development
format specific extraction