.codex/skills/edge-case-handling/SKILL.md
Edge Case Handling in html-to-markdown
npx skillsauth add kreuzberg-dev/html-to-markdown edge-case-handlingInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
The html-to-markdown converter handles numerous edge cases and adversarial inputs with robust validation, error recovery, and fallback mechanisms. This skill documents binary detection, encoding issues, malformed HTML recovery, and other robustness strategies.
The converter implements multiple detection techniques to prevent binary data from being processed as HTML.
Located in /crates/html-to-markdown/src/lib.rs (lines 50-57, 111-118):
const BINARY_MAGIC_PREFIXES: &[(&[u8], &str)] = &[
(b"\x1F\x8B", "gzip-compressed data"), // gzip
(b"\x28\xB5\x2F\xFD", "zstd-compressed data"), // zstd
(b"PK\x03\x04", "zip archive"), // ZIP
(b"PK\x05\x06", "zip archive"),
(b"PK\x07\x08", "zip archive"),
(b"%PDF-", "PDF data"), // PDF
];
fn detect_binary_magic(bytes: &[u8]) -> Option<&'static str> {
for (prefix, label) in BINARY_MAGIC_PREFIXES {
if bytes.starts_with(prefix) {
return Some(*label);
}
}
None
}
Detection Flow:
Input bytes
|
+-- Check for known magic signatures (first 4-8 bytes)
|
+-- Match found? Return error immediately
|
+-- No match? Continue to next detection layer
Example:
// Gzip-compressed file
let html = b"\x1F\x8B\x08\x00...gzipped content...".to_vec();
let result = convert(&String::from_utf8_lossy(&html), None);
// Error: "binary data detected (gzip-compressed data)"
Located in /crates/html-to-markdown/src/lib.rs (lines 120-147):
BOM (Byte Order Mark) Detection:
// UTF-16LE BOM
if bytes.starts_with(b"\xFF\xFE") {
return Some("UTF-16LE BOM");
}
// UTF-16BE BOM
if bytes.starts_with(b"\xFE\xFF") {
return Some("UTF-16BE BOM");
}
Heuristic UTF-16 Detection (without BOM):
const BINARY_UTF16_NULL_RATIO: f64 = 0.2; // 20% null bytes
// Count null bytes in sample
let nul_ratio = nul_count as f64 / sample_len as f64;
if nul_ratio < BINARY_UTF16_NULL_RATIO {
return None; // Not enough nulls for UTF-16
}
// UTF-16 has consistent null pattern (even or odd positions)
let dominant_ratio = (even_nul_count.max(odd_nul_count) as f64) / nul_count as f64;
if dominant_ratio >= 0.9 {
return Some("UTF-16 data without BOM");
}
Examples:
UTF-16LE without BOM: <\0h\0t\0m\0l\0>\0
- Nulls at even indices (2, 4, 6, 8, 10, 12)
- Detected as UTF-16 heuristic
UTF-16BE without BOM: \0<\0h\0t\0m\0l\0>
- Nulls at odd indices (1, 3, 5, 7, 9, 11)
- Detected as UTF-16 heuristic
Located in /crates/html-to-markdown/src/lib.rs (lines 71-106):
const BINARY_SCAN_LIMIT: usize = 8192; // Scan first 8KB
const BINARY_CONTROL_RATIO: f64 = 0.3; // 30% threshold
fn validate_input(html: &str) -> Result<()> {
let bytes = html.as_bytes();
if bytes.is_empty() {
return Ok(());
}
// Check magic prefixes first
if let Some(label) = detect_binary_magic(bytes) {
return Err(ConversionError::InvalidInput(format!(
"binary data detected ({label}); decode/decompress to UTF-8 HTML first"
)));
}
let sample_len = bytes.len().min(BINARY_SCAN_LIMIT);
let mut control_count = 0usize;
for &byte in bytes[..sample_len].iter() {
let is_control = (byte < 0x09) || (0x0E..0x20).contains(&byte);
if is_control {
control_count += 1;
}
}
let control_ratio = control_count as f64 / sample_len as f64;
if control_ratio > BINARY_CONTROL_RATIO {
return Err(ConversionError::InvalidInput(
"binary data detected (excess control bytes)".to_string(),
));
}
Ok(())
}
Control Character Ranges:
0x00-0x08: NUL, SOH, STX, ETX, EOT, ENQ, ACK, BEL, BS0x0E-0x1F: Shift Out through Unit Separator (except TAB 0x09, LF 0x0A, CR 0x0D)Examples:
HTML with 35% control chars: REJECTED
"Hello\x00\x01\x02World\x03\x04\x05..."
^^^^^^ ^^^^^^^ = 6/8 sample = 75% > 30%
Text with \r\n newlines: ACCEPTED
"Hello\r\nWorld\r\n" = 2 CR, 2 LF within normal ranges (0x0D, 0x0A allowed)
All binary detection errors return ConversionError::InvalidInput:
pub enum ConversionError {
InvalidInput(String), // Binary, encoding, or malformed
ParseError(String),
ConfigError(String),
Panic(String),
Other(String),
}
Example Error Flow:
let html = read_file("archive.zip"); // Binary ZIP file
match convert(&html, None) {
Err(ConversionError::InvalidInput(msg)) => {
eprintln!("Cannot convert: {}", msg);
// Error: "binary data detected (zip archive); decode/decompress to UTF-8 HTML first"
}
_ => {}
}
\x61\x00 in UTF-16LE or \x00\x61 in UTF-16BE<meta charset> specify UTF-8Since detection happens before parsing:
// User receives error
Err("binary data detected (UTF-16LE BOM); decode to UTF-8 HTML first")
// User must decode first:
let utf16_html = fs::read("utf16.html").unwrap(); // Vec<u8>
let utf8_html = String::from_utf16_le(&utf16_html)
.expect("decode UTF-16LE to UTF-8")
.into_string();
let markdown = convert(&utf8_html, None)?;
# Common mistake: file saved as UTF-16
with open('document.html', 'rb') as f:
raw_bytes = f.read()
# Detect and convert
try:
# Try UTF-16LE (Windows)
html = raw_bytes.decode('utf-16-le')
except UnicodeDecodeError:
# Try UTF-16BE (Mac)
html = raw_bytes.decode('utf-16-be')
# Now convert
markdown = html_to_markdown.convert(html)
The converter uses two parsers with different robustness levels:
astral-tl (fast, primary):
/crates/html-to-markdown/src/converter.rshtml5ever (robust, fallback):
<!-- Input -->
<div>
<p>Paragraph 1
<p>Paragraph 2
</div>
<!-- astral-tl behavior: Treats as flat text flow -->
<!-- html5ever behavior: Auto-closes <p>, handles nesting -->
Output:
Paragraph 1
Paragraph 2
<!-- Input: Divs inside table cells (invalid structure) -->
<table>
<tr>
<td>
<div>Content in div</div>
</td>
</table>
<!-- Missing closing </tr>, </td> -->
<!-- astral-tl: Best-effort recovery -->
<!-- html5ever: Proper reconstruction of table structure -->
<!-- Input: Unescaped quotes in attribute values -->
<a href="page?a=1&b=2">Link</a>
<!-- astral-tl: Attribute parsing stops at first "
<a href="page?a=1">
<!-- html5ever: Proper entity decoding (& → &) -->
<a href="page?a=1&b=2">
Before parsing, normalize line endings:
fn normalize_line_endings(html: &str) -> Cow<'_, str> {
if html.contains('\r') {
Cow::Owned(html.replace("\r\n", "\n").replace('\r', "\n"))
} else {
Cow::Borrowed(html)
}
}
Input: <p>Line1\r\nLine2\r<br></p> (mixed CRLF, CR, LF)
Output: <p>Line1\nLine2\n<br></p> (normalized to LF)
For text-only inputs (no < character), skip HTML parsing entirely:
fn fast_text_only(html: &str, options: &ConversionOptions) -> Option<String> {
if html.contains('<') {
return None; // Has HTML, can't use fast path
}
// Apply text processing directly
let decoded = text::decode_html_entities_cow(html);
let normalized = if options.whitespace_mode == WhitespaceMode::Normalized {
text::normalize_whitespace_cow(&decoded)
} else {
Cow::Borrowed(&decoded)
};
Some(normalized.into_owned() + "\n")
}
Benefit: Plain text documents converted without parsing overhead
Handle HTML entities even with malformed markup:
// From text.rs
pub fn decode_html_entities_cow(text: &str) -> Cow<'_, str> {
// Handles:
// & → &
// { → {
//  → (Unicode char)
// → (non-breaking space)
// ... 100+ HTML5 entities
}
Examples:
<script> → <script>
"test" → "test"
€ → (Euro symbol)
😀 → (Emoji)
When parser encounters malformed attributes, graceful degradation:
<!-- Input -->
<img src="image.jpg" alt=Unquoted width=100 data-x='mixed quote>
<!-- Outcome: Some attributes lost, image still converts -->

Located in /crates/html-to-markdown/src/error.rs:
pub enum ConversionError {
Panic(String), // Catches panics in conversion pipeline
}
Bindings wrap conversion in catch-panic:
// From Python bindings
#[pyfunction]
fn convert(html: String, options: Option<PyObject>) -> PyResult<String> {
let result = std::panic::catch_unwind(|| {
html_to_markdown_rs::convert(&html, None)
});
match result {
Ok(Ok(md)) => Ok(md),
Ok(Err(e)) => Err(PyErr::new::<PyException, _>(e.to_string())),
Err(_) => Err(PyErr::new::<PyException, _>("panic in conversion")),
}
}
Never unwrap in conversion code:
// WRONG - Can panic
let tag_name = element.name().expect("tag name");
// CORRECT - Returns Result
match element.name() {
Some(name) => { /* use name */ },
None => { /* handle missing name gracefully */ },
}
fn validate_input(html: &str) -> Result<()> {
let bytes = html.as_bytes();
if bytes.is_empty() {
return Ok(()); // Empty string is valid
}
// Continue with other checks
// ...
}
convert("", None) → Ok("\n") // Single newline
convert(" ", None) → Ok("\n") // Whitespace stripped
convert(null, None) → Error (from binding) // Python None → error
The converter handles large documents efficiently:
Tested sizes:
For extremely large documents (> 100 MB), consider:
Streaming parsing (future enhancement):
Document splitting:
<div> sectionsChunked metadata:
// Use pre-allocated buffers
let mut output = String::with_capacity(html.len());
// LRU cache for common patterns
static PATTERN_CACHE: Lazy<Mutex<LruCache<String, String>>> =
Lazy::new(|| Mutex::new(LruCache::new(NonZeroUsize::new(1024).unwrap())));
// Metadata size-bounded
config.max_structured_data_size = 10 * 1024 * 1024; // 10 MB limit
<!-- Missing tbody, thead -->
<table>
<tr><td>Cell 1</td></tr>
</table>
<!-- Cells spanning beyond row width -->
<table>
<tr>
<td colspan="5">Wide cell</td>
<td>Normal</td>
</tr>
</table>
<!-- Nested tables (complex) -->
<table>
<tr>
<td>
<table><tr><td>Inner</td></tr></table>
</td>
</tr>
</table>
Single-cell table:
<table><tr><td>Data</td></tr></table>
→ | Data |
|------|
Colspan expansion:
<table>
<tr><td colspan="3">Wide</td></tr>
<tr><td>A</td><td>B</td><td>C</td></tr>
</table>
→ | Wide | | |
|------|---|---|
| A | B | C |
Nested tables: Converts outer table, treats inner table as cell content:
<table><tr><td>Cell with [nested table] inside</td></tr></table>
→ Outer table with nested table markdown in cell
Based on ConversionOptions:
pub struct ConversionOptions {
pub escape_misc: bool, // \ & < ` [ > ~ # = + | -
pub escape_asterisks: bool, // *
pub escape_underscores: bool, // _
pub escape_ascii: bool, // All ASCII punctuation
}
Input: "Price: $10 & shipping"
escape_misc=true → "Price: $10 \& shipping"
Input: "*not bold*"
escape_asterisks=true → "\*not bold\*"
Input: "Text_with_underscore"
escape_underscores=true → "Text\_with\_underscore"
Input: "1. First point"
No explicit option, but auto-escaped to "1\\. First point" for ordered list safety
Deep nesting (h1 inside 50 nested divs)
Large attribute maps
Regex escaping
// Cache compiled regex patterns (already done in text.rs)
static ESCAPE_MISC_RE: Lazy<Regex> =
Lazy::new(|| Regex::new(r"([\\&<`\[\]>~#=+|\-])").unwrap());
// Use iterators instead of allocating vectors
for byte in bytes.iter() {
// Process in-place
}
// Limit depth checks in visitor
if ctx.depth > 100 {
return VisitResult::Skip; // Avoid stack exhaustion
}
pub enum ConversionError {
InvalidInput(String), // Binary, encoding, empty (recoverable)
ParseError(String), // HTML parsing failed (unrecoverable)
ConfigError(String), // Invalid config (unrecoverable)
Panic(String), // Unexpected panic (unrecoverable)
Other(String), // Misc errors (mixed)
}
| Error | Cause | Recovery | |-------|-------|----------| | InvalidInput | UTF-16, gzip, control chars | User must fix input, then retry | | ParseError | HTML unparseable | Fall back to html5ever or give up | | ConfigError | Bad options | Fix config, retry | | Panic | Code bug | Report as issue, workaround TBD |
Location: /crates/html-to-markdown/src/lib.rs (lines 722-765)
#[test]
fn test_binary_input_rejected() {
let html = "PDF\0DATA";
let result = convert(html, None);
assert!(matches!(result, Err(ConversionError::InvalidInput(_))));
}
#[test]
fn test_binary_magic_rejected() {
let html = String::from_utf8_lossy(b"\x1F\x8B\x08\x00gzip").to_string();
let result = convert(&html, None);
assert!(matches!(result, Err(ConversionError::InvalidInput(_))));
}
#[test]
fn test_utf16_hint_rejected() {
let html = String::from_utf8_lossy(b"\xFF\xFE<\0h\0t\0m\0l\0>\0").to_string();
let result = convert(&html, None);
assert!(matches!(result, Err(ConversionError::InvalidInput(_))));
}
#[test]
fn test_plain_text_allowed() {
let result = convert("Just text", None).unwrap();
assert!(result.contains("Just text"));
}
| Input Type | Detection | Outcome | User Action |
|-----------|-----------|---------|------------|
| Empty string | Whitespace check | Ok("\n") | None needed |
| Plain text (no HTML) | fast_text_only() | Fast path OK | None |
| UTF-16 BOM | Magic prefix | Error | Decode to UTF-8 first |
| UTF-16 heuristic | Null byte pattern | Error | Decode to UTF-8 first |
| Gzip/ZIP | Magic prefix | Error | Decompress first |
| PDF | Magic prefix | Error | Extract text first |
| 35% control chars | Ratio > 0.3 | Error | Check encoding |
| Malformed HTML | astral-tl best-effort | Best-effort conversion | Check for quality |
| Deep nesting | Depth tracking | Slowdown (acceptable) | Consider chunking |
| Large document | Memory allocation | Proportional memory | Monitor RAM |
tools
Convert HTML to Markdown, Djot, or plain text with structured extraction. Use when writing code that calls html-to-markdown APIs in Rust, Python, TypeScript, Go, Ruby, PHP, Java, C#, Elixir, R, C, or WASM. Covers installation, conversion, configuration, metadata extraction, document structure, and CLI usage.
development
Developer quick start guide with prerequisites, setup, and workflow commands
development
Common task runner commands for build, test, lint, and format workflows
tools
______________________________________________________________________ ## priority: high # Workspace Structure & Project Organization **Rust workspace** (Cargo.toml): crates/{kreuzberg,kreuzberg-py,kreuzberg-node,kreuzberg-ffi,kreuzberg-cli}, packages/ruby/ext/kreuzberg_rb/native, tools/{benchmark-harness,e2e-generator}, e2e/{rust,go}. **Language packages**: packages/{python,typescript,ruby,java,go} - thin wrappers around Rust core. **E2E tests**: Auto-generated from fixtures/ via tools/e2e