.ai-rulez/skills/html-parsing-strategies/SKILL.md
HTML Parsing Strategies for html-to-markdown
npx skillsauth add kreuzberg-dev/html-to-markdown .ai-rulez/skills/html-parsing-strategiesInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
The html-to-markdown project uses two complementary HTML parsers to handle different conversion scenarios and performance requirements:
tl crateStrengths:
Weaknesses:
Use Cases:
Implementation Location:
/crates/html-to-markdown/src/converter.rs - Primary conversion logictl.workspace = true in Cargo.tomlStrengths:
Weaknesses:
Use Cases:
Implementation Location:
html5ever.workspace = true and markup5ever_rcdom.workspace = trueThe tl parser provides a lightweight DOM interface:
// Example from converter.rs
// Using tl parser for fast sequential traversal
for node in dom.iter() {
match node.kind() {
NodeKind::Tag(tag) => {
// Handle element
let tag_name = tag.name();
let attrs = tag.attributes();
// Process children recursively
}
NodeKind::Text(text) => {
// Handle text node
let content = text.as_bytes();
}
_ => {}
}
}
Characteristics:
The markup5ever_rcdom provides a full DOM tree:
// html5ever creates a reference-counted DOM tree
// Traversal via RcDom with Handle references
// Recursive descent through child_nodes()
// Full attribute access via attributes()
Characteristics:
Plain text input (no < characters)
fast_text_only() before any parsingWell-formed HTML
converter.rsUTF-8 validated input
Binary detection triggered
validate_input() in lib.rsSeverely malformed markup
<table><div> structure)Legacy or fuzzing inputs
The converter preserves whitespace exactly as parsed:
// From converter.rs
// All text nodes retain original spacing
// No HTML5 whitespace collapsing applied
// Raw text preservation mode (not normalized)
Modes:
Both parsers decode HTML entities:
html-escape crate for quick common entities{) and named (&) entitiestext::decode_html_entities_cow()Input HTML string
|
+-- validate_input() checks for binary/encoding issues
|
+-- FAIL? Reject with ConversionError::InvalidInput
|
+-- PASS? Proceed to conversion
|
+-- Is plain text (no '<')? Use fast_text_only()
|
+-- Otherwise: Use tl parser (default)
|
+-- Converts successfully? Return markdown
|
+-- Need html5ever for edge case?
(handled via feature gates or fallback)
Parser performance characteristics:
cargo flamegraphif html.contains('<') firstInput validation: /crates/html-to-markdown/src/lib.rs (lines 59-147)
validate_input() functionFast text path: /crates/html-to-markdown/src/lib.rs (lines 157-197)
fast_text_only() optimizationConverter logic: /crates/html-to-markdown/src/converter.rs
Configuration: /crates/html-to-markdown/src/options.rs
WhitespaceMode enum (Strict vs Normalized)PreprocessingOptions for input handlingconvert_html() or create new function| Input Type | Parser | Reason | |-----------|--------|--------| | Plain text | None (fast_text_only) | No parsing needed | | Standard web HTML | tl | Fast, sufficient correctness | | Untrusted/fuzzing input | html5ever | Full spec compliance | | UTF-16 detected | Error | Reject with validation error | | Severely malformed | html5ever | Better error recovery | | Large documents | tl | Better streaming potential | | WASM target | tl | Smaller binary footprint |
tools
Convert HTML to Markdown, Djot, or plain text with structured extraction. Use when writing code that calls html-to-markdown APIs in Rust, Python, TypeScript, Go, Ruby, PHP, Java, C#, Elixir, R, C, or WASM. Covers installation, conversion, configuration, metadata extraction, document structure, and CLI usage.
development
Developer quick start guide with prerequisites, setup, and workflow commands
development
Common task runner commands for build, test, lint, and format workflows
tools
______________________________________________________________________ ## priority: high # Workspace Structure & Project Organization **Rust workspace** (Cargo.toml): crates/{kreuzberg,kreuzberg-py,kreuzberg-node,kreuzberg-ffi,kreuzberg-cli}, packages/ruby/ext/kreuzberg_rb/native, tools/{benchmark-harness,e2e-generator}, e2e/{rust,go}. **Language packages**: packages/{python,typescript,ruby,java,go} - thin wrappers around Rust core. **E2E tests**: Auto-generated from fixtures/ via tools/e2e