.ai-rulez/skills/whitespace-handling/SKILL.md
Whitespace Handling in html-to-markdown
npx skillsauth add kreuzberg-dev/html-to-markdown .ai-rulez/skills/whitespace-handlingInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Whitespace handling in html-to-markdown is a critical aspect of conversion fidelity. The library provides multiple modes to control how multiple newlines, indentation, spaces, and tabs are handled during HTML-to-Markdown conversion.
Principle: "Preserve exactly as it appears in HTML source"
By default, html-to-markdown does NOT apply HTML5's automatic whitespace collapsing rules. Instead:
This differs from browsers, which normalize whitespace according to CSS/display rules.
Located in /crates/html-to-markdown/src/options.rs:
pub enum WhitespaceMode {
#[default]
Normalized, // Collapse multiple spaces to single; normalize newlines
Strict, // Preserve every space and newline exactly
}
let options = ConversionOptions {
whitespace_mode: WhitespaceMode::Normalized,
..Default::default()
};
let html = "<p>Hello world\n\n\nwith spaces</p>";
// Multiple spaces → single space
// Multiple newlines → single newline
let markdown = convert(html, Some(options))?;
→ "Hello world with spaces\n"
Rules:
Use Cases:
let options = ConversionOptions {
whitespace_mode: WhitespaceMode::Strict,
..Default::default()
};
let html = "<p>Hello world\n\n\nwith spaces</p>";
// Every space and newline preserved exactly
let markdown = convert(html, Some(options))?;
→ "Hello world\n\n\nwith spaces\n"
Rules:
Use Cases:
Located in /crates/html-to-markdown/src/options.rs:
pub struct PreprocessingOptions {
pub strip_newlines: bool, // Remove \n from text content
pub collapse_whitespace: bool, // Multiple spaces → single
pub remove_empty_elements: bool, // Skip empty <p>, <div>, etc.
pub trim_text_nodes: bool, // Strip leading/trailing whitespace
pub normalize_unicode: bool, // NFC normalization
pub remove_comments: bool, // Strip HTML comments
pub remove_empty_paragraphs: bool, // Skip <p></p>
}
pub enum PreprocessingPreset {
Minimal, // Least processing
#[default]
Standard, // Balanced
Aggressive, // Maximum cleanup
}
Minimal:
Standard (Default):
Aggressive:
let options = ConversionOptions {
preprocessing: PreprocessingOptions {
strip_newlines: false,
collapse_whitespace: true,
remove_empty_elements: true,
trim_text_nodes: true,
normalize_unicode: false,
remove_comments: true,
remove_empty_paragraphs: true,
},
whitespace_mode: WhitespaceMode::Normalized,
..Default::default()
};
Located in /crates/html-to-markdown/src/options.rs:
Controls how <br> tags are rendered:
pub enum NewlineStyle {
#[default]
Spaces, // Two spaces at end of line
Backslash, // Backslash at end of line
}
Line 1
Line 2
Generated:
Line 1 \n
Line 2\n
Markdown parsers recognize: Two spaces before newline as hard line break
HTML Input:
<p>Line 1<br>Line 2</p>
Markdown Output:
Line 1
Line 2
Line 1\
Line 2
Generated:
Line 1\\n
Line 2\n
Markdown parsers recognize: Backslash before newline as hard line break
HTML Input:
<p>Line 1<br>Line 2</p>
Markdown Output:
Line 1\
Line 2
| Style | Pros | Cons | Use Case | |-------|------|------|----------| | Spaces | Standard, widely supported, visual | Invisible, can be lost in editing | Default choice | | Backslash | Visible, explicit, CommonMark spec | Less common support | Standards-strict, visibility preferred |
let options = ConversionOptions {
newline_style: NewlineStyle::Backslash, // Override default
..Default::default()
};
let html = "<p>A<br>B<br>C</p>";
let markdown = convert(html, Some(options))?;
// With Backslash: "A\\\nB\\\nC\n"
// With Spaces: "A \nB \nC\n"
Located in /crates/html-to-markdown/src/options.rs:
Controls indentation for nested lists:
pub enum ListIndentType {
#[default]
Spaces, // 2 or 4 spaces per level
Tabs, // One tab per level
}
- Item 1
- Nested 1.1
- Deeply nested 1.1.1
- Item 2
- Nested 2.1
Generated:
- Item 1\n
- Nested 1.1\n
- Deeply nested 1.1.1\n
- Item 2\n
- Nested 2.1\n
Characteristics:
- Item 1
- Nested 1.1
- Deeply nested 1.1.1
- Item 2
- Nested 2.1
Generated:
- Item 1\n
\t- Nested 1.1\n
\t\t- Deeply nested 1.1.1\n
- Item 2\n
\t- Nested 2.1\n
Characteristics:
\t per nesting levellet options = ConversionOptions {
list_indent_type: ListIndentType::Tabs,
..Default::default()
};
let html = "<ul><li>A<ul><li>B</li></ul></li></ul>";
let markdown = convert(html, Some(options))?;
// With Tabs: "- A\n\t- B\n"
// With Spaces: "- A\n - B\n"
Located in /crates/html-to-markdown/src/text.rs:
pub fn normalize_whitespace_cow(text: &str) -> Cow<'_, str> {
if text.is_empty() {
return Cow::Borrowed("");
}
// Check if normalization needed
let needs_norm = text
.split_whitespace()
.count() != text.split_whitespace().count();
if !needs_norm && !text.starts_with(' ') && !text.ends_with(' ') {
return Cow::Borrowed(text); // No-op if already normalized
}
let words = text.split_whitespace().collect::<Vec<_>>();
let result = words.join(" ");
Cow::Owned(result)
}
Algorithm:
\s+)Examples:
"hello world" → "hello world"
" leading/trailing " → "leading/trailing"
"multiple\n\nlines" → "multiple lines"
"text\twith\ttabs" → "text with tabs"
Text processing flow:
Raw HTML text
|
+-- Decode HTML entities: & → &
|
+-- Normalize whitespace (if mode = Normalized)
|
+-- Trim leading/trailing spaces
|
+-- Escape special Markdown characters
|
+-- Output
Example:
Input HTML: "<p> Hello world </p>"
After decode: " Hello world "
After normalize: "Hello world"
After escape (misc): "Hello world"
Final: "Hello world\n"
Located in /crates/html-to-markdown/src/wrapper.rs:
pub struct ConversionOptions {
pub wrap: bool, // Enable wrapping
pub wrap_width: usize, // Line width (default: 80)
pub wrap_preserve_words: bool, // Don't break mid-word
}
Without wrapping (default):
let options = ConversionOptions {
wrap: false,
..Default::default()
};
let markdown = "This is a very long line that would normally wrap at 80 characters if wrapping was enabled but since it's disabled it stays on one line.\n";
With wrapping at 80 characters:
let options = ConversionOptions {
wrap: true,
wrap_width: 80,
..Default::default()
};
let markdown = "This is a very long line that would normally wrap at 80\ncharacters if wrapping was enabled but since it's\ndisabled it stays on one line.\n";
pub struct ConversionOptions {
pub wrap_preserve_words: bool, // true = don't break words
}
With word preservation (true):
Line with a very_long_word_that_exceeds_wrap_width...
→ "Line with a\nvery_long_word_that_exceeds_wrap_width..."
Without word preservation (false):
Line with a very_long_word_that_exceeds_wrap_width...
→ "Line with a\nvery_long_word_that_exce\neds_wrap_width..."
<pre>
The forest is dark
and full of secrets
sleeping in moonlight
</pre>
Conversion with Strict mode:
let options = ConversionOptions {
whitespace_mode: WhitespaceMode::Strict,
code_block_style: CodeBlockStyle::Indented,
..Default::default()
};
let markdown = convert(html, Some(options))?;
Output preserves exact spacing:
The forest is dark
and full of secrets
sleeping in moonlight
<p>
Welcome to our
site! We have great
content for you.
</p>
Conversion with Normalized mode (default):
let markdown = convert(html, None)?; // Uses default options
Output (spaces collapsed):
Welcome to our site! We have great content for you.
<ul>
<li>Feature 1
<ul>
<li>Sub-feature 1.1</li>
<li>Sub-feature 1.2</li>
</ul>
</li>
<li>Feature 2</li>
</ul>
With Tab indentation:
let options = ConversionOptions {
list_indent_type: ListIndentType::Tabs,
..Default::default()
};
let markdown = convert(html, Some(options))?;
Output:
- Feature 1
- Sub-feature 1.1
- Sub-feature 1.2
- Feature 2
<p>Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.</p>
Without wrapping:
let markdown = convert(html, None)?;
Output (single line):
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.
With wrapping at 60 chars:
let options = ConversionOptions {
wrap: true,
wrap_width: 60,
..Default::default()
};
let markdown = convert(html, Some(options))?;
Output (multiple lines):
Lorem ipsum dolor sit amet, consectetur adipiscing
elit, sed do eiusmod tempor incididunt ut labore et
dolore magna aliqua.
Whitespace normalization has measurable performance characteristics:
| Option | Impact | Notes | |--------|--------|-------| | Strict mode | +0% | No extra processing | | Normalized mode | +2-3% | Single regex pass | | Wrapping (80 chars) | +5-10% | Line-by-line processing | | Unicode normalization | +1-2% | Optional feature | | Comment removal | +1% | Single pass removal |
Use Strict mode for large documents if whitespace preservation needed
Disable wrapping if not needed
Batch preprocessing before conversion
Core Files:
/crates/html-to-markdown/src/options.rs - All whitespace option definitions
WhitespaceMode enum (lines 42-57)NewlineStyle enum (lines 59-76)ListIndentType enum (lines 25-40)PreprocessingOptions structPreprocessingPreset enum/crates/html-to-markdown/src/text.rs - Text processing
normalize_whitespace_cow() functiondecode_html_entities_cow() functionescape() function (whitespace-aware)/crates/html-to-markdown/src/converter.rs - Element conversion
convert_text()convert_br()convert_list()/crates/html-to-markdown/src/wrapper.rs - Line wrapping
wrap_markdown() function/crates/html-to-markdown/src/lib.rs - Preprocessing
normalize_line_endings() function (lines 149-155)fast_text_only() function (lines 157-197)// Default options (sensible for most web content)
ConversionOptions::default()
→ Normalized whitespace, Spaces newlines, Space indentation, No wrapping
// Poetry/ASCII art (preserve exact spacing)
ConversionOptions {
whitespace_mode: WhitespaceMode::Strict,
..Default::default()
}
// Readable web conversion with line wrapping
ConversionOptions {
wrap: true,
wrap_width: 80,
whitespace_mode: WhitespaceMode::Normalized,
..Default::default()
}
// Tab-indented nested lists
ConversionOptions {
list_indent_type: ListIndentType::Tabs,
..Default::default()
}
// Strict CommonMark compliance
ConversionOptions {
newline_style: NewlineStyle::Backslash,
whitespace_mode: WhitespaceMode::Normalized,
..Default::default()
}
// Aggressive cleanup for content indexing
ConversionOptions {
preprocessing: PreprocessingOptions {
collapse_whitespace: true,
remove_comments: true,
remove_empty_elements: true,
..Default::default()
},
whitespace_mode: WhitespaceMode::Normalized,
..Default::default()
}
Located throughout /crates/html-to-markdown/tests/:
# Run whitespace-specific tests
task rust:test -- --exact "test_whitespace"
task rust:test -- --exact "test_normalize"
task rust:test -- --exact "test_wrap"
Test patterns:
tools
Convert HTML to Markdown, Djot, or plain text with structured extraction. Use when writing code that calls html-to-markdown APIs in Rust, Python, TypeScript, Go, Ruby, PHP, Java, C#, Elixir, R, C, or WASM. Covers installation, conversion, configuration, metadata extraction, document structure, and CLI usage.
development
Developer quick start guide with prerequisites, setup, and workflow commands
development
Common task runner commands for build, test, lint, and format workflows
tools
______________________________________________________________________ ## priority: high # Workspace Structure & Project Organization **Rust workspace** (Cargo.toml): crates/{kreuzberg,kreuzberg-py,kreuzberg-node,kreuzberg-ffi,kreuzberg-cli}, packages/ruby/ext/kreuzberg_rb/native, tools/{benchmark-harness,e2e-generator}, e2e/{rust,go}. **Language packages**: packages/{python,typescript,ruby,java,go} - thin wrappers around Rust core. **E2E tests**: Auto-generated from fixtures/ via tools/e2e