.codex/skills/visitor-pattern-usage/SKILL.md
Visitor Pattern Usage for html-to-markdown
npx skillsauth add kreuzberg-dev/html-to-markdown visitor-pattern-usageInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
The visitor pattern in html-to-markdown provides extensible hooks into the HTML-to-Markdown conversion pipeline. Custom visitors can inspect, modify, or replace the default conversion behavior for any of the 60+ HTML element types.
Key Principles:
The visitor pattern is conditionally compiled:
#[cfg(feature = "visitor")]
pub mod visitor;
In Cargo.toml:
[features]
default = ["metadata"]
visitor = []
Located in /crates/html-to-markdown/src/visitor.rs, categorizes all HTML elements:
pub enum NodeType {
// Text content
Text,
// Block elements
Heading,
Paragraph,
Div,
Blockquote,
Pre,
Hr,
// Lists
List, // ul, ol
ListItem, // li
DefinitionList, // dl
DefinitionTerm, // dt
DefinitionDescription, // dd
// Tables
Table,
TableRow,
TableCell,
TableHeader,
TableBody,
TableHead,
TableFoot,
// Inline formatting
Link,
Image,
Strong,
Em,
Code,
Strikethrough,
Mark,
Sub,
Sup,
LineBreak,
Ruby,
// Semantic HTML5
Article,
Section,
Nav,
Aside,
Header,
Footer,
Main,
// Media
Audio,
Video,
Picture,
Iframe,
Svg,
// Forms
Input,
Select,
Button,
Textarea,
Fieldset,
// Other
Form,
Label,
Span,
Generic(String), // Unknown tags
}
Specifies what action the conversion should take:
pub enum VisitResult {
/// Use default conversion for this element
Default,
/// Skip this element entirely (no output)
Skip,
/// Custom markdown for this element
Custom(String),
/// Process children normally, wrap with custom before/after
Custom(String), // Could also support Wrap variant
/// Replace element content with custom markdown
Replace(String),
}
Provides context about the current node being visited:
pub struct NodeContext {
pub node_type: NodeType,
pub tag_name: Option<String>, // Actual HTML tag if element
pub attributes: BTreeMap<String, String>, // All HTML attributes
pub parent_node_type: Option<NodeType>, // Parent element type
pub depth: usize, // Nesting depth
pub position_in_parent: usize, // Sibling index
}
The main visitor trait with methods for each element type:
pub trait HtmlVisitor {
// Generic element fallback
fn visit_element(
&mut self,
ctx: &NodeContext,
tag: &str,
attributes: &BTreeMap<String, String>,
) -> VisitResult;
// Text content
fn visit_text(&mut self, ctx: &NodeContext, text: &str) -> VisitResult;
// Headings
fn visit_heading(
&mut self,
ctx: &NodeContext,
level: u8, // 1-6
text: &str,
) -> VisitResult;
fn visit_paragraph(&mut self, ctx: &NodeContext, text: &str) -> VisitResult;
// Links and images
fn visit_link(
&mut self,
ctx: &NodeContext,
href: &str,
text: &str,
title: Option<&str>,
) -> VisitResult;
fn visit_image(
&mut self,
ctx: &NodeContext,
src: &str,
alt: &str,
title: Option<&str>,
) -> VisitResult;
// Formatting
fn visit_strong(&mut self, ctx: &NodeContext, text: &str) -> VisitResult;
fn visit_em(&mut self, ctx: &NodeContext, text: &str) -> VisitResult;
fn visit_code(&mut self, ctx: &NodeContext, code: &str) -> VisitResult;
fn visit_code_block(
&mut self,
ctx: &NodeContext,
code: &str,
language: Option<&str>,
) -> VisitResult;
fn visit_strikethrough(&mut self, ctx: &NodeContext, text: &str) -> VisitResult;
// Lists
fn visit_list(
&mut self,
ctx: &NodeContext,
ordered: bool,
items: &[String],
) -> VisitResult;
fn visit_list_item(
&mut self,
ctx: &NodeContext,
content: &str,
index: usize,
) -> VisitResult;
// Tables
fn visit_table(
&mut self,
ctx: &NodeContext,
rows: &[Vec<String>],
) -> VisitResult;
fn visit_table_cell(
&mut self,
ctx: &NodeContext,
content: &str,
is_header: bool,
) -> VisitResult;
// ... and 40+ more visitor methods
}
Convert all external links to plain text with URLs in parentheses:
use html_to_markdown_rs::visitor::{HtmlVisitor, NodeContext, VisitResult};
#[derive(Debug)]
struct PlainTextLinkVisitor;
impl HtmlVisitor for PlainTextLinkVisitor {
fn visit_link(
&mut self,
_ctx: &NodeContext,
href: &str,
text: &str,
_title: Option<&str>,
) -> VisitResult {
// Convert all links to plain text with URL
VisitResult::Custom(format!("{} ({})", text, href))
}
// ... implement other visitor methods as Default
}
// Usage
let html = r#"<p>Visit <a href="https://example.com">our site</a></p>"#;
let mut visitor = PlainTextLinkVisitor;
let markdown = convert(html, None, Some(&mut visitor))?;
// Output: Visit our site (https://example.com)
Highlight code blocks with language-specific syntax:
use html_to_markdown_rs::visitor::{HtmlVisitor, NodeContext, VisitResult};
#[derive(Debug)]
struct HighlightingVisitor;
impl HtmlVisitor for HighlightingVisitor {
fn visit_code_block(
&mut self,
_ctx: &NodeContext,
code: &str,
language: Option<&str>,
) -> VisitResult {
match language {
Some("python") => {
// Custom Python highlighting
VisitResult::Custom(format!(
"```python\n<!-- HIGHLIGHTED -->\n{}\n```",
code
))
}
Some("rust") => {
// Custom Rust highlighting
VisitResult::Custom(format!(
"```rust\n<!-- WITH SYNTAX HIGHLIGHTING -->\n{}\n```",
code
))
}
_ => VisitResult::Default, // Use default for other languages
}
}
fn visit_link(
&mut self,
_ctx: &NodeContext,
href: &str,
text: &str,
title: Option<&str>,
) -> VisitResult {
// Links in documentation: add reference-style syntax
VisitResult::Custom(format!("[{}][{}]", text, href))
}
fn visit_heading(
&mut self,
_ctx: &NodeContext,
level: u8,
text: &str,
) -> VisitResult {
// Add anchor links to all headings
let id = text.to_lowercase().replace(' ', '-');
VisitResult::Custom(format!(
"{} {{#{}}}\n",
"#".repeat(level as usize),
id
))
}
}
// Usage
let html = r#"
<h1>Documentation</h1>
<p>See <a href="https://docs.rs">our docs</a></p>
<pre><code class="language-rust">fn main() {}</code></pre>
"#;
let mut visitor = HighlightingVisitor;
let markdown = convert(html, None, Some(&mut visitor))?;
Visit only specific element types:
use html_to_markdown_rs::visitor::{HtmlVisitor, NodeContext, VisitResult, NodeType};
#[derive(Debug)]
struct ImageOnlyVisitor {
image_count: usize,
}
impl HtmlVisitor for ImageOnlyVisitor {
fn visit_image(
&mut self,
_ctx: &NodeContext,
src: &str,
alt: &str,
_title: Option<&str>,
) -> VisitResult {
self.image_count += 1;
println!("Image {}: {} ({})", self.image_count, alt, src);
// Could extract images to separate directory
VisitResult::Custom(format!("", alt, src))
}
fn visit_text(&mut self, _ctx: &NodeContext, _text: &str) -> VisitResult {
VisitResult::Skip // Skip all text, only output images
}
}
// Usage
let mut visitor = ImageOnlyVisitor { image_count: 0 };
let markdown = convert(html, None, Some(&mut visitor))?;
println!("Found {} images", visitor.image_count);
Use parent context and depth to transform based on structure:
use html_to_markdown_rs::visitor::{HtmlVisitor, NodeContext, VisitResult, NodeType};
#[derive(Debug)]
struct DepthTrackingVisitor {
current_depth: usize,
}
impl HtmlVisitor for DepthTrackingVisitor {
fn visit_paragraph(
&mut self,
ctx: &NodeContext,
text: &str,
) -> VisitResult {
// Different formatting based on depth
match ctx.depth {
0 => VisitResult::Custom(format!("**{}**\n", text)), // Bold at top level
1 => VisitResult::Custom(format!("*{}*\n", text)), // Italic nested once
_ => VisitResult::Default, // Normal elsewhere
}
}
fn visit_link(
&mut self,
ctx: &NodeContext,
href: &str,
text: &str,
_title: Option<&str>,
) -> VisitResult {
// Links in blockquotes get footnote style
if let Some(NodeType::Blockquote) = ctx.parent_node_type {
VisitResult::Custom(format!("{}[^{}]", text, href))
} else {
VisitResult::Default
}
}
}
// Usage
let mut visitor = DepthTrackingVisitor { current_depth: 0 };
let markdown = convert(html, None, Some(&mut visitor))?;
Route handling based on HTML attributes:
use html_to_markdown_rs::visitor::{HtmlVisitor, NodeContext, VisitResult};
#[derive(Debug)]
struct AttributeRoutingVisitor;
impl HtmlVisitor for AttributeRoutingVisitor {
fn visit_link(
&mut self,
ctx: &NodeContext,
href: &str,
text: &str,
title: Option<&str>,
) -> VisitResult {
// Custom handling for data attributes
if let Some(tracking_id) = ctx.attributes.get("data-tracking-id") {
return VisitResult::Custom(format!(
"[{}]({} \"{}\")",
text,
href,
tracking_id
));
}
// Skip links marked with data-skip="true"
if ctx.attributes.get("data-skip").map_or(false, |v| v == "true") {
return VisitResult::Skip;
}
VisitResult::Default
}
fn visit_paragraph(
&mut self,
ctx: &NodeContext,
text: &str,
) -> VisitResult {
// Blockquote paragraphs differently
if ctx.attributes.get("data-featured") == Some(&"true".to_string()) {
VisitResult::Custom(format!("> {}\n", text))
} else {
VisitResult::Default
}
}
}
// Usage
let html = r#"
<a href="/page" data-tracking-id="click-001">Track me</a>
<a href="/skip" data-skip="true">Skip me</a>
<p data-featured="true">Important paragraph</p>
"#;
let mut visitor = AttributeRoutingVisitor;
let markdown = convert(html, None, Some(&mut visitor))?;
Maintain state across multiple visits:
use html_to_markdown_rs::visitor::{HtmlVisitor, NodeContext, VisitResult};
use std::collections::HashSet;
#[derive(Debug)]
struct LinkCollectorVisitor {
external_links: HashSet<String>,
email_links: HashSet<String>,
internal_links: HashSet<String>,
}
impl HtmlVisitor for LinkCollectorVisitor {
fn visit_link(
&mut self,
_ctx: &NodeContext,
href: &str,
_text: &str,
_title: Option<&str>,
) -> VisitResult {
if href.starts_with("mailto:") {
self.email_links.insert(href.to_string());
} else if href.starts_with("http") {
self.external_links.insert(href.to_string());
} else {
self.internal_links.insert(href.to_string());
}
VisitResult::Default // Keep default link formatting
}
}
// Usage
let mut visitor = LinkCollectorVisitor {
external_links: HashSet::new(),
email_links: HashSet::new(),
internal_links: HashSet::new(),
};
let markdown = convert(html, None, Some(&mut visitor))?;
println!("External: {:?}", visitor.external_links);
println!("Email: {:?}", visitor.email_links);
println!("Internal: {:?}", visitor.internal_links);
Fast path for most elements:
fn visit_text(&mut self, _ctx: &NodeContext, _text: &str) -> VisitResult {
VisitResult::Default // Quick return for most text nodes
}
Only override when needed:
// Only override link handling
// All other methods inherit Default implementation
Avoid allocations in hot path:
// Bad: allocate string for every node
VisitResult::Custom(format!(">{}<", text))
// Better: pre-allocate or use Cow
let mut result = String::with_capacity(text.len() + 2);
result.push('>');
result.push_str(text);
result.push('<');
VisitResult::Custom(result)
Visitors work alongside ConversionOptions:
use html_to_markdown_rs::{ConversionOptions, HeadingStyle};
let options = ConversionOptions {
heading_style: HeadingStyle::AtxClosed, // User preference
wrap: true,
wrap_width: 80,
..Default::default()
};
// Visitor can override specific behaviors
let mut visitor = CustomVisitor;
let markdown = convert(html, Some(options), Some(&mut visitor))?;
Priority: Visitor always takes precedence. If visitor returns Custom or Skip, conversion options are bypassed for that element.
The visitor pattern doesn't support errors directly. Return Default or Skip instead:
impl HtmlVisitor for SafeVisitor {
fn visit_link(
&mut self,
_ctx: &NodeContext,
href: &str,
text: &str,
title: Option<&str>,
) -> VisitResult {
// Can't return error, so validate and fallback
if href.is_empty() {
return VisitResult::Custom(text.to_string()); // Fallback to text
}
VisitResult::Default
}
}
Located in binding test suites (Python, TypeScript, Ruby, PHP):
# Test visitor feature
task rust:test # Includes visitor tests
# Binding-specific visitor tests
task python:test # tests/test_visitor.py
task typescript:test # packages/typescript/tests/visitor.spec.ts
task ruby:test # packages/ruby/spec/visitor_spec.rb
Core Files:
/crates/html-to-markdown/src/visitor.rs - Trait definitions and NodeType enum/crates/html-to-markdown/src/visitor_helpers.rs - VisitorHandle and async support/crates/html-to-markdown/src/converter.rs - Integration with conversion pipelineBinding Examples:
/crates/html-to-markdown-py/src/lib.rs - PyO3 visitor wrapping/crates/html-to-markdown-node/src/lib.rs - NAPI-RS visitor support/packages/ruby/lib/visitor.rb - Ruby visitor interface/packages/php/src/Visitor.php - PHP visitor base class// Primary API -- visitor is an optional third argument
pub fn convert(
html: &str,
options: Option<ConversionOptions>,
visitor: Option<visitor::VisitorHandle>,
) -> Result<ConversionResult>
| Use Case | Implementation |
|----------|----------------|
| Skip certain elements | Return VisitResult::Skip |
| Modify element output | Return VisitResult::Custom(new_markdown) |
| Track state | Use &mut self fields to accumulate data |
| Conditional routing | Use ctx fields (parent, depth, attributes) |
| Preserve default | Return VisitResult::Default |
| Context-aware | Match on ctx.parent_node_type, ctx.depth |
| Attribute-based | Read from ctx.attributes map |
| Stateless transformation | Implement stateless visitor struct |
tools
Convert HTML to Markdown, Djot, or plain text with structured extraction. Use when writing code that calls html-to-markdown APIs in Rust, Python, TypeScript, Go, Ruby, PHP, Java, C#, Elixir, R, C, or WASM. Covers installation, conversion, configuration, metadata extraction, document structure, and CLI usage.
development
Developer quick start guide with prerequisites, setup, and workflow commands
development
Common task runner commands for build, test, lint, and format workflows
tools
______________________________________________________________________ ## priority: high # Workspace Structure & Project Organization **Rust workspace** (Cargo.toml): crates/{kreuzberg,kreuzberg-py,kreuzberg-node,kreuzberg-ffi,kreuzberg-cli}, packages/ruby/ext/kreuzberg_rb/native, tools/{benchmark-harness,e2e-generator}, e2e/{rust,go}. **Language packages**: packages/{python,typescript,ruby,java,go} - thin wrappers around Rust core. **E2E tests**: Auto-generated from fixtures/ via tools/e2e