.codex/skills/metadata-extraction/SKILL.md
Metadata Extraction for html-to-markdown
npx skillsauth add kreuzberg-dev/html-to-markdown metadata-extractionInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
The html-to-markdown library provides comprehensive, single-pass metadata extraction during HTML-to-Markdown conversion. This enables content analysis, SEO optimization, document indexing, and structured data processing without extra parsing passes.
Metadata is collected during the single convert() tree walk and returned as part of the ConversionResult:
// v3 API: convert() returns ConversionResult with metadata included
let result = convert(html, Some(options))?;
// Access metadata from the result
let metadata = result.metadata;
Key Benefits:
Located in /crates/html-to-markdown/src/metadata.rs:
pub struct MetadataConfig {
pub extract_document: bool, // <head> meta tags, title, etc.
pub extract_headers: bool, // h1-h6 with hierarchy
pub extract_links: bool, // All hyperlinks with classification
pub extract_images: bool, // All images with dimensions
pub extract_structured_data: bool, // JSON-LD, Microdata, RDFa
pub max_structured_data_size: usize, // Prevent memory exhaustion
}
impl Default for MetadataConfig {
fn default() -> Self {
MetadataConfig {
extract_document: true,
extract_headers: true,
extract_links: true,
extract_images: true,
extract_structured_data: true,
max_structured_data_size: DEFAULT_MAX_STRUCTURED_DATA_SIZE, // typically 10MB
}
}
}
Only extract specific metadata types by configuring ConversionOptions:
let options = ConversionOptions {
extract_metadata: true, // Enable metadata extraction (default)
..Default::default()
};
let result = convert(html, Some(options))?;
// result.metadata contains document, headers, links, images, structured_data
Document metadata collects head-level information:
pub struct DocumentMetadata {
pub title: Option<String>, // <title> element
pub description: Option<String>, // meta[name="description"]
pub author: Option<String>, // meta[name="author"]
pub language: Option<String>, // html[lang] or meta[http-equiv]
pub charset: Option<String>, // meta[charset] or meta[http-equiv]
pub canonical_url: Option<String>, // <link rel="canonical">
pub viewport: Option<String>, // meta[name="viewport"]
pub text_direction: TextDirection, // html[dir] or meta property
pub open_graph: BTreeMap<String, String>, // og:* properties
pub twitter_card: BTreeMap<String, String>, // twitter:* properties
pub other_meta: BTreeMap<String, String>, // Remaining meta tags
}
pub enum TextDirection {
Ltr, // <html dir="ltr">
Rtl, // <html dir="rtl">
Auto, // Default or <html dir="auto">
}
<html lang="en" dir="ltr">
<head>
<title>My Article</title>
<meta name="description" content="Article about HTML conversion">
<meta name="author" content="John Doe">
<meta charset="utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1">
<link rel="canonical" href="https://example.com/article">
<!-- Open Graph -->
<meta property="og:title" content="Article Title">
<meta property="og:description" content="Description for sharing">
<meta property="og:image" content="https://example.com/image.jpg">
<meta property="og:type" content="article">
<meta property="og:url" content="https://example.com/article">
<!-- Twitter Card -->
<meta name="twitter:card" content="summary_large_image">
<meta name="twitter:title" content="Article Title">
<meta name="twitter:description" content="Description">
<meta name="twitter:image" content="https://example.com/image.jpg">
</head>
</html>
Extracted Result:
DocumentMetadata {
title: Some("My Article"),
description: Some("Article about HTML conversion"),
author: Some("John Doe"),
language: Some("en"),
charset: Some("utf-8"),
canonical_url: Some("https://example.com/article"),
viewport: Some("width=device-width, initial-scale=1"),
text_direction: TextDirection::Ltr,
open_graph: {
"title" => "Article Title",
"description" => "Description for sharing",
"image" => "https://example.com/image.jpg",
"type" => "article",
"url" => "https://example.com/article",
},
twitter_card: {
"card" => "summary_large_image",
"title" => "Article Title",
"description" => "Description",
"image" => "https://example.com/image.jpg",
},
other_meta: { /* remaining meta tags */ },
}
pub struct HeaderMetadata {
pub level: u8, // 1-6 (h1-h6)
pub text: String, // Extracted text content
pub id: Option<String>, // id attribute
pub hierarchy_depth: u8, // Nesting depth (0 = top-level)
pub position: usize, // Document order position
}
Headers are extracted with context about document structure:
<h1>Main Title</h1> <!-- hierarchy_depth: 0 -->
<p>Introduction</p>
<h2 id="section-1">Section 1</h2> <!-- hierarchy_depth: 1 -->
<p>Content</p>
<h3>Subsection 1.1</h3> <!-- hierarchy_depth: 2 -->
<h3>Subsection 1.2</h3> <!-- hierarchy_depth: 2 -->
<h2>Section 2</h2> <!-- hierarchy_depth: 1 -->
Extracted Headers:
vec![
HeaderMetadata { level: 1, text: "Main Title", id: None, hierarchy_depth: 0, position: 0 },
HeaderMetadata { level: 2, text: "Section 1", id: Some("section-1"), hierarchy_depth: 1, position: 2 },
HeaderMetadata { level: 3, text: "Subsection 1.1", id: None, hierarchy_depth: 2, position: 4 },
HeaderMetadata { level: 3, text: "Subsection 1.2", id: None, hierarchy_depth: 2, position: 5 },
HeaderMetadata { level: 2, text: "Section 2", id: None, hierarchy_depth: 1, position: 6 },
]
pub enum LinkType {
Anchor, // #section-id (internal fragment)
Internal, // /page, ../other, /path/to/page
External, // https://example.com
Email, // mailto:[email protected]
Phone, // tel:+1234567890
Other, // Unknown schemes (ftp, data, etc.)
}
pub struct LinkMetadata {
pub href: String, // Full href attribute
pub text: String, // Link display text
pub title: Option<String>, // title attribute
pub link_type: LinkType, // Classification
pub rel_attributes: Vec<String>, // rel attribute values
pub custom_attributes: BTreeMap<String, String>, // data-*, aria-*, etc.
pub is_external: bool, // Convenience flag
}
href="#intro" → LinkType::Anchor
href="/page" → LinkType::Internal
href="../sibling" → LinkType::Internal
href="relative/path" → LinkType::Internal
href="https://example.com" → LinkType::External
href="http://example.com" → LinkType::External
href="mailto:[email protected]" → LinkType::Email
href="tel:+1234567890" → LinkType::Phone
href="ftp://server.com" → LinkType::Other
href="javascript:void(0)" → LinkType::Other
<body>
<!-- Anchor link -->
<a href="#main-section" title="Jump to main">Main</a>
<!-- Internal links -->
<a href="/about">About Us</a>
<a href="../other-page" rel="internal">Other</a>
<!-- External links -->
<a href="https://google.com" rel="external nofollow">Google</a>
<!-- Email -->
<a href="mailto:[email protected]">Contact</a>
<!-- Phone -->
<a href="tel:+1-555-1234">Call us</a>
<!-- Data attributes -->
<a href="/product" data-id="123" data-category="electronics">Product</a>
</body>
Extracted Links:
vec![
LinkMetadata {
href: "#main-section",
text: "Main",
title: Some("Jump to main"),
link_type: LinkType::Anchor,
rel_attributes: vec![],
custom_attributes: {},
is_external: false,
},
LinkMetadata {
href: "/about",
text: "About Us",
title: None,
link_type: LinkType::Internal,
rel_attributes: vec![],
custom_attributes: {},
is_external: false,
},
LinkMetadata {
href: "https://google.com",
text: "Google",
title: None,
link_type: LinkType::External,
rel_attributes: vec!["external", "nofollow"],
custom_attributes: {},
is_external: true,
},
LinkMetadata {
href: "mailto:[email protected]",
text: "Contact",
title: None,
link_type: LinkType::Email,
rel_attributes: vec![],
custom_attributes: {},
is_external: false,
},
LinkMetadata {
href: "tel:+1-555-1234",
text: "Call us",
title: None,
link_type: LinkType::Phone,
rel_attributes: vec![],
custom_attributes: {},
is_external: false,
},
LinkMetadata {
href: "/product",
text: "Product",
title: None,
link_type: LinkType::Internal,
rel_attributes: vec![],
custom_attributes: {
"data-id" => "123",
"data-category" => "electronics",
},
is_external: false,
},
]
pub enum ImageType {
DataUri, // data:image/png;base64,...
External, // https://example.com/image.jpg
Relative, // /images/photo.jpg or ../img/pic.png
InlineSvg, // <svg>...</svg> embedded (if inline-images feature)
}
pub struct ImageMetadata {
pub src: String, // Image source URL or data URI
pub alt: Option<String>, // alt attribute text
pub title: Option<String>, // title attribute
pub image_type: ImageType, // Classification
pub width: Option<u32>, // width attribute (pixels)
pub height: Option<u32>, // height attribute (pixels)
pub custom_attributes: BTreeMap<String, String>, // data-*, aria-*, etc.
}
<body>
<!-- External image -->
<img src="https://example.com/photo.jpg" alt="A photo" title="Photo title" width="800" height="600">
<!-- Relative path -->
<img src="/images/icon.png" alt="Icon">
<!-- Data URI (base64-encoded PNG) -->
<img src="data:image/png;base64,iVBORw0KGgoAAAANS..." alt="Embedded">
<!-- With metadata -->
<img src="product.jpg" alt="Product" data-id="456" data-category="gadgets">
</body>
Extracted Images:
vec![
ImageMetadata {
src: "https://example.com/photo.jpg",
alt: Some("A photo"),
title: Some("Photo title"),
image_type: ImageType::External,
width: Some(800),
height: Some(600),
custom_attributes: {},
},
ImageMetadata {
src: "/images/icon.png",
alt: Some("Icon"),
title: None,
image_type: ImageType::Relative,
width: None,
height: None,
custom_attributes: {},
},
ImageMetadata {
src: "data:image/png;base64,iVBORw0KGgoAAAANS...",
alt: Some("Embedded"),
title: None,
image_type: ImageType::DataUri,
width: None,
height: None,
custom_attributes: {},
},
ImageMetadata {
src: "product.jpg",
alt: Some("Product"),
title: None,
image_type: ImageType::Relative,
width: None,
height: None,
custom_attributes: {
"data-id" => "456",
"data-category" => "gadgets",
},
},
]
pub enum StructuredDataType {
JsonLd, // <script type="application/ld+json">
Microdata, // itemscope, itemtype, itemprop attributes
RDFa, // vocab, property, typeof attributes
}
pub struct StructuredData {
pub data_type: StructuredDataType,
pub raw_data: String, // Original JSON, HTML, or RDF markup
}
<script type="application/ld+json">
{
"@context": "https://schema.org",
"@type": "Article",
"headline": "My Article",
"author": {
"@type": "Person",
"name": "John Doe"
},
"datePublished": "2025-12-29",
"articleBody": "Content here..."
}
</script>
Extracted:
StructuredData {
data_type: StructuredDataType::JsonLd,
raw_data: r#"{"@context":"https://schema.org","@type":"Article",...}"#,
}
<div itemscope itemtype="https://schema.org/Article">
<h1 itemprop="headline">Article Title</h1>
<p itemprop="articleBody">Content...</p>
<span itemprop="author" itemscope itemtype="https://schema.org/Person">
<span itemprop="name">John Doe</span>
</span>
</div>
Extracted: Raw HTML fragment preserved with attributes
<div vocab="https://schema.org/" typeof="Article">
<h1 property="headline">Article Title</h1>
<p property="articleBody">Content...</p>
<span property="author" typeof="Person">
<span property="name">John Doe</span>
</span>
</div>
The v3 API uses a single convert() function that returns a ConversionResult containing all extracted data:
use html_to_markdown_rs::{convert, ConversionOptions};
let html = "<html><head><title>My Page</title></head><body><h1>Hello</h1></body></html>";
let result = convert(html, None)?;
// Access the converted markdown
println!("{}", result.content);
// Access metadata from the result
if let Some(metadata) = &result.metadata {
println!("Title: {:?}", metadata.document.title);
println!("Headers: {:?}", metadata.headers);
println!("Links: {:?}", metadata.links);
println!("Images: {:?}", metadata.images);
}
// Tables and warnings are also available
println!("Tables: {:?}", result.tables);
println!("Warnings: {:?}", result.warnings);
Benchmarking:
Memory Safety:
max_structured_data_size prevents DoS from huge JSON-LD blocksCore Files:
/crates/html-to-markdown/src/metadata.rs - All metadata types and collector/crates/html-to-markdown/src/lib.rs - convert() public API returning ConversionResult/crates/html-to-markdown/src/converter.rs - Integration with conversion pipelineTesting:
/crates/html-to-markdown/src/lib.rs - Tests starting at line 604The v3 API uses a single convert() function for all use cases:
// Basic conversion -- returns ConversionResult with .content, .metadata, .tables, .images, .warnings
let result = convert(html, None)?;
// With options
let result = convert(html, Some(options))?;
// With visitor (for custom element handling)
let result = convert(html, Some(options), Some(visitor))?;
metadata feature in Cargo.toml (enabled by default)use html_to_markdown_rs::convert;let result = convert(html, None)?;result.metadata.document.title, result.metadata.headers, result.metadata.links, result.metadata.imagesextract_metadata in ConversionOptions to disable metadata extractiontools
Convert HTML to Markdown, Djot, or plain text with structured extraction. Use when writing code that calls html-to-markdown APIs in Rust, Python, TypeScript, Go, Ruby, PHP, Java, C#, Elixir, R, C, or WASM. Covers installation, conversion, configuration, metadata extraction, document structure, and CLI usage.
development
Developer quick start guide with prerequisites, setup, and workflow commands
development
Common task runner commands for build, test, lint, and format workflows
tools
______________________________________________________________________ ## priority: high # Workspace Structure & Project Organization **Rust workspace** (Cargo.toml): crates/{kreuzberg,kreuzberg-py,kreuzberg-node,kreuzberg-ffi,kreuzberg-cli}, packages/ruby/ext/kreuzberg_rb/native, tools/{benchmark-harness,e2e-generator}, e2e/{rust,go}. **Language packages**: packages/{python,typescript,ruby,java,go} - thin wrappers around Rust core. **E2E tests**: Auto-generated from fixtures/ via tools/e2e