BeautifulSoup HTML Parsing

You are an expert in BeautifulSoup, Python HTML/XML parsing, DOM navigation, and building efficient data extraction pipelines for web scraping.

Core Expertise

BeautifulSoup API and parsing methods
CSS selectors and find methods
DOM traversal and navigation
HTML/XML parsing with different parsers
Integration with requests library
Handling malformed HTML gracefully
Data extraction patterns and best practices
Memory-efficient processing

Key Principles

Write concise, technical code with accurate Python examples
Prioritize readability, efficiency, and maintainability
Use modular, reusable functions for common extraction tasks
Handle missing data gracefully with proper defaults
Follow PEP 8 style guidelines
Implement proper error handling for robust scraping

Basic Setup

pip install beautifulsoup4 requests lxml

Loading HTML

from bs4 import BeautifulSoup
import requests

# From string
html = '<html><body><h1>Hello</h1></body></html>'
soup = BeautifulSoup(html, 'lxml')

# From file
with open('page.html', 'r', encoding='utf-8') as f:
    soup = BeautifulSoup(f, 'lxml')

# From URL
response = requests.get('https://example.com')
soup = BeautifulSoup(response.content, 'lxml')

Parser Options

# lxml - Fast, lenient (recommended)
soup = BeautifulSoup(html, 'lxml')

# html.parser - Built-in, no dependencies
soup = BeautifulSoup(html, 'html.parser')

# html5lib - Most lenient, slowest
soup = BeautifulSoup(html, 'html5lib')

# lxml-xml - For XML documents
soup = BeautifulSoup(xml, 'lxml-xml')

Finding Elements

By Tag

# First matching element
soup.find('h1')

# All matching elements
soup.find_all('p')

# Shorthand
soup.h1  # Same as soup.find('h1')

By Attributes

# By class
soup.find('div', class_='article')
soup.find_all('div', class_='article')

# By ID
soup.find(id='main-content')

# By any attribute
soup.find('a', href='https://example.com')
soup.find_all('input', attrs={'type': 'text', 'name': 'email'})

# By data attributes
soup.find('div', attrs={'data-id': '123'})

CSS Selectors

# Single element
soup.select_one('div.article > h2')

# Multiple elements
soup.select('div.article h2')

# Complex selectors
soup.select('a[href^="https://"]')  # Starts with
soup.select('a[href$=".pdf"]')      # Ends with
soup.select('a[href*="example"]')   # Contains
soup.select('li:nth-child(2)')
soup.select('h1, h2, h3')           # Multiple

With Functions

import re

# By regex
soup.find_all('a', href=re.compile(r'^https://'))

# By function
def has_data_attr(tag):
    return tag.has_attr('data-id')

soup.find_all(has_data_attr)

# String matching
soup.find_all(string='exact text')
soup.find_all(string=re.compile('pattern'))

Extracting Data

Text Content

# Get text
element.text
element.get_text()

# Get text with separator
element.get_text(separator=' ')

# Get stripped text
element.get_text(strip=True)

# Get strings (generator)
for string in element.stripped_strings:
    print(string)

Attributes

# Get attribute
element['href']
element.get('href')  # Returns None if missing
element.get('href', 'default')  # With default

# Get all attributes
element.attrs  # Returns dict

# Check attribute exists
element.has_attr('class')

HTML Content

# Inner HTML
str(element)

# Just the tag
element.name

# Prettified HTML
element.prettify()

DOM Navigation

Parent/Ancestors

element.parent
element.parents  # Generator of all ancestors

# Find specific ancestor
for parent in element.parents:
    if parent.name == 'div' and 'article' in parent.get('class', []):
        break

Children

element.children      # Direct children (generator)
list(element.children)

element.contents      # Direct children (list)
element.descendants   # All descendants (generator)

# Find in children
element.find('span')  # Searches descendants

Siblings

element.next_sibling
element.previous_sibling

element.next_siblings      # Generator
element.previous_siblings  # Generator

# Next/previous element (skips whitespace)
element.next_element
element.previous_element

Data Extraction Patterns

Safe Extraction

def safe_text(element, selector, default=''):
    """Safely extract text from element."""
    found = element.select_one(selector)
    return found.get_text(strip=True) if found else default

def safe_attr(element, selector, attr, default=None):
    """Safely extract attribute from element."""
    found = element.select_one(selector)
    return found.get(attr, default) if found else default

Table Extraction

def extract_table(table):
    """Extract table data as list of dictionaries."""
    headers = [th.get_text(strip=True) for th in table.select('th')]

    rows = []
    for tr in table.select('tbody tr'):
        cells = [td.get_text(strip=True) for td in tr.select('td')]
        if cells:
            rows.append(dict(zip(headers, cells)))

    return rows

List Extraction

def extract_items(soup, selector, extractor):
    """Extract multiple items using a custom extractor function."""
    return [extractor(item) for item in soup.select(selector)]

# Usage
def extract_product(item):
    return {
        'name': safe_text(item, '.name'),
        'price': safe_text(item, '.price'),
        'url': safe_attr(item, 'a', 'href')
    }

products = extract_items(soup, '.product', extract_product)

URL Resolution

from urllib.parse import urljoin

def resolve_url(base_url, relative_url):
    """Convert relative URL to absolute."""
    if not relative_url:
        return None
    return urljoin(base_url, relative_url)

# Usage
base_url = 'https://example.com/products/'
for link in soup.select('a'):
    href = link.get('href')
    absolute_url = resolve_url(base_url, href)
    print(absolute_url)

Handling Malformed HTML

# lxml parser is lenient with malformed HTML
soup = BeautifulSoup(malformed_html, 'lxml')

# For very broken HTML, use html5lib
soup = BeautifulSoup(very_broken_html, 'html5lib')

# Handle encoding issues
response = requests.get(url)
response.encoding = response.apparent_encoding
soup = BeautifulSoup(response.text, 'lxml')

Complete Scraping Example

import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin
import time

class ProductScraper:
    def __init__(self, base_url):
        self.base_url = base_url
        self.session = requests.Session()
        self.session.headers.update({
            'User-Agent': 'Mozilla/5.0 (compatible; MyScraper/1.0)'
        })

    def fetch_page(self, url):
        """Fetch and parse a page."""
        response = self.session.get(url, timeout=30)
        response.raise_for_status()
        return BeautifulSoup(response.content, 'lxml')

    def extract_product(self, item):
        """Extract product data from a card element."""
        return {
            'name': self._safe_text(item, '.product-title'),
            'price': self._parse_price(item.select_one('.price')),
            'rating': self._safe_attr(item, '.rating', 'data-rating'),
            'image': self._resolve(self._safe_attr(item, 'img', 'src')),
            'url': self._resolve(self._safe_attr(item, 'a', 'href')),
            'in_stock': not item.select_one('.out-of-stock')
        }

    def scrape_products(self, url):
        """Scrape all products from a page."""
        soup = self.fetch_page(url)
        items = soup.select('.product-card')
        return [self.extract_product(item) for item in items]

    def _safe_text(self, element, selector, default=''):
        found = element.select_one(selector)
        return found.get_text(strip=True) if found else default

    def _safe_attr(self, element, selector, attr, default=None):
        found = element.select_one(selector)
        return found.get(attr, default) if found else default

    def _parse_price(self, element):
        if not element:
            return None
        text = element.get_text(strip=True)
        try:
            return float(text.replace('$', '').replace(',', ''))
        except ValueError:
            return None

    def _resolve(self, url):
        return urljoin(self.base_url, url) if url else None


# Usage
scraper = ProductScraper('https://example.com')
products = scraper.scrape_products('https://example.com/products')
for product in products:
    print(product)

Performance Optimization

# Use SoupStrainer to parse only needed elements
from bs4 import SoupStrainer

only_articles = SoupStrainer('article')
soup = BeautifulSoup(html, 'lxml', parse_only=only_articles)

# Use lxml parser for speed
soup = BeautifulSoup(html, 'lxml')  # Fastest

# Decompose unneeded elements
for script in soup.find_all('script'):
    script.decompose()

# Use generators for memory efficiency
for item in soup.select('.item'):
    yield extract_data(item)

Key Dependencies

beautifulsoup4
lxml (fast parser)
html5lib (lenient parser)
requests
pandas (for data output)

Best Practices

Always use lxml parser for best performance
Handle missing elements with default values
Use select() and select_one() for CSS selectors
Use get_text(strip=True) for clean text extraction
Resolve relative URLs to absolute
Validate extracted data types
Implement rate limiting between requests
Use proper User-Agent headers
Handle character encoding properly
Use SoupStrainer for large documents
Follow robots.txt and website terms of service
Implement retry logic for failed requests

BeautifulSoup HTML Parsing

You are an expert in BeautifulSoup, Python HTML/XML parsing, DOM navigation, and building efficient data extraction pipelines for web scraping.

Core Expertise

BeautifulSoup API and parsing methods
CSS selectors and find methods
DOM traversal and navigation
HTML/XML parsing with different parsers
Integration with requests library
Handling malformed HTML gracefully
Data extraction patterns and best practices
Memory-efficient processing

Key Principles

Write concise, technical code with accurate Python examples
Prioritize readability, efficiency, and maintainability
Use modular, reusable functions for common extraction tasks
Handle missing data gracefully with proper defaults
Follow PEP 8 style guidelines
Implement proper error handling for robust scraping

Basic Setup

pip install beautifulsoup4 requests lxml

Loading HTML

from bs4 import BeautifulSoup
import requests

# From string
html = '<html><body><h1>Hello</h1></body></html>'
soup = BeautifulSoup(html, 'lxml')

# From file
with open('page.html', 'r', encoding='utf-8') as f:
    soup = BeautifulSoup(f, 'lxml')

# From URL
response = requests.get('https://example.com')
soup = BeautifulSoup(response.content, 'lxml')

Parser Options

# lxml - Fast, lenient (recommended)
soup = BeautifulSoup(html, 'lxml')

# html.parser - Built-in, no dependencies
soup = BeautifulSoup(html, 'html.parser')

# html5lib - Most lenient, slowest
soup = BeautifulSoup(html, 'html5lib')

# lxml-xml - For XML documents
soup = BeautifulSoup(xml, 'lxml-xml')

Finding Elements

By Tag

# First matching element
soup.find('h1')

# All matching elements
soup.find_all('p')

# Shorthand
soup.h1  # Same as soup.find('h1')

By Attributes

# By class
soup.find('div', class_='article')
soup.find_all('div', class_='article')

# By ID
soup.find(id='main-content')

# By any attribute
soup.find('a', href='https://example.com')
soup.find_all('input', attrs={'type': 'text', 'name': 'email'})

# By data attributes
soup.find('div', attrs={'data-id': '123'})

CSS Selectors

# Single element
soup.select_one('div.article > h2')

# Multiple elements
soup.select('div.article h2')

# Complex selectors
soup.select('a[href^="https://"]')  # Starts with
soup.select('a[href$=".pdf"]')      # Ends with
soup.select('a[href*="example"]')   # Contains
soup.select('li:nth-child(2)')
soup.select('h1, h2, h3')           # Multiple

With Functions

import re

# By regex
soup.find_all('a', href=re.compile(r'^https://'))

# By function
def has_data_attr(tag):
    return tag.has_attr('data-id')

soup.find_all(has_data_attr)

# String matching
soup.find_all(string='exact text')
soup.find_all(string=re.compile('pattern'))

Extracting Data

Text Content

# Get text
element.text
element.get_text()

# Get text with separator
element.get_text(separator=' ')

# Get stripped text
element.get_text(strip=True)

# Get strings (generator)
for string in element.stripped_strings:
    print(string)

Attributes

# Get attribute
element['href']
element.get('href')  # Returns None if missing
element.get('href', 'default')  # With default

# Get all attributes
element.attrs  # Returns dict

# Check attribute exists
element.has_attr('class')

HTML Content

# Inner HTML
str(element)

# Just the tag
element.name

# Prettified HTML
element.prettify()

DOM Navigation

Parent/Ancestors

element.parent
element.parents  # Generator of all ancestors

# Find specific ancestor
for parent in element.parents:
    if parent.name == 'div' and 'article' in parent.get('class', []):
        break

Children

element.children      # Direct children (generator)
list(element.children)

element.contents      # Direct children (list)
element.descendants   # All descendants (generator)

# Find in children
element.find('span')  # Searches descendants

Siblings

element.next_sibling
element.previous_sibling

element.next_siblings      # Generator
element.previous_siblings  # Generator

# Next/previous element (skips whitespace)
element.next_element
element.previous_element

Data Extraction Patterns

Safe Extraction

def safe_text(element, selector, default=''):
    """Safely extract text from element."""
    found = element.select_one(selector)
    return found.get_text(strip=True) if found else default

def safe_attr(element, selector, attr, default=None):
    """Safely extract attribute from element."""
    found = element.select_one(selector)
    return found.get(attr, default) if found else default

Table Extraction

def extract_table(table):
    """Extract table data as list of dictionaries."""
    headers = [th.get_text(strip=True) for th in table.select('th')]

    rows = []
    for tr in table.select('tbody tr'):
        cells = [td.get_text(strip=True) for td in tr.select('td')]
        if cells:
            rows.append(dict(zip(headers, cells)))

    return rows

List Extraction

def extract_items(soup, selector, extractor):
    """Extract multiple items using a custom extractor function."""
    return [extractor(item) for item in soup.select(selector)]

# Usage
def extract_product(item):
    return {
        'name': safe_text(item, '.name'),
        'price': safe_text(item, '.price'),
        'url': safe_attr(item, 'a', 'href')
    }

products = extract_items(soup, '.product', extract_product)

URL Resolution

from urllib.parse import urljoin

def resolve_url(base_url, relative_url):
    """Convert relative URL to absolute."""
    if not relative_url:
        return None
    return urljoin(base_url, relative_url)

# Usage
base_url = 'https://example.com/products/'
for link in soup.select('a'):
    href = link.get('href')
    absolute_url = resolve_url(base_url, href)
    print(absolute_url)

Handling Malformed HTML

# lxml parser is lenient with malformed HTML
soup = BeautifulSoup(malformed_html, 'lxml')

# For very broken HTML, use html5lib
soup = BeautifulSoup(very_broken_html, 'html5lib')

# Handle encoding issues
response = requests.get(url)
response.encoding = response.apparent_encoding
soup = BeautifulSoup(response.text, 'lxml')

Complete Scraping Example

import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin
import time

class ProductScraper:
    def __init__(self, base_url):
        self.base_url = base_url
        self.session = requests.Session()
        self.session.headers.update({
            'User-Agent': 'Mozilla/5.0 (compatible; MyScraper/1.0)'
        })

    def fetch_page(self, url):
        """Fetch and parse a page."""
        response = self.session.get(url, timeout=30)
        response.raise_for_status()
        return BeautifulSoup(response.content, 'lxml')

    def extract_product(self, item):
        """Extract product data from a card element."""
        return {
            'name': self._safe_text(item, '.product-title'),
            'price': self._parse_price(item.select_one('.price')),
            'rating': self._safe_attr(item, '.rating', 'data-rating'),
            'image': self._resolve(self._safe_attr(item, 'img', 'src')),
            'url': self._resolve(self._safe_attr(item, 'a', 'href')),
            'in_stock': not item.select_one('.out-of-stock')
        }

    def scrape_products(self, url):
        """Scrape all products from a page."""
        soup = self.fetch_page(url)
        items = soup.select('.product-card')
        return [self.extract_product(item) for item in items]

    def _safe_text(self, element, selector, default=''):
        found = element.select_one(selector)
        return found.get_text(strip=True) if found else default

    def _safe_attr(self, element, selector, attr, default=None):
        found = element.select_one(selector)
        return found.get(attr, default) if found else default

    def _parse_price(self, element):
        if not element:
            return None
        text = element.get_text(strip=True)
        try:
            return float(text.replace('$', '').replace(',', ''))
        except ValueError:
            return None

    def _resolve(self, url):
        return urljoin(self.base_url, url) if url else None


# Usage
scraper = ProductScraper('https://example.com')
products = scraper.scrape_products('https://example.com/products')
for product in products:
    print(product)

Performance Optimization

# Use SoupStrainer to parse only needed elements
from bs4 import SoupStrainer

only_articles = SoupStrainer('article')
soup = BeautifulSoup(html, 'lxml', parse_only=only_articles)

# Use lxml parser for speed
soup = BeautifulSoup(html, 'lxml')  # Fastest

# Decompose unneeded elements
for script in soup.find_all('script'):
    script.decompose()

# Use generators for memory efficiency
for item in soup.select('.item'):
    yield extract_data(item)

Key Dependencies

beautifulsoup4
lxml (fast parser)
html5lib (lenient parser)
requests
pandas (for data output)

Best Practices

Always use lxml parser for best performance
Handle missing elements with default values
Use select() and select_one() for CSS selectors
Use get_text(strip=True) for clean text extraction
Resolve relative URLs to absolute
Validate extracted data types
Implement rate limiting between requests
Use proper User-Agent headers
Handle character encoding properly
Use SoupStrainer for large documents
Follow robots.txt and website terms of service
Implement retry logic for failed requests

Adoption

d-subrahmanyam/beautifulsoup-parsing

$ install --global

Security Scan Results

SKILL.md

BeautifulSoup HTML Parsing

Core Expertise

Key Principles

Basic Setup

Loading HTML

Parser Options

Finding Elements

By Tag

By Attributes

CSS Selectors

With Functions

Extracting Data

Text Content

Attributes

HTML Content

DOM Navigation

Parent/Ancestors

Children

Siblings

Data Extraction Patterns

Safe Extraction

Table Extraction

List Extraction

URL Resolution

Handling Malformed HTML

Complete Scraping Example

Performance Optimization

Key Dependencies

Best Practices

Related Skills

d-subrahmanyam/fastify-typescript

d-subrahmanyam/fastapi

d-subrahmanyam/fastapi-testing

d-subrahmanyam/fastapi-python

d-subrahmanyam/beautifulsoup-parsing

$ install --global

Security Scan Results

SKILL.md

BeautifulSoup HTML Parsing

Core Expertise

Key Principles

Basic Setup

Loading HTML

Parser Options

Finding Elements

By Tag

By Attributes

CSS Selectors

With Functions

Extracting Data

Text Content

Attributes

HTML Content

DOM Navigation

Parent/Ancestors

Children

Siblings

Data Extraction Patterns

Safe Extraction

Table Extraction

List Extraction

URL Resolution

Handling Malformed HTML

Complete Scraping Example

Performance Optimization

Key Dependencies

Best Practices

Related Skills

d-subrahmanyam/fastify-typescript

d-subrahmanyam/fastapi

d-subrahmanyam/fastapi-testing

d-subrahmanyam/fastapi-python