PDF Manipulation Skill

Comprehensive guide for working with PDF files in Python, covering extraction, manipulation, creation, and advanced operations using progressive disclosure for efficiency.

Core Capabilities

Extract and manipulate PDF content:

Extract text with layout preservation
Extract tables and parse structured data
Fill PDF forms programmatically
Merge multiple PDFs into a single document
Split PDFs by pages or ranges
Create PDFs from scratch with text, images, and graphics
Add watermarks and annotations
Extract and modify metadata (author, title, keywords)
Add password protection and encryption
Perform OCR on scanned documents
Convert images to PDF
Compress and optimize PDF files
Extract images from PDFs
Rotate and reorder pages

Quick Start

Install required libraries:

pip install pypdf pdfplumber reportlab PyMuPDF pdf2image pytesseract pillow

For detailed installation instructions including system dependencies, see:

Library Installation Guide

Python Libraries Overview

pypdf: Basic operations (merge, split, rotate, metadata) pdfplumber: Advanced text/table extraction with layout awareness reportlab: Create PDFs from scratch (reports, invoices, documents) PyMuPDF (fitz): Advanced manipulation, annotations, compression pdf2image: Convert PDF pages to images (requires poppler) pytesseract: OCR for scanned documents (requires tesseract)

Text Extraction Workflow

Basic Extraction

from pypdf import PdfReader

reader = PdfReader("document.pdf")
for page in reader.pages:
    text = page.extract_text()
    print(text)

Layout-Aware Extraction

import pdfplumber

with pdfplumber.open("document.pdf") as pdf:
    for page in pdf.pages:
        text = page.extract_text()
        words = page.extract_words()  # With positioning
        print(text)

Extract from Specific Region

with pdfplumber.open("document.pdf") as pdf:
    page = pdf.pages[0]
    bbox = (0, 0, 612, 100)  # x0, y0, x1, y1
    header = page.crop(bbox).extract_text()

For detailed text extraction methods including OCR fallback and encoding handling, see:

Text Extraction Reference

Table Extraction Workflow

Extract All Tables

import pdfplumber

with pdfplumber.open("report.pdf") as pdf:
    for page in pdf.pages:
        tables = page.extract_tables()
        for table in tables:
            print(table)

Advanced Table Detection

table_settings = {
    "vertical_strategy": "lines",
    "horizontal_strategy": "lines",
    "snap_tolerance": 3
}

tables = page.extract_tables(table_settings=table_settings)

For detailed table extraction strategies and data cleaning, see:

Table Extraction Reference

PDF Form Operations

Fill Form Fields

import fitz

doc = fitz.open("form.pdf")
for page in doc:
    for widget in page.widgets():
        if widget.field_name == "name":
            widget.field_value = "John Doe"
            widget.update()
doc.save("filled.pdf")
doc.close()

Extract Form Field Names

doc = fitz.open("form.pdf")
for page in doc:
    for widget in page.widgets():
        print(f"{widget.field_name}: {widget.field_type_string}")
doc.close()

For form filling, flattening, and debugging, see:

PDF Operations Reference

Merging and Splitting

Merge PDFs

from pypdf import PdfMerger

merger = PdfMerger()
for pdf in ["file1.pdf", "file2.pdf", "file3.pdf"]:
    merger.append(pdf)
merger.write("merged.pdf")
merger.close()

Merge with Page Ranges

merger = PdfMerger()
merger.append("doc1.pdf", pages=(0, 3))  # First 3 pages
merger.append("doc2.pdf")  # All pages
merger.write("compiled.pdf")
merger.close()

Split into Individual Pages

from pypdf import PdfReader, PdfWriter

reader = PdfReader("document.pdf")
for i, page in enumerate(reader.pages):
    writer = PdfWriter()
    writer.add_page(page)
    with open(f"page_{i+1}.pdf", 'wb') as f:
        writer.write(f)

For merging with bookmarks and splitting by size, see:

PDF Operations Reference

Creating PDFs

Simple Text PDF

from reportlab.pdfgen import canvas
from reportlab.lib.pagesizes import letter

c = canvas.Canvas("output.pdf", pagesize=letter)
c.setFont("Helvetica", 12)
c.drawString(50, 750, "Hello, World!")
c.save()

Styled Report

from reportlab.platypus import SimpleDocTemplate, Paragraph, Spacer
from reportlab.lib.styles import getSampleStyleSheet

doc = SimpleDocTemplate("report.pdf")
styles = getSampleStyleSheet()
story = []

story.append(Paragraph("Report Title", styles['Title']))
story.append(Spacer(1, 12))
story.append(Paragraph("Content here", styles['BodyText']))

doc.build(story)

PDF with Table

from reportlab.platypus import Table, TableStyle
from reportlab.lib import colors

data = [
    ['Product', 'Quantity', 'Price'],
    ['Widget A', '10', '$50'],
    ['Widget B', '5', '$75']
]

table = Table(data)
table.setStyle(TableStyle([
    ('BACKGROUND', (0, 0), (-1, 0), colors.grey),
    ('GRID', (0, 0), (-1, -1), 1, colors.black)
]))

For complete PDF creation workflows including images, multi-column layouts, and custom fonts, see:

PDF Creation Reference

For practical examples:

Invoice Generator
Report Automation

Metadata and Security

Extract Metadata

from pypdf import PdfReader

reader = PdfReader("document.pdf")
metadata = reader.metadata
print(f"Title: {metadata.get('/Title')}")
print(f"Author: {metadata.get('/Author')}")

Modify Metadata

from pypdf import PdfWriter

writer = PdfWriter()
for page in reader.pages:
    writer.add_page(page)

writer.add_metadata({
    '/Title': 'New Title',
    '/Author': 'John Doe'
})

with open("updated.pdf", 'wb') as f:
    writer.write(f)

Add Password Protection

writer.encrypt(
    user_password="user123",
    owner_password="owner456",
    algorithm="AES-256"
)

For detailed security operations and comprehensive metadata management, see:

Metadata, Security, and OCR Reference

OCR for Scanned Documents

Basic OCR

from pdf2image import convert_from_path
import pytesseract

images = convert_from_path("scanned.pdf")
for i, image in enumerate(images):
    text = pytesseract.image_to_string(image)
    print(f"Page {i+1}:\n{text}")

Multi-Language OCR

text = pytesseract.image_to_string(image, lang='eng+fra+deu')

For searchable PDF creation and OCR preprocessing, see:

Metadata, Security, and OCR Reference

Watermarks and Annotations

Add Text Watermark

import fitz

doc = fitz.open("document.pdf")
for page in doc:
    page.insert_textbox(
        page.rect,
        "CONFIDENTIAL",
        fontsize=50,
        rotate=45,
        opacity=0.3,
        color=(0.7, 0.7, 0.7)
    )
doc.save("watermarked.pdf")
doc.close()

Add Annotations

page.add_highlight_annot(rect)  # Highlight
page.add_text_annot(point, "Note")  # Text note
page.add_underline_annot(rect)  # Underline

For stamps and image watermarks, see:

Metadata, Security, and OCR Reference

Page Operations

Rotate Pages

from pypdf import PdfReader, PdfWriter

reader = PdfReader("document.pdf")
writer = PdfWriter()

for page in reader.pages:
    page.rotate(90)
    writer.add_page(page)

with open("rotated.pdf", 'wb') as f:
    writer.write(f)

Extract Images

import fitz

doc = fitz.open("document.pdf")
for page_num in range(len(doc)):
    page = doc[page_num]
    for img_index, img in enumerate(page.get_images()):
        xref = img[0]
        base_image = doc.extract_image(xref)
        with open(f"image_{page_num}_{img_index}.png", "wb") as f:
            f.write(base_image["image"])
doc.close()

Convert Images to PDF

from PIL import Image
from reportlab.pdfgen import canvas

c = canvas.Canvas("output.pdf")
for img_path in ["img1.jpg", "img2.jpg"]:
    img = Image.open(img_path)
    c.setPageSize(img.size)
    c.drawImage(img_path, 0, 0, width=img.width, height=img.height)
    c.showPage()
c.save()

For detailed page operations, see:

PDF Operations Reference

Optimization

Compress PDF

import fitz

doc = fitz.open("large.pdf")
doc.save(
    "optimized.pdf",
    garbage=4,
    deflate=True,
    clean=True
)
doc.close()

Best Practices

Memory Management

Process large PDFs in chunks:

from pypdf import PdfReader
import gc

reader = PdfReader("large.pdf")
for i, page in enumerate(reader.pages):
    text = page.extract_text()
    # Process text
    if i % 10 == 0:
        gc.collect()

Error Handling

Always handle encryption and errors:

from pypdf import PdfReader

try:
    reader = PdfReader("document.pdf")

    if reader.is_encrypted:
        reader.decrypt(password)

    for page in reader.pages:
        text = page.extract_text()
except Exception as e:
    print(f"Error: {e}")

OCR Fallback

Detect and handle scanned documents:

import fitz

doc = fitz.open("document.pdf")
text = doc[0].get_text()

if not text.strip():
    # Use OCR for scanned document
    from pdf2image import convert_from_path
    import pytesseract

    images = convert_from_path("document.pdf")
    text = pytesseract.image_to_string(images[0])

For comprehensive best practices, common pitfalls, and troubleshooting, see:

Best Practices and Common Pitfalls

Common Pitfalls

Scanned Documents: Text extraction returns empty for scanned PDFs. Use OCR (pytesseract).

Table Detection: Tables not detected correctly. Adjust table_settings strategies.

Encrypted PDFs: Operations fail. Check and decrypt with password first.

Form Fields: Can't find field names. Use debug helper to list all fields.

Memory Issues: Large PDFs cause crashes. Process in chunks with garbage collection.

Encoding Issues: Special characters corrupted. Handle with UTF-8 encoding explicitly.

For detailed solutions and debugging strategies, see:

Best Practices and Common Pitfalls

Quick Reference

Text Extraction:

Simple: pypdf - page.extract_text()
Advanced: pdfplumber - page.extract_text() + page.extract_words()

Table Extraction:

Always use: pdfplumber - page.extract_tables()

PDF Creation:

Use: reportlab - canvas.Canvas() or SimpleDocTemplate()

Advanced Operations:

Use: PyMuPDF (fitz) - forms, annotations, compression

OCR:

Use: pytesseract + pdf2image

Merging/Splitting:

Use: pypdf - PdfMerger() and PdfWriter()

Helper Scripts

The skill includes helper scripts for common operations:

# See scripts directory for utilities
python scripts/pdf_helper.py --help

Additional Resources

Comprehensive References:

Library Installation - Setup and dependencies
Text Extraction - All extraction methods
Table Extraction - Table detection strategies
PDF Operations - Forms, merge, split, pages
PDF Creation - Creating PDFs from scratch
Metadata, Security, OCR - Advanced operations
Best Practices - Pitfalls and solutions

Practical Examples:

Invoice Generator - Professional invoice templates
Report Automation - Automated report generation

Implementation Guidelines

When working with PDFs:

Choose the right library for your task (see Quick Reference)
Handle errors with try-except blocks
Check for encryption before processing
Use OCR fallback for scanned documents
Process large files in chunks to manage memory
Validate input files before operations
Close documents to free resources: doc.close()

For production use, always implement proper error handling, validate inputs, and test with various PDF types and versions.

PDF Manipulation Skill

Comprehensive guide for working with PDF files in Python, covering extraction, manipulation, creation, and advanced operations using progressive disclosure for efficiency.

Core Capabilities

Extract and manipulate PDF content:

Extract text with layout preservation
Extract tables and parse structured data
Fill PDF forms programmatically
Merge multiple PDFs into a single document
Split PDFs by pages or ranges
Create PDFs from scratch with text, images, and graphics
Add watermarks and annotations
Extract and modify metadata (author, title, keywords)
Add password protection and encryption
Perform OCR on scanned documents
Convert images to PDF
Compress and optimize PDF files
Extract images from PDFs
Rotate and reorder pages

Quick Start

Install required libraries:

pip install pypdf pdfplumber reportlab PyMuPDF pdf2image pytesseract pillow

For detailed installation instructions including system dependencies, see:

Library Installation Guide

Python Libraries Overview

Text Extraction Workflow

Basic Extraction

from pypdf import PdfReader

reader = PdfReader("document.pdf")
for page in reader.pages:
    text = page.extract_text()
    print(text)

Layout-Aware Extraction

import pdfplumber

with pdfplumber.open("document.pdf") as pdf:
    for page in pdf.pages:
        text = page.extract_text()
        words = page.extract_words()  # With positioning
        print(text)

Extract from Specific Region

with pdfplumber.open("document.pdf") as pdf:
    page = pdf.pages[0]
    bbox = (0, 0, 612, 100)  # x0, y0, x1, y1
    header = page.crop(bbox).extract_text()

For detailed text extraction methods including OCR fallback and encoding handling, see:

Text Extraction Reference

Table Extraction Workflow

Extract All Tables

import pdfplumber

with pdfplumber.open("report.pdf") as pdf:
    for page in pdf.pages:
        tables = page.extract_tables()
        for table in tables:
            print(table)

Advanced Table Detection

table_settings = {
    "vertical_strategy": "lines",
    "horizontal_strategy": "lines",
    "snap_tolerance": 3
}

tables = page.extract_tables(table_settings=table_settings)

For detailed table extraction strategies and data cleaning, see:

Table Extraction Reference

PDF Form Operations

Fill Form Fields

import fitz

doc = fitz.open("form.pdf")
for page in doc:
    for widget in page.widgets():
        if widget.field_name == "name":
            widget.field_value = "John Doe"
            widget.update()
doc.save("filled.pdf")
doc.close()

Extract Form Field Names

doc = fitz.open("form.pdf")
for page in doc:
    for widget in page.widgets():
        print(f"{widget.field_name}: {widget.field_type_string}")
doc.close()

For form filling, flattening, and debugging, see:

PDF Operations Reference

Merging and Splitting

Merge PDFs

from pypdf import PdfMerger

merger = PdfMerger()
for pdf in ["file1.pdf", "file2.pdf", "file3.pdf"]:
    merger.append(pdf)
merger.write("merged.pdf")
merger.close()

Merge with Page Ranges

merger = PdfMerger()
merger.append("doc1.pdf", pages=(0, 3))  # First 3 pages
merger.append("doc2.pdf")  # All pages
merger.write("compiled.pdf")
merger.close()

Split into Individual Pages

from pypdf import PdfReader, PdfWriter

reader = PdfReader("document.pdf")
for i, page in enumerate(reader.pages):
    writer = PdfWriter()
    writer.add_page(page)
    with open(f"page_{i+1}.pdf", 'wb') as f:
        writer.write(f)

For merging with bookmarks and splitting by size, see:

PDF Operations Reference

Creating PDFs

Simple Text PDF

from reportlab.pdfgen import canvas
from reportlab.lib.pagesizes import letter

c = canvas.Canvas("output.pdf", pagesize=letter)
c.setFont("Helvetica", 12)
c.drawString(50, 750, "Hello, World!")
c.save()

Styled Report

from reportlab.platypus import SimpleDocTemplate, Paragraph, Spacer
from reportlab.lib.styles import getSampleStyleSheet

doc = SimpleDocTemplate("report.pdf")
styles = getSampleStyleSheet()
story = []

story.append(Paragraph("Report Title", styles['Title']))
story.append(Spacer(1, 12))
story.append(Paragraph("Content here", styles['BodyText']))

doc.build(story)

PDF with Table

from reportlab.platypus import Table, TableStyle
from reportlab.lib import colors

data = [
    ['Product', 'Quantity', 'Price'],
    ['Widget A', '10', '$50'],
    ['Widget B', '5', '$75']
]

table = Table(data)
table.setStyle(TableStyle([
    ('BACKGROUND', (0, 0), (-1, 0), colors.grey),
    ('GRID', (0, 0), (-1, -1), 1, colors.black)
]))

For complete PDF creation workflows including images, multi-column layouts, and custom fonts, see:

PDF Creation Reference

For practical examples:

Invoice Generator
Report Automation

Metadata and Security

Extract Metadata

from pypdf import PdfReader

reader = PdfReader("document.pdf")
metadata = reader.metadata
print(f"Title: {metadata.get('/Title')}")
print(f"Author: {metadata.get('/Author')}")

Modify Metadata

from pypdf import PdfWriter

writer = PdfWriter()
for page in reader.pages:
    writer.add_page(page)

writer.add_metadata({
    '/Title': 'New Title',
    '/Author': 'John Doe'
})

with open("updated.pdf", 'wb') as f:
    writer.write(f)

Add Password Protection

writer.encrypt(
    user_password="user123",
    owner_password="owner456",
    algorithm="AES-256"
)

For detailed security operations and comprehensive metadata management, see:

Metadata, Security, and OCR Reference

OCR for Scanned Documents

Basic OCR

from pdf2image import convert_from_path
import pytesseract

images = convert_from_path("scanned.pdf")
for i, image in enumerate(images):
    text = pytesseract.image_to_string(image)
    print(f"Page {i+1}:\n{text}")

Multi-Language OCR

text = pytesseract.image_to_string(image, lang='eng+fra+deu')

For searchable PDF creation and OCR preprocessing, see:

Metadata, Security, and OCR Reference

Watermarks and Annotations

Add Text Watermark

import fitz

doc = fitz.open("document.pdf")
for page in doc:
    page.insert_textbox(
        page.rect,
        "CONFIDENTIAL",
        fontsize=50,
        rotate=45,
        opacity=0.3,
        color=(0.7, 0.7, 0.7)
    )
doc.save("watermarked.pdf")
doc.close()

Add Annotations

page.add_highlight_annot(rect)  # Highlight
page.add_text_annot(point, "Note")  # Text note
page.add_underline_annot(rect)  # Underline

For stamps and image watermarks, see:

Metadata, Security, and OCR Reference

Page Operations

Rotate Pages

from pypdf import PdfReader, PdfWriter

reader = PdfReader("document.pdf")
writer = PdfWriter()

for page in reader.pages:
    page.rotate(90)
    writer.add_page(page)

with open("rotated.pdf", 'wb') as f:
    writer.write(f)

Extract Images

import fitz

doc = fitz.open("document.pdf")
for page_num in range(len(doc)):
    page = doc[page_num]
    for img_index, img in enumerate(page.get_images()):
        xref = img[0]
        base_image = doc.extract_image(xref)
        with open(f"image_{page_num}_{img_index}.png", "wb") as f:
            f.write(base_image["image"])
doc.close()

Convert Images to PDF

from PIL import Image
from reportlab.pdfgen import canvas

c = canvas.Canvas("output.pdf")
for img_path in ["img1.jpg", "img2.jpg"]:
    img = Image.open(img_path)
    c.setPageSize(img.size)
    c.drawImage(img_path, 0, 0, width=img.width, height=img.height)
    c.showPage()
c.save()

For detailed page operations, see:

PDF Operations Reference

Optimization

Compress PDF

import fitz

doc = fitz.open("large.pdf")
doc.save(
    "optimized.pdf",
    garbage=4,
    deflate=True,
    clean=True
)
doc.close()

Best Practices

Memory Management

Process large PDFs in chunks:

from pypdf import PdfReader
import gc

reader = PdfReader("large.pdf")
for i, page in enumerate(reader.pages):
    text = page.extract_text()
    # Process text
    if i % 10 == 0:
        gc.collect()

Error Handling

Always handle encryption and errors:

from pypdf import PdfReader

try:
    reader = PdfReader("document.pdf")

    if reader.is_encrypted:
        reader.decrypt(password)

    for page in reader.pages:
        text = page.extract_text()
except Exception as e:
    print(f"Error: {e}")

OCR Fallback

Detect and handle scanned documents:

import fitz

doc = fitz.open("document.pdf")
text = doc[0].get_text()

if not text.strip():
    # Use OCR for scanned document
    from pdf2image import convert_from_path
    import pytesseract

    images = convert_from_path("document.pdf")
    text = pytesseract.image_to_string(images[0])

For comprehensive best practices, common pitfalls, and troubleshooting, see:

Best Practices and Common Pitfalls

Common Pitfalls

Scanned Documents: Text extraction returns empty for scanned PDFs. Use OCR (pytesseract).

Table Detection: Tables not detected correctly. Adjust table_settings strategies.

Encrypted PDFs: Operations fail. Check and decrypt with password first.

Form Fields: Can't find field names. Use debug helper to list all fields.

Memory Issues: Large PDFs cause crashes. Process in chunks with garbage collection.

Encoding Issues: Special characters corrupted. Handle with UTF-8 encoding explicitly.

For detailed solutions and debugging strategies, see:

Best Practices and Common Pitfalls

Quick Reference

Text Extraction:

Simple: pypdf - page.extract_text()
Advanced: pdfplumber - page.extract_text() + page.extract_words()

Table Extraction:

Always use: pdfplumber - page.extract_tables()

PDF Creation:

Use: reportlab - canvas.Canvas() or SimpleDocTemplate()

Advanced Operations:

Use: PyMuPDF (fitz) - forms, annotations, compression

OCR:

Use: pytesseract + pdf2image

Merging/Splitting:

Use: pypdf - PdfMerger() and PdfWriter()

Helper Scripts

The skill includes helper scripts for common operations:

# See scripts directory for utilities
python scripts/pdf_helper.py --help

Additional Resources

Comprehensive References:

Library Installation - Setup and dependencies
Text Extraction - All extraction methods
Table Extraction - Table detection strategies
PDF Operations - Forms, merge, split, pages
PDF Creation - Creating PDFs from scratch
Metadata, Security, OCR - Advanced operations
Best Practices - Pitfalls and solutions

Practical Examples:

Invoice Generator - Professional invoice templates
Report Automation - Automated report generation

Implementation Guidelines

When working with PDFs:

Choose the right library for your task (see Quick Reference)
Handle errors with try-except blocks
Check for encryption before processing
Use OCR fallback for scanned documents
Process large files in chunks to manage memory
Validate input files before operations
Close documents to free resources: doc.close()

For production use, always implement proper error handling, validate inputs, and test with various PDF types and versions.

Adoption

aiskillstore/pdf

$ install --global

Security Scan Results

SKILL.md

PDF Manipulation Skill

Core Capabilities

Quick Start

Python Libraries Overview

Text Extraction Workflow

Basic Extraction

Layout-Aware Extraction

Extract from Specific Region

Table Extraction Workflow

Extract All Tables

Advanced Table Detection

PDF Form Operations

Fill Form Fields

Extract Form Field Names

Merging and Splitting

Merge PDFs

Merge with Page Ranges

Split into Individual Pages

Creating PDFs

Simple Text PDF

Styled Report

PDF with Table

Metadata and Security

Extract Metadata

Modify Metadata

Add Password Protection

OCR for Scanned Documents

Basic OCR

Multi-Language OCR

Watermarks and Annotations

Add Text Watermark

Add Annotations

Page Operations

Rotate Pages

Extract Images

Convert Images to PDF

Optimization

Compress PDF

Best Practices

Memory Management

Error Handling

OCR Fallback

Common Pitfalls

Quick Reference

Helper Scripts

Additional Resources

Implementation Guidelines

Related Skills

aiskillstore/hig-components-content

aiskillstore/helpdesk-automation

aiskillstore/haskell-pro

aiskillstore/graphql

aiskillstore/pdf

$ install --global

Security Scan Results

SKILL.md

PDF Manipulation Skill

Core Capabilities

Quick Start

Python Libraries Overview

Text Extraction Workflow

Basic Extraction

Layout-Aware Extraction

Extract from Specific Region

Table Extraction Workflow

Extract All Tables

Advanced Table Detection

PDF Form Operations

Fill Form Fields

Extract Form Field Names

Merging and Splitting

Merge PDFs

Merge with Page Ranges

Split into Individual Pages

Creating PDFs