skills/pdf/SKILL.md
Use this skill whenever the user wants to do anything with PDF files. This includes reading or extracting text/tables from PDFs, combining or merging multiple PDFs into one, splitting PDFs apart, rotating pages, adding watermarks, creating new PDFs, filling PDF forms, encrypting/decrypting PDFs, extracting images, cleaning background noise from scanned PDFs, and OCR on scanned PDFs to make them searchable. If the user mentions a .pdf file or asks to produce one, use this skill.
npx skillsauth add gouzhuang/agent-skills pdfInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
This guide covers essential PDF processing operations using Python libraries and command-line tools. For advanced features, JavaScript libraries, and detailed examples, see REFERENCE.md. If you need to fill out a PDF form, read FORMS.md and follow its instructions.
from pypdf import PdfReader, PdfWriter
# Read a PDF
reader = PdfReader("document.pdf")
print(f"Pages: {len(reader.pages)}")
# Extract text
text = ""
for page in reader.pages:
text += page.extract_text()
from pypdf import PdfWriter, PdfReader
writer = PdfWriter()
for pdf_file in ["doc1.pdf", "doc2.pdf", "doc3.pdf"]:
reader = PdfReader(pdf_file)
for page in reader.pages:
writer.add_page(page)
with open("merged.pdf", "wb") as output:
writer.write(output)
reader = PdfReader("input.pdf")
for i, page in enumerate(reader.pages):
writer = PdfWriter()
writer.add_page(page)
with open(f"page_{i+1}.pdf", "wb") as output:
writer.write(output)
reader = PdfReader("document.pdf")
meta = reader.metadata
print(f"Title: {meta.title}")
print(f"Author: {meta.author}")
print(f"Subject: {meta.subject}")
print(f"Creator: {meta.creator}")
reader = PdfReader("input.pdf")
writer = PdfWriter()
page = reader.pages[0]
page.rotate(90) # Rotate 90 degrees clockwise
writer.add_page(page)
with open("rotated.pdf", "wb") as output:
writer.write(output)
import pdfplumber
with pdfplumber.open("document.pdf") as pdf:
for page in pdf.pages:
text = page.extract_text()
print(text)
with pdfplumber.open("document.pdf") as pdf:
for i, page in enumerate(pdf.pages):
tables = page.extract_tables()
for j, table in enumerate(tables):
print(f"Table {j+1} on page {i+1}:")
for row in table:
print(row)
import pandas as pd
with pdfplumber.open("document.pdf") as pdf:
all_tables = []
for page in pdf.pages:
tables = page.extract_tables()
for table in tables:
if table: # Check if table is not empty
df = pd.DataFrame(table[1:], columns=table[0])
all_tables.append(df)
# Combine all tables
if all_tables:
combined_df = pd.concat(all_tables, ignore_index=True)
combined_df.to_excel("extracted_tables.xlsx", index=False)
from reportlab.lib.pagesizes import letter
from reportlab.pdfgen import canvas
c = canvas.Canvas("hello.pdf", pagesize=letter)
width, height = letter
# Add text
c.drawString(100, height - 100, "Hello World!")
c.drawString(100, height - 120, "This is a PDF created with reportlab")
# Add a line
c.line(100, height - 140, 400, height - 140)
# Save
c.save()
from reportlab.lib.pagesizes import letter
from reportlab.platypus import SimpleDocTemplate, Paragraph, Spacer, PageBreak
from reportlab.lib.styles import getSampleStyleSheet
doc = SimpleDocTemplate("report.pdf", pagesize=letter)
styles = getSampleStyleSheet()
story = []
# Add content
title = Paragraph("Report Title", styles['Title'])
story.append(title)
story.append(Spacer(1, 12))
body = Paragraph("This is the body of the report. " * 20, styles['Normal'])
story.append(body)
story.append(PageBreak())
# Page 2
story.append(Paragraph("Page 2", styles['Heading1']))
story.append(Paragraph("Content for page 2", styles['Normal']))
# Build PDF
doc.build(story)
IMPORTANT: Never use Unicode subscript/superscript characters (₀₁₂₃₄₅₆₇₈₉, ⁰¹²³⁴⁵⁶⁷⁸⁹) in ReportLab PDFs. The built-in fonts do not include these glyphs, causing them to render as solid black boxes.
Instead, use ReportLab's XML markup tags in Paragraph objects:
from reportlab.platypus import Paragraph
from reportlab.lib.styles import getSampleStyleSheet
styles = getSampleStyleSheet()
# Subscripts: use <sub> tag
chemical = Paragraph("H<sub>2</sub>O", styles['Normal'])
# Superscripts: use <super> tag
squared = Paragraph("x<super>2</super> + y<super>2</super>", styles['Normal'])
For canvas-drawn text (not Paragraph objects), manually adjust font the size and position rather than using Unicode subscripts/superscripts.
# Extract text
pdftotext input.pdf output.txt
# Extract text preserving layout
pdftotext -layout input.pdf output.txt
# Extract specific pages
pdftotext -f 1 -l 5 input.pdf output.txt # Pages 1-5
# Merge PDFs
qpdf --empty --pages file1.pdf file2.pdf -- merged.pdf
# Split pages
qpdf input.pdf --pages . 1-5 -- pages1-5.pdf
qpdf input.pdf --pages . 6-10 -- pages6-10.pdf
# Rotate pages
qpdf input.pdf output.pdf --rotate=+90:1 # Rotate page 1 by 90 degrees
# Remove password (use environment variable for security)
qpdf --password="$PDF_PASSWORD" --decrypt encrypted.pdf decrypted.pdf
# Merge
pdftk file1.pdf file2.pdf cat output merged.pdf
# Split
pdftk input.pdf burst
# Rotate
pdftk input.pdf rotate 1east output rotated.pdf
Ghostscript is the most reliable tool for compressing PDF files by reducing image resolution and applying efficient compression.
gs -sDEVICE=pdfwrite -dCompatibilityLevel=1.4 -dPDFSETTINGS=<setting> -dNOPAUSE -dQUIET -dBATCH -sOutputFile="compressed.pdf" "input.pdf"
| Setting | Quality Description | Image Resolution (DPI) | Typical Use Case |
|---|---|---|---|
| /screen | Lowest quality, highest compression | 72 dpi | On-screen viewing, smallest file size |
| /ebook | Medium quality, balanced compression | 150 dpi | E-readers and general digital use |
| /printer | High quality | 300 dpi | Printing with good resolution |
| /prepress | Highest quality, color preserving | 300 dpi | Professional printing workflows |
| /default | General purpose output | Varies by parameter | Versatile but may result in larger files |
# Maximum compression for web/email (/screen preset)
gs -sDEVICE=pdfwrite -dCompatibilityLevel=1.4 -dPDFSETTINGS=/screen -dNOPAUSE -dQUIET -dBATCH -sOutputFile="compressed.pdf" "input.pdf"
# Balanced compression for general use (/ebook preset)
gs -sDEVICE=pdfwrite -dCompatibilityLevel=1.4 -dPDFSETTINGS=/ebook -dNOPAUSE -dQUIET -dBATCH -sOutputFile="compressed.pdf" "input.pdf"
# High quality for printing (/printer preset)
gs -sDEVICE=pdfwrite -dCompatibilityLevel=1.4 -dPDFSETTINGS=/printer -dNOPAUSE -dQUIET -dBATCH -sOutputFile="compressed.pdf" "input.pdf"
For finer control, override specific parameters after the -dPDFSETTINGS flag:
# Use /ebook settings but force higher color image resolution
gs -sDEVICE=pdfwrite -dPDFSETTINGS=/ebook -dColorImageResolution=300 -dNOPAUSE -dQUIET -dBATCH -sOutputFile="output.pdf" "input.pdf"
# Disable image downsampling entirely
gs -sDEVICE=pdfwrite -dPDFSETTINGS=/default -dDownsampleColorImages=false -dDownsampleGrayImages=false -dDownsampleMonoImages=false -dNOPAUSE -dQUIET -dBATCH -sOutputFile="output.pdf" "input.pdf"
Use scripts/clean_pdf_background.py to remove background noise and shadows from scanned PDFs while preserving color content (photos, stamps) and text clarity.
python scripts/clean_pdf_background.py input.pdf output.pdf [dpi]
Parameters:
input.pdf: Source PDF with background noiseoutput.pdf: Cleaned output PDFdpi: Resolution for processing (default: 300, use 600 for higher quality)Processing method:
Example:
# Standard quality cleaning
python scripts/clean_pdf_background.py scanned.pdf clean.pdf
# High quality cleaning
python scripts/clean_pdf_background.py scanned.pdf clean.pdf 600
Requirements:
convert)pdftoppm)Pillow, numpy# Requires: pip install pytesseract pdf2image
import pytesseract
from pdf2image import convert_from_path
# Convert PDF to images
images = convert_from_path('scanned.pdf')
# OCR each page
text = ""
for i, image in enumerate(images):
text += f"Page {i+1}:\n"
text += pytesseract.image_to_string(image)
text += "\n\n"
print(text)
from pypdf import PdfReader, PdfWriter
# Create watermark (or load existing)
watermark = PdfReader("watermark.pdf").pages[0]
# Apply to all pages
reader = PdfReader("document.pdf")
writer = PdfWriter()
for page in reader.pages:
page.merge_page(watermark)
writer.add_page(page)
with open("watermarked.pdf", "wb") as output:
writer.write(output)
# Using pdfimages (poppler-utils)
pdfimages -j input.pdf output_prefix
# This extracts all images as output_prefix-000.jpg, output_prefix-001.jpg, etc.
import os
from pypdf import PdfReader, PdfWriter
# Load passwords from environment variables for security
USER_PASSWORD = os.environ.get('PDF_USER_PASSWORD')
OWNER_PASSWORD = os.environ.get('PDF_OWNER_PASSWORD')
if not USER_PASSWORD or not OWNER_PASSWORD:
raise ValueError("PDF_USER_PASSWORD and PDF_OWNER_PASSWORD environment variables must be set")
reader = PdfReader("input.pdf")
writer = PdfWriter()
for page in reader.pages:
writer.add_page(page)
# Add password protection
writer.encrypt(USER_PASSWORD, OWNER_PASSWORD)
with open("encrypted.pdf", "wb") as output:
writer.write(output)
| Task | Best Tool | Command/Code |
|------|-----------|--------------|
| Merge PDFs | pypdf | writer.add_page(page) |
| Split PDFs | pypdf | One page per file |
| Extract text | pdfplumber | page.extract_text() |
| Extract tables | pdfplumber | page.extract_tables() |
| Create PDFs | reportlab | Canvas or Platypus |
| Command line merge | qpdf | qpdf --empty --pages ... |
| OCR scanned PDFs | pytesseract | Convert to image first |
| Compress PDF | Ghostscript | gs -sDEVICE=pdfwrite -dPDFSETTINGS=/ebook |
| Clean background noise | scripts/clean_pdf_background.py | python scripts/clean_pdf_background.py input.pdf output.pdf |
| Fill PDF forms | pdf-lib or pypdf (see FORMS.md) | See FORMS.md |
documentation
根据Git暂存变更或会话内容更新变更日志文件(CHANGELOG.md)。支持自动分析git变更、基于会话内容生成日志、符合项目风格的变更日志。
development
Scan codebases for privacy data leakage risks such as hardcoded secrets, API keys, passwords, database connection strings, JWT tokens, email addresses, phone numbers, and ID cards. Use when the user wants to check code for privacy compliance, security audit, or before open-sourcing a project. Triggered by requests like 'check for privacy leaks', 'scan for secrets', 'privacy review', 'find hardcoded credentials', or 'check for PII in code'.
development
Query the LOINC medical terminology database via the Regenstrief Search API. Use when the user needs to search for LOINC codes, parts, answer lists, or groups. Supports advanced search syntax including field restrictions (Component:, System:, etc.), boolean operators (AND/OR/NOT), wildcards, fuzzy search, and phrase search. Triggered by requests like "find LOINC code for X", "search LOINC", "look up LOINC term", "LOINC code for glucose/blood test/etc.", or any medical terminology lookup task involving LOINC.
content-media
Summarize or extract text/transcripts from URLs, podcasts, and local files (great fallback for “transcribe this YouTube/video”).