skills/table-extractor/SKILL.md
Extract tables from PDFs with high accuracy using camelot. Handles complex table structures including merged cells, multi-line rows, and spanning headers. Use when a user asks to extract a table from a PDF, pull tabular data from a document, convert PDF tables to CSV or Excel, or parse structured tables from reports.
npx skillsauth add tusosos/manus-knowledge-base table-extractorInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Extract tables from PDF documents with high accuracy using camelot-py. Handles complex table structures including merged cells, multi-line rows, spanning headers, and borderless tables. Outputs clean DataFrames that can be exported to CSV, Excel, or JSON.
When a user asks you to extract tables from a PDF, follow this process:
# Install camelot and its dependencies
pip install "camelot-py[base]" ghostscript opencv-python-headless pandas
# Verify ghostscript is available (required by camelot)
gs --version 2>/dev/null || echo "Install ghostscript: sudo apt install ghostscript"
If ghostscript is not available, fall back to pdfplumber:
pip install pdfplumber pandas
import camelot
# Quick scan: how many tables are in the document?
tables = camelot.read_pdf("document.pdf", pages="all", flavor="lattice")
print(f"Found {len(tables)} tables using lattice detection")
# If no tables found, try stream detection (for borderless tables)
if len(tables) == 0:
tables = camelot.read_pdf("document.pdf", pages="all", flavor="stream")
print(f"Found {len(tables)} tables using stream detection")
# Summary of each table
for i, table in enumerate(tables):
print(f"\nTable {i}: {table.shape[0]} rows x {table.shape[1]} cols (page {table.page})")
print(f"Accuracy: {table.accuracy:.1f}%")
print(table.df.head(3))
Lattice flavor (for tables with visible borders/gridlines):
tables = camelot.read_pdf(
"document.pdf",
pages="1,2,3", # Specific pages
flavor="lattice",
line_scale=40, # Adjust line detection sensitivity
process_background=True # Detect lines on colored backgrounds
)
Stream flavor (for borderless tables, whitespace-separated):
tables = camelot.read_pdf(
"document.pdf",
pages="1",
flavor="stream",
edge_tol=50, # Tolerance for edge detection
row_tol=10, # Tolerance for grouping text into rows
columns=["72,200,350,500"] # Manual column boundaries if auto-detect fails
)
import pandas as pd
for i, table in enumerate(tables):
df = table.df
# Promote first row to header if it contains column names
if df.iloc[0].str.match(r'^[A-Za-z]').all():
df.columns = df.iloc[0]
df = df[1:].reset_index(drop=True)
# Clean whitespace and newlines within cells
df = df.apply(lambda col: col.str.strip().str.replace(r'\n', ' ', regex=True))
# Remove completely empty rows
df = df.dropna(how='all').replace('', pd.NA).dropna(how='all')
# Convert numeric columns
for col in df.columns:
try:
df[col] = pd.to_numeric(df[col].str.replace(',', '').str.replace('$', ''))
except (ValueError, AttributeError):
pass # Keep as string
print(f"\nCleaned Table {i}:")
print(df.head())
Merged cells and spanning headers:
# Forward-fill merged cells (common in row headers)
df.iloc[:, 0] = df.iloc[:, 0].replace('', pd.NA).ffill()
# Handle multi-level column headers
if df.iloc[0:2].apply(lambda x: x.str.len().mean()).mean() < 20:
# Combine first two rows as multi-level header
new_cols = df.iloc[0] + " - " + df.iloc[1]
df.columns = new_cols.str.strip(" - ")
df = df[2:].reset_index(drop=True)
Tables spanning multiple pages:
# Extract from all pages and concatenate
all_tables = camelot.read_pdf("document.pdf", pages="all", flavor="lattice")
# Group tables that are continuations (same column count)
groups = {}
for t in all_tables:
key = t.shape[1]
groups.setdefault(key, []).append(t.df)
for col_count, dfs in groups.items():
combined = pd.concat(dfs, ignore_index=True)
# Remove duplicate header rows that appear at page breaks
combined = combined[~combined.duplicated(keep='first')]
# CSV (one file per table)
for i, table in enumerate(tables):
table.df.to_csv(f"table_{i+1}.csv", index=False)
# Excel (all tables as separate sheets)
with pd.ExcelWriter("extracted_tables.xlsx") as writer:
for i, table in enumerate(tables):
table.df.to_excel(writer, sheet_name=f"Table_{i+1}", index=False)
# JSON
for i, table in enumerate(tables):
table.df.to_json(f"table_{i+1}.json", orient="records", indent=2)
print(f"Exported {len(tables)} tables")
User request: "Extract all tables from this annual report PDF"
Actions:
Output: "Extracted 7 tables across 42 pages. Exported to extracted_tables.xlsx with sheets: Income_Statement, Balance_Sheet, Cash_Flow, Revenue_Breakdown, Expenses, Quarterly_Summary, KPIs."
User request: "Get the results table from page 8 of this paper"
Actions:
camelot.read_pdf("paper.pdf", pages="8")Output: A single CSV file with the results table, plus a preview of the first few rows printed to the console.
User request: "Extract the summary table from each of these 20 monthly reports"
Actions:
import glob
results = []
for pdf_path in sorted(glob.glob("reports/*.pdf")):
tables = camelot.read_pdf(pdf_path, pages="1", flavor="lattice")
if tables:
df = tables[0].df # First table on first page
df["source_file"] = pdf_path
results.append(df)
combined = pd.concat(results, ignore_index=True)
combined.to_csv("all_summaries.csv", index=False)
Output: A single CSV combining the summary table from all 20 reports with a source_file column for traceability.
lattice flavor first (bordered tables). Fall back to stream for borderless tables.accuracy score on each table. Below 80% indicates extraction issues that need manual review.ocrmypdf) before table extraction.page.extract_table(table_settings={...}).tools
Download video and audio from YouTube and other platforms with yt-dlp. Use when a user asks to download YouTube videos, extract audio from videos, download playlists, get subtitles, download specific formats or qualities, batch download, archive channels, extract metadata, embed thumbnails, download from social media platforms (Twitter, Instagram, TikTok), or build media ingestion pipelines. Covers format selection, audio extraction, playlists, subtitles, metadata, and automation.
development
Download YouTube videos with customizable quality and format options. Use this skill when the user asks to download, save, or grab YouTube videos. Supports various quality settings (best, 1080p, 720p, 480p, 360p), multiple formats (mp4, webm, mkv), and audio-only downloads as MP3.
development
Use this skill any time a spreadsheet file is the primary input or output. This means any task where the user wants to: open, read, edit, or fix an existing .xlsx, .xlsm, .csv, or .tsv file (e.g., adding columns, computing formulas, formatting, charting, cleaning messy data); create a new spreadsheet from scratch or from other data sources; or convert between tabular file formats. Trigger especially when the user references a spreadsheet file by name or path — even casually (like "the xlsx in my downloads") — and wants something done to it or produced from it. Also trigger for cleaning or restructuring messy tabular data files (malformed rows, misplaced headers, junk data) into proper spreadsheets. The deliverable must be a spreadsheet file. Do NOT trigger when the primary deliverable is a Word document, HTML report, standalone Python script, database pipeline, or Google Sheets API integration, even if tabular data is involved.
development
Use when you have a spec or requirements for a multi-step task, before touching code