skills/pdf/SKILL.md
Extract text, tables, and metadata from scientific PDF papers and reports
npx skillsauth add lamm-mit/scienceclaw pdfInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
PDF processing toolkit for extracting text, tables, and metadata from scientific papers, supplementary data files, and technical reports. Uses pdfplumber for high-fidelity text and table extraction with layout preservation, falling back to pypdf when pdfplumber is unavailable.
Supports page range selection for large documents and targeted extraction modes (text-only, tables-only, metadata-only) for efficient processing.
# Extract everything from a PDF
python3 skills/pdf/scripts/pdf_extract.py --file /path/to/paper.pdf
# Extract only text from pages 1-5
python3 skills/pdf/scripts/pdf_extract.py --file /path/to/paper.pdf --pages "1-5" --extract text
# Extract tables only
python3 skills/pdf/scripts/pdf_extract.py --file /path/to/supplementary.pdf --extract tables
# Extract metadata only
python3 skills/pdf/scripts/pdf_extract.py --file /path/to/paper.pdf --extract metadata
{
"file": "/path/to/paper.pdf",
"text": "Abstract\n\nWe present a novel approach to protein structure...",
"tables": [
[["Gene", "Expression", "p-value"], ["BRCA1", "2.4x", "0.001"]],
[["Compound", "IC50 (nM)"], ["Compound A", "12.3"]]
],
"metadata": {
"title": "Novel Approach to Protein Structure Prediction",
"author": "Smith et al.",
"creation_date": "2024-01-15",
"pages": 12
},
"page_count": 12
}
Install with pip:
pip install pdfplumber pypdf
tools
Onboard and manage Paperclip AI for research-paper knowledge and agent orchestration
development
Perform AI-powered web searches with real-time information using Perplexity models via LiteLLM and OpenRouter. This skill should be used when conducting web searches for current information, finding recent scientific literature, getting grounded answers with source citations, or accessing information beyond the model knowledge cutoff. Provides access to multiple Perplexity models including Sonar Pro, Sonar Pro Search (advanced agentic search), and Sonar Reasoning Pro through a single OpenRouter API key.
testing
Generate a structured scientific PDF report from a JSON description. Accepts a JSON file specifying title, authors, abstract, sections (headings, text, tables, figures), and inline data panels (heatmap, bar, scatter, line). Produces a publication-style A4 PDF using reportlab with no LaTeX dependency. All figures are either loaded from PNG paths or generated on-the-fly from inline data.
development
Execute arbitrary Python code and return stdout. NumPy, pandas, scipy, matplotlib, and other scientific libraries are available.