skills/literature/search/arxiv-paper-processor/SKILL.md
Process and analyze arXiv papers systematically for research workflows
npx skillsauth add wentorai/research-plugins arxiv-paper-processorInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
The arXiv Paper Processor skill provides a complete pipeline for downloading, parsing, and analyzing arXiv papers programmatically. While the arXiv API provides metadata, researchers often need to work with the full text—extracting sections, equations, figures, and references for deeper analysis.
This skill covers the entire processing chain: retrieving papers by ID or search query, downloading PDF and LaTeX source files, extracting structured content, and producing analysis-ready outputs. It is particularly valuable for researchers conducting large-scale literature analysis, building training datasets from academic text, or automating evidence extraction for systematic reviews.
The pipeline handles common challenges in academic PDF processing including multi-column layouts, mathematical notation, table extraction, and reference parsing. It integrates with tools like GROBID for PDF parsing and can work directly with arXiv LaTeX sources for higher-fidelity extraction.
The most reliable method is to fetch papers by their arXiv identifier:
import urllib.request
import feedparser
# Fetch metadata via Atom feed
arxiv_id = "2301.07041"
url = f"http://export.arxiv.org/api/query?id_list={arxiv_id}"
response = urllib.request.urlopen(url)
feed = feedparser.parse(response.read())
entry = feed.entries[0]
title = entry.title
abstract = entry.summary
authors = [a.name for a in entry.authors]
pdf_url = entry.links[1].href # PDF link
arXiv stores LaTeX source files for most papers. These provide much richer structure than PDFs:
# Download LaTeX source (typically a .tar.gz)
wget https://arxiv.org/e-print/2301.07041 -O paper_source.tar.gz
tar -xzf paper_source.tar.gz -C paper_source/
Source files contain the original .tex files, figures, bibliography files, and any custom style files. Parsing LaTeX directly gives you access to section structure, equations in their original notation, citation keys, and figure captions without the ambiguity of PDF extraction.
When downloading multiple papers, respect arXiv's usage policies:
For papers where only PDF is available, use GROBID (GeneRation Of BIbliographic Data) for structured extraction:
# Run GROBID as a local service
docker run --rm -p 8070:8070 grobid/grobid:0.8.0
# Process a PDF
curl -X POST "http://localhost:8070/api/processFulltextDocument" \
-F "[email protected]" \
-F "consolidateHeader=1" \
-F "consolidateCitations=1" \
> paper_tei.xml
GROBID outputs TEI-XML with structured sections including:
When LaTeX source is available, parse it directly for higher fidelity:
.tex file (look for \documentclass or \begin{document})\input{} and \include{} directives to build the complete document\section{}, \subsection{} markersequation, align, gather environments\cite{} commands and cross-reference with the .bib file\caption{} commandsProduce a standardized JSON output for each processed paper:
{
"arxiv_id": "2301.07041",
"title": "Paper Title",
"authors": ["Author One", "Author Two"],
"abstract": "...",
"sections": [
{"heading": "Introduction", "level": 1, "text": "..."},
{"heading": "Related Work", "level": 1, "text": "..."}
],
"equations": ["E = mc^2", "..."],
"figures": [{"id": "fig1", "caption": "..."}],
"references": [{"key": "smith2020", "title": "...", "doi": "..."}],
"processed_date": "2026-03-10"
}
Once papers are processed into structured format, several downstream analyses become possible:
Integrate processed outputs with your reference manager by generating BibTeX entries enriched with extracted metadata, or feed structured JSON into a local search index for full-text retrieval across your paper collection.
tools
10 document processing skills. Trigger: extracting text from PDFs, parsing references, document Q&A. Design: parsing pipelines (GROBID, marker) and structured extraction tools.
documentation
Guide to tldraw for infinite canvas whiteboarding and diagram creation
testing
Create graphical abstracts, schematic diagrams, and scientific illustrations
documentation
Create UML diagrams and architecture visualizations with PlantUML