skills/brunoasm/extract-from-pdfs/SKILL.md
This skill should be used when extracting structured data from scientific PDFs for systematic reviews, meta-analyses, or database creation. Use when working with collections of research papers that need to be converted into analyzable datasets with validation metrics.
npx skillsauth add aiskillstore/marketplace extract-from-pdfsInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Extract standardized, structured data from scientific PDF literature using Claude's vision capabilities. Transform PDF collections into validated databases ready for statistical analysis in Python, R, or other frameworks.
Core capabilities:
Use when:
Do not use for:
Read the setup guide for installation and configuration:
cat references/setup_guide.md
Key setup steps:
conda env create -f environment.ymlexport ANTHROPIC_API_KEY='your-key'Ask the user:
Provide 2-3 example PDFs to analyze structure and design schema.
Create custom schema from template:
cp assets/schema_template.json my_schema.json
Customize for the specific domain:
objective describing what to extractoutput_schema with field types and descriptionsinstructions for Claudeoutput_example showing desired formatSee assets/example_flower_visitors_schema.json for real-world ecology example.
Run the 6-step pipeline (plus optional validation):
# Step 1: Organize metadata
python scripts/01_organize_metadata.py \
--source-type bibtex \
--source library.bib \
--pdf-dir pdfs/ \
--output metadata.json
# Step 2: Filter papers (optional - recommended)
# Choose backend: anthropic-haiku (cheap), anthropic-sonnet (accurate), ollama (free)
python scripts/02_filter_abstracts.py \
--metadata metadata.json \
--backend anthropic-haiku \
--use-batches \
--output filtered_papers.json
# Step 3: Extract from PDFs
python scripts/03_extract_from_pdfs.py \
--metadata filtered_papers.json \
--schema my_schema.json \
--method batches \
--output extracted_data.json
# Step 4: Repair JSON
python scripts/04_repair_json.py \
--input extracted_data.json \
--schema my_schema.json \
--output cleaned_data.json
# Step 5: Validate with APIs
python scripts/05_validate_with_apis.py \
--input cleaned_data.json \
--apis my_api_config.json \
--output validated_data.json
# Step 6: Export to analysis format
python scripts/06_export_database.py \
--input validated_data.json \
--format python \
--output results
Calculate extraction quality metrics:
# Step 7: Sample papers for annotation
python scripts/07_prepare_validation_set.py \
--extraction-results cleaned_data.json \
--schema my_schema.json \
--sample-size 20 \
--strategy stratified \
--output validation_set.json
# Step 8: Manually annotate (edit validation_set.json)
# Fill ground_truth field for each sampled paper
# Step 9: Calculate metrics
python scripts/08_calculate_validation_metrics.py \
--annotations validation_set.json \
--output validation_metrics.json \
--report validation_report.txt
Validation produces precision, recall, and F1 metrics per field and overall.
Access comprehensive guides in the references/ directory:
Setup and installation:
cat references/setup_guide.md
Complete workflow with examples:
cat references/workflow_guide.md
Validation methodology:
cat references/validation_guide.md
API integration details:
cat references/api_reference.md
Modify my_schema.json to match the research domain:
Use imperative language in instructions. Be specific about data types, required vs optional fields, and edge cases.
Configure external database validation in my_api_config.json:
Map extracted fields to validation APIs:
gbif_taxonomy - Biological taxonomywfo_plants - Plant names specificallygeonames - Geographic locationsgeocode - Address to coordinatespubchem - Chemical compoundsncbi_gene - Gene identifiersSee assets/example_api_config_ecology.json for ecology-specific example.
Edit filtering criteria in scripts/02_filter_abstracts.py (line 74):
Replace TODO section with domain-specific criteria:
Use conservative criteria (when in doubt, include paper) to avoid false negatives.
Backend selection for filtering (Step 2):
Typical costs for 100 papers:
Optimization strategies:
--use-caching--use-batchesValidation workflow provides:
Use metrics to:
Recommended sample sizes:
See references/validation_guide.md for detailed guidance on interpreting metrics and improving extraction quality.
Data organization:
scripts/01_organize_metadata.py - Standardize PDFs and metadataFiltering:
scripts/02_filter_abstracts.py - Filter by abstract (Haiku/Sonnet/Ollama)Extraction:
scripts/03_extract_from_pdfs.py - Extract from PDFs with Claude visionProcessing:
scripts/04_repair_json.py - Repair and validate JSONscripts/05_validate_with_apis.py - Enrich with external databasesscripts/06_export_database.py - Export to analysis formatsValidation:
scripts/07_prepare_validation_set.py - Sample papers for annotationscripts/08_calculate_validation_metrics.py - Calculate P/R/F1 metricsTemplates:
assets/schema_template.json - Blank extraction schema templateassets/api_config_template.json - API validation configuration templateExamples:
assets/example_flower_visitors_schema.json - Ecology extraction exampleassets/example_api_config_ecology.json - Ecology API validation exampledevelopment
Apple Human Interface Guidelines for content display components. Use this skill when the user asks about charts component, collection view, image view, web view, color well, image well, activity view, lockup, data visualization, content display, displaying images, rendering web content, color pickers, or presenting collections of items in Apple apps. Also use when the user says how should I display charts, what's the best way to show images, should I use a web view, how do I build a grid of items, what component shows media, or how do I present a share sheet. Cross-references: hig-foundations for color/typography/accessibility, hig-patterns for data visualization patterns, hig-components-layout for structural containers, hig-platforms for platform-specific component behavior.
tools
Automate HelpDesk tasks via Rube MCP (Composio): list tickets, manage views, use canned responses, and configure custom fields. Always search tools first for current schemas.
testing
Expert Haskell engineer specializing in advanced type systems, pure functional design, and high-reliability software. Use PROACTIVELY for type-level programming, concurrency, and architecture guidance.
tools
GraphQL gives clients exactly the data they need - no more, no less. One endpoint, typed schema, introspection. But the flexibility that makes it powerful also makes it dangerous. Without proper controls, clients can craft queries that bring down your server. This skill covers schema design, resolvers, DataLoader for N+1 prevention, federation for microservices, and client integration with Apollo/urql. Key insight: GraphQL is a contract. The schema is the API documentation. Design it carefully.