Adoption

Agent Skills are supported by leading AI development tools.

VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory

aiskillstore/extract-from-pdfs

Name: extract-from-pdfs
Author: aiskillstore

skills/brunoasm/extract-from-pdfs/SKILL.md

npx skillsauth add aiskillstore/marketplace extract-from-pdfs

Clean

TrivyContainer and dependency vulnerability scanner

Clean

SemgrepStatic code analysis for vulnerabilities

Clean

mcp-scan (Snyk)Model Context Protocol security validation

Skipped

Snyk (dep)Open source security scanning

Skipped

Socket.devSupply chain security analysis

Skipped

VirusTotalMulti-engine malware detection

Skipped

CrowdStrikeAdvanced threat intelligence

Skipped

OSV-ScannerOpen Source Vulnerability database check

Skipped

OWASP Dep-Check

Extract Structured Data from Scientific PDFs

Purpose

Extract standardized, structured data from scientific PDF literature using Claude's vision capabilities. Transform PDF collections into validated databases ready for statistical analysis in Python, R, or other frameworks.

Core capabilities:

Organize metadata from BibTeX, RIS, directories, or DOI lists
Filter papers by abstract using Claude (Haiku/Sonnet) or local models (Ollama)
Extract structured data from PDFs with customizable schemas
Repair and validate JSON outputs automatically
Enrich with external databases (GBIF, WFO, GeoNames, PubChem, NCBI)
Calculate precision/recall metrics for quality assurance
Export to Python, R, CSV, Excel, or SQLite

When to Use This Skill

Use when:

Conducting systematic literature reviews requiring data extraction
Building databases from scientific publications
Converting PDF collections to structured datasets
Validating extraction quality with ground truth metrics
Comparing extraction approaches (different models, prompts)

Do not use for:

Single PDF summarization (use basic PDF reading instead)
Full-text PDF search (use document search tools)
PDF editing or manipulation

Getting Started

1. Initial Setup

Read the setup guide for installation and configuration:

cat references/setup_guide.md

Key setup steps:

Install dependencies: conda env create -f environment.yml
Set API keys: export ANTHROPIC_API_KEY='your-key'
Optional: Install Ollama for free local filtering

2. Define Extraction Requirements

Ask the user:

Research domain and extraction goals
How PDFs are organized (reference manager, directory, DOI list)
Approximate collection size
Preferred analysis environment (Python, R, etc.)

Provide 2-3 example PDFs to analyze structure and design schema.

3. Design Extraction Schema

Create custom schema from template:

cp assets/schema_template.json my_schema.json

Customize for the specific domain:

Set objective describing what to extract
Define output_schema with field types and descriptions
Add domain-specific instructions for Claude
Provide output_example showing desired format

See assets/example_flower_visitors_schema.json for real-world ecology example.

Workflow Execution

Complete Pipeline

Run the 6-step pipeline (plus optional validation):

# Step 1: Organize metadata
python scripts/01_organize_metadata.py \
  --source-type bibtex \
  --source library.bib \
  --pdf-dir pdfs/ \
  --output metadata.json

# Step 2: Filter papers (optional - recommended)
# Choose backend: anthropic-haiku (cheap), anthropic-sonnet (accurate), ollama (free)
python scripts/02_filter_abstracts.py \
  --metadata metadata.json \
  --backend anthropic-haiku \
  --use-batches \
  --output filtered_papers.json

# Step 3: Extract from PDFs
python scripts/03_extract_from_pdfs.py \
  --metadata filtered_papers.json \
  --schema my_schema.json \
  --method batches \
  --output extracted_data.json

# Step 4: Repair JSON
python scripts/04_repair_json.py \
  --input extracted_data.json \
  --schema my_schema.json \
  --output cleaned_data.json

# Step 5: Validate with APIs
python scripts/05_validate_with_apis.py \
  --input cleaned_data.json \
  --apis my_api_config.json \
  --output validated_data.json

# Step 6: Export to analysis format
python scripts/06_export_database.py \
  --input validated_data.json \
  --format python \
  --output results

Validation (Optional but Recommended)

Calculate extraction quality metrics:

# Step 7: Sample papers for annotation
python scripts/07_prepare_validation_set.py \
  --extraction-results cleaned_data.json \
  --schema my_schema.json \
  --sample-size 20 \
  --strategy stratified \
  --output validation_set.json

# Step 8: Manually annotate (edit validation_set.json)
# Fill ground_truth field for each sampled paper

# Step 9: Calculate metrics
python scripts/08_calculate_validation_metrics.py \
  --annotations validation_set.json \
  --output validation_metrics.json \
  --report validation_report.txt

Validation produces precision, recall, and F1 metrics per field and overall.

Detailed Documentation

Access comprehensive guides in the references/ directory:

Setup and installation:

cat references/setup_guide.md

Complete workflow with examples:

cat references/workflow_guide.md

Validation methodology:

cat references/validation_guide.md

API integration details:

cat references/api_reference.md

Customization

Schema Customization

Modify my_schema.json to match the research domain:

Objective: Describe what data to extract
Instructions: Step-by-step extraction guidance
Output schema: JSON schema defining structure
Important notes: Domain-specific rules
Examples: Show desired output format

Use imperative language in instructions. Be specific about data types, required vs optional fields, and edge cases.

API Configuration

Configure external database validation in my_api_config.json:

Map extracted fields to validation APIs:

gbif_taxonomy - Biological taxonomy
wfo_plants - Plant names specifically
geonames - Geographic locations
geocode - Address to coordinates
pubchem - Chemical compounds
ncbi_gene - Gene identifiers

See assets/example_api_config_ecology.json for ecology-specific example.

Filtering Customization

Edit filtering criteria in scripts/02_filter_abstracts.py (line 74):

Replace TODO section with domain-specific criteria:

What constitutes primary data vs review?
What data types are relevant?
What scope (geographic, temporal, taxonomic) is needed?

Use conservative criteria (when in doubt, include paper) to avoid false negatives.

Cost Optimization

Backend selection for filtering (Step 2):

Ollama (local): $0 - Best for privacy and high volume
Haiku (API): ~$0.25/M tokens - Best balance of cost/quality
Sonnet (API): ~$3/M tokens - Best for complex filtering

Typical costs for 100 papers:

With filtering (Haiku + Sonnet): ~$4
With local Ollama + Sonnet: ~$3.75
Without filtering (Sonnet only): ~$7.50

Optimization strategies:

Use abstract filtering to reduce PDF processing
Use local Ollama for filtering (free)
Enable prompt caching with --use-caching
Process in batches with --use-batches

Quality Assurance

Validation workflow provides:

Precision: % of extracted items that are correct
Recall: % of true items that were extracted
F1 score: Harmonic mean of precision and recall
Per-field metrics: Identify weak fields

Use metrics to:

Establish baseline extraction quality
Compare different approaches (models, prompts, schemas)
Identify areas for improvement
Report extraction quality in publications

Recommended sample sizes:

Small projects (<100 papers): 10-20 papers
Medium projects (100-500 papers): 20-50 papers
Large projects (>500 papers): 50-100 papers

Iterative Improvement

Run initial extraction with baseline schema
Validate on sample using Steps 7-9
Analyze field-level metrics and error patterns
Revise schema, prompts, or model selection
Re-extract and re-validate
Compare metrics to verify improvement
Repeat until acceptable quality achieved

See references/validation_guide.md for detailed guidance on interpreting metrics and improving extraction quality.

Available Scripts

Data organization:

scripts/01_organize_metadata.py - Standardize PDFs and metadata

Filtering:

scripts/02_filter_abstracts.py - Filter by abstract (Haiku/Sonnet/Ollama)

Extraction:

scripts/03_extract_from_pdfs.py - Extract from PDFs with Claude vision

Processing:

scripts/04_repair_json.py - Repair and validate JSON
scripts/05_validate_with_apis.py - Enrich with external databases
scripts/06_export_database.py - Export to analysis formats

Validation:

scripts/07_prepare_validation_set.py - Sample papers for annotation
scripts/08_calculate_validation_metrics.py - Calculate P/R/F1 metrics

Assets

Templates:

assets/schema_template.json - Blank extraction schema template
assets/api_config_template.json - API validation configuration template

Examples:

assets/example_flower_visitors_schema.json - Ecology extraction example
assets/example_api_config_ecology.json - Ecology API validation example

aiskillstore/extract-from-pdfs

skills/brunoasm/extract-from-pdfs/SKILL.md

This skill should be used when extracting structured data from scientific PDFs for systematic reviews, meta-analyses, or database creation. Use when working with collections of research papers that need to be converted into analyzable datasets with validation metrics.

231 stars

testing

Updated Mar 29, 2026

$ install --global

skillsauth

npx skillsauth add aiskillstore/marketplace extract-from-pdfs

Install this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.

Security Scan Results

3 of 9 scanners reported clean

Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.

Scanners Passed

Scanners in report

Clean

TrivyContainer and dependency vulnerability scanner

95%

Clean

SemgrepStatic code analysis for vulnerabilities

95%

Clean

mcp-scan (Snyk)Model Context Protocol security validation

95%

Skipped

Snyk (dep)Open source security scanning

50%

Skipped

Socket.devSupply chain security analysis

50%

Skipped

VirusTotalMulti-engine malware detection

50%

Skipped

CrowdStrikeAdvanced threat intelligence

50%

Skipped

OSV-ScannerOpen Source Vulnerability database check

50%

Skipped

OWASP Dep-Check

50%

Last scanned: Mar 31, 2026, 1:37 PM64.5s20 files scanned

SKILL.md

name:: extract-from-pdfs
description:: This skill should be used when extracting structured data from scientific PDFs for systematic reviews, meta-analyses, or database creation. Use when working with collections of research papers that need to be converted into analyzable datasets with validation metrics.

Extract Structured Data from Scientific PDFs

Purpose

Core capabilities:

Organize metadata from BibTeX, RIS, directories, or DOI lists
Filter papers by abstract using Claude (Haiku/Sonnet) or local models (Ollama)
Extract structured data from PDFs with customizable schemas
Repair and validate JSON outputs automatically
Enrich with external databases (GBIF, WFO, GeoNames, PubChem, NCBI)
Calculate precision/recall metrics for quality assurance
Export to Python, R, CSV, Excel, or SQLite

When to Use This Skill

Use when:

Conducting systematic literature reviews requiring data extraction
Building databases from scientific publications
Converting PDF collections to structured datasets
Validating extraction quality with ground truth metrics
Comparing extraction approaches (different models, prompts)

Do not use for:

Single PDF summarization (use basic PDF reading instead)
Full-text PDF search (use document search tools)
PDF editing or manipulation

Getting Started

1. Initial Setup

Read the setup guide for installation and configuration:

cat references/setup_guide.md

Key setup steps:

Install dependencies: conda env create -f environment.yml
Set API keys: export ANTHROPIC_API_KEY='your-key'
Optional: Install Ollama for free local filtering

2. Define Extraction Requirements

Ask the user:

Research domain and extraction goals
How PDFs are organized (reference manager, directory, DOI list)
Approximate collection size
Preferred analysis environment (Python, R, etc.)

Provide 2-3 example PDFs to analyze structure and design schema.

3. Design Extraction Schema

Create custom schema from template:

cp assets/schema_template.json my_schema.json

Customize for the specific domain:

Set objective describing what to extract
Define output_schema with field types and descriptions
Add domain-specific instructions for Claude
Provide output_example showing desired format

See assets/example_flower_visitors_schema.json for real-world ecology example.

Workflow Execution

Complete Pipeline

Run the 6-step pipeline (plus optional validation):

# Step 1: Organize metadata
python scripts/01_organize_metadata.py \
  --source-type bibtex \
  --source library.bib \
  --pdf-dir pdfs/ \
  --output metadata.json

# Step 2: Filter papers (optional - recommended)
# Choose backend: anthropic-haiku (cheap), anthropic-sonnet (accurate), ollama (free)
python scripts/02_filter_abstracts.py \
  --metadata metadata.json \
  --backend anthropic-haiku \
  --use-batches \
  --output filtered_papers.json

# Step 3: Extract from PDFs
python scripts/03_extract_from_pdfs.py \
  --metadata filtered_papers.json \
  --schema my_schema.json \
  --method batches \
  --output extracted_data.json

# Step 4: Repair JSON
python scripts/04_repair_json.py \
  --input extracted_data.json \
  --schema my_schema.json \
  --output cleaned_data.json

# Step 5: Validate with APIs
python scripts/05_validate_with_apis.py \
  --input cleaned_data.json \
  --apis my_api_config.json \
  --output validated_data.json

# Step 6: Export to analysis format
python scripts/06_export_database.py \
  --input validated_data.json \
  --format python \
  --output results

Validation (Optional but Recommended)

Calculate extraction quality metrics:

# Step 7: Sample papers for annotation
python scripts/07_prepare_validation_set.py \
  --extraction-results cleaned_data.json \
  --schema my_schema.json \
  --sample-size 20 \
  --strategy stratified \
  --output validation_set.json

# Step 8: Manually annotate (edit validation_set.json)
# Fill ground_truth field for each sampled paper

# Step 9: Calculate metrics
python scripts/08_calculate_validation_metrics.py \
  --annotations validation_set.json \
  --output validation_metrics.json \
  --report validation_report.txt

Validation produces precision, recall, and F1 metrics per field and overall.

Detailed Documentation

Access comprehensive guides in the references/ directory:

Setup and installation:

cat references/setup_guide.md

Complete workflow with examples:

cat references/workflow_guide.md

Validation methodology:

cat references/validation_guide.md

API integration details:

cat references/api_reference.md

Customization

Schema Customization

Modify my_schema.json to match the research domain:

Objective: Describe what data to extract
Instructions: Step-by-step extraction guidance
Output schema: JSON schema defining structure
Important notes: Domain-specific rules
Examples: Show desired output format

Use imperative language in instructions. Be specific about data types, required vs optional fields, and edge cases.

API Configuration

Configure external database validation in my_api_config.json:

Map extracted fields to validation APIs:

gbif_taxonomy - Biological taxonomy
wfo_plants - Plant names specifically
geonames - Geographic locations
geocode - Address to coordinates
pubchem - Chemical compounds
ncbi_gene - Gene identifiers

See assets/example_api_config_ecology.json for ecology-specific example.

Filtering Customization

Edit filtering criteria in scripts/02_filter_abstracts.py (line 74):

Replace TODO section with domain-specific criteria:

What constitutes primary data vs review?
What data types are relevant?
What scope (geographic, temporal, taxonomic) is needed?

Use conservative criteria (when in doubt, include paper) to avoid false negatives.

Cost Optimization

Backend selection for filtering (Step 2):

Ollama (local): $0 - Best for privacy and high volume
Haiku (API): ~$0.25/M tokens - Best balance of cost/quality
Sonnet (API): ~$3/M tokens - Best for complex filtering

Typical costs for 100 papers:

With filtering (Haiku + Sonnet): ~$4
With local Ollama + Sonnet: ~$3.75
Without filtering (Sonnet only): ~$7.50

Optimization strategies:

Use abstract filtering to reduce PDF processing
Use local Ollama for filtering (free)
Enable prompt caching with --use-caching
Process in batches with --use-batches

Quality Assurance

Validation workflow provides:

Precision: % of extracted items that are correct
Recall: % of true items that were extracted
F1 score: Harmonic mean of precision and recall
Per-field metrics: Identify weak fields

Use metrics to:

Establish baseline extraction quality
Compare different approaches (models, prompts, schemas)
Identify areas for improvement
Report extraction quality in publications

Recommended sample sizes:

Small projects (<100 papers): 10-20 papers
Medium projects (100-500 papers): 20-50 papers
Large projects (>500 papers): 50-100 papers

Iterative Improvement

Run initial extraction with baseline schema
Validate on sample using Steps 7-9
Analyze field-level metrics and error patterns
Revise schema, prompts, or model selection
Re-extract and re-validate
Compare metrics to verify improvement
Repeat until acceptable quality achieved

See references/validation_guide.md for detailed guidance on interpreting metrics and improving extraction quality.

Available Scripts

Data organization:

scripts/01_organize_metadata.py - Standardize PDFs and metadata

Filtering:

scripts/02_filter_abstracts.py - Filter by abstract (Haiku/Sonnet/Ollama)

Extraction:

scripts/03_extract_from_pdfs.py - Extract from PDFs with Claude vision

Processing:

scripts/04_repair_json.py - Repair and validate JSON
scripts/05_validate_with_apis.py - Enrich with external databases
scripts/06_export_database.py - Export to analysis formats

Validation:

scripts/07_prepare_validation_set.py - Sample papers for annotation
scripts/08_calculate_validation_metrics.py - Calculate P/R/F1 metrics

Assets

Templates:

assets/schema_template.json - Blank extraction schema template
assets/api_config_template.json - API validation configuration template

Examples:

assets/example_flower_visitors_schema.json - Ecology extraction example
assets/example_api_config_ecology.json - Ecology API validation example

Related Skills

aiskillstore/hig-components-content

development

VerifiedTrustedCommunity

Apple Human Interface Guidelines for content display components. Use this skill when the user asks about charts component, collection view, image view, web view, color well, image well, activity view, lockup, data visualization, content display, displaying images, rendering web content, color pickers, or presenting collections of items in Apple apps. Also use when the user says how should I display charts, what's the best way to show images, should I use a web view, how do I build a grid of items, what component shows media, or how do I present a share sheet. Cross-references: hig-foundations for color/typography/accessibility, hig-patterns for data visualization patterns, hig-components-layout for structural containers, hig-platforms for platform-specific component behavior.

244SKILL.mdUpdated Apr 10, 2026

aiskillstore/hig-components-content

aiskillstore/helpdesk-automation

tools

VerifiedTrustedCommunity

Automate HelpDesk tasks via Rube MCP (Composio): list tickets, manage views, use canned responses, and configure custom fields. Always search tools first for current schemas.

244SKILL.mdUpdated Apr 10, 2026

aiskillstore/helpdesk-automation

aiskillstore/haskell-pro

testing

VerifiedTrustedCommunity

Expert Haskell engineer specializing in advanced type systems, pure functional design, and high-reliability software. Use PROACTIVELY for type-level programming, concurrency, and architecture guidance.

244SKILL.mdUpdated Apr 10, 2026

aiskillstore/haskell-pro

aiskillstore/graphql

tools

VerifiedTrustedCommunity

GraphQL gives clients exactly the data they need - no more, no less. One endpoint, typed schema, introspection. But the flexibility that makes it powerful also makes it dangerous. Without proper controls, clients can craft queries that bring down your server. This skill covers schema design, resolvers, DataLoader for N+1 prevention, federation for microservices, and client integration with Apollo/urql. Key insight: GraphQL is a contract. The schema is the API documentation. Design it carefully.

244SKILL.mdUpdated Apr 10, 2026

Download

For Claude Desktop. Download once, then upload the file in the app — no terminal needed.

Need help? View full Cowork setup guide →

Install manually

Choose your platform

# Clone the repo
git clone https://github.com/aiskillstore/marketplace.git

# Copy into Claude Code skills folder (global)
cp -r marketplace/skills/brunoasm/extract-from-pdfs ~/.claude/skills/

Claude Code Skills — official skills path docs.

Repository

aiskillstore/marketplace

231 stars

Compatible with

Claude Code

OpenAI Codex CLI

ChatGPT