skills/43-wentorai-research-plugins/skills/literature/fulltext/open-access-mining-guide/SKILL.md
Mine open access full-text repositories for research data extraction
npx skillsauth add brycewang-stanford/Awesome-Agent-Skills-for-Empirical-Research open-access-mining-guideInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
A skill for systematically mining open access full-text repositories to extract structured research data at scale. Covers legal frameworks for text and data mining (TDM), major open access repositories and their APIs, full-text retrieval and parsing, section-level extraction, entity recognition in scientific text, and building reproducible mining pipelines.
Text and data mining of published literature operates within a specific legal framework that varies by jurisdiction. Understanding these rules is essential before starting any mining project.
Legal landscape for TDM:
EU Directive 2019/790 (DSM Directive):
- Article 3: TDM exception for research organizations
- Lawful access required (institutional subscription counts)
- Must be for scientific research purposes
- No opt-out possible for publishers
- Applies to EU/EEA research institutions
- Article 4: General TDM exception
- Available to anyone with lawful access
- Publishers CAN opt out (via robots.txt or metadata)
UK: TDM exception for non-commercial research (CDPA s.29A)
US: No specific TDM law; relies on fair use doctrine
- Transformative use generally favored by courts
- Google Books case (2015) supports large-scale text analysis
- But: database protection via Terms of Service
Practical guidelines:
- Mine open access content (CC-BY, CC-BY-SA) freely
- Mine subscription content under institutional license
- Check publisher TDM policies (Elsevier, Springer, Wiley
all have TDM APIs for licensed content)
- Never redistribute full text; share derived data only
- Credit the data source in publications
Repository overview for full-text mining:
PubMed Central (PMC):
- Coverage: 8M+ full-text articles (biomedical/life sciences)
- Access: Free, OA subset freely downloadable
- Formats: XML (JATS), PDF
- API: E-utilities (Entrez), bulk FTP download
- License: varies by article (check individual licenses)
- Best for: biomedical systematic reviews, meta-analyses
- Bulk download: ftp.ncbi.nlm.nih.gov/pub/pmc/
Europe PMC:
- Coverage: PMC content + European-funded research
- Access: Free, REST API
- Formats: XML, JSON
- API: europepmc.org/RestfulWebService
- Annotations: sentence-level annotations, concepts, data links
- Best for: European research, annotated text mining
CORE (core.ac.uk):
- Coverage: 200M+ metadata records, 36M+ full texts
- Access: Free API (registration required)
- Formats: JSON, full text as extracted plain text
- Sources: aggregates from 10,000+ repositories worldwide
- Best for: cross-disciplinary mining, thesis/dissertation text
arXiv:
- Coverage: 2M+ preprints (physics, math, CS, etc.)
- Access: Free bulk download, API
- Formats: LaTeX source, PDF
- Bulk: Kaggle dataset, S3 requester-pays bucket
- Best for: STEM preprint analysis, citation studies
Unpaywall / OpenAlex:
- Coverage: tracks OA status of 200M+ works
- Access: Free API, database dump
- Use: Find OA versions of any DOI
- Best for: Locating freely available versions of papers
OpenAlex:
- Coverage: 250M+ works, all disciplines
- Access: Free API, no key required
- Features: Concepts, citation counts, author profiles, institution data
- Best for: Cross-disciplinary metadata and OA discovery
import requests
import xml.etree.ElementTree as ET
import time
def fetch_pmc_fulltext(pmc_id):
"""
Fetch full-text XML from PubMed Central via E-utilities.
Args:
pmc_id: PMC identifier (e.g., "PMC7096724")
Returns:
Parsed article as structured dictionary
"""
base_url = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi"
params = {
"db": "pmc",
"id": pmc_id.replace("PMC", ""),
"rettype": "xml",
}
response = requests.get(base_url, params=params, timeout=30)
response.raise_for_status()
root = ET.fromstring(response.content)
article = parse_jats_xml(root)
return article
def parse_jats_xml(root):
"""
Parse JATS XML (Journal Article Tag Suite) into structured data.
JATS is the standard XML format for PMC articles.
"""
article = {}
# Title
title_elem = root.find(".//article-title")
article["title"] = "".join(title_elem.itertext()) if title_elem is not None else ""
# Abstract
abstract_elem = root.find(".//abstract")
if abstract_elem is not None:
article["abstract"] = "".join(abstract_elem.itertext()).strip()
# Body sections
body = root.find(".//body")
if body is not None:
article["sections"] = extract_sections(body)
# References
ref_list = root.find(".//ref-list")
if ref_list is not None:
article["references"] = extract_references(ref_list)
return article
def extract_sections(body_element):
"""
Extract sections with their titles and text content.
Preserves the hierarchical structure of the paper.
"""
sections = []
for sec in body_element.findall(".//sec"):
title_elem = sec.find("title")
title = title_elem.text if title_elem is not None else "Untitled"
paragraphs = []
for p in sec.findall("p"):
text = "".join(p.itertext()).strip()
if text:
paragraphs.append(text)
sections.append({
"title": title,
"text": "\n".join(paragraphs),
"id": sec.get("id", ""),
})
return sections
def batch_mine_pmc(pmc_ids, output_dir, delay=0.4):
"""
Mine multiple PMC articles with rate limiting.
NCBI E-utilities rate limit:
- Without API key: 3 requests/second
- With API key: 10 requests/second
- Register for API key at ncbi.nlm.nih.gov/account/
"""
import json
import os
results = []
errors = []
for i, pmc_id in enumerate(pmc_ids):
try:
article = fetch_pmc_fulltext(pmc_id)
results.append(article)
# Save individual article
output_path = os.path.join(output_dir, f"{pmc_id}.json")
with open(output_path, "w") as f:
json.dump(article, f, indent=2)
if (i + 1) % 100 == 0:
print(f"Processed {i + 1}/{len(pmc_ids)} articles")
except Exception as e:
errors.append({"pmc_id": pmc_id, "error": str(e)})
# Rate limiting
time.sleep(delay)
print(f"Successfully mined {len(results)} articles, "
f"{len(errors)} errors")
return results, errors
Targeted extraction by paper section:
Introduction:
- Research questions and hypotheses
- Knowledge gaps identified
- Theoretical framework references
Methods:
- Study design (RCT, cohort, case-control, etc.)
- Sample size and population characteristics
- Measurement instruments and their validity
- Statistical analysis methods
- Software and versions used
Results:
- Effect sizes with confidence intervals
- P-values and test statistics
- Participant flow (enrollment, dropout, analysis)
- Tables and figures (structured data)
Discussion:
- Key findings summarized
- Comparison with prior work
- Limitations acknowledged
- Future directions proposed
- Clinical/practical implications
def extract_scientific_entities(text):
"""
Extract scientific named entities from full text.
For biomedical text, use specialized NER models:
- SciSpaCy: biomedical NER (diseases, chemicals, genes)
- BioBERT: contextual biomedical NER
- PubTator: NCBI's annotation service
"""
import scispacy
import spacy
nlp = spacy.load("en_ner_bionlp13cg_md")
doc = nlp(text)
entities = []
for ent in doc.ents:
entities.append({
"text": ent.text,
"label": ent.label_,
"start": ent.start_char,
"end": ent.end_char,
})
return entities
Recommended pipeline structure:
1. Query definition:
- Define search terms, date ranges, inclusion criteria
- Document in a protocol file (version-controlled)
2. Article retrieval:
- Search API for matching articles
- Download full text (XML/PDF)
- Store raw data with metadata
3. Text extraction:
- Parse XML or extract text from PDF
- Section segmentation
- Table and figure extraction (if needed)
4. Information extraction:
- NER for entities of interest
- Relation extraction
- Numeric data extraction (effect sizes, p-values)
5. Quality control:
- Sample-based manual validation (10-20% of results)
- Inter-annotator agreement on validation sample
- Error analysis and pipeline refinement
6. Data export:
- Structured output (CSV, JSON, database)
- Provenance tracking (which article, which section)
- Ready for downstream analysis
Best practices:
- Version control the entire pipeline code
- Log all API queries and responses
- Set random seeds for any sampling steps
- Share the pipeline code in supplementary materials
- Use DOIs or PMCIDs as stable article identifiers
- Cache downloaded articles to avoid re-fetching
Open access full-text mining enables research at a scale impossible with manual reading. A single researcher can systematically extract data from thousands of papers, enabling comprehensive evidence synthesis, trend analysis, and hypothesis generation. The key requirements are respecting legal and ethical boundaries, building robust parsing pipelines, and rigorously validating extracted data against manual review.
development
Conduct rigorous thematic analysis (TA) of qualitative data following Braun and Clarke's (2006) six-phase framework. Use whenever the user mentions 'thematic analysis', 'TA', 'Braun and Clarke', 'qualitative coding', 'identifying themes', or asks for help analysing interviews, focus groups, open-ended survey responses, or transcripts to identify patterns. Also trigger for questions about inductive vs theoretical coding, semantic vs latent themes, essentialist vs constructionist epistemology, building a thematic map, or writing up a qualitative findings section. Covers all six phases, the four upfront analytic decisions, the 15-point quality checklist, and the five common pitfalls. Produces a Word document write-up and an annotated thematic map. Does NOT cover IPA, grounded theory, discourse analysis, conversation analysis, or narrative analysis — use a different method for those.
development
Guide users through writing a systematic literature review (SLR) following the PRISMA 2020 framework. Use this skill whenever the user mentions 'systematic review', 'systematic literature review', 'SLR', 'PRISMA', 'PRISMA 2020', 'PRISMA flow diagram', 'PRISMA checklist', or asks for help writing, structuring, or auditing a literature review that follows reporting guidelines. Also trigger when the user asks about inclusion/exclusion criteria for a review, search strategies for databases like Scopus/WoS/PubMed, study selection processes, risk of bias assessment, or narrative synthesis for a review paper. This skill covers the full PRISMA 2020 checklist (27 items), produces a Word document manuscript in strict journal article format, generates an annotated PRISMA flow diagram, and enforces APA 7th Edition referencing throughout. It does NOT cover meta-analysis or statistical pooling. By Chuah Kee Man.
testing
Performs placebo-in-time sensitivity analysis with hierarchical null model and optional Bayesian assurance. Use when checking model robustness, verifying lack of pre-intervention effects, or estimating study power.
data-ai
Fit, summarize, plot, and interpret a chosen CausalPy experiment. Use after the causal method has been selected, including when configuring PyMC/sklearn models and scale-aware custom priors.