skills/literature/fulltext/bioc-pmc-api/SKILL.md
Access PMC Open Access articles in BioC format for text mining
npx skillsauth add wentorai/research-plugins bioc-pmc-apiInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
The BioC API provides full-text articles from PubMed Central (PMC) in the BioC format — a simplified XML/JSON structure designed specifically for biomedical text mining. Unlike the standard PMC OAI service (which returns JATS XML), BioC pre-segments text into passages with offset annotations, making it ideal for NLP pipelines, named entity recognition, relation extraction, and other text mining tasks. Free, no authentication required.
https://www.ncbi.nlm.nih.gov/research/bionlp/RESTful/pmcoa.cgi/BioC_json/{PMCID}/unicode
# JSON format (recommended for programmatic use)
curl "https://www.ncbi.nlm.nih.gov/research/bionlp/RESTful/pmcoa.cgi/BioC_json/PMC6267067/unicode"
# XML format
curl "https://www.ncbi.nlm.nih.gov/research/bionlp/RESTful/pmcoa.cgi/BioC_xml/PMC6267067/unicode"
# ASCII encoding (strips non-ASCII characters)
curl "https://www.ncbi.nlm.nih.gov/research/bionlp/RESTful/pmcoa.cgi/BioC_json/PMC6267067/ascii"
# Convert PMID to PMCID first, then query
curl "https://www.ncbi.nlm.nih.gov/pmc/utils/idconv/v1.0/?ids=29346600&format=json"
# Returns: {"records": [{"pmid": "29346600", "pmcid": "PMC6267067", ...}]}
{
"source": "PMC",
"date": "2024-01-15",
"key": "collection.key",
"documents": [
{
"id": "PMC6267067",
"passages": [
{
"infons": {
"section_type": "TITLE",
"type": "title"
},
"offset": 0,
"text": "Article Title Here"
},
{
"infons": {
"section_type": "ABSTRACT",
"type": "abstract"
},
"offset": 25,
"text": "Background: This study investigates..."
},
{
"infons": {
"section_type": "INTRO",
"type": "paragraph"
},
"offset": 350,
"text": "The introduction text..."
}
]
}
]
}
Key fields:
passages[].infons.section_type: TITLE, ABSTRACT, INTRO, METHODS, RESULTS, DISCUSS, CONCL, REF, FIG, TABLEpassages[].offset: Character offset from document startpassages[].text: Plain text content of the passageimport requests
import json
def get_bioc_article(pmcid: str, fmt: str = "json") -> dict:
"""Fetch a PMC article in BioC format."""
url = f"https://www.ncbi.nlm.nih.gov/research/bionlp/RESTful/pmcoa.cgi/BioC_{fmt}/{pmcid}/unicode"
resp = requests.get(url, timeout=30)
resp.raise_for_status()
return resp.json() if fmt == "json" else resp.text
def extract_sections(bioc_doc: dict) -> dict:
"""Extract text organized by section type."""
sections = {}
for doc in bioc_doc.get("documents", []):
for passage in doc.get("passages", []):
section = passage.get("infons", {}).get("section_type", "OTHER")
text = passage.get("text", "")
sections.setdefault(section, []).append(text)
return {k: "\n".join(v) for k, v in sections.items()}
# Example: fetch and parse
article = get_bioc_article("PMC6267067")
sections = extract_sections(article)
print(f"Title: {sections.get('TITLE', 'N/A')}")
print(f"Abstract length: {len(sections.get('ABSTRACT', ''))} chars")
print(f"Sections found: {list(sections.keys())}")
tool=your_tool_name&[email protected] to requests for priority queueWhen using this API in publications, cite:
Comeau DC, Wei CH, Islamaj Dogan R, Lu Z. PMC text mining subset in BioC: about 3 million full text articles and growing. Bioinformatics, btz070, 2019.
tools
10 document processing skills. Trigger: extracting text from PDFs, parsing references, document Q&A. Design: parsing pipelines (GROBID, marker) and structured extraction tools.
documentation
Guide to tldraw for infinite canvas whiteboarding and diagram creation
testing
Create graphical abstracts, schematic diagrams, and scientific illustrations
documentation
Create UML diagrams and architecture visualizations with PlantUML