database-access/uniprot-access/SKILL.md
Query UniProt's REST API (post-2022 endpoint at rest.uniprot.org) for protein sequences, annotations, GO terms, cross-references, ID mappings, and proteomes. Use when fetching UniProtKB entries, navigating the JSON schema, choosing between UniProtKB/UniRef/UniParc/Proteomes resources, deciding stream vs search endpoint for batch retrieval, running ID-mapping jobs with the async pattern, handling isoform suffixes, or filtering reviewed Swiss-Prot vs auto-annotated TrEMBL. Encodes the legacy URL migration (2022), the new JSON schema layout, and bulk-pull patterns.
npx skillsauth add GPTomics/bioSkills bio-uniprot-accessInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Reference examples tested with: requests 2.31+, pandas 2.2+; UniProt REST API as of 2024_06 release
Before using code patterns, verify installed versions match. If versions differ:
pip show requests pandasThe REST API JSON schema is stable within a release; major schema changes are documented at https://www.uniprot.org/release-notes. The 2022 migration broke the legacy https://www.uniprot.org/uniprot/... endpoints.
"Get protein information from UniProt" -> Two facts dominate every UniProt workflow in 2026: (1) the API endpoint migrated in 2022 from https://www.uniprot.org/uniprot/... to https://rest.uniprot.org/uniprotkb/... with a substantially different JSON schema; pre-2022 code does not work as-is. (2) ?fields= is essential — default JSON returns the full entry (~20-30 KB each); for bulk pulls, request only the fields actually needed.
The major databases under the UniProt umbrella have different scopes:
UniProtKB: the curated knowledgebase — Swiss-Prot (manually reviewed, ~570K entries as of 2024) + TrEMBL (auto-annotated, ~250M). Always specify reviewed:true for high-quality reference work.
UniRef: clustered sequences at 100%, 90%, 50% identity. UniRef50 is the standard for redundancy reduction.
UniParc: archival "every unique sequence ever seen" — for provenance and historical lookup.
Proteomes: organism-level groupings; reference proteomes (one per species) are the canonical subset.
Python: requests.get('https://rest.uniprot.org/uniprotkb/...') (REST API)
Python: Bio.ExPASy.get_sprot_raw() (BioPython; legacy SwissProt format)
CLI: curl https://rest.uniprot.org/uniprotkb/P04637.json
import requests
import pandas as pd
import time
No API key required. Rate limit is generous (~200 req/sec tolerated empirically); ID-mapping has its own job queue.
Base: https://rest.uniprot.org/
| Resource | Endpoint | Use |
|---|---|---|
| Single entry | /uniprotkb/{accession} | One protein record |
| Search | /uniprotkb/search | Query with up to 500 results per page |
| Stream | /uniprotkb/stream | No 500-result limit; for bulk |
| Batch by accession | /uniprotkb/accessions | Multiple specific accessions |
| ID Mapping (run) | /idmapping/run | Submit conversion job |
| ID Mapping (status) | /idmapping/status/{jobId} | Poll |
| ID Mapping (results) | /idmapping/results/{jobId} | Retrieve |
| UniRef entry | /uniref/{cluster_id} | One cluster |
| UniRef search | /uniref/search | UniRef cluster queries |
| Proteome | /proteomes/{upid} | Organism proteome |
| Proteome FASTA | /proteomes/{upid}.fasta.gz | Download whole proteome |
| Taxonomy | /taxonomy/{taxid} | Taxonomy info |
Append .json, .fasta, .tsv, .xml, .txt, or .gff to single-entry URLs to control format.
UniProt search queries use a Lucene-like syntax distinct from Entrez:
| Query | Means |
|---|---|
| gene:TP53 | Gene name TP53 |
| gene_exact:TP53 | Exact gene name (no wildcard match) |
| organism_id:9606 | Human (NCBI taxonomy ID) |
| organism_name:"Homo sapiens" | By name (slower than taxid) |
| reviewed:true | Swiss-Prot only |
| reviewed:false | TrEMBL only |
| length:[100 TO 500] | Sequence length range |
| go:0006915 | GO term (apoptosis) |
| keyword:KW-0067 | UniProt keyword |
| ec:2.7.1.1 | Enzyme classification |
| database:pdb | Has PDB cross-ref |
| xref:pdb | Same as above |
| existence:1 | Evidence at protein level (1 = strongest) |
Combine: organism_id:9606 AND reviewed:true AND keyword:KW-0067 AND xref:pdb.
?fields= for bulk pullsDefault JSON entry is ~20-30 KB. For batch work, restrict fields:
fields = 'accession,id,gene_names,protein_name,length,sequence,xref_pdb,xref_alphafolddb'
url = 'https://rest.uniprot.org/uniprotkb/search'
params = {'query': 'organism_id:9606 AND reviewed:true', 'fields': fields, 'format': 'tsv', 'size': 500}
Common field selectors:
| Field | Returns |
|---|---|
| accession, id | Primary accession (P04637), entry name (P53_HUMAN) |
| gene_names | All gene names |
| gene_primary | Primary gene name only |
| protein_name | Recommended name |
| organism_name, organism_id | Species |
| length, mass | Sequence stats |
| sequence | The actual sequence |
| cc_function, cc_subcellular_location | Function and localization comments |
| ft_domain, ft_binding, ft_active_site | Domain/site features |
| go_p, go_c, go_f | GO biological process / cellular component / molecular function |
| xref_pdb, xref_alphafolddb, xref_ensembl, xref_refseq | Cross-references |
| keyword | UniProt keywords |
| ec | Enzyme classification |
| reviewed | Swiss-Prot vs TrEMBL flag |
| cc_alternative_products | Isoforms |
| Endpoint | When | Limit |
|---|---|---|
| /uniprotkb/{acc} | One accession | 1 entry |
| /uniprotkb/accessions?accessions=... | Several known accessions | Up to ~100 per call |
| /uniprotkb/search?query=... | Query-driven; need pagination | 500 results per page; cursor= for paging |
| /uniprotkb/stream?query=... | Bulk query (>500) | No hard limit; one HTTP stream |
For 1000+ results, /stream is the right endpoint. Stream returns one HTTP response; iterate over the stream to avoid memory blowup.
The new schema is deeply nested. Common access patterns:
entry = requests.get('https://rest.uniprot.org/uniprotkb/P04637.json').json()
acc = entry['primaryAccession'] # 'P04637'
entry_name = entry['uniProtkbId'] # 'P53_HUMAN'
sequence = entry['sequence']['value'] # actual AA sequence
length = entry['sequence']['length']
# Names (nested; defensive .get() because some fields are optional)
recommended = entry.get('proteinDescription', {}).get('recommendedName', {}).get('fullName', {}).get('value')
primary_gene = entry.get('genes', [{}])[0].get('geneName', {}).get('value')
# Cross-references
xrefs_by_db = {}
for xref in entry.get('uniProtKBCrossReferences', []):
xrefs_by_db.setdefault(xref['database'], []).append(xref['id'])
# Features (domains, binding sites)
domains = [f for f in entry.get('features', []) if f['type'] == 'Domain']
binding = [f for f in entry.get('features', []) if f['type'] == 'Binding site']
# Isoforms
isoforms = []
for comment in entry.get('comments', []):
if comment.get('commentType') == 'ALTERNATIVE PRODUCTS':
isoforms = [iso['name']['value'] for iso in comment.get('isoforms', [])]
Canonical sequence is returned for the bare accession (e.g. P04637). Isoforms have -2, -3, etc. suffixes (P04637-2). To fetch a specific isoform:
iso = requests.get('https://rest.uniprot.org/uniprotkb/P04637-2.fasta').text
The canonical entry's comments[type=ALTERNATIVE PRODUCTS] lists all isoforms with their differences. For workflows needing all isoforms, iterate the list and fetch separately.
Convert between identifier systems (Ensembl Gene -> UniProt; PDB -> UniProt; UniProt -> RefSeq; etc.). The job pattern:
POST /idmapping/run with ids, from, to.GET /idmapping/status/{jobId} — returns {'jobStatus': 'RUNNING'} or {'results': [...]}.GET /idmapping/results/{jobId} once status is complete.Job typically completes in 30s; larger batches take 5-10 min. Always set a poll timeout — the API doesn't fail-soft on stuck jobs.
| From | To | Notes |
|---|---|---|
| UniProtKB_AC-ID | UniProtKB | Resolve obsolete to current accessions |
| Gene_Name | UniProtKB | Symbol -> accession (lossy; check matches) |
| Ensembl | UniProtKB | Ensembl Gene/Transcript/Protein |
| EMBL-GenBank-DDBJ | UniProtKB | INSDC nucleotide accessions |
| RefSeq_Protein | UniProtKB | NP_/XP_ accessions |
| PDB | UniProtKB | PDB chain to protein |
| UniProtKB | EMBL-GenBank-DDBJ | Reverse direction |
Full from/to list at https://rest.uniprot.org/configure/idmapping/fields.
Goal: Fetch one UniProt entry as JSON and extract canonical name, gene, sequence, PDB cross-refs without KeyErrors.
Approach: GET /uniprotkb/{acc}.json; navigate with .get() chains; handle missing fields gracefully.
Reference (UniProt REST as of 2024_06):
import requests
def fetch_uniprot_entry(accession):
r = requests.get(f'https://rest.uniprot.org/uniprotkb/{accession}.json')
r.raise_for_status()
e = r.json()
return {
'accession': e['primaryAccession'],
'entry_name': e.get('uniProtkbId'),
'reviewed': e.get('entryType') == 'UniProtKB reviewed (Swiss-Prot)',
'protein_name': e.get('proteinDescription', {}).get('recommendedName', {}).get('fullName', {}).get('value'),
'gene_primary': (e.get('genes') or [{}])[0].get('geneName', {}).get('value'),
'sequence': e['sequence']['value'],
'length': e['sequence']['length'],
'pdb_ids': [x['id'] for x in e.get('uniProtKBCrossReferences', []) if x['database'] == 'PDB'],
'alphafold_id': next((x['id'] for x in e.get('uniProtKBCrossReferences', []) if x['database'] == 'AlphaFoldDB'), None),
}
print(fetch_uniprot_entry('P04637'))
fields= (bulk-friendly)Goal: Get a DataFrame of human reviewed kinases with their PDB and AlphaFold IDs.
Approach: /search with format=tsv and explicit fields; paginate via cursor if results exceed 500.
Reference (requests 2.31+):
import pandas as pd
from io import StringIO
def search_uniprot_tsv(query, fields, size=500):
url = 'https://rest.uniprot.org/uniprotkb/search'
params = {'query': query, 'fields': ','.join(fields), 'format': 'tsv', 'size': size}
r = requests.get(url, params=params)
r.raise_for_status()
return pd.read_csv(StringIO(r.text), sep='\t')
df = search_uniprot_tsv(
'organism_id:9606 AND reviewed:true AND keyword:"Kinase"',
fields=['accession', 'gene_primary', 'protein_name', 'length', 'xref_pdb', 'xref_alphafolddb'],
)
print(f'{len(df)} reviewed human kinases')
print(df.head())
import requests
import pandas as pd
from io import StringIO
def stream_uniprot(query, fields):
url = 'https://rest.uniprot.org/uniprotkb/stream'
params = {'query': query, 'fields': ','.join(fields), 'format': 'tsv'}
r = requests.get(url, params=params, stream=True)
r.raise_for_status()
return pd.read_csv(StringIO(r.text), sep='\t')
# All human reviewed proteins (~20K)
df = stream_uniprot(
'organism_id:9606 AND reviewed:true',
fields=['accession', 'gene_primary', 'protein_name', 'length'],
)
print(f'All human Swiss-Prot: {len(df)}')
Goal: Convert Ensembl Gene IDs to UniProt accessions.
Approach: Submit job; poll with timeout; retrieve results.
Reference (UniProt REST 2024_06):
import time
def map_ids(ids, from_db='Ensembl', to_db='UniProtKB', timeout=600, poll_interval=3):
submit = requests.post('https://rest.uniprot.org/idmapping/run',
data={'ids': ','.join(ids), 'from': from_db, 'to': to_db})
submit.raise_for_status()
job_id = submit.json()['jobId']
print(f'Submitted job {job_id}')
elapsed = 0
while elapsed < timeout:
status = requests.get(f'https://rest.uniprot.org/idmapping/status/{job_id}')
status.raise_for_status()
js = status.json()
if 'jobStatus' in js and js['jobStatus'] == 'RUNNING':
time.sleep(poll_interval)
elapsed += poll_interval
continue
# Completed (results in status response) or has results endpoint
break
else:
raise TimeoutError(f'ID mapping job {job_id} did not complete in {timeout}s')
results = requests.get(f'https://rest.uniprot.org/idmapping/results/{job_id}')
results.raise_for_status()
return results.json()
mapping = map_ids(['ENSG00000141510', 'ENSG00000171862', 'ENSG00000139618'])
for r in mapping.get('results', []):
print(f" {r['from']:<20} -> {r['to']}")
for failed in mapping.get('failedIds', []):
print(f" {failed:<20} -> NOT MAPPED")
def resolve_obsolete(accessions):
'''Use ID mapping to update obsolete accessions to current primary IDs.'''
return map_ids(accessions, from_db='UniProtKB_AC-ID', to_db='UniProtKB')
import gzip
def download_proteome(upid, out_path):
'''upid: UniProt Proteome ID, e.g. UP000005640 (human reference).'''
url = f'https://rest.uniprot.org/proteomes/{upid}.fasta.gz'
r = requests.get(url, stream=True)
r.raise_for_status()
with open(out_path, 'wb') as f:
for chunk in r.iter_content(8192):
f.write(chunk)
return out_path
download_proteome('UP000005640', 'human.fasta.gz') # human reference proteome
def uniref_cluster(uniref_id):
'''e.g. UniRef50_P04637 -- the UniRef50 cluster centered on P04637.'''
r = requests.get(f'https://rest.uniprot.org/uniref/{uniref_id}.json')
r.raise_for_status()
j = r.json()
return {
'id': j['id'],
'representative': j['representativeMember']['memberId'],
'member_count': j['memberCount'],
'identity': j.get('entryType'),
}
https://www.uniprot.org/uniprot/{acc}.json.KeyError from old field paths.https://rest.uniprot.org/uniprotkb/{acc}.json; update field navigation to the new nested layout.?fields= not specifiedfields= for bulk; request only the fields actually needed.cursor for next./stream for >500 results; or paginate /search with cursor.timeout= on polling; surface TimeoutError.P04637 and assuming that's the only sequence.comments[type=ALTERNATIVE PRODUCTS]; fetch each isoform with -N suffix.reviewed:true returning millions of TrEMBL hits.reviewed:true.gene:TP53 returns multiple species or duplicates.organism_id:9606 (or specific taxon); use gene_exact: to avoid wildcard matches.| Error / symptom | Cause | Solution |
|---|---|---|
| 404 on legacy URL | Pre-2022 endpoint | Use rest.uniprot.org/uniprotkb/ |
| KeyError on old field path | Schema migration 2022 | Update to new nested layout; use .get() |
| Bulk fetch very slow | Default JSON entry size | Specify fields= for TSV bulk |
| Mid-pagination data missing | 500-record cap | Use /stream or paginate with cursor |
| ID mapping job hangs | API doesn't fail stuck jobs | Set timeout= on poll loop |
| Mixed-species search results | Symbol shared across species | Add organism_id: filter |
| Million-row search returning TrEMBL | No reviewed filter | Add reviewed:true |
| Missing isoform | Default returns canonical only | Fetch with -N suffix per isoform |
tools
--- name: bio-phasing-imputation-foundations description: Frames the phasing/imputation pipeline before any tool runs: phasing and imputation are one Li-Stephens copying HMM (recombination is the transition, mutation the emission, the genetic map and Ne set the rates), imputation's honest output is a dosage with a self-estimated quality (INFO/R2/DR2) not a hard genotype, and the stages are ordered and each fails silently (QC, align build and strand to the panel, phase, impute per chromosome, fil
tools
Chooses the enrichment generation before any tool runs, mapping the input shape to a method class - a pre-selected gene list plus a background to over-representation analysis (ORA, hypergeometric), a ranked statistic for all genes to gene set enrichment (GSEA), a signed signaling topology to pathway-topology (SPIA) - then making the null explicit (competitive vs self-contained, gene vs subject sampling) and running a trustworthiness checklist (testable-gene universe, FDR, redundancy collapse, leading-edge check, version reporting). Covers why every clusterProfiler GSEA is the inter-gene-correlation-uncorrected competitive null, why the background not the gene list decides ORA significance, and why no method is universally best. Use when deciding ORA vs GSEA vs topology, which gene-set DB, whether a result is trustworthy, or which null a tool computes. For ORA see go-enrichment, GSEA see gsea, databases kegg-pathways/reactome-pathways/wikipathways; the ranking comes from differential-expression/de-results.
testing
End-to-end GWAS workflow from VCF to association results. Covers PLINK QC, population structure correction, and association testing for case-control or quantitative traits. Use when running genome-wide association studies.
development
Orchestrates the full path from differential expression results to redundancy-collapsed functional enrichment: choose ORA vs GSEA, convert gene IDs per method, run enrichGO/enrichKEGG/enrichPathway/enrichWP or gseGO/gseKEGG (clusterProfiler, ReactomePA, rWikiPathways), and visualize. Routes the ORA-vs-GSEA generation fork and the null/universe/reproducibility theory to pathway-analysis/enrichment-foundations. Use when a DESeq2/edgeR/limma result must become enriched GO terms, KEGG/Reactome/WikiPathways pathways, or a GSEA leading edge; when deciding whether a ranking exists for all genes (GSEA, named decreasing vector) or only a pre-selected list (ORA plus a defensible background universe); or when assembling DE-to-pathway end to end. The DE list and ranking statistic come from differential-expression/de-results; per-method nuance lives in the pathway-analysis skills.