database-access/entrez-link/SKILL.md
Find cross-database references between NCBI databases using Biopython Bio.Entrez (ELink). Use when navigating gene to protein/structure, sequence to publication, PubMed to GEO, BioProject to SRA runs, or discovering all link relationships for a record. Covers linkname semantics, cmd= variants, asymmetric link warnings, neighbor_history for >200 input IDs, and per-database link tables.
npx skillsauth add GPTomics/bioSkills bio-entrez-linkInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Reference examples tested with: BioPython 1.83+, Entrez Direct 21.0+
Before using code patterns, verify installed versions match. If versions differ:
pip show biopython then help(Bio.Entrez.elink) to check signatureselink -version then elink -help to confirm flagsIf code throws ImportError, AttributeError, or TypeError, introspect the installed package and adapt the example to match the actual API rather than retrying.
"Find records linked to this record in another NCBI database" -> ELink walks the curated, weekly-maintained link tables between Entrez databases. A link is an asserted relationship (e.g. "this PubMed article describes this nucleotide sequence"), not a similarity hit.
ELink is the navigation layer of Entrez. The decision that matters most is which linkname to use — not which databases. A single (dbfrom, db) pair can have a dozen linkname variants distinguishing curation level, evidence type, and direction. Picking the wrong one is the difference between 5 high-confidence matches and 500 noisy automated assertions.
Entrez.elink(dbfrom=..., db=..., id=..., linkname=...) (BioPython)elink -db pubmed -target gene -name pubmed_gene_rif (Entrez Direct)entrez_link(dbfrom=..., db=..., id=...) (rentrez)from Bio import Entrez
Entrez.email = '[email protected]'
Entrez.api_key = 'optional_api_key' # raises rate to 10 req/sec
linkname decision (most important)For most (dbfrom, db) pairs NCBI exposes multiple link tables. The qualifiers in the name encode the curation level and the evidence source. Choose deliberately.
| linkname | Returns | When to use |
|---|---|---|
| gene_protein | All linked proteins (curated + automated) | Exploration; expect 10-1000x more hits |
| gene_protein_refseq | RefSeq proteins only | Reference-quality analyses; orthology |
| gene_protein_swissprot | Reviewed UniProt entries with NCBI cross-ref | Functional annotation; literature support |
| linkname | Returns |
|---|---|
| pubmed_gene | Genes mentioned in this paper (text-mined + curated) |
| pubmed_gene_rif | Genes with a Reference Into Function (curated, high-quality) |
| pubmed_gene_pubmed | Other PubMed records sharing gene linkage (rare use) |
| linkname | Returns |
|---|---|
| nuccore_protein | All proteins encoded by this nucleotide record (CDS-linked) |
| nuccore_protein_refseq | RefSeq proteins only |
h = Entrez.elink(dbfrom='gene', db='protein', id='672', cmd='acheck')
record = Entrez.read(h); h.close()
for ls in record[0]['IdCheckList']['IdLinkSet'][0]['LinkInfo']:
print(f'{ls["Name"]} -> {ls["DbTo"]} | {ls["MenuTag"]} ({ls["HtmlTag"]})')
cmd='acheck' is the only authoritative way to enumerate available linknames — they change with each NCBI release.
cmd for which goal| Goal | cmd | Returns |
|---|---|---|
| Get linked records | neighbor (default) | Linked IDs in target db |
| Get linked + relevance scores | neighbor_score | IDs with similarity scores (mostly pubmed_pubmed) |
| Get >200 source IDs in one go | neighbor_history | WebEnv + QueryKey for downstream EFetch |
| Enumerate available links | acheck | List of all linknames for source IDs |
| Check if any link exists | ncheck | Boolean per source ID |
| Check specific link exists | lcheck | Boolean per source ID + linkname |
| Get NCBI HTML link URLs | llinks | URLs to Entrez record pages |
| Get external provider links | prlinks | URLs to journal sites, etc. |
The neighbor_history cmd is essential when source id count exceeds ~200 — past that, the URL-length limit makes the comma-joined form fail. With neighbor_history ELink puts results on the history server and returns WebEnv/QueryKey for downstream pickup.
ELink relationships are not guaranteed symmetric. pubmed_gene and gene_pubmed may return different sets because:
If round-trip consistency matters (e.g. "every gene mentioned in this paper, then every paper mentioning each gene"), expect the round-trip set to be larger than the input — and never assume A -> B -> A returns the original ID alone.
| Target | Common linknames | Notes |
|---|---|---|
| protein | gene_protein, gene_protein_refseq, gene_protein_swissprot | RefSeq is the safe default |
| nuccore | gene_nuccore, gene_nuccore_refseqrna, gene_nuccore_refseqgene | refseqrna for mRNA, refseqgene for the curated gene region |
| pubmed | gene_pubmed, gene_pubmed_rif | RIF is curated and high-quality |
| homologene | gene_homologene | Deprecated 2014 but data still queryable |
| snp | gene_snp | dbSNP entries in gene region |
| clinvar | gene_clinvar | Clinical variants |
| omim | gene_omim | Disease associations |
| Target | Common linknames |
|---|---|
| protein | nuccore_protein, nuccore_protein_refseq |
| gene | nuccore_gene |
| taxonomy | nuccore_taxonomy |
| biosample | nuccore_biosample |
| sra | nuccore_sra |
| pubmed | nuccore_pubmed, nuccore_pubmed_refseq |
| Target | Common linknames |
|---|---|
| nuccore | protein_nuccore, protein_nuccore_cds, protein_nuccore_mrna |
| gene | protein_gene |
| structure | protein_structure |
| cdd | protein_cdd (conserved domains) |
| pubmed | protein_pubmed |
| Target | Common linknames |
|---|---|
| pubmed | pubmed_pubmed, pubmed_pubmed_citedin, pubmed_pubmed_refs |
| gene | pubmed_gene, pubmed_gene_rif |
| protein | pubmed_protein |
| nuccore | pubmed_nuccore |
| gds | pubmed_gds (GEO datasets cited in paper) |
| sra | pubmed_sra |
| Target | Common linknames |
|---|---|
| biosample | bioproject_biosample |
| sra | bioproject_sra |
| pubmed | bioproject_pubmed |
Goal: Get RefSeq proteins for a single gene.
Approach: ELink with explicit linkname to restrict to curated set.
Reference (BioPython 1.83+):
def gene_to_refseq_proteins(gene_id):
h = Entrez.elink(dbfrom='gene', db='protein', id=gene_id, linkname='gene_protein_refseq')
r = Entrez.read(h); h.close()
if not r[0]['LinkSetDb']:
return []
return [link['Id'] for link in r[0]['LinkSetDb'][0]['Link']]
print(gene_to_refseq_proteins('672')) # BRCA1
Goal: Get linked proteins for a list of <200 gene IDs in one call.
Approach: Comma-join IDs; one linkset per input in the response.
Reference (BioPython 1.83+):
def batch_gene_protein(gene_ids):
h = Entrez.elink(dbfrom='gene', db='protein', id=','.join(gene_ids), linkname='gene_protein_refseq')
r = Entrez.read(h); h.close()
out = {}
for linkset in r:
src = linkset['IdList'][0]
out[src] = [link['Id'] for link in linkset['LinkSetDb'][0]['Link']] if linkset['LinkSetDb'] else []
return out
Goal: Link 5,000 gene IDs to proteins without hitting URL-length limits.
Approach: EPost the IDs first (chunked at 200), then ELink with cmd='neighbor_history' referencing the WebEnv. Downstream EFetch picks up linked IDs from the history server.
Reference (BioPython 1.83+):
def post_then_link(gene_ids, target='protein', linkname='gene_protein_refseq'):
# EPost in chunks of 200
webenv = None
for i in range(0, len(gene_ids), 200):
chunk = gene_ids[i:i+200]
kwargs = {'db': 'gene', 'id': ','.join(chunk)}
if webenv:
kwargs['WebEnv'] = webenv
h = Entrez.epost(**kwargs)
r = Entrez.read(h); h.close()
webenv = r['WebEnv']
query_key = r['QueryKey']
time.sleep(0.1 if Entrez.api_key else 0.34)
# Link with neighbor_history
h = Entrez.elink(dbfrom='gene', db=target, linkname=linkname,
cmd='neighbor_history', WebEnv=webenv, query_key=query_key)
r = Entrez.read(h); h.close()
# WebEnv is at the top level of the response; QueryKey is per-LinkSetDbHistory entry.
return r[0]['WebEnv'], r[0]['LinkSetDbHistory'][0]['QueryKey']
we, qk = post_then_link(['672', '675', '7157'] * 1000)
# Downstream: Entrez.efetch(db='protein', WebEnv=we, query_key=qk, retstart=..., retmax=500)
Goal: Before writing a pipeline, enumerate what link tables NCBI exposes for a (dbfrom, source-id) pair.
Approach: cmd='acheck' returns the full LinkInfo list per source.
Reference (BioPython 1.83+):
def list_link_names(dbfrom, id):
h = Entrez.elink(dbfrom=dbfrom, id=id, cmd='acheck')
r = Entrez.read(h); h.close()
info = r[0]['IdCheckList']['IdLinkSet'][0]['LinkInfo']
return [(i['Name'], i['DbTo'], i['MenuTag']) for i in info]
for name, target, label in list_link_names('gene', '672'):
print(f'{name:<40} -> {target:<15} ({label})')
def gene_to_structures(gene_id):
h = Entrez.elink(dbfrom='gene', db='protein', id=gene_id, linkname='gene_protein_refseq')
r = Entrez.read(h); h.close()
if not r[0]['LinkSetDb']:
return []
prot_ids = [l['Id'] for l in r[0]['LinkSetDb'][0]['Link'][:10]]
time.sleep(0.1 if Entrez.api_key else 0.34)
h = Entrez.elink(dbfrom='protein', db='structure', id=','.join(prot_ids))
r = Entrez.read(h); h.close()
out = []
for ls in r:
if ls['LinkSetDb']:
out.extend(l['Id'] for l in ls['LinkSetDb'][0]['Link'])
return out
def related_pubmed(pmid, top=10):
h = Entrez.elink(dbfrom='pubmed', db='pubmed', id=pmid,
linkname='pubmed_pubmed', cmd='neighbor_score')
r = Entrez.read(h); h.close()
if not r[0]['LinkSetDb']:
return []
return [(l['Id'], int(l['Score'])) for l in r[0]['LinkSetDb'][0]['Link'][:top]]
For SRA discovery, pysradb.SRAweb().sra_metadata(prjna, detailed=True) (see sra-data) is the higher-fidelity path — returns SRR accessions directly with run-level metadata in one call. Use ELink only when staying inside Bio.Entrez:
def bioproject_to_sra(prjna):
# Convert PRJNA to UID first
h = Entrez.esearch(db='bioproject', term=f'{prjna}[BioProject]')
r = Entrez.read(h); h.close()
if not r['IdList']:
return []
bp_uid = r['IdList'][0]
time.sleep(0.1 if Entrez.api_key else 0.34)
# Link to SRA
h = Entrez.elink(dbfrom='bioproject', db='sra', id=bp_uid)
r = Entrez.read(h); h.close()
return [l['Id'] for l in r[0]['LinkSetDb'][0]['Link']] if r[0]['LinkSetDb'] else []
gene_protein when gene_protein_refseq was intended.gene_protein includes all automated and predicted entries (XP_* RefSeq plus all GenBank submissions).record[0]['LinkSetDb'] is an empty list, not raising an error.KeyError if code assumes record[0]['LinkSetDb'][0] always exists.if not record[0]['LinkSetDb']: return [].genes_for_paper(pmid) -> papers_for_each_gene -> set of PMIDs.pubmed_gene (text-mined + curated) is larger than gene_pubmed (curated only); the round-trip set is not closed.*_rif variants) when fidelity matters.id= with 200+ IDs.cmd='neighbor_history'.record[0]['LinkSetDb'][0]['Link'] expecting the union.LinkSet per input UID, indexed by position.for linkset in record: and map by linkset['IdList'][0].dbfrom='nucleotide'.| Error / symptom | Cause | Solution |
|---|---|---|
| KeyError: 'LinkSetDb' | Empty result not guarded | if not record[0]['LinkSetDb']: return [] |
| HTTPError 414 | Comma-joined id too long | Use EPost + neighbor_history |
| HTTPError 400 | Invalid linkname or wrong db namespace | Use cmd='acheck' to enumerate valid links |
| 500 hits instead of 5 | Wrong linkname (e.g. gene_protein vs _refseq) | Pick curated variant |
| Round-trip set differs from input | Asymmetric link tables | Document; use curated variants |
neighbor_historytools
--- name: bio-phasing-imputation-foundations description: Frames the phasing/imputation pipeline before any tool runs: phasing and imputation are one Li-Stephens copying HMM (recombination is the transition, mutation the emission, the genetic map and Ne set the rates), imputation's honest output is a dosage with a self-estimated quality (INFO/R2/DR2) not a hard genotype, and the stages are ordered and each fails silently (QC, align build and strand to the panel, phase, impute per chromosome, fil
tools
Chooses the enrichment generation before any tool runs, mapping the input shape to a method class - a pre-selected gene list plus a background to over-representation analysis (ORA, hypergeometric), a ranked statistic for all genes to gene set enrichment (GSEA), a signed signaling topology to pathway-topology (SPIA) - then making the null explicit (competitive vs self-contained, gene vs subject sampling) and running a trustworthiness checklist (testable-gene universe, FDR, redundancy collapse, leading-edge check, version reporting). Covers why every clusterProfiler GSEA is the inter-gene-correlation-uncorrected competitive null, why the background not the gene list decides ORA significance, and why no method is universally best. Use when deciding ORA vs GSEA vs topology, which gene-set DB, whether a result is trustworthy, or which null a tool computes. For ORA see go-enrichment, GSEA see gsea, databases kegg-pathways/reactome-pathways/wikipathways; the ranking comes from differential-expression/de-results.
testing
End-to-end GWAS workflow from VCF to association results. Covers PLINK QC, population structure correction, and association testing for case-control or quantitative traits. Use when running genome-wide association studies.
development
Orchestrates the full path from differential expression results to redundancy-collapsed functional enrichment: choose ORA vs GSEA, convert gene IDs per method, run enrichGO/enrichKEGG/enrichPathway/enrichWP or gseGO/gseKEGG (clusterProfiler, ReactomePA, rWikiPathways), and visualize. Routes the ORA-vs-GSEA generation fork and the null/universe/reproducibility theory to pathway-analysis/enrichment-foundations. Use when a DESeq2/edgeR/limma result must become enriched GO terms, KEGG/Reactome/WikiPathways pathways, or a GSEA leading edge; when deciding whether a ranking exists for all genes (GSEA, named decreasing vector) or only a pre-selected list (ORA plus a defensible background universe); or when assembling DE-to-pathway end to end. The DE list and ranking statistic come from differential-expression/de-results; per-method nuance lives in the pathway-analysis skills.