database-access/geo-data/SKILL.md
Query and download from NCBI Gene Expression Omnibus (GEO) and EMBL-EBI's BioStudies/ArrayExpress mirror. Use when finding expression datasets, navigating SuperSeries vs SubSeries, choosing between series-matrix (submitter-normalized) and raw supplementary files, downloading via GEOparse (Python) or GEOquery (R/Bioconductor), linking GEO to SRA for raw reads, or distinguishing GSE/GSM/GPL/GDS record types. Encodes the SuperSeries trap, the series-matrix normalization-trust caveat, GEOmetadb deprecation, ArrayExpress migration to BioStudies, and processed-vs-raw decision matrix.
npx skillsauth add GPTomics/bioSkills bio-geo-dataInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Reference examples tested with: BioPython 1.83+, GEOparse 2.0+, R Bioconductor GEOquery 2.70+, pandas 2.2+
Before using code patterns, verify installed versions match. If versions differ:
pip show biopython geoparse then introspect signaturespackageVersion('GEOquery')If the GSE structure doesn't match expectations (missing fields, malformed series matrix), re-fetch from FTP directly and inspect the SOFT or MINiML file as source of truth.
"Pull expression data from GEO accession GSE..." -> GEO stores Series (GSE), Samples (GSM), Platforms (GPL), and curated DataSets (GDS, frozen 2018). The single most consequential decision is processed (series matrix) vs raw (supplementary files / linked SRA) — the answer turns on how much trust the submitter's normalization deserves.
The single most-missed gotcha: SuperSeries. A GSE may be a meta-container (!Series_relation = SuperSeries of: GSExxxxx) holding multiple sub-studies on different platforms. Naively pulling samples from a SuperSeries gives mixed Affymetrix + Illumina + RNA-seq, mis-batched.
Entrez.esearch(db='gds'), GEOparse for full series downloadGEOquery::getGEO() (Bioconductor; more mature than GEOparse)wget from ftp.ncbi.nlm.nih.gov/geo/series/...pip install biopython GEOparse pandas
# OR for R-side:
# R: BiocManager::install('GEOquery')
from Bio import Entrez
Entrez.email = '[email protected]'
Entrez.api_key = 'optional'
| Prefix | Type | Granularity | What's in it |
|---|---|---|---|
| GSE | Series | One study | Title, summary, design, links to GSMs, supplementary files |
| GSM | Sample | One biological/technical sample | Submitter metadata, per-sample processed data, link to raw SRA |
| GPL | Platform | One array / sequencer | Probe annotations or sequencer model |
| GDS | DataSet | Curated, normalized subset of one GSE | Re-normalized expression matrix (frozen 2018; new GDS no longer created) |
| GSEXXX SuperSeries | Series meta-container | Wraps multiple SubSeries | !Series_relation = SuperSeries of: ... |
GDS is dead-as-format: NCBI stopped creating new GDS records in 2018. Existing GDS still queryable but use GSE for anything current.
A SuperSeries (GSE) wraps multiple SubSeries, often with different platforms. Detection:
# Read the !Series_relation field from SOFT format
from Bio import Entrez
h = Entrez.esummary(db='gds', id='200122288') # example
r = Entrez.read(h)[0]; h.close()
print(r.get('summary')) # may or may not flag SuperSeries
# Definitive check: download SOFT and grep:
# curl ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE122nnn/GSE122288/soft/GSE122288_family.soft.gz | zgrep Series_relation
A SuperSeries of: GSE12345 line means the SuperSeries' samples are the union of all SubSeries — almost certainly mixed-platform / mixed-batch. Process each SubSeries independently.
Symmetric trap: a paper may cite a SubSeries (SubSeries of: GSEsuper) where the wider context is essential — check both directions.
| Question | Source | Trust level |
|---|---|---|
| "I want expression values; submitter normalization is fine" | Series matrix (GSE_series_matrix.txt.gz) | Trust submitter's normalization |
| "I want raw Affymetrix CEL files and to do my own RMA" | Supplementary files (suppl/) | Re-normalize locally |
| "I want raw RNA-seq FASTQ" | pysradb gse_to_srp -> srp_to_srr (Entrez gds->sra ELink unreliable) | Always raw; processed at submitter is rarely re-usable |
| "I want submitter-provided counts (RNA-seq)" | Supplementary files (usually a *_counts.txt.gz) | Trust at risk; submitter pipelines vary |
| "I want a curated subset across many studies" | Use ArchS4 (https://archs4.org) or recount3 | Curated re-processing |
Default to raw whenever possible. For Affymetrix: CEL + locally-run RMA is far more reliable than the submitter's "normalized" matrix. For RNA-seq: SRA FASTQ + locally-run alignment/quantification is the only reproducible path; submitter counts often use a private pipeline.
A series matrix (GSE12345_series_matrix.txt.gz) is a header (sample metadata as !Sample_* lines) plus a sample-by-feature expression table. The format is fragile and the values' provenance is whatever the submitter chose. Critical caveats:
!Series_overall_design and !Sample_data_processing to know.!Sample_characteristics_ch1 rows that hold the metadata of interest — these are submitter-formatted strings, often inconsistent within one series.| Format | Content | Parser support |
|---|---|---|
| SOFT (*_family.soft.gz) | Plain-text, key=value style | GEOparse (Python), GEOquery (R), Entrez Direct |
| MINiML (*_family.xml.tgz) | XML-structured | GEOparse, GEOquery, custom XML |
Both contain the same content. SOFT is the legacy, MINiML the XML successor. GEOparse handles SOFT well; for very large series (1000+ samples) MINiML's XML structure is slower to parse.
| Aspect | GEOparse (Python) | GEOquery (R/Bioconductor) |
|---|---|---|
| Maturity | OK; some known supplementary-file fetch issues since ~2022 | Mature; Bioconductor-supported |
| Output | GEOparse.GSE object with gsms, gpls, metadata dicts | ExpressionSet or list per platform |
| Supplementary files | gse.download_supplementary_files() (sometimes flakey) | getGEOSuppFiles(gse) (more reliable) |
| Integration | Pandas DataFrames | Bioconductor ecosystem |
| When | Python-first pipelines | R-first / use ExpressionSet downstream |
For production GEO workflows in R, GEOquery is the stable choice. For Python, GEOparse is the only option but verify file counts after download.
GEOmetadb (Zhu 2008) was a SQLite mirror of GEO metadata enabling fast SQL queries. Unmaintained since 2020; downloads still work but data is stale. Modern replacement: pysradb (pysradb gse_to_srp, pysradb metadata) covers most of the GEO->SRA mapping; for full GEO queries fall back to Entrez gds.
ArrayExpress (EMBL-EBI's microarray archive, mirroring GEO) was migrated into BioStudies in 2020. Old E-MTAB-#### accessions still resolve but the API moved:
| Old (pre-2020) | New (BioStudies) |
|---|---|
| https://www.ebi.ac.uk/arrayexpress/... | https://www.ebi.ac.uk/biostudies/... |
| ArrayExpress REST | BioStudies REST: https://www.ebi.ac.uk/biostudies/api/v1/... |
For new workflows, use BioStudies. For legacy ArrayExpress URLs in old papers, redirect via BioStudies.
Goal: Find GSE accessions matching keywords + organism + study type.
Approach: ESearch on gds db with field-qualified terms; filter to gse[Entry Type]; summarize with ESummary.
Reference (BioPython 1.83+):
from Bio import Entrez
import time
Entrez.email = '[email protected]'
def search_geo(term, study_type='gse', organism=None, max_results=50):
full_term = f'{term} AND {study_type}[Entry Type]'
if organism:
full_term += f' AND {organism}[Organism]'
h = Entrez.esearch(db='gds', term=full_term, retmax=max_results)
s = Entrez.read(h); h.close()
if not s['IdList']:
return []
h = Entrez.esummary(db='gds', id=','.join(s['IdList']))
summaries = Entrez.read(h); h.close()
return summaries
for s in search_geo('breast cancer RNA-seq', organism='Homo sapiens', max_results=10):
# Surface SuperSeries
relation = s.get('summary', '')
is_super = 'SuperSeries' in str(relation)
print(f" {s['Accession']:12} {s['n_samples']:>4} samples {'[SuperSeries]' if is_super else '':12} {s['title'][:60]}")
Goal: Avoid mixing platforms by detecting SuperSeries structure first.
Approach: Download SOFT family file and read !Series_relation keys.
import gzip
import urllib.request
def check_super_or_sub_series(gse):
prefix = gse[:-3] + 'nnn'
url = f'https://ftp.ncbi.nlm.nih.gov/geo/series/{prefix}/{gse}/soft/{gse}_family.soft.gz'
urllib.request.urlretrieve(url, f'{gse}.soft.gz')
super_of = []
sub_of = None
with gzip.open(f'{gse}.soft.gz', 'rt') as f:
for line in f:
if line.startswith('!Series_relation'):
if 'SuperSeries of' in line:
super_of.append(line.split('SuperSeries of: ')[1].strip())
elif 'SubSeries of' in line:
sub_of = line.split('SubSeries of: ')[1].strip()
if line.startswith('^SAMPLE'):
break # Speed: don't read past header
return {'super_of': super_of, 'sub_of': sub_of}
print(check_super_or_sub_series('GSE122288'))
# {'super_of': ['GSExxxxx', 'GSEyyyyy'], 'sub_of': None} -> SuperSeries; process subseries separately
import gzip
import pandas as pd
def download_series_matrix(gse):
prefix = gse[:-3] + 'nnn'
url = f'https://ftp.ncbi.nlm.nih.gov/geo/series/{prefix}/{gse}/matrix/{gse}_series_matrix.txt.gz'
urllib.request.urlretrieve(url, f'{gse}_matrix.txt.gz')
return f'{gse}_matrix.txt.gz'
def parse_series_matrix(path):
metadata = {}
with gzip.open(path, 'rt') as f:
for line in f:
if line.startswith('!series_matrix_table_begin'):
break
if line.startswith('!'):
key, *vals = line.rstrip('\n').split('\t')
metadata[key] = [v.strip('"') for v in vals]
expr = pd.read_csv(f, sep='\t', index_col=0, comment='!')
# Series matrix values are whatever submitter chose -- check metadata['!Sample_data_processing']
return metadata, expr
meta, expr = parse_series_matrix(download_series_matrix('GSE123456'))
print('Sample-level data processing notes:')
for note in set(meta.get('!Sample_data_processing', [])):
print(f' - {note}')
from pysradb import SRAweb
def gse_to_srr(gse):
db = SRAweb()
srp_df = db.gse_to_srp(gse)
if srp_df.empty:
return []
srp = srp_df['study_accession'].iloc[0]
srr_df = db.srp_to_srr(srp)
return srr_df['run_accession'].tolist()
srrs = gse_to_srr('GSE123456')
print(f'GSE123456 -> {len(srrs)} SRR runs')
import GEOparse
def get_gse(gse_id, dest='./geo_cache'):
gse = GEOparse.get_GEO(geo=gse_id, destdir=dest)
print(f'{gse_id}: {len(gse.gsms)} samples, {len(gse.gpls)} platforms')
for gsm_name, gsm in list(gse.gsms.items())[:3]:
print(f' {gsm_name}: {gsm.metadata.get("title", ["?"])[0]}')
return gse
# Supplementary files (raw data) -- verify file count manually after
gse = get_gse('GSE123456')
gse.download_supplementary_files(directory='./geo_cache')
# Reference: Bioconductor GEOquery 2.70+ | Verify API if version differs
library(GEOquery)
gse <- getGEO('GSE123456', GSEMatrix = TRUE)
length(gse) # one ExpressionSet per platform
head(pData(gse[[1]])) # sample metadata
head(exprs(gse[[1]])) # expression matrix (submitter-normalized -- verify processing notes)
# Raw / supplementary files
supp_dir <- getGEOSuppFiles('GSE123456', baseDir = './geo_cache')
list.files(rownames(supp_dir))
def geo_from_pubmed(pmid):
h = Entrez.elink(dbfrom='pubmed', db='gds', id=pmid)
r = Entrez.read(h); h.close()
if not r[0]['LinkSetDb']:
return []
gds_ids = [l['Id'] for l in r[0]['LinkSetDb'][0]['Link']]
h = Entrez.esummary(db='gds', id=','.join(gds_ids))
summaries = Entrez.read(h); h.close()
return summaries
!Series_relation in SOFT before pulling; process SubSeries independently.!Sample_data_processing to know what's in the matrix; re-normalize from raw if in doubt.*_counts.txt.gz supplementary file as the count matrix.GSE_series_matrix.txt.gz is the merged one; per-platform are GSE-GPLxxx_series_matrix.txt.gz.!Series_platform_id count.gse.download_supplementary_files() silently misses files.wget -r on the suppl/ subdirectory.https://www.ebi.ac.uk/arrayexpress/experiments/E-MTAB-1234/.https://www.ebi.ac.uk/biostudies/arrayexpress/studies/E-MTAB-1234.GEOmetadb.sqlite for fast queries.| Error / symptom | Cause | Solution |
|---|---|---|
| Empty IdList for gse[entry_type] | Wrong field name | Use gse[Entry Type] (case-sensitive) |
| Matrix file has no expression data | SuperSeries with no aggregate matrix | Pull per-SubSeries matrices |
| Submitter "normalized" matrix gives different result than paper | Hidden submitter transforms | Re-process from raw |
| 404 on ArrayExpress URL | Migrated to BioStudies | Use new BioStudies URL |
| GEOparse missing CEL files | Known flake | Use R GEOquery or direct FTP |
| GEOmetadb-based pipeline missing recent series | DB unmaintained | Switch to pysradb / Entrez |
development
Find restriction enzyme cut sites in DNA sequences using Biopython Bio.Restriction. Search with single enzymes, batches of enzymes, or commercially available enzyme sets. Returns cut positions for linear or circular DNA. Use when finding restriction enzyme cut sites in sequences.
development
Create restriction maps showing enzyme cut positions on DNA sequences using Biopython Bio.Restriction. Visualize cut sites, calculate distances between sites, and generate text or graphical maps. Use when creating or analyzing restriction maps.
development
Analyze restriction digest fragments using Biopython Bio.Restriction. Predict fragment sizes, get fragment sequences, simulate gel electrophoresis patterns, and perform double digests. Use when analyzing restriction digest fragment patterns.
development
Select restriction enzymes by criteria using Biopython Bio.Restriction. Find enzymes that cut once, don't cut, produce specific overhangs, are commercially available, or have compatible ends for cloning. Use when selecting restriction enzymes for cloning or analysis.