plugin/skills/tooluniverse-expression-data-retrieval/SKILL.md
Retrieve gene expression and omics datasets from ArrayExpress and BioStudies with gene disambiguation and quality assessment. Use for finding RNA-seq/microarray datasets by organism/tissue/condition, comparing across studies (case-control, time-series, dose-response), and assessing dataset suitability before downloading. Always uses English search terms.
npx skillsauth add mims-harvard/tooluniverse tooluniverse-expression-data-retrievalInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Retrieve gene expression experiments and multi-omics datasets with disambiguation and quality assessment.
IMPORTANT: Always use English terms in tool calls. Respond in the user's language.
LOOK UP DON'T GUESS: Never assume which datasets exist or their accessions. Always search to confirm.
Before retrieving, determine: organism, tissue, experimental design (case-control/time-series/dose-response). These affect which database to search and how to interpret results. RNA-seq provides wider dynamic range; microarray has extensive legacy data. Prioritize experiments with >=3 biological replicates, complete annotations, and both raw+processed data.
Phase 0: Clarify (if ambiguous) → Phase 1: Disambiguate → Phase 2: Search & Retrieve → Phase 3: Report
Ask ONLY if: gene name ambiguous, tissue/condition unclear, organism not specified. Skip for: specific accessions (E-MTAB-, E-GEOD-, S-BSST*), clear disease/tissue+organism, explicit platform requests.
Resolve official gene symbol (HGNC for human, MGI for mouse). Note common aliases for search expansion.
| User Query Type | Search Strategy | |-----------------|-----------------| | Specific accession | Direct retrieval | | Gene + condition | "[gene] [condition]" + species filter | | Disease only | "[disease]" + species filter | | Technology-specific | Add platform keywords |
Search silently. Do NOT narrate the process.
# ArrayExpress search
result = tu.tools.arrayexpress_search_experiments(keywords="[gene/disease]", species="[species]", limit=20)
# Get experiment details, samples, files
details = tu.tools.arrayexpress_get_experiment(accession=accession)
samples = tu.tools.arrayexpress_get_experiment_samples(accession=accession)
files = tu.tools.arrayexpress_get_experiment_files(accession=accession)
# BioStudies for multi-omics
biostudies = tu.tools.biostudies_search(query="[keywords]", limit=10)
study = tu.tools.biostudies_get_study(accession=study_accession)
study_files = tu.tools.biostudies_get_study_files(accession=study_accession)
| Primary | Fallback | |---------|----------| | ArrayExpress search | BioStudies search | | arrayexpress_get_experiment | biostudies_get_study | | arrayexpress_get_experiment_files | Note "Files unavailable" |
Present as a Dataset Search Report. Hide search process. Include:
| Tier | Symbol | Criteria | |------|--------|----------| | High | ●●● | >=3 bio replicates, complete metadata, processed data available | | Medium | ●●○ | 2-3 replicates OR some metadata gaps | | Low | ●○○ | No replicates, sparse metadata, or access issues | | Caution | ○○○ | Single sample, no replication, outdated platform |
Dataset quality: Prioritize >=3 biological replicates, complete annotations, both raw+processed data. Single-replicate experiments can inform but not be sole evidence.
Platform comparison: RNA-seq = wider dynamic range, novel transcripts. Microarray = probe-limited but extensive legacy data. Cross-platform combining requires batch correction.
Metadata scoring: Rate 0-5 on: (1) sample annotations, (2) design documented, (3) pipeline described, (4) raw data deposited, (5) publication linked. Score <=2 warrants caution.
GEO vs ArrayExpress: GEO has broader coverage (older studies); ArrayExpress enforces stricter metadata. BioStudies captures multi-omics. Search both.
| Error | Response | |-------|----------| | "No experiments found" | Broaden keywords, remove species filter, try synonyms | | "Accession not found" | Verify format, check if withdrawn | | "Files not available" | Note: "Data files restricted by submitter" | | "API timeout" | Retry once, note "(metadata retrieval incomplete)" |
ArrayExpress: arrayexpress_search_experiments (search), arrayexpress_get_experiment (metadata), arrayexpress_get_experiment_files (downloads), arrayexpress_get_experiment_samples (annotations)
BioStudies: biostudies_search (search), biostudies_get_study (metadata+sections), biostudies_get_study_files (files)
Additional Sources:
GEO_search_rnaseq_datasets / geo_search_datasets -- GEO (largest RNA-seq repo)OmicsDI_search_datasets -- cross-repository aggregation (GEO+ArrayExpress+PRIDE+MassIVE)GTEx_get_expression_summary -- baseline tissue expression (54 normal tissues, param: gene_symbol)ENAPortal_search_studies -- sequencing studies (param: query with description="...")CxGDisc_search_datasets -- single-cell datasets (needs exact disease ontology terms)PubMed_search_articles -- dataset discovery via publicationsArrayExpress: keywords (free text), species (scientific name), array (platform filter), limit
BioStudies: query (free text), limit
tools
PCR / qPCR primer and oligo design — design forward/reverse primers for a target region (SantaLucia nearest-neighbor thermodynamics), compute melting temperature (Tm) and annealing temperature (Ta), check GC content, and screen an oligo for hairpins and primer-dimers. Use when you need primers for a sequence, want to QC an existing primer pair, or need the Tm of an oligo. Covers the primer-design rules (Tm matching, GC clamp, 3'-end, length) and the tools' constraint quirks.
tools
Pharmacokinetic (PK) analysis of concentration-time data — non-compartmental analysis (NCA) for Cmax, Tmax, AUC (0-t and 0-∞), terminal half-life, clearance (CL), volume of distribution (Vd), MRT, and absolute bioavailability (F). Also one-compartment fitting. Use when you have plasma/serum drug concentrations over time after a dose and need PK parameters, or to compute bioavailability from IV + oral AUCs. NOT for ADMET property prediction from structure (use tooluniverse-admet-prediction).
tools
Molecular cloning assembly design — Gibson Assembly (overlap design for seamless multi-fragment joining) and Golden Gate Assembly (Type IIS / BsaI / BbsI design with unique 4-bp fusion overhangs). Use when you need to plan how to join DNA fragments into a construct, design assembly overlaps/overhangs, or decide between cloning methods. Covers the domestication (internal-site removal), overhang-uniqueness, and overlap-Tm rules. For PCR primers to generate the fragments, see tooluniverse-primer-design.
tools
Meta-analysis / evidence synthesis — pool effect sizes across studies (odds ratios, risk ratios, hazard ratios, mean differences, correlations, GWAS betas) with fixed- or random-effects models, quantify heterogeneity (Q, I², τ²), and build a forest plot. Use when you have results from MULTIPLE studies and need a single pooled estimate, or to synthesize evidence from a systematic review / multiple GWAS / replicated experiments. Handles the error-prone effect-size + standard-error preparation (converting OR/HR/CI, two-group means±SD, proportions, and correlations into the (effect, SE) the pooling step needs).