skills/tooluniverse-dataset-discovery/SKILL.md
Find and evaluate research datasets for any scientific question. Maps research questions to required study designs (longitudinal vs cross-sectional, observational vs experimental, single-cohort vs multi-cohort). Use when the user asks 'find data about X', 'where can I get data on Y', or needs a specific cohort/survey/repository. Covers GEO, ArrayExpress, dbGaP, NHANES, UK Biobank, ClinicalTrials.gov, GWAS Catalog, and 30+ scientific repositories.
npx skillsauth add mims-harvard/tooluniverse tooluniverse-dataset-discoveryInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Before searching, determine the minimum data requirements:
Study design needed:
Variables needed:
Population needed:
Search from broadest to most specific. Use find_tools to discover available dataset search tools — don't rely on memorized tool names.
Layer 1 — Cross-repository search (cast wide net): Search tools that index datasets across thousands of repositories. These find datasets you didn't know existed.
Layer 2 — Domain-specific repositories: Search repositories specialized for your data type.
Layer 3 — Literature-based discovery: Many datasets aren't in any repository — they're described in paper methods sections.
For each candidate dataset, assess these dimensions:
Variables:
Design match:
Sample:
Access:
Quality:
Don't stop at finding datasets — download and analyze them. Write and run Python code via Bash. Never describe what you "would do" — execute it.
Choose the loader that matches your data source. When unsure of the format, download a small sample first and inspect.
import requests, io, pandas as pd
# --- Tabular files (most common) ---
df = pd.read_csv("data.csv") # CSV / TSV (use sep="\t" for TSV)
df = pd.read_excel("data.xlsx") # Excel
df = pd.read_stata("data.dta") # Stata
df = pd.read_sas("data.xpt", format="xport") # SAS transport (XPT)
df = pd.read_sas("data.sas7bdat", format="sas7bdat") # SAS native
df = pd.read_parquet("data.parquet") # Parquet
df = pd.read_json("data.json") # JSON (records or columnar)
df = pd.read_fwf("data.dat") # Fixed-width (some legacy surveys)
# --- Download from URL first, then parse ---
resp = requests.get(url, timeout=120)
content = resp.content
# Detect format from URL or content header
if url.endswith(".XPT") or url.endswith(".xpt"):
df = pd.read_sas(io.BytesIO(content), format="xport")
elif url.endswith(".csv") or url.endswith(".csv.gz"):
df = pd.read_csv(io.BytesIO(content))
elif url.endswith(".tsv") or url.endswith(".tsv.gz"):
df = pd.read_csv(io.BytesIO(content), sep="\t")
elif url.endswith(".json"):
df = pd.read_json(io.BytesIO(content))
else:
# Try CSV first, then inspect
df = pd.read_csv(io.BytesIO(content))
# --- REST API pagination (common for GDC, ClinicalTrials.gov, etc.) ---
import json
all_records = []
offset = 0
while True:
resp = requests.get(f"{api_url}?offset={offset}&limit=100", timeout=30)
batch = resp.json().get("data", [])
if not batch:
break
all_records.extend(batch)
offset += len(batch)
df = pd.DataFrame(all_records)
# Merge multiple files on participant/sample ID
merged = df1.merge(df2, on="id_col", how="inner")
# Filter population
subset = merged[(merged["age"] >= 60) & (merged["age"] <= 80)].copy()
# Handle missing values
missing_pct = subset.isnull().mean() * 100
print("Missing % per variable:\n", missing_pct[missing_pct > 0].sort_values(ascending=False))
subset = subset.dropna(subset=["exposure_var", "outcome_var"])
# Quick regression
import statsmodels.formula.api as smf
model = smf.ols("outcome ~ exposure + age + sex", data=subset).fit()
print(model.summary())
# Visualization
import matplotlib.pyplot as plt
plt.scatter(subset["exposure"], subset["outcome"], alpha=0.3)
plt.xlabel("Exposure"); plt.ylabel("Outcome")
plt.savefig("/tmp/scatter.png", dpi=150, bbox_inches="tight")
Always run the code and report actual numbers (β, p-value, CI, N).
Structure the report as:
Critical honesty rules:
Never assume a dataset exists — search for it. Never assume access is public — check. Never assume variables are measured the way you need — verify the codebook.
tools
Post-market safety surveillance and recall/adverse-event RETRIEVAL across the full spectrum of FDA-regulated products that are NOT covered by the drug-AE signal skills: medical devices, food / dietary supplements / cosmetics, veterinary drugs, and drug supply (shortages). Orchestrates openFDA endpoints (MAUDE device adverse events + device recalls + 510(k), CAERS food/supplement/ cosmetic adverse events, veterinary adverse events, drug shortages, and cross-product enforcement/recall reports). USE WHEN the user asks: "are there adverse events for [device / pacemaker / infusion pump / insulin pump]", "device recalls for [firm/product]", "supplement / vitamin / cosmetic adverse reactions", "is [drug] in shortage", "what injectables are on shortage", "veterinary / animal adverse events for [drug] in [dog/cat/horse]", "food recall for listeria", "MAUDE report for [device]", "CAERS reactions for [brand]". DO NOT USE for drug adverse-event SIGNAL detection or disproportionality (PRR / ROR / IC) or drug-AE association scoring — that is `tooluniverse-pharmacovigilance` / `tooluniverse-adverse-event-detection`. This skill is multi-product surveillance and retrieval, not drug-AE statistical signal mining.
tools
--- name: tooluniverse-phewas description: Cross-ancestry / cross-biobank phenome-wide association (PheWAS) and replication. Given ONE variant (rsID) or ONE gene, look up every phenotype it associates with across European/UK (UKB-TOPMed), Finnish (FinnGen), Japanese (BioBank Japan), and Taiwanese (TPMI) biobanks, plus exome-wide gene-burden PheWAS (Genebass), then judge whether an association replicates across ancestries or is population-specific. Use whenever the user asks "what else is this va
tools
Dereplicate a putative natural product and assign its chemical taxonomy. Use to answer "is [compound] a known natural product", "what microbe/organism produces [compound]", "what chemical class is [compound]", "dereplicate this metabolite (by formula/exact mass/InChIKey/SMILES)", or "classify this molecule into ChemOnt". Searches NPAtlas for known microbial natural products (producing organism + literature reference), assigns the ChemOnt kingdom→superclass→class→subclass hierarchy via ClassyFire, resolves systematic IUPAC names to structure via OPSIN, and cross-references identity in PubChem. NOT for general drug/compound identity or ADMET (use tooluniverse-chemical-compound-retrieval / tooluniverse-small-molecule-discovery) and NOT for metabolomics pathway/enrichment analysis (use tooluniverse-metabolomics skills).
tools
Genome-ASSEMBLY discovery, QC, and replicon mapping for any organism (bacteria, archaea, fungi, and beyond) using NCBI Datasets. Resolves an organism name or taxid to assemblies, picks the reference/representative or best-quality assembly, pulls assembly QC metrics (total length, contig/scaffold N50, contig count, GC%, assembly level, RefSeq category), enumerates chromosomes and plasmids via per-replicon sequence reports, and compares candidate assemblies on quality. Use for "what genomes are available for [organism]", "assembly stats / N50 / GC content for [GCF_/GCA_ accession]", "how many plasmids does [strain] have", "compare assemblies for [species]", "find the reference genome for [taxon]", "is this assembly Complete Genome or just contigs". NOT for gene-level orthology/synteny (use tooluniverse-comparative-genomics), plant gene structure (use tooluniverse-plant-genomics), de novo assembly from raw reads (no tool exists), or taxonomy-only name/lineage lookups.