plugin/skills/tooluniverse-dataset-discovery/SKILL.md
Find and evaluate research datasets for any scientific question. Maps research questions to required study designs (longitudinal vs cross-sectional, observational vs experimental, single-cohort vs multi-cohort). Use when the user asks 'find data about X', 'where can I get data on Y', or needs a specific cohort/survey/repository. Covers GEO, ArrayExpress, dbGaP, NHANES, UK Biobank, ClinicalTrials.gov, GWAS Catalog, and 30+ scientific repositories.
npx skillsauth add mims-harvard/tooluniverse tooluniverse-dataset-discoveryInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Before searching, determine the minimum data requirements:
Study design needed:
Variables needed:
Population needed:
Search from broadest to most specific. Use find_tools to discover available dataset search tools — don't rely on memorized tool names.
Layer 1 — Cross-repository search (cast wide net): Search tools that index datasets across thousands of repositories. These find datasets you didn't know existed.
Layer 2 — Domain-specific repositories: Search repositories specialized for your data type.
Layer 3 — Literature-based discovery: Many datasets aren't in any repository — they're described in paper methods sections.
For each candidate dataset, assess these dimensions:
Variables:
Design match:
Sample:
Access:
Quality:
Don't stop at finding datasets — download and analyze them. Write and run Python code via Bash. Never describe what you "would do" — execute it.
Choose the loader that matches your data source. When unsure of the format, download a small sample first and inspect.
import requests, io, pandas as pd
# --- Tabular files (most common) ---
df = pd.read_csv("data.csv") # CSV / TSV (use sep="\t" for TSV)
df = pd.read_excel("data.xlsx") # Excel
df = pd.read_stata("data.dta") # Stata
df = pd.read_sas("data.xpt", format="xport") # SAS transport (XPT)
df = pd.read_sas("data.sas7bdat", format="sas7bdat") # SAS native
df = pd.read_parquet("data.parquet") # Parquet
df = pd.read_json("data.json") # JSON (records or columnar)
df = pd.read_fwf("data.dat") # Fixed-width (some legacy surveys)
# --- Download from URL first, then parse ---
resp = requests.get(url, timeout=120)
content = resp.content
# Detect format from URL or content header
if url.endswith(".XPT") or url.endswith(".xpt"):
df = pd.read_sas(io.BytesIO(content), format="xport")
elif url.endswith(".csv") or url.endswith(".csv.gz"):
df = pd.read_csv(io.BytesIO(content))
elif url.endswith(".tsv") or url.endswith(".tsv.gz"):
df = pd.read_csv(io.BytesIO(content), sep="\t")
elif url.endswith(".json"):
df = pd.read_json(io.BytesIO(content))
else:
# Try CSV first, then inspect
df = pd.read_csv(io.BytesIO(content))
# --- REST API pagination (common for GDC, ClinicalTrials.gov, etc.) ---
import json
all_records = []
offset = 0
while True:
resp = requests.get(f"{api_url}?offset={offset}&limit=100", timeout=30)
batch = resp.json().get("data", [])
if not batch:
break
all_records.extend(batch)
offset += len(batch)
df = pd.DataFrame(all_records)
# Merge multiple files on participant/sample ID
merged = df1.merge(df2, on="id_col", how="inner")
# Filter population
subset = merged[(merged["age"] >= 60) & (merged["age"] <= 80)].copy()
# Handle missing values
missing_pct = subset.isnull().mean() * 100
print("Missing % per variable:\n", missing_pct[missing_pct > 0].sort_values(ascending=False))
subset = subset.dropna(subset=["exposure_var", "outcome_var"])
# Quick regression
import statsmodels.formula.api as smf
model = smf.ols("outcome ~ exposure + age + sex", data=subset).fit()
print(model.summary())
# Visualization
import matplotlib.pyplot as plt
plt.scatter(subset["exposure"], subset["outcome"], alpha=0.3)
plt.xlabel("Exposure"); plt.ylabel("Outcome")
plt.savefig("/tmp/scatter.png", dpi=150, bbox_inches="tight")
Always run the code and report actual numbers (β, p-value, CI, N).
Structure the report as:
Critical honesty rules:
Never assume a dataset exists — search for it. Never assume access is public — check. Never assume variables are measured the way you need — verify the codebook.
tools
PCR / qPCR primer and oligo design — design forward/reverse primers for a target region (SantaLucia nearest-neighbor thermodynamics), compute melting temperature (Tm) and annealing temperature (Ta), check GC content, and screen an oligo for hairpins and primer-dimers. Use when you need primers for a sequence, want to QC an existing primer pair, or need the Tm of an oligo. Covers the primer-design rules (Tm matching, GC clamp, 3'-end, length) and the tools' constraint quirks.
tools
Pharmacokinetic (PK) analysis of concentration-time data — non-compartmental analysis (NCA) for Cmax, Tmax, AUC (0-t and 0-∞), terminal half-life, clearance (CL), volume of distribution (Vd), MRT, and absolute bioavailability (F). Also one-compartment fitting. Use when you have plasma/serum drug concentrations over time after a dose and need PK parameters, or to compute bioavailability from IV + oral AUCs. NOT for ADMET property prediction from structure (use tooluniverse-admet-prediction).
tools
Molecular cloning assembly design — Gibson Assembly (overlap design for seamless multi-fragment joining) and Golden Gate Assembly (Type IIS / BsaI / BbsI design with unique 4-bp fusion overhangs). Use when you need to plan how to join DNA fragments into a construct, design assembly overlaps/overhangs, or decide between cloning methods. Covers the domestication (internal-site removal), overhang-uniqueness, and overlap-Tm rules. For PCR primers to generate the fragments, see tooluniverse-primer-design.
tools
Meta-analysis / evidence synthesis — pool effect sizes across studies (odds ratios, risk ratios, hazard ratios, mean differences, correlations, GWAS betas) with fixed- or random-effects models, quantify heterogeneity (Q, I², τ²), and build a forest plot. Use when you have results from MULTIPLE studies and need a single pooled estimate, or to synthesize evidence from a systematic review / multiple GWAS / replicated experiments. Handles the error-prone effect-size + standard-error preparation (converting OR/HR/CI, two-group means±SD, proportions, and correlations into the (effect, SE) the pooling step needs).