plugin/skills/tooluniverse-epidemiological-analysis/SKILL.md
End-to-end observational epidemiology analysis — from research question (PECO Population/Exposure/Comparator/Outcome) to publication-ready statistical report. Covers cohort/case-control/cross-sectional design, regression with confounders, propensity scoring, sensitivity analysis. Writes Python code for every step. Use for epidemiology study analysis, NHANES/UK-Biobank-style analyses.
npx skillsauth add mims-harvard/tooluniverse tooluniverse-epidemiological-analysisInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Complete workflow for observational epidemiology — from research question to publication-ready report. Write and run Python code for every step. Never describe what you "would do" — do it.
Define Population, Exposure, Comparator, Outcome before touching data.
Study design check: Does the question require temporality?
If the question implies causation ("does X cause Y?") but only cross-sectional data is available, state the limitation explicitly and proceed with association language.
Use ToolUniverse to discover datasets and find what prior studies used:
# Search for relevant datasets — use find_tools to discover what's available
find_tools("dataset search")
find_tools("your domain keywords") # e.g., "cancer genomics", "clinical trial", "survey health"
# Search literature for study precedents — papers cite their data sources
execute_tool("PubMed_search_articles", {"query": "[exposure] [outcome] [study design]", "max_results": 5})
execute_tool("EuropePMC_search_articles", {"query": "[exposure] [outcome] cohort", "limit": 5})
Evaluate dataset fitness: Does it have the exposure variable? The outcome? Key confounders (age, sex, plus domain-specific)? Adequate sample size?
Power analysis (run before committing to a dataset):
from scipy.stats import norm
import numpy as np
def sample_size_logistic(p0, OR, alpha=0.05, power=0.80):
"""Minimum N for logistic regression detecting OR at given power."""
p1 = (p0 * OR) / (1 - p0 + p0 * OR)
z_a, z_b = norm.ppf(1 - alpha/2), norm.ppf(power)
n = ((z_a + z_b)**2 * (1/(p0*(1-p0)) + 1/(p1*(1-p1)))) / (np.log(OR))**2
return int(np.ceil(n))
print(f"Need N={sample_size_logistic(0.10, 1.5)} for OR=1.5 with 10% baseline prevalence")
Download data programmatically. Adapt the loading code to your data source's format.
import pandas as pd
import requests, io
# Generic download helper — adapt URL and format to your source
def download_and_parse(url, fmt="csv"):
r = requests.get(url, timeout=120)
content = io.BytesIO(r.content)
if fmt == "xpt":
return pd.read_sas(content, format="xport")
elif fmt == "csv":
return pd.read_csv(content)
elif fmt == "tsv":
return pd.read_csv(content, sep="\t")
elif fmt == "stata":
return pd.read_stata(content)
elif fmt == "json":
return pd.read_json(content)
else:
return pd.read_csv(content) # default fallback
# Load and merge multiple files on shared ID column
df1 = download_and_parse(url1, fmt="xpt")
df2 = download_and_parse(url2, fmt="xpt")
df = df1.merge(df2, on="id_col", how="inner")
# Filter population (inclusion/exclusion criteria)
df = df[(df['age'] >= 20) & (df['age'] < 80)]
# Handle missing data
missing_pct = df.isnull().mean() * 100
print("Missing % per variable:\n", missing_pct[missing_pct > 0].sort_values(ascending=False))
# Decision: complete case if <5% missing; multiple imputation if 5-20%; drop variable if >20%
# Variable coding (adapt to your data)
df['age_group'] = pd.cut(df['age'], bins=[20,40,60,80], labels=['20-39','40-59','60-79'])
df['outcome_binary'] = (df['outcome_continuous'] >= threshold).astype(int)
Survey weights: Some surveys (NHANES, BRFSS, MEPS) require sampling weights for valid inference. Check the survey documentation. For weighted regression, use statsmodels.stats.weightstats or linearmodels.
REST API data: For sources like GDC (TCGA), ClinicalTrials.gov, or OpenTargets, paginate through the API:
all_records = []
offset = 0
while True:
resp = requests.get(f"{api_url}?offset={offset}&limit=500", timeout=30)
batch = resp.json().get("data", [])
if not batch:
break
all_records.extend(batch)
offset += len(batch)
df = pd.DataFrame(all_records)
# Table 1: mean +/- SD for continuous, N(%) for categorical, by exposure group
continuous_vars = ['age', 'bmi'] # adapt to your variables
for var in continuous_vars:
print(df.groupby('exposure_group')[var].agg(['mean', 'std', 'count']))
categorical_vars = ['sex', 'race'] # adapt to your variables
for var in categorical_vars:
print(pd.crosstab(df['exposure_group'], df[var], normalize='index') * 100)
Check distributions: df[var].skew(), scipy.stats.shapiro(), histograms for outliers.
Sequential adjustment strategy (build evidence for confounding):
import statsmodels.formula.api as smf
import numpy as np
# Model 1: Unadjusted
m1 = smf.logit('outcome ~ exposure', data=df).fit(disp=0)
# Model 2: + demographics
m2 = smf.logit('outcome ~ exposure + age + sex + race', data=df).fit(disp=0)
# Model 3: + clinical factors
m3 = smf.logit('outcome ~ exposure + age + sex + race + bmi + smoking + alcohol', data=df).fit(disp=0)
# Report ORs with 95% CI
for name, model in [('Unadjusted', m1), ('Demographics', m2), ('Fully adjusted', m3)]:
or_val = np.exp(model.params['exposure'])
ci = np.exp(model.conf_int().loc['exposure'])
print(f"{name}: OR={or_val:.2f} (95% CI: {ci[0]:.2f}-{ci[1]:.2f}), p={model.pvalues['exposure']:.4f}")
Model selection by outcome type:
smf.ols()smf.logit()OrderedModel from statsmodelsCoxPHFitter from lifelinessmf.poisson() or smf.negativebinomial()Assumption checks:
from statsmodels.stats.outliers_influence import variance_inflation_factor
# Multicollinearity (VIF > 5 is concerning, > 10 is severe)
X = df[['age', 'bmi', 'exposure']].dropna()
for i, col in enumerate(X.columns):
print(f"VIF {col}: {variance_inflation_factor(X.values, i):.1f}")
# Stratified analysis (effect modification)
for stratum_var in ['sex', 'age_group', 'race']:
print(f"\n--- Stratified by {stratum_var} ---")
for level, sub in df.groupby(stratum_var):
if len(sub) < 50: continue
try:
m = smf.logit('outcome ~ exposure + age + bmi', data=sub).fit(disp=0)
or_val = np.exp(m.params['exposure'])
ci = np.exp(m.conf_int().loc['exposure'])
print(f" {level}: OR={or_val:.2f} ({ci[0]:.2f}-{ci[1]:.2f}), p={m.pvalues['exposure']:.3f}, N={len(sub)}")
except: print(f" {level}: model failed (N={len(sub)})")
# Exclude outliers (+/- 3 SD) and re-run
df_no_outliers = df[np.abs(stats.zscore(df['exposure'].dropna())) < 3]
m_robust = smf.logit('outcome ~ exposure + age + sex + bmi', data=df_no_outliers).fit(disp=0)
# Confounder-adjusted exposure (residual method, e.g., energy-adjusted nutrient intake)
# Use when exposure correlates strongly with a confounder (total calories, body size, etc.)
adj_model = smf.ols('exposure ~ confounder', data=df).fit()
df['exposure_adj'] = adj_model.resid
# Multiple comparisons note
n_tests = 5 # number of exposure-outcome pairs tested
bonferroni_threshold = 0.05 / n_tests
This is where ToolUniverse adds value beyond any statistics package. After finding a statistical association, investigate the biological plausibility. Use find_tools to discover the right tools for your domain.
# 1. Literature evidence — search for mechanism connecting exposure to outcome
execute_tool("PubMed_search_articles", {"query": "[exposure] [outcome] mechanism", "max_results": 5})
execute_tool("EuropePMC_search_articles", {"query": "[exposure] [outcome] mouse model in vivo", "limit": 5})
# 2. Pathway/molecular context — discover tools for your exposure type
find_tools("[exposure type] pathway") # e.g., "nutrient pathway", "drug target", "chemical toxicology"
find_tools("[outcome type] gene disease") # e.g., "diabetes gene", "cancer survival", "cardiac risk"
# 3. Gene-disease evidence (if a gene/variant is involved)
find_tools("gene disease association")
find_tools("variant functional annotation")
# 4. Drug/chemical mechanisms (if a drug or chemical is the exposure)
find_tools("drug mechanism target")
find_tools("chemical gene interaction")
The pattern: Exposure X → (what molecular pathway?) → (what biological process?) → Outcome Y. Use ToolUniverse tools to fill in the middle steps. This converts a statistical association into a biologically plausible hypothesis.
Key plots to produce (use matplotlib):
sns.regplot()Write the final report in this order:
Key limitations to always state:
Before finalizing any epidemiological analysis:
tools
Post-market safety surveillance and recall/adverse-event RETRIEVAL across the full spectrum of FDA-regulated products that are NOT covered by the drug-AE signal skills: medical devices, food / dietary supplements / cosmetics, veterinary drugs, and drug supply (shortages). Orchestrates openFDA endpoints (MAUDE device adverse events + device recalls + 510(k), CAERS food/supplement/ cosmetic adverse events, veterinary adverse events, drug shortages, and cross-product enforcement/recall reports). USE WHEN the user asks: "are there adverse events for [device / pacemaker / infusion pump / insulin pump]", "device recalls for [firm/product]", "supplement / vitamin / cosmetic adverse reactions", "is [drug] in shortage", "what injectables are on shortage", "veterinary / animal adverse events for [drug] in [dog/cat/horse]", "food recall for listeria", "MAUDE report for [device]", "CAERS reactions for [brand]". DO NOT USE for drug adverse-event SIGNAL detection or disproportionality (PRR / ROR / IC) or drug-AE association scoring — that is `tooluniverse-pharmacovigilance` / `tooluniverse-adverse-event-detection`. This skill is multi-product surveillance and retrieval, not drug-AE statistical signal mining.
tools
--- name: tooluniverse-phewas description: Cross-ancestry / cross-biobank phenome-wide association (PheWAS) and replication. Given ONE variant (rsID) or ONE gene, look up every phenotype it associates with across European/UK (UKB-TOPMed), Finnish (FinnGen), Japanese (BioBank Japan), and Taiwanese (TPMI) biobanks, plus exome-wide gene-burden PheWAS (Genebass), then judge whether an association replicates across ancestries or is population-specific. Use whenever the user asks "what else is this va
tools
Dereplicate a putative natural product and assign its chemical taxonomy. Use to answer "is [compound] a known natural product", "what microbe/organism produces [compound]", "what chemical class is [compound]", "dereplicate this metabolite (by formula/exact mass/InChIKey/SMILES)", or "classify this molecule into ChemOnt". Searches NPAtlas for known microbial natural products (producing organism + literature reference), assigns the ChemOnt kingdom→superclass→class→subclass hierarchy via ClassyFire, resolves systematic IUPAC names to structure via OPSIN, and cross-references identity in PubChem. NOT for general drug/compound identity or ADMET (use tooluniverse-chemical-compound-retrieval / tooluniverse-small-molecule-discovery) and NOT for metabolomics pathway/enrichment analysis (use tooluniverse-metabolomics skills).
tools
Genome-ASSEMBLY discovery, QC, and replicon mapping for any organism (bacteria, archaea, fungi, and beyond) using NCBI Datasets. Resolves an organism name or taxid to assemblies, picks the reference/representative or best-quality assembly, pulls assembly QC metrics (total length, contig/scaffold N50, contig count, GC%, assembly level, RefSeq category), enumerates chromosomes and plasmids via per-replicon sequence reports, and compares candidate assemblies on quality. Use for "what genomes are available for [organism]", "assembly stats / N50 / GC content for [GCF_/GCA_ accession]", "how many plasmids does [strain] have", "compare assemblies for [species]", "find the reference genome for [taxon]", "is this assembly Complete Genome or just contigs". NOT for gene-level orthology/synteny (use tooluniverse-comparative-genomics), plant gene structure (use tooluniverse-plant-genomics), de novo assembly from raw reads (no tool exists), or taxonomy-only name/lineage lookups.