clinical-databases/polygenic-risk/SKILL.md
Constructs and validates polygenic risk scores using LDpred2-auto, SBayesRC, MegaPRS, PRS-CS, PROSPER, MUSSEL, BridgePRS, JointPRS, PRSmix, or PGS Catalog Calculator with ancestry-aware reference panels (HapMap3, UKB-LD), Pejaver-style calibration, and PRS-RS reporting standards. Use when computing PRS for cohorts, applying Whiffin-style absolute-risk transformation, assessing cross-ancestry portability (Martin 2017 / Ding 2023 continuous ancestry), or auditing PRS manuscripts against the 22-item PRS-RS reviewer checklist.
npx skillsauth add GPTomics/bioSkills bio-clinical-databases-polygenic-riskInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Reference examples tested with: bigsnpr 1.12+ (LDpred2; Privé 2020), PRSice-2 2.3.5+, PRS-CS 1.0.0+ (Ge 2019), gctb 2.5+ (SBayesR/SBayesS/SBayesRC; Zheng 2024), LDAK 6.0+ (MegaPRS; Zhang 2021), pgsc_calc 2.0+ (nf-core; Lambert 2024), Hail 0.2.130+, numpy 1.26+, pandas 2.2+. No general FDA PRS guidance document exists as of May 2026; the operative regulatory text is the August 2025 Federal Register notice on Cancer Predisposition Risk Assessment Systems (Class II device with special controls).
Before using code patterns, verify installed versions match. If versions differ:
pip show <package> then help(module.function) to check signaturespackageVersion('<pkg>') then ?function_name<tool> --version then <tool> --help to confirm flagsIf code throws ImportError, AttributeError, or TypeError, introspect the installed package and adapt the example to match the actual API rather than retrying. The LDpred2-auto snp_ldpred2_auto() signature changed in bigsnpr 1.11+; pin allow_jump_sign = FALSE and shrink_corr = 0.95 explicitly.
'Compute a PRS for my cohort using these GWAS summary statistics' -> Match variants to target genotypes, choose method by data availability + trait architecture, derive LD-aware effect estimates, score, normalize by ancestry, transform to absolute risk.
pgsc_calc --target target.vcf --pgs_id PGS000001 (nf-core, Lambert 2024)bigsnpr::snp_ldpred2_auto() (Privé 2020)PRS-CSx, PROSPER, MUSSEL, BridgePRS, JointPRSgctb --sbayes-rc --bfile target --gwas-summary sumstats.maPRSice_linux (clumping + thresholding; still cited for some clinical scores)| Method | Approach | Best for | Fails when | |--------|----------|----------|-----------| | SBayesRC (Zheng 2024 Nat Genet) | Bayesian + 96 functional annotations | EUR; sparse traits | Sumstats LD-incoherent with reference; chain divergence (run --impute-summary first) | | MegaPRS (Zhang 2021 Nat Commun) | BLD-LDAK heritability model | EUR; sparser traits | Lacks GCTA-model assumptions; legacy GCTA pipelines | | LDpred2-auto (Privé 2020) | Bayesian + auto-tuning | Polygenic EUR | LD ref mismatch (s > 0.05); allow_jump_sign default True (must pin FALSE) | | PRS-CS-auto (Ge 2019 Nat Commun) | Continuous-shrinkage prior | Polygenic EUR | Sparse-trait architecture; HapMap3-restricted variants only | | lassosum2 (Privé 2022) | Penalized regression | EUR alternative | Highly polygenic (Bayesian methods better); requires tuning data | | C+T (PRSice-2; Choi 2019) | Clumping + thresholding | Legacy clinical scores (PRS313) | Highly polygenic; Bayesian methods dominate | | PROSPER (Zhang 2024 Nat Commun) | Ensemble penalized regression | Multi-ancestry, AFR + others | Single-ancestry; tuning set < 1000 | | MUSSEL (Jin 2024 Cell Genomics) | Spike-slab + super-learner | Multi-ancestry; admixed AFR | Single-ancestry; lacks tuning data | | JointPRS (Hu S et al 2025 Nat Commun) | Data-adaptive Bayesian | Multi-ancestry; sumstats only | Single-ancestry; very small target. (Verify exact volume/page reference in the latest Nat Commun citation; the earlier "16:59243" attribution appears implausible.) | | PRS-CSx (Ruan 2022 Nat Genet) | PRS-CS multi-ancestry extension | Multi-ancestry with EUR + non-EUR sumstats | Low causal-variant overlap across ancestries | | BridgePRS (Hoggart 2024 Nat Genet) | Ridge-bridge sharing | Low-h^2 AFR / low causal overlap | Standard scenarios (PROSPER/MUSSEL win) | | PolyPred / PolyPred+ (Weissbrod 2022) | BOLT-LMM + PolyFun-SuSIE | Multi-ancestry; biobank-scale | Small individual-level data; expensive |
Citation traps caught by senior PIs:
Martin 2017 Nat Genet established the 4.5x R^2 attenuation between EUR and AFR. Updates:
Multi-ancestry method choice:
--ldm-eigen).Khera 2018 Nat Genet established the clinical-PRS narrative; Hingorani 2023 BMJ Med is the operative critique:
Calibration mechanics:
22-item checklist. Reviewer-priority items: cohort independence between development + evaluation (item 13); confounder adjustment in evaluation (item 16); absolute-risk reporting (item 19); ancestry composition of validation (item 21). Nature Genetics-tier manuscripts without PRS-RS adherence are rejected at review.
| Scenario | Recommended path | Why |
|----------|------------------|-----|
| EUR cohort + individual-level data | LDpred2-auto or SBayesRC | SBayesRC integrates functional annotations |
| EUR cohort + sumstats only | MegaPRS or SBayesRC | LDpred2 also viable |
| Highly polygenic trait (height, BMI, education) | LDpred2-auto, PRS-CS-auto | Continuous-shrinkage priors well-suited |
| Sparse trait (lipids, AMD) | MegaPRS, SBayesRC | BLD-LDAK or SBayesR mixture priors |
| Multi-ancestry, large tuning set | PROSPER or MUSSEL | Top performance per benchmarks |
| Multi-ancestry, sumstats + small tuning | JointPRS or PRS-CSx | Joint Bayesian framework |
| Multi-ancestry, target = AFR | MUSSEL or BridgePRS | Best non-EUR performance |
| Combining multiple PGS Catalog scores | PRSmix (single trait) or PRSmix+ (cross-trait) | Elastic-net combination |
| Production score for biobank | pgsc_calc Nextflow nf-core | Handles liftover, ancestry, normalization automatically |
| No tuning data available | LDpred2-auto, PRS-CS-auto, JointPRS-auto | Bayesian auto-tuning |
| Clinical reporting | Apply Hingorani-aware framing: absolute-risk transform via external incidence curve | HR per SD = 1.3-1.7 alone is inadequate for screening |
Goal: Compute LDpred2-auto PRS from sumstats + target genotypes with appropriate LD reference and sample-overlap detection.
Approach: Use bigsnpr R package; match variants strand-aware; compute LD or use prebuilt UK Biobank LD; run snp_ldpred2_auto() with shrinkage and jump-sign guards; assess s parameter for LD mismatch.
library(bigsnpr)
library(data.table)
# Load target genotypes (PLINK .bed/.bim/.fam -> bigSNP object .rds)
# snp_readBed('target.bed', 'target.rds')
obj <- snp_attach('target.rds')
G <- obj$genotypes
map <- obj$map
# Load GWAS summary stats (standardized format)
sumstats <- fread('gwas_sumstats.txt')
# Required columns: chr, pos, a0 (ref), a1 (effect), beta, beta_se, n_eff, p
# Match variants strand-aware (snp_match handles A/T C/G ambiguity)
df_beta <- snp_match(sumstats, map, strand_flip = TRUE)
# Compute LD correlation matrix (in-sample) OR use prebuilt UKB LD.
# For a 3 cM window, pass the cM positions via `infos.pos = CHR_POS_CM` and set
# `size = 3`. Writing `size = 3/1000` is silently broken because 0.003 rounds to 0.
corr <- snp_cor(G, ind.col = df_beta[['_NUM_ID_']],
infos.pos = df_beta[['cM']],
size = 3, # 3 cM window when infos.pos is in cM
ncores = 8)
# LDSC heritability estimate + LD mismatch (s) diagnostic
ldsc_res <- snp_ldsc2(corr, df_beta)
h2_est <- ldsc_res[['h2']]
ldsc_s <- ldsc_res[['int']] # intercept; large => sample overlap
# LDpred2-auto; run multiple chains in parallel
multi_auto <- snp_ldpred2_auto(
corr, df_beta,
h2_init = h2_est,
vec_p_init = seq_log(1e-4, 0.2, 30),
burn_in = 500, num_iter = 200,
allow_jump_sign = FALSE, # CRITICAL: pin to FALSE to avoid sign artifacts
shrink_corr = 0.95, # LD-shrinkage guard
ncores = 8
)
# Filter divergent chains
beta_auto <- sapply(multi_auto, function(x) x$beta_est)
range_auto <- sapply(multi_auto, function(x) diff(range(x$corr_est)))
keep <- range_auto > (0.95 * quantile(range_auto, 0.95))
beta_final <- rowMeans(beta_auto[, keep, drop = FALSE])
# Score
pred <- big_prodMat(G, beta_final, ind.col = df_beta[['_NUM_ID_']])
Goal: Compute PRS integrating ~96 functional annotations (baseline-LD v2.2) for +14% R^2 over SBayesR.
Approach: GCTB CLI with the SBayesRC algorithm; preprocess sumstats with --impute-summary.
# 1. Impute missing variants in sumstats against reference panel
gctb \
--sbayes-rc \
--impute-summary \
--gwas-summary gwas_sumstats.ma \
--ldm-eigen ukb_eigen_ld.eigen \
--annot baselineLD_v2.2 \
--out sbayesrc_preprocessed.ma
# 2. Run SBayesRC main step
gctb \
--sbayes-rc \
--gwas-summary sbayesrc_preprocessed.ma \
--ldm-eigen ukb_eigen_ld.eigen \
--annot baselineLD_v2.2 \
--num-chains 4 --chain-length 25000 --burn-in 5000 \
--out sbayesrc_run
# 3. Score target genotypes
plink2 --bfile target \
--score sbayesrc_run.snpRes 2 5 8 header \
--out sbayesrc_scores
Goal: Compute multiple published PGS scores in one pipeline with automatic liftover, ancestry projection, and normalization.
Approach: Nextflow nf-core pgsc_calc workflow handles GRCh37/38 liftover, PC-projection-based ancestry assignment, ambiguous-SNP filtering, and mean/variance normalization.
# Calculate multiple PGS for cohort
nextflow run pgscatalog/pgsc_calc \
--input samplesheet.csv \
--target_build GRCh38 \
--pgs_id PGS000004,PGS000019,PGS001775 \
--run_ancestry resources/pgsc_HGDP+1kGP_v1.tar.zst \
--outdir pgsc_output/ \
-profile docker
Goal: Compute multi-ancestry PRS using ancestry-specific GWAS sumstats jointly via continuous-shrinkage Bayesian framework.
Approach: PRS-CSx jointly models multiple ancestries with shared shrinkage prior; outperforms per-ancestry PRS-CS for non-EUR.
python PRScsx.py \
--ref_dir=ldblk_1kg \
--bim_prefix=target \
--sst_file=eur_sumstats.txt,afr_sumstats.txt,eas_sumstats.txt \
--n_gwas=200000,30000,50000 \
--pop=EUR,AFR,EAS \
--out_dir=prscsx_out --out_name=cohort \
--phi=1e-2 # tune by trait architecture: 1e-2 polygenic, 1e-4 sparse
# Score each ancestry-specific posterior, then combine on tuning set
for pop in EUR AFR EAS; do
plink2 --bfile target \
--score prscsx_out/cohort_${pop}_pst_eff_a1_b0.5_phi1e-02.txt 2 4 6 \
--out scores_${pop}
done
Goal: Convert raw PRS to ancestry-conditional percentiles for clinical interpretation.
Approach: Subtract conditional mean given test-cohort PCs (NOT discovery PCs); divide by conditional SD.
import numpy as np
from scipy import stats
import statsmodels.api as sm
def ancestry_conditional_normalize(prs, pcs, n_pcs=10):
'''Recalibrate PRS by removing ancestry effects (Sun 2021 + Ding 2023 continuous).
pcs: matrix of principal components computed in the TEST cohort (not discovery).
Returns: ancestry-conditional Z scores.
'''
X = sm.add_constant(pcs[:, :n_pcs])
mean_model = sm.OLS(prs, X).fit()
expected = mean_model.predict(X)
residuals = prs - expected
log_var_model = sm.OLS(np.log(residuals ** 2 + 1e-12), X).fit()
expected_log_var = log_var_model.predict(X)
sd = np.sqrt(np.exp(expected_log_var))
return residuals / sd
def prs_to_percentile(z_scores):
'''Convert ancestry-conditional Z to population percentile.'''
return stats.norm.cdf(z_scores) * 100
def absolute_risk_transform(percentile, incidence_curve_age, age):
'''Integrate over external age-conditional incidence curve to get absolute risk.
Khera 2018 used FOS for CAD; Lee 2019 BOADICEA v6 for breast cancer.
The HR-per-SD-only framing (Hingorani 2023 critique) is insufficient for screening.
'''
base_risk = incidence_curve_age(age)
z = stats.norm.ppf(percentile / 100)
relative_risk = np.exp(z * 0.5) # placeholder; trait-specific log HR
return base_risk * relative_risk
Goal: Detect overlap between PRS-discovery cohort and target cohort, which inflates apparent PRS performance.
Approach: Run bivariate LDSC; |intercept| > 0.05 with target n >= 1000 is the typical alarm.
# bivariate LDSC for sample overlap
ldsc.py \
--rg target_sumstats.sumstats.gz,discovery_sumstats.sumstats.gz \
--ref-ld-chr eur_w_ld_chr/ \
--w-ld-chr eur_w_ld_chr/ \
--out overlap_check
# Inspect *.log: gcov_int is the sample-overlap-driven intercept (after gencov correction)
# |gcov_int| > 0.05 with target n >= 1000 => substantial overlap
1. Discovery + tuning + test overlap (the inflated R^2 trap)
2. Treating discrete ancestry boxes as ground truth
3. Strand-ambiguous SNPs (A/T, C/G) handled wrong
snp_match() frequency-matches with 0.4-0.6 MAF tolerance.4. LD reference mismatch (high s parameter)
snp_ldsc2() returns LD-mismatch parameter s; values > 0.05 indicate problematic mismatch.s routinely.5. PCs in derivation but not in evaluation
6. Treating HR per SD = 1.5 as clinically actionable
7. PRSmix / MUSSEL / PROSPER cited in wrong journal
8. HLA region included in PRS without explicit handling
9. Sex chromosomes mishandled
| Pattern | Likely cause | Action |
|---------|-------------|--------|
| LDpred2-auto vs PRS-CS-auto disagree on top decile | Trait sparsity differs; method assumption mismatch | Use SBayesRC as tiebreaker; ensemble via PRSmix |
| Multi-ancestry methods rank differently | Tuning set size; ancestry composition | Report ensemble; document hyperparameter sensitivity |
| EUR PRS R^2 << non-EUR target R^2 | Expected; Martin 2017 transferability | Switch to PROSPER / MUSSEL / BridgePRS |
| PRS performance differs across age strata | Mostafavi 2020; within-ancestry heterogeneity | Report age-stratified PRS performance |
| PRS unrelated to phenotype despite high h^2 | Sample overlap; strand-ambiguous SNPs; PC issues | Run EraSOR; check snp_match() flips; recompute PCs in test |
| PRS percentile shifts with array platform | Different ascertainment of common variants | Recalibrate per-platform; use ancestry-conditional Z |
| Threshold | Convention | Source |
|-----------|-----------|--------|
| HapMap3 SNPs | ~1.1M variants; required for PRS-CS, LDpred2 | HapMap project |
| LDpred2 LD ref s | < 0.05 indicates well-matched LD | Privé 2022 |
| EraSOR | |intercept| > 0.05 with n >= 1000 => sample overlap | Choi 2023 |
| MAF range for ambiguous SNPs | Drop or frequency-match 0.4-0.6 | LDpred2 default |
| INFO score (imputed) | >= 0.8 for inclusion | Yengo 2018 |
| Kinship coefficient cutoff | KING > 0.0884 = 3rd-degree or closer; remove | KING documentation |
| HLA region exclusion | chr6 28-34 Mb (some use 25-35) | Convention |
| HR per SD typical | 1.3-1.7 for most diseases | Hingorani 2023 BMJ Med |
| Top 2.5% PRS CAD detection rate | ~7% of cases captured | Hingorani 2023 |
| Mavaddat PRS313 | 313 SNPs, breast cancer | Mavaddat 2019 |
| Khera CAD PRS | ~6.6M variants | Khera 2018 |
| Aragam CAD PRS | GPS_Mult; multi-ancestry SOTA 2023 | Aragam 2023 |
| Symptom | Cause | Solution |
|---------|-------|----------|
| PRS R^2 << expected on held-out data | Sample overlap with discovery | Run EraSOR; rebuild three-way disjoint splits |
| LDpred2-auto chains diverge | LD mismatch (s > 0.05) or wrong h^2 init | Use UKB LD panel; check snp_ldsc2() output |
| Strand-flip warnings ignored | snp_match defaults | Set strand_flip = TRUE; review flipped variants |
| HLA region produces extreme PRS values | Long-range LD | Exclude chr6 28-34 Mb; model HLA separately |
| Non-EUR PRS at chance | Method not multi-ancestry-aware | Switch to PROSPER/MUSSEL/BridgePRS |
| Different ranking with same data | Stochastic MCMC | Seed; report posterior CIs; check convergence |
| PGS Catalog liftover failure | Incompatible build | Use pgsc_calc auto-liftover; check log |
| Top 1% PRS appears benign | Conditional mean drift across ancestries | Apply Sun 2021 / Ding 2023 ancestry-conditional Z |
| Pushback | Standard response | |----------|-------------------| | "PRS HR 1.5 per SD is similar to family history" | Acknowledged (Hingorani 2023). We report absolute-risk integrated over age-conditional incidence (Wald-Hingorani framing); we do NOT recommend population-level screening on PRS alone. | | "Why use Bayesian methods over C+T?" | LDpred2-auto / PRS-CS / SBayesRC capture polygenic architecture better; +20-50% R^2 vs C+T per Pain 2021 / Privé 2022 benchmarks. | | "Why not use the largest published GWAS?" | We checked sample overlap between discovery and target with EraSOR / bivariate LDSC intercept; |intercept| < 0.05. | | "Discrete ancestry boxes don't capture genetic structure" | Agreed; we report PRS performance vs continuous PC distance (Ding 2023 Nature); discrete labels are summary only. | | "Why exclude HLA?" | HLA region extreme LD makes SNP-based posteriors unstable; we model classical HLA alleles separately via HIBAG / SNP2HLA. | | "PRS-CS phi parameter; how chosen?" | We use --phi=auto with multiple chains; for sparse traits (lipids) 1e-4; for highly polygenic (height) 1e-2. | | "Where is the absolute risk?" | Reported per Wand 2021 PRS-RS item 19; integrated over age-conditional incidence using external curve. | | "FDA PRS guidance?" | No general FDA PRS draft guidance exists as of 2026; August 2025 Federal Register Class II classification for Cancer Predisposition Risk Assessment Systems is the operative regulatory text. | | "Why three different references for SBayesRC?" | Zheng 2024 Nat Genet is the primary paper; baseline-LD v2.2 is Finucane 2018; UKB LD panel is the bigsnpr/Privé 2020 release. |
https://www.pgscatalog.orghttps://github.com/PGScatalog/pgsc_calchttps://www.federalregister.gov/documents/2025/08/21/2025-16035/testing
Analyze multi-modal single-cell data (CITE-seq, Multiome, spatial). Use when working with data that measures multiple modalities per cell like RNA + protein or RNA + ATAC. Use when analyzing CITE-seq, Multiome, or other multi-modal single-cell data.
data-ai
Analyze metabolite-mediated cell-cell communication using MeboCost for metabolic signaling inference between cell types. Predict metabolite secretion and sensing patterns from scRNA-seq data. Use when studying metabolic crosstalk between cell populations or metabolite-receptor interactions.
development
Find marker genes and annotate cell types in single-cell RNA-seq using Seurat (R) and Scanpy (Python). Use for differential expression between clusters, identifying cluster-specific markers, scoring gene sets, and assigning cell type labels. Use when finding marker genes and annotating clusters.
development
Reconstruct cell lineage trees from CRISPR barcode tracing or mitochondrial mutations. Use when studying clonal dynamics, cell fate decisions, or developmental trajectories.