data-visualization/manhattan-qq-locuszoom/SKILL.md
Build Manhattan, Miami, QQ, and locuszoom-style regional plots from GWAS, TWAS, PWAS, and QTL summary statistics with correct genomic-inflation diagnostics, multi-trait overlays, lead-SNP labeling, and LD-aware regional rendering. Use when visualizing association results across the genome, comparing two traits, computing genomic inflation lambda, or zooming into a locus with LD coloring.
npx skillsauth add GPTomics/bioSkills bio-data-visualization-manhattan-qq-locuszoomInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Reference examples tested with: qqman 0.1.9 (R), CMplot 4.5+ (R), matplotlib 3.8+, pandas 2.2+, scipy 1.12+, plinkQC 0.3+. For locuszoom-style: locuszoomr 0.3+ (R) or pyranges + matplotlib.
Before using code patterns, verify installed versions match. If versions differ:
pip show <package> then help(module.function) to check signaturespackageVersion('<pkg>') then ?function_name to verify parametersIf code throws ImportError, AttributeError, or TypeError, introspect the installed package and adapt the example to match the actual API rather than retrying.
"Plot my GWAS results" -> Render per-variant -log10(p) across the genome (Manhattan), compare expected vs observed p quantiles (QQ + λGC), overlay two traits with mirrored axes (Miami), and zoom into a locus with LD-colored points + recombination rate + gene track (locuszoom). The choices that matter: significance thresholds, axis truncation for ultra-significant peaks, lead-SNP labeling, and LD reference selection for regional plots.
qqman::manhattan / qqman::qq (Turner 2018), CMplot::CMplot, locuszoomr::locus_plotmatplotlib + pandas for custom; assocplots for ready-madeThe "genome-wide significant" line at p < 5e-8 (Pe'er 2008 Genet Epidemiol 32:381) is calibrated for European-ancestry common-variant GWAS assuming ~1M effectively independent tests. It is the wrong threshold for:
A Manhattan plot's significance line is a contract with the reader about which multiple-testing regime applies. Mismatched thresholds over- or under-report hits.
| Analysis | Genome-wide threshold | Suggestive threshold | Reference | |----------|----------------------|---------------------|-----------| | Common-variant GWAS (Eur) | 5e-8 | 1e-5 | Pe'er 2008 Genet Epidemiol 32:381 | | Whole-genome sequencing (all variants, EUR) | 5e-9 | 5e-8 | Pulit 2017 Genet Epidemiol 41:145; Xu 2014 Genet Epidemiol 38:281 | | Non-European ancestry (empirical per pop) | ~3.24e-8 AFR; ~9.26e-8 EAS | – | Kanai 2016 J Hum Genet 61:861 | | TWAS (~20k genes) | 2.5e-6 (Bonferroni) | 1e-4 | Standard practice | | PWAS (~5k proteins) | 1e-5 | 1e-4 | Standard practice | | eQTL trans (genome-wide per probe) | Bonferroni over genes × variants | Per-tissue | GTEx convention | | eQTL cis (within 1Mb) | nominal p < 1e-5 with permutation | – | GTEx FastQTL | | Rare-variant gene burden | 2.5e-6 | 1e-4 | Bonferroni 20k genes | | Trans-ancestry meta-analysis | 5e-9 | – | Convention |
Goal: Quantify whether observed p-values are inflated relative to the chi-square null, indicating cryptic population structure, relatedness, or technical artifacts.
Approach: Convert observed p to chi-square; compute median chi-square divided by 0.4549 (the median of chi-square_1; Devlin-Roeder 1999); plot expected vs observed quantiles (QQ plot).
library(qqman)
chisq <- qchisq(1 - df$P, df = 1)
lambda <- median(chisq) / 0.4549
# lambda = 1.0 -> no inflation
# lambda > 1.1 -> investigate; could indicate confounding
# lambda > 1.2 -> almost certainly confounded; principal components or LMM needed
qq(df$P, main = paste('QQ plot (lambda =', round(lambda, 3), ')'))
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats
chisq = stats.chi2.isf(df['P'], df=1)
lambda_gc = np.median(chisq) / stats.chi2.ppf(0.5, df=1)
# QQ plot
expected = -np.log10(np.arange(1, len(df)+1) / (len(df) + 1))
observed = -np.log10(np.sort(df['P']))
fig, ax = plt.subplots(figsize=(4, 4))
ax.scatter(expected, observed, s=2)
ax.plot([0, max(expected)], [0, max(expected)], 'r--')
ax.set_xlabel(r'Expected $-\log_{10}(p)$')
ax.set_ylabel(r'Observed $-\log_{10}(p)$')
ax.set_title(f'QQ ($\\lambda_{{GC}} = {lambda_gc:.3f}$)')
Interpretation:
λ_1000 = 1 + (λ - 1) * 1000/n)Inflation can also be legitimate polygenic signal (Yang 2011 Eur J Hum Genet 19:807). Distinguish via LD-score regression intercept: confounding inflates intercept; polygenic signal inflates slope.
Standard logistic regression / Wald test p-values are anti-conservative when (a) case count < 200, (b) per-variant minor-allele count < 20, (c) case-control ratio is severely unbalanced (typical in EHR-derived cohorts). λGC may look normal but per-variant p-values are inflated independently — a Manhattan plot of these p-values is misleading regardless of inflation diagnostics.
| Regime | Test choice | Tool | |--------|-------------|------| | Balanced case-control, N>5000, MAF>0.01 | Standard logistic / linear regression | PLINK, REGENIE | | Unbalanced (case fraction <10%), large N | SPA-corrected logistic regression | SAIGE, REGENIE Firth/SPA | | Small N (<5000) | Penalized regression with bias correction | SAIGE Firth, REGENIE | | Rare variants (MAC <20) | Gene-burden or SKAT-O | STAAR, REGENIE burden |
Specifically: SAIGE (Zhou 2018 Nat Genet 50:1335) and REGENIE (Mbatchou 2021 Nat Genet 53:1097) implement saddlepoint-approximation (SPA) and Firth-bias correction. Use them for any cohort with severe case-control imbalance; the Manhattan / QQ output is then defensibly calibrated.
library(qqman)
manhattan(df,
chr = 'CHR', bp = 'BP', p = 'P', snp = 'SNP',
col = c('#0072B2', '#56B4E9'),
genomewideline = -log10(5e-8),
suggestiveline = -log10(1e-5),
ylim = c(0, max(-log10(df$P)) * 1.1),
highlight = lead_snps,
annotatePval = 5e-8,
annotateTop = TRUE)
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
def manhattan_plot(df, p_threshold=5e-8, suggestive=1e-5, y_cap=None):
df = df.sort_values(['CHR', 'BP']).copy()
df['neg_log10_p'] = -np.log10(df['P'])
if y_cap:
df['neg_log10_p'] = df['neg_log10_p'].clip(upper=y_cap)
df['x'] = np.arange(len(df))
chr_ticks = df.groupby('CHR')['x'].median()
chr_colors = ['#0072B2', '#56B4E9']
fig, ax = plt.subplots(figsize=(10, 4))
for i, (chrom, group) in enumerate(df.groupby('CHR')):
ax.scatter(group['x'], group['neg_log10_p'],
c=chr_colors[i % 2], s=3, rasterized=True)
ax.axhline(-np.log10(p_threshold), color='red', linestyle='--', lw=0.5)
ax.axhline(-np.log10(suggestive), color='grey', linestyle='--', lw=0.5)
ax.set_xticks(chr_ticks)
ax.set_xticklabels(chr_ticks.index, rotation=0)
ax.set_xlabel('Chromosome')
ax.set_ylabel(r'$-\log_{10}(p)$')
return fig
Genome-wide significant peaks routinely reach -log10(p) = 100+ (e.g., GWAS of BMI at FTO). The visual effect: one peak fills the y-axis, all other signal is invisible.
Fixes (ordered by preference):
y_cap = 25; clip points above the cap; mark capped points with ^ arrow at the top of the panel. Use Y-axis label "−log10(P), capped at 25"ggbreak::scale_y_break() (R) or axes_grid1.divider (matplotlib)# CMplot supports Miami natively
library(CMplot)
CMplot(list(trait1_df, trait2_df),
plot.type = 'm',
multraits = TRUE,
threshold = 5e-8,
threshold.col = 'red',
col = list(c('#0072B2','#56B4E9'), c('#D55E00','#E69F00')),
file = 'jpg', file.output = TRUE)
Miami plot mirrors trait 1 above the x-axis, trait 2 below. Useful for shared-locus discovery (mirrored peaks at the same locus = pleiotropy candidate).
A locuszoom plot zooms into a ~1Mb window around a lead SNP, colors SNPs by LD r² to the lead, overlays recombination rate (cM/Mb), and shows the gene track. The canonical tool is locuszoom.org (Pruim 2010 Bioinformatics 26:2336); the R package is locuszoomr (Lai 2024).
library(locuszoomr)
loc <- locus(gene = 'TCF7L2',
flank = 5e5,
ens_db = 'EnsDb.Hsapiens.v86',
data = gwas_df,
snp = 'SNP', chrom = 'CHR', pos = 'BP', p = 'P', labs = 'SNP')
# LD computed via LDlinkR / 1000G reference
loc <- link_LD(loc, pop = 'EUR', token = ldlink_token)
locus_plot(loc, labels = c('index', 'top'))
LD reference choice: ALWAYS match the GWAS population. Using a 1000G European LD reference for a Japanese GWAS produces wrong LD colorings and misleads fine-mapping.
Trigger: λGC = 1.15; analyst concludes "polygenic," moves on.
Mechanism: Inflation can be confounding (population structure, relatedness, technical) OR polygenicity. LD-score regression separates them: intercept = confounding; slope = polygenicity.
Symptom: Top hits replicate poorly in independent cohorts.
Fix: Run LDSC (ldsc.py --h2); intercept significantly > 1 indicates confounding. Adjust with 10-20 PCs or switch to LMM.
Trigger: Plotting Bonferroni-naive 5e-8 line on a TWAS or rare-variant burden plot.
Mechanism: 5e-8 is calibrated for ~1M independent common-variant tests; other analyses have different effective test counts.
Symptom: Reviewer flags "this gene doesn't pass Bonferroni" but the plot's red line is at 5e-8.
Fix: Match the threshold to the analysis (table above).
Trigger: Y-axis clipped at 25 without arrow markers for capped points.
Mechanism: Reader cannot tell how high the true peak is.
Symptom: Reviewer asks for actual p-value; the reported 1e-100 conflicts with the displayed cap at 25.
Fix: Mark capped points with ^ symbol; annotate axis "(capped at 25)"; include unclipped numerical value in caption.
Trigger: col = rainbow(22) for chromosome alternating.
Mechanism: 22 random hues add no information; visual chaos.
Symptom: Reader cannot quickly identify which chromosome a peak is on.
Fix: Two-color alternation (c('#0072B2', '#56B4E9')). Chromosome boundaries are clear from spacing alone.
Trigger: 1000G EUR LD reference used to color a Japanese / African GWAS regional plot.
Mechanism: LD differs by population; r² is population-specific.
Symptom: Locuszoom shows uncorrelated SNPs in red (high r²) or vice versa.
Fix: Match LD reference to GWAS population; for trans-ancestry GWAS, show per-population panels or use largest-N population reference and annotate the discrepancy.
Trigger: Unsorted input data frame.
Mechanism: qqman plots in input row order, not coordinate order.
Symptom: Peaks render as vertical scatter at wrong x-position.
Fix: df <- df %>% arrange(CHR, BP) before plotting.
| Pattern | Likely cause | Action | |---------|--------------|--------| | λGC > 1.1 but LDSC intercept ~1 | Polygenic signal | Document; no action needed | | λGC > 1.1 AND LDSC intercept > 1 | Confounding | Add PCs / use LMM | | QQ plot "S-shaped" | Severe inflation or non-additive model misspecification | Inspect; possibly model misspecified | | QQ plot deflated below diagonal | Conservative p (e.g., score test); or wrong test stat | Review test statistic computation | | Top SNP genome-wide but small effect | Likely true; or relatedness | Verify in unrelated subset | | Replication fails for top hits | Confounding (winner's curse); or true heterogeneity | Trans-ancestry meta-analysis or LMM rerun |
| Threshold | Value | Source | |-----------|-------|--------| | Common-variant GWAS sig | 5e-8 | Pe'er 2008 | | WGS sig (all variants, EUR) | 5e-9 | Pulit 2017; Xu 2014 | | Empirical pop-specific (e.g., EAS) | ~9.26e-8 EAS | Kanai 2016 | | TWAS / PWAS Bonferroni | 0.05 / n_genes | Standard | | λGC well-calibrated | 1.00 ± 0.02 | Standard | | λGC investigate | >1.05 | Common practice | | λGC confounded | >1.10 | Common practice | | Sample-size adjusted λ | λ_1000 = 1 + (λ - 1) × 1000/n | Standard scaling | | Suggestive threshold | 1e-5 | Pe'er 2008 | | Y-cap typical | 25-50 -log10(p) | Visualization choice |
| Error / symptom | Cause | Solution |
|-----------------|-------|----------|
| Peaks at wrong x position | Data not sorted by CHR, BP | arrange(CHR, BP) upstream |
| λGC reported as conclusion alone | Confounding vs polygenicity not separated | Run LDSC for intercept vs slope |
| Threshold line at 5e-8 on TWAS | Wrong multiple-testing regime | Use Bonferroni-correct threshold |
| Y-axis crushed by one peak | No cap, no split | Cap at 25-50 with arrow markers OR split axis |
| Regional plot LD colors look wrong | LD reference mismatched to GWAS pop | Match LD reference to ancestry |
| QQ plot deflated | Conservative test or wrong stat | Verify test statistic |
| Manhattan with 22 distinct hues | Cosmetic clutter | Two-color alternation |
| Lead SNPs unlabeled | Default labeling off | annotatePval = 5e-8, annotateTop = TRUE |
tools
--- name: bio-phasing-imputation-foundations description: Frames the phasing/imputation pipeline before any tool runs: phasing and imputation are one Li-Stephens copying HMM (recombination is the transition, mutation the emission, the genetic map and Ne set the rates), imputation's honest output is a dosage with a self-estimated quality (INFO/R2/DR2) not a hard genotype, and the stages are ordered and each fails silently (QC, align build and strand to the panel, phase, impute per chromosome, fil
tools
Chooses the enrichment generation before any tool runs, mapping the input shape to a method class - a pre-selected gene list plus a background to over-representation analysis (ORA, hypergeometric), a ranked statistic for all genes to gene set enrichment (GSEA), a signed signaling topology to pathway-topology (SPIA) - then making the null explicit (competitive vs self-contained, gene vs subject sampling) and running a trustworthiness checklist (testable-gene universe, FDR, redundancy collapse, leading-edge check, version reporting). Covers why every clusterProfiler GSEA is the inter-gene-correlation-uncorrected competitive null, why the background not the gene list decides ORA significance, and why no method is universally best. Use when deciding ORA vs GSEA vs topology, which gene-set DB, whether a result is trustworthy, or which null a tool computes. For ORA see go-enrichment, GSEA see gsea, databases kegg-pathways/reactome-pathways/wikipathways; the ranking comes from differential-expression/de-results.
testing
End-to-end GWAS workflow from VCF to association results. Covers PLINK QC, population structure correction, and association testing for case-control or quantitative traits. Use when running genome-wide association studies.
development
Orchestrates the full path from differential expression results to redundancy-collapsed functional enrichment: choose ORA vs GSEA, convert gene IDs per method, run enrichGO/enrichKEGG/enrichPathway/enrichWP or gseGO/gseKEGG (clusterProfiler, ReactomePA, rWikiPathways), and visualize. Routes the ORA-vs-GSEA generation fork and the null/universe/reproducibility theory to pathway-analysis/enrichment-foundations. Use when a DESeq2/edgeR/limma result must become enriched GO terms, KEGG/Reactome/WikiPathways pathways, or a GSEA leading edge; when deciding whether a ranking exists for all genes (GSEA, named decreasing vector) or only a pre-selected list (ORA plus a defensible background universe); or when assembling DE-to-pathway end to end. The DE list and ranking statistic come from differential-expression/de-results; per-method nuance lives in the pathway-analysis skills.