workflows/gwas-pipeline/SKILL.md
End-to-end GWAS workflow from VCF to association results. Covers PLINK QC, population structure correction, and association testing for case-control or quantitative traits. Use when running genome-wide association studies.
npx skillsauth add GPTomics/bioSkills bio-workflows-gwas-pipelineInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Reference examples tested with: ggplot2 3.5+
Before using code patterns, verify installed versions match. If versions differ:
packageVersion('<pkg>') then ?function_name to verify parameters<tool> --version then <tool> --help to confirm flagsIf code throws ImportError, AttributeError, or TypeError, introspect the installed package and adapt the example to match the actual API rather than retrying.
"Run a GWAS from my genotype data" -> Orchestrate sample/variant QC (PLINK2), population stratification (PCA), association testing (linear/logistic regression), multiple testing correction, and Manhattan/QQ plot visualization.
Complete workflow for genome-wide association studies from genotype data to significant associations.
VCF/PLINK files
|
v
[1. QC Filtering] ------> Sample and variant QC
|
v
[2. Phase + Impute] ----> Align to panel, phase, impute to dosages, filter by R2 (-> phasing-imputation)
|
v
[3. LD Pruning] --------> Independent variants for PCA
|
v
[4. Population Structure] --> PCA for covariates
|
v
[5. Association Testing] --> Logistic/linear regression on dosages
|
v
[6. Results] -----------> Manhattan plot, QQ plot
|
v
Significant associations
# VCF to PLINK binary format
plink2 --vcf input.vcf.gz \
--make-bed \
--out study
# Or with phenotype/covariate files
plink2 --vcf input.vcf.gz \
--pheno phenotypes.txt \
--make-bed \
--out study
# Calculate sample statistics
plink2 --bfile study \
--missing \
--out study_stats
# Remove samples with high missing rate (>5%)
plink2 --bfile study \
--mind 0.05 \
--make-bed \
--out study_sample_qc
# Check for sex discrepancies (if sex chromosome data available)
plink2 --bfile study_sample_qc \
--check-sex \
--out study_sex_check
# Remove related individuals (optional, requires IBD)
plink2 --bfile study_sample_qc \
--king-cutoff 0.0884 \
--make-bed \
--out study_unrelated
# Apply standard variant filters
plink2 --bfile study_sample_qc \
--geno 0.05 \
--maf 0.01 \
--hwe 1e-6 \
--make-bed \
--out study_qc
# Summary
plink2 --bfile study_qc --freq --out study_qc
QC Checkpoint:
Most GWAS impute the QC'd array genotypes up to a dense reference panel before association, to increase variant density and harmonize across platforms and studies. This stage is owned by the phasing-imputation skills; the workflow only orchestrates the handoff. The decisions that matter here, in order:
Goal: Increase variant density and harmonize across platforms by imputing the QC'd genotypes against a dense ancestry-matched panel, carrying dosages (not hard calls) into association.
Approach: Export the QC'd genotypes to VCF, hand off to the phasing-imputation skills for strand/build harmonization, phasing, and per-chromosome imputation (cases and controls together), then filter on the engine's quality field plus a MAF floor.
# Convert QC'd PLINK back to VCF, align to the panel, phase + impute (see phasing-imputation skills for the full commands)
plink2 --bfile study_qc --export vcf bgz --out study_qc
# ... reference-panels: strand/build harmonization; haplotype-phasing: phase; genotype-imputation: impute to dosages ...
# Post-imputation quality filter on the engine's field (DR2 Beagle / R2 Minimac), with a MAF floor
bcftools view -e 'INFO/DR2<0.3 || INFO/AF<0.01 || INFO/AF>0.99' imputed.vcf.gz -Oz -o imputed.qc.vcf.gz
QC Checkpoint: cases and controls imputed together; INFO/R2 filtered (MAF-stratified) with a MAF floor; dosages (DS), not hard calls, carried into association.
# Identify independent variants
plink2 --bfile study_qc \
--indep-pairwise 50 5 0.2 \
--out pruned
# Extract pruned variants
plink2 --bfile study_qc \
--extract pruned.prune.in \
--make-bed \
--out study_pruned
# Calculate principal components
plink2 --bfile study_pruned \
--pca 10 \
--out study_pca
# The eigenvec file contains PCs for use as covariates
library(ggplot2)
# Load PCA results
pca <- read.table('study_pca.eigenvec', header = FALSE)
colnames(pca) <- c('FID', 'IID', paste0('PC', 1:10))
# Load phenotype for coloring
pheno <- read.table('phenotypes.txt', header = TRUE)
pca <- merge(pca, pheno, by = c('FID', 'IID'))
# Plot
ggplot(pca, aes(x = PC1, y = PC2, color = as.factor(PHENO))) +
geom_point(alpha = 0.5) +
labs(title = 'PCA of Study Samples', color = 'Phenotype') +
theme_minimal()
ggsave('pca_plot.pdf', width = 8, height = 6)
Run the association on imputed DOSAGES, not hard calls, so the imputation uncertainty is propagated (PLINK2 reads dosages with --vcf imputed.qc.vcf.gz dosage=DS, or use a .pgen built from dosages). The examples below use the QC'd best-guess genotypes for brevity; substitute the dosage input for an imputed analysis.
# Logistic regression with PCA covariates
plink2 --bfile study_qc \
--pheno phenotypes.txt \
--covar study_pca.eigenvec \
--covar-col-nums 3-12 \
--glm hide-covar \
--out gwas_results
# Results in gwas_results.PHENO.glm.logistic
# Linear regression
plink2 --bfile study_qc \
--pheno phenotypes.txt \
--pheno-name BMI \
--covar study_pca.eigenvec \
--covar-col-nums 3-12 \
--glm hide-covar \
--out gwas_bmi
# Results in gwas_bmi.BMI.glm.linear
# Include age, sex, and PCs
plink2 --bfile study_qc \
--pheno phenotypes.txt \
--covar covariates.txt \
--covar-name AGE,SEX,PC1-PC10 \
--glm hide-covar \
--out gwas_adjusted
library(qqman)
# Load results
results <- read.table('gwas_results.PHENO.glm.logistic', header = TRUE)
results <- results[!is.na(results$P),]
# Manhattan plot
png('manhattan.png', width = 1200, height = 600)
manhattan(results, chr = 'X.CHROM', bp = 'POS', snp = 'ID', p = 'P',
suggestiveline = -log10(1e-5), genomewideline = -log10(5e-8))
dev.off()
# QQ plot
png('qq_plot.png', width = 600, height = 600)
qq(results$P)
dev.off()
# Lambda (genomic inflation factor)
chisq <- qchisq(1 - results$P, 1)
lambda <- median(chisq) / qchisq(0.5, 1)
cat('Lambda:', round(lambda, 3), '\n')
# Lambda should be close to 1.0 (1.0-1.1 acceptable)
# Genome-wide significant (p < 5e-8)
awk '$12 < 5e-8' gwas_results.PHENO.glm.logistic > significant_hits.txt
# Suggestive (p < 1e-5)
awk '$12 < 1e-5' gwas_results.PHENO.glm.logistic > suggestive_hits.txt
| Step | Parameter | Value | |------|-----------|-------| | Sample QC | --mind | 0.05 | | Variant QC | --geno | 0.05 | | Variant QC | --maf | 0.01 | | Variant QC | --hwe | 1e-6 | | LD pruning | --indep-pairwise | 50 5 0.2 | | PCA | --pca | 10 | | Significance | p-value | 5e-8 |
| Issue | Likely Cause | Solution | |-------|--------------|----------| | High lambda (>1.1) | Population stratification | Add more PCs, check ancestry | | No significant hits | Low power | Increase sample size, meta-analysis | | Deflated lambda (<1) | Over-correction | Reduce PC covariates | | QQ deviation at low end | Batch effects | Check for technical artifacts |
#!/bin/bash
set -e
INPUT_VCF="genotypes.vcf.gz"
PHENO="phenotypes.txt"
OUTDIR="gwas_results"
mkdir -p ${OUTDIR}
# Step 1: Convert and QC
plink2 --vcf ${INPUT_VCF} --make-bed --out ${OUTDIR}/raw
plink2 --bfile ${OUTDIR}/raw --mind 0.05 --geno 0.05 --maf 0.01 --hwe 1e-6 \
--make-bed --out ${OUTDIR}/qc
# Step 2: LD pruning
plink2 --bfile ${OUTDIR}/qc --indep-pairwise 50 5 0.2 --out ${OUTDIR}/pruned
plink2 --bfile ${OUTDIR}/qc --extract ${OUTDIR}/pruned.prune.in \
--make-bed --out ${OUTDIR}/pruned
# Step 3: PCA
plink2 --bfile ${OUTDIR}/pruned --pca 10 --out ${OUTDIR}/pca
# Step 4: Association
plink2 --bfile ${OUTDIR}/qc --pheno ${PHENO} \
--covar ${OUTDIR}/pca.eigenvec --covar-col-nums 3-12 \
--glm hide-covar --out ${OUTDIR}/gwas
echo "=== GWAS Complete ==="
echo "Results: ${OUTDIR}/gwas.*.glm.*"
tools
--- name: bio-phasing-imputation-foundations description: Frames the phasing/imputation pipeline before any tool runs: phasing and imputation are one Li-Stephens copying HMM (recombination is the transition, mutation the emission, the genetic map and Ne set the rates), imputation's honest output is a dosage with a self-estimated quality (INFO/R2/DR2) not a hard genotype, and the stages are ordered and each fails silently (QC, align build and strand to the panel, phase, impute per chromosome, fil
tools
Chooses the enrichment generation before any tool runs, mapping the input shape to a method class - a pre-selected gene list plus a background to over-representation analysis (ORA, hypergeometric), a ranked statistic for all genes to gene set enrichment (GSEA), a signed signaling topology to pathway-topology (SPIA) - then making the null explicit (competitive vs self-contained, gene vs subject sampling) and running a trustworthiness checklist (testable-gene universe, FDR, redundancy collapse, leading-edge check, version reporting). Covers why every clusterProfiler GSEA is the inter-gene-correlation-uncorrected competitive null, why the background not the gene list decides ORA significance, and why no method is universally best. Use when deciding ORA vs GSEA vs topology, which gene-set DB, whether a result is trustworthy, or which null a tool computes. For ORA see go-enrichment, GSEA see gsea, databases kegg-pathways/reactome-pathways/wikipathways; the ranking comes from differential-expression/de-results.
development
Orchestrates the full path from differential expression results to redundancy-collapsed functional enrichment: choose ORA vs GSEA, convert gene IDs per method, run enrichGO/enrichKEGG/enrichPathway/enrichWP or gseGO/gseKEGG (clusterProfiler, ReactomePA, rWikiPathways), and visualize. Routes the ORA-vs-GSEA generation fork and the null/universe/reproducibility theory to pathway-analysis/enrichment-foundations. Use when a DESeq2/edgeR/limma result must become enriched GO terms, KEGG/Reactome/WikiPathways pathways, or a GSEA leading edge; when deciding whether a ranking exists for all genes (GSEA, named decreasing vector) or only a pre-selected list (ORA plus a defensible background universe); or when assembling DE-to-pathway end to end. The DE list and ranking statistic come from differential-expression/de-results; per-method nuance lives in the pathway-analysis skills.
development
End-to-end pooled and single-cell CRISPR screen analysis from FASTQ to hit genes. Orchestrates library design QC, guide counting, six-stage screen QC (plasmid Gini, replicate Pearson, CEGv2 PR-AUC, copy-number artifact), method-appropriate hit calling across MAGeCK RRA/MLE, BAGEL2, drugZ, JACKS, and Chronos, cancer-cell-line copy-number correction (CRISPRcleanR / Chronos), batch correction for multi-batch screens, and the specialized branches for combinatorial paralog screens, single-cell Perturb-seq, base-editor variant-function screens, prime-editor screens, and in vivo bottleneck-aware screens. Use when analyzing any pooled CRISPR screen end-to-end, choosing the correct hit-calling method by experimental design, integrating copy-number correction into the pipeline, or branching the workflow for single-cell, combinatorial, base-editor, prime-editor, or in vivo variants.