skills/genomics-bioinformatics/plink2-gwas-analysis/SKILL.md
GWAS and population genetics tool. Processes PLINK (.bed/.bim/.fam), VCF, and BGEN; runs QC (MAF, HWE, missingness), IBD estimation, PCA, and linear/logistic regression GWAS. Outputs Manhattan-ready summary stats. Use regenie or SAIGE for biobanks (>100k samples) needing mixed models.
npx skillsauth add jaechang-hits/scicraft plink2-gwas-analysisInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
PLINK2 is the high-performance successor to PLINK 1.9, designed for genome-wide association studies (GWAS) and population genetics analysis on large cohorts. It processes genotype data in PLINK binary format (.bed/.bim/.fam), VCF, and BGEN formats — performing sample and variant quality control (QC), kinship estimation, principal component analysis (PCA), and linear/logistic regression association testing. PLINK2 is 10–100× faster than PLINK 1.9 on most tasks due to multithreading and optimized I/O. Output files are compatible with downstream visualization (Manhattan/QQ plots) and meta-analysis tools.
Check before installing: The tool may already be available (e.g., inside a
pixi/condaenv). Always runcommand -v plink2first and skip the install block if it returns a path. When executing tools inside a pixi project, preferpixi run <tool>over plain<tool>.
# Skip install if already present
if command -v plink2 >/dev/null 2>&1; then
echo "plink2 already installed: $(plink2 --version)"
else
# Download PLINK2 pre-compiled binary (Linux)
wget https://s3.amazonaws.com/plink2-assets/alpha6/plink2_linux_avx2_20241112.zip
unzip plink2_linux_avx2_20241112.zip
chmod +x plink2
export PATH="$PWD:$PATH"
# macOS
# wget https://s3.amazonaws.com/plink2-assets/alpha6/plink2_mac_20241112.zip
# unzip plink2_mac_20241112.zip
plink2 --version
# PLINK v2.00a6LM
fi
# Python for downstream analysis
pip install pandas numpy matplotlib scipy
# Run GWAS: linear regression for quantitative trait
plink2 \
--bfile cohort_qc \
--pheno phenotypes.txt \
--covar covariates.txt \
--linear hide-covar \
--out results/gwas_result \
--threads 8
# View top hits
head results/gwas_result.*.glm.linear | sort -k12,12g | head -20
Convert input genotype data to PLINK binary format for fast processing.
# Convert VCF to PLINK binary
plink2 \
--vcf cohort_genotyped.vcf.gz \
--make-bed \
--out cohort_plink \
--threads 8
# Convert BGEN (imputed data from Michigan/TopMed imputation server)
plink2 \
--bgen cohort_imputed.bgen ref-first \
--sample cohort_imputed.sample \
--make-bed \
--out cohort_imputed_plink \
--threads 8
echo "Files created:"
ls -lh cohort_plink.{bed,bim,fam}
echo "Samples: $(wc -l < cohort_plink.fam)"
echo "Variants: $(wc -l < cohort_plink.bim)"
Remove samples with high missingness or heterozygosity outliers.
# Compute per-sample and per-variant missingness
plink2 \
--bfile cohort_plink \
--missing \
--out qc/sample_missingness \
--threads 8
# Remove samples with > 2% missingness and variants with > 5% missing
plink2 \
--bfile cohort_plink \
--mind 0.02 \
--geno 0.05 \
--make-bed \
--out cohort_sample_qc \
--threads 8
echo "After sample QC:"
echo "Samples: $(wc -l < cohort_sample_qc.fam)"
echo "Variants: $(wc -l < cohort_sample_qc.bim)"
Filter variants by minor allele frequency, Hardy-Weinberg equilibrium, and imputation quality.
# Variant QC: MAF, HWE, missingness
plink2 \
--bfile cohort_sample_qc \
--maf 0.01 \
--hwe 1e-6 \
--geno 0.05 \
--make-bed \
--out cohort_variantqc \
--threads 8
echo "After variant QC:"
echo "Variants remaining: $(wc -l < cohort_variantqc.bim)"
# For imputed data: also filter by INFO score (using VCF INFO field)
# plink2 --bfile cohort_variantqc --var-min-qual 0.8 --make-bed --out cohort_info_qc
Compute principal components from LD-pruned variants.
# LD pruning: remove one variant from each pair with r² > 0.2
plink2 \
--bfile cohort_variantqc \
--indep-pairwise 50 5 0.2 \
--out qc/ld_pruned \
--threads 8
# Compute PCA on LD-pruned variants (top 20 PCs)
plink2 \
--bfile cohort_variantqc \
--extract qc/ld_pruned.prune.in \
--pca 20 \
--out pca/cohort_pca \
--threads 8
echo "PCA files: pca/cohort_pca.eigenvec (PCs) and pca/cohort_pca.eigenval (variance)"
head pca/cohort_pca.eigenvec
Perform logistic regression (case-control) or linear regression (quantitative trait).
# Case-control GWAS: logistic regression with covariates
plink2 \
--bfile cohort_variantqc \
--pheno phenotypes.txt \
--1 \
--covar covariates.txt \
--covar-variance-standardize \
--logistic hide-covar \
--ci 0.95 \
--out results/gwas_cc \
--threads 8
# Quantitative trait GWAS: linear regression
plink2 \
--bfile cohort_variantqc \
--pheno phenotypes.txt \
--covar covariates.txt \
--covar-variance-standardize \
--linear hide-covar \
--ci 0.95 \
--out results/gwas_qt \
--threads 8
echo "GWAS complete. Files: results/gwas_*.glm.*"
wc -l results/gwas_qt.*.glm.linear
Visualize GWAS results with Python.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import chi2
# Load GWAS results (PLINK2 linear output)
df = pd.read_csv("results/gwas_qt.PHENO1.glm.linear", sep="\t",
usecols=["#CHROM", "POS", "ID", "P"])
df.columns = ["CHR", "POS", "SNP", "P"]
df = df.dropna(subset=["P"])
df["P"] = pd.to_numeric(df["P"], errors="coerce")
df = df[df["P"] > 0].copy()
df["-log10P"] = -np.log10(df["P"])
print(f"Variants: {len(df)}")
print(f"Genome-wide significant (p<5e-8): {(df['P'] < 5e-8).sum()}")
# Manhattan plot
fig, ax = plt.subplots(figsize=(14, 4))
colors = plt.cm.Set1.colors
chrom_pos = 0
ticks = []
for chrom, group in df.groupby("CHR", sort=False):
col = colors[int(chrom) % 2] if str(chrom).isdigit() else "gray"
ax.scatter(group["POS"] + chrom_pos, group["-log10P"],
c=[col], s=2, alpha=0.7)
ticks.append((chrom_pos + group["POS"].mean(), str(chrom)))
chrom_pos += group["POS"].max() + 1e7
ax.axhline(-np.log10(5e-8), color="red", linestyle="--", lw=1, label="p=5×10⁻⁸")
ax.set_xticks([t[0] for t in ticks[::2]])
ax.set_xticklabels([t[1] for t in ticks[::2]], fontsize=6)
ax.set_xlabel("Chromosome")
ax.set_ylabel("-log₁₀(p)")
ax.set_title("GWAS Manhattan Plot")
plt.tight_layout()
plt.savefig("manhattan.png", dpi=150)
print("Saved: manhattan.png")
| Parameter | Default | Range/Options | Effect |
|-----------|---------|---------------|--------|
| --maf | — | 0–0.5 | Minor allele frequency threshold; 0.01 removes singletons |
| --hwe | — | 0–1 | Hardy-Weinberg equilibrium p-value threshold; 1e-6 typical |
| --geno | — | 0–1 | Maximum variant missingness; 0.05 = 5% |
| --mind | — | 0–1 | Maximum sample missingness; 0.02 = 2% |
| --linear / --logistic | — | flag | Regression mode: linear for quantitative, logistic for case-control |
| --covar | — | file path | Covariate file (FID IID COV1 COV2...); add PCs here |
| --pca | — | integer | Number of PCs to compute from LD-pruned data |
| --indep-pairwise | — | window step r² | LD pruning: window size (variants), step, r² threshold |
| --threads | 1 | 1–64 | CPU threads; 8–16 typical for cluster |
| --ci | — | 0–1 | Confidence interval for effect size output (0.95 for 95% CI) |
| --1 | off | flag | Recode phenotype: 1=control, 2=case → 0=control, 1=case (logistic) |
# Compute kinship coefficients (King-robust estimator)
plink2 \
--bfile cohort_variantqc \
--extract qc/ld_pruned.prune.in \
--king-cutoff 0.0625 \
--out qc/kinship \
--threads 8
# Remove related individuals (king-cutoff auto-generates exclusion list)
plink2 \
--bfile cohort_variantqc \
--remove qc/kinship.king.cutoff.out.id \
--make-bed \
--out cohort_unrelated \
--threads 8
echo "After relatedness QC: $(wc -l < cohort_unrelated.fam) samples"
import pandas as pd
import numpy as np
def load_gwas_results(filepath: str, p_threshold: float = 5e-8) -> pd.DataFrame:
"""Load PLINK2 GWAS linear/logistic output and return genome-wide significant hits."""
df = pd.read_csv(filepath, sep="\t",
usecols=["#CHROM", "POS", "ID", "REF", "ALT", "A1", "BETA", "SE", "P"])
df.columns = ["CHR", "POS", "SNP", "REF", "ALT", "A1", "BETA", "SE", "P"]
df = df.dropna(subset=["P"])
df["P"] = pd.to_numeric(df["P"], errors="coerce")
sig = df[df["P"] < p_threshold].sort_values("P")
return sig
# Load results for one phenotype
hits = load_gwas_results("results/gwas_qt.PHENO1.glm.linear")
print(f"Genome-wide significant loci: {len(hits)}")
print(hits.head(10)[["CHR", "POS", "SNP", "BETA", "SE", "P"]])
hits.to_csv("gwas_top_hits.tsv", sep="\t", index=False)
| Output | Format | Description |
|--------|--------|-------------|
| *.glm.linear | TSV | Linear GWAS results: CHR, POS, ID, BETA, SE, P per variant |
| *.glm.logistic.hybrid | TSV | Logistic GWAS results: OR, 95% CI, P per variant |
| *.eigenvec | TSV | PCA loadings: FID IID PC1 PC2 ... PC20 |
| *.eigenval | Text | Eigenvalues for PCA variance explained |
| *.king.cutoff.out.id | TSV | Sample IDs to remove for relatedness QC |
| *.bed/.bim/.fam | Binary | PLINK binary format after QC filtering |
| Problem | Cause | Solution |
|---------|-------|----------|
| Error: No samples left after --mind filter | Missingness threshold too strict | Relax to --mind 0.05; check input data quality |
| Logistic regression fails to converge | Rare variant, few cases, or collinear covariates | Use --logistic firth for Firth regression; remove correlated covariates |
| PCA shows clear outlier cluster | Admixed population or sample contamination | Remove outliers using PC cutoffs; check ancestry with 1000 Genomes reference panel |
| --hwe removes too many variants | Population structure inflating HWE test | Apply HWE filter to controls only: --hwe 1e-6 ctrl-only |
| BGEN import fails | Wrong ref-first/ref-last flag | Check imputation server documentation; try --bgen file.bgen ref-last |
| Very slow association test | Too many covariates or large file | Pre-filter to GWAS-significant window; use --threads 16 |
| Sex mismatch warnings | X chromosome heterozygosity outside sex-specific thresholds | Run --check-sex and remove F-statistic outliers |
| Memory error on large dataset | RAM insufficient for 500k+ variants | Use --memory 32000 to cap RAM; split by chromosome |
tools
Fast short-read DNA aligner for WGS/WES/ChIP-seq. 2× faster BWA-MEM successor; outputs SAM/BAM with read group headers for GATK. Primary plus supplementary records for chimeric reads. Use STAR for RNA-seq splice-aware alignment; Bowtie2 is a comparable alternative.
tools
smina molecular docking CLI. AutoDock Vina fork with customizable scoring functions, native SDF/MOL2/PDB ligand input, autoboxing, local energy minimization, and per-atom score breakdowns. Pipeline: receptor PDBQT prep -> ligand prep (RDKit/OpenBabel) -> dock via autobox or explicit grid -> rescore/minimize with custom scoring -> rank poses by affinity. Choose smina over Vina when you need custom scoring terms (--custom_scoring), local optimization of an existing pose (--local_only), per-atom contributions (--atom_term_data), or SDF/MOL2 ligands without manual PDBQT conversion. For unknown binding sites use diffdock-blind-docking; for the Python-bindings/Vinardo workflow use autodock-vina-docking.
development
mdtraj molecular dynamics trajectory analysis (Python). Reads DCD/XTC/TRR/NetCDF/H5/PDB topologies and trajectories; computes RMSD vs time, radius of gyration, per-residue RMSF, residue-residue contact frequency maps, phi/psi torsions for Ramachandran plots (general + Gly/Pro), and 8-state DSSP secondary structure. Modules: trajectory I/O, geometry (distances/angles/dihedrals), structural analysis (RMSD/Rg/RMSF/SASA), contacts, hydrogen bonds, secondary structure (DSSP), NMR observables. For broader atom-selection grammar use mdanalysis-trajectory; for running MD simulations use OpenMM/GROMACS.
development
Programmatic PubMed access via NCBI E-utilities REST API. Covers Boolean/MeSH queries, field-tagged search, endpoints (ESearch, EFetch, ESummary, EPost, ELink), history server for batches, citation matching, systematic review strategies. Use for biomedical literature search or automated pipelines.