population-genetics/association-testing/SKILL.md
Genome-wide association studies (GWAS) with PLINK. Perform case-control and quantitative trait association testing using logistic/linear regression with covariates, generate Manhattan and QQ plots for result visualization. Use when running GWAS or association tests.
npx skillsauth add GPTomics/bioSkills bio-population-genetics-association-testingInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Reference examples tested with: matplotlib 3.8+, numpy 1.26+, pandas 2.2+, scipy 1.12+
Before using code patterns, verify installed versions match. If versions differ:
pip show <package> then help(module.function) to check signatures<tool> --version then <tool> --help to confirm flagsIf code throws ImportError, AttributeError, or TypeError, introspect the installed package and adapt the example to match the actual API rather than retrying.
"Run a GWAS on my genotyping data" -> Perform genome-wide association testing using logistic (case-control) or linear (quantitative) regression with covariates, then visualize results with Manhattan and QQ plots.
plink2 --glm for association testing with covariatesGWAS analysis using PLINK 2.0's unified --glm command for case-control and quantitative traits.
# Basic logistic regression
plink2 --bfile data --glm --out results
# With phenotype file
plink2 --bfile data --pheno pheno.txt --glm --out results
# Linear regression for quantitative traits
plink2 --bfile data --pheno pheno.txt --glm --out results
# Include covariates (sex, age, PCs)
plink2 --bfile data \
--pheno pheno.txt \
--covar covariates.txt \
--glm --out results
# Specify which covariates to use
plink2 --bfile data \
--pheno pheno.txt \
--covar covariates.txt \
--covar-name PC1,PC2,PC3,age,sex \
--glm --out results
# pheno.txt: FID IID pheno
# For binary: 1=control, 2=case, -9=missing
# For quantitative: continuous values
FAM001 IND001 2
FAM002 IND002 1
FAM003 IND003 1.5
# covariates.txt: FID IID cov1 cov2 ...
FAM001 IND001 0.15 35 1
FAM002 IND002 -0.22 42 2
FAM003 IND003 0.08 28 1
# Multiple phenotypes (test all)
plink2 --bfile data --pheno pheno_multi.txt --glm --out results
# Specific phenotype column
plink2 --bfile data --pheno pheno_multi.txt --pheno-name trait1 --glm --out results
# Missing phenotype handling
plink2 --bfile data --glm allow-no-covars --out results
# Additive model (default)
plink2 --bfile data --glm --out results
# Dominant model
plink2 --bfile data --glm dominant --out results
# Recessive model
plink2 --bfile data --glm recessive --out results
# Genotypic (2df test)
plink2 --bfile data --glm genotypic --out results
# Hide covariates from output (cleaner output)
plink2 --bfile data --covar cov.txt --glm hide-covar --out results
# Enable Firth fallback for case-control (default in PLINK 2.0)
plink2 --bfile data --glm firth-fallback --out results
# Force Firth regression
plink2 --bfile data --glm firth --out results
# Disable Firth
plink2 --bfile data --glm no-firth --out results
# Default output: results.PHENO1.glm.logistic or results.PHENO1.glm.linear
# Columns: CHROM, POS, ID, REF, ALT, A1, FIRTH?, TEST, OBS_CT, OR/BETA, SE, Tstat, P
# Add specific columns
plink2 --bfile data --glm cols=+a1freq,+machr2 --out results
# Available columns:
# +a1freq: A1 allele frequency
# +machr2: MaCH R-squared
# +ax: Reference allele dosage
# +err: Standard errors
# 1. Run PCA
plink2 --bfile data --pca 10 --out pca_results
# 2. Use PCs as covariates
plink2 --bfile data \
--pheno pheno.txt \
--covar pca_results.eigenvec \
--covar-name PC1,PC2,PC3,PC4,PC5 \
--glm --out results
Goal: Run a complete GWAS pipeline from raw genotypes through population stratification correction to association testing.
Approach: Apply MAF, genotyping rate, and HWE quality filters, compute principal components for population structure correction, then run logistic/linear regression with PCs as covariates.
# QC, PCA, and GWAS in sequence
plink2 --bfile raw --maf 0.01 --geno 0.05 --hwe 1e-6 --make-bed --out qc
plink2 --bfile qc --pca 10 --out pca
plink2 --bfile qc \
--pheno pheno.txt \
--covar pca.eigenvec \
--covar-name PC1-PC5 \
--glm hide-covar --out gwas
# Filter significant results
awk 'NR==1 || $13 < 5e-8' results.PHENO1.glm.logistic > significant.txt
# Extract top hits
sort -k13 -g results.PHENO1.glm.logistic | head -100 > top_hits.txt
import pandas as pd
results = pd.read_csv('results.PHENO1.glm.logistic', sep='\t')
significant = results[results['P'] < 5e-8]
print(f'Genome-wide significant hits: {len(significant)}')
suggestive = results[results['P'] < 1e-5]
print(f'Suggestive hits: {len(suggestive)}')
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
results = pd.read_csv('results.PHENO1.glm.logistic', sep='\t')
results = results[results['TEST'] == 'ADD']
results['-log10P'] = -np.log10(results['P'])
chrom_colors = ['#1f77b4', '#ff7f0e']
results['color'] = results['#CHROM'].apply(lambda x: chrom_colors[x % 2])
cumulative_pos = []
offset = 0
for chrom in sorted(results['#CHROM'].unique()):
chrom_data = results[results['#CHROM'] == chrom]
cumulative_pos.extend(chrom_data['POS'] + offset)
offset += chrom_data['POS'].max()
results['cumulative_pos'] = cumulative_pos
plt.figure(figsize=(14, 6))
plt.scatter(results['cumulative_pos'], results['-log10P'], c=results['color'], s=1)
plt.axhline(y=-np.log10(5e-8), color='red', linestyle='--', label='Genome-wide (5e-8)')
plt.axhline(y=-np.log10(1e-5), color='blue', linestyle='--', label='Suggestive (1e-5)')
plt.xlabel('Chromosome')
plt.ylabel('-log10(P)')
plt.legend()
plt.savefig('manhattan.png', dpi=150)
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats
results = pd.read_csv('results.PHENO1.glm.logistic', sep='\t')
observed_p = results[results['TEST'] == 'ADD']['P'].dropna().sort_values()
n = len(observed_p)
expected_p = np.arange(1, n + 1) / (n + 1)
plt.figure(figsize=(6, 6))
plt.scatter(-np.log10(expected_p), -np.log10(observed_p), s=1)
plt.plot([0, 8], [0, 8], 'r--')
plt.xlabel('Expected -log10(P)')
plt.ylabel('Observed -log10(P)')
lambda_gc = np.median(stats.chi2.ppf(1 - observed_p, 1)) / stats.chi2.ppf(0.5, 1)
plt.title(f'QQ Plot (λ = {lambda_gc:.3f})')
plt.savefig('qqplot.png', dpi=150)
from scipy import stats
import numpy as np
results = pd.read_csv('results.PHENO1.glm.logistic', sep='\t')
pvalues = results[results['TEST'] == 'ADD']['P'].dropna()
chisq = stats.chi2.ppf(1 - pvalues, 1)
lambda_gc = np.median(chisq) / stats.chi2.ppf(0.5, 1)
print(f'Genomic inflation factor: {lambda_gc:.3f}')
# Good: 1.0-1.05, Acceptable: 1.05-1.1, Concerning: >1.1
testing
Analyze multi-modal single-cell data (CITE-seq, Multiome, spatial). Use when working with data that measures multiple modalities per cell like RNA + protein or RNA + ATAC. Use when analyzing CITE-seq, Multiome, or other multi-modal single-cell data.
data-ai
Analyze metabolite-mediated cell-cell communication using MeboCost for metabolic signaling inference between cell types. Predict metabolite secretion and sensing patterns from scRNA-seq data. Use when studying metabolic crosstalk between cell populations or metabolite-receptor interactions.
development
Find marker genes and annotate cell types in single-cell RNA-seq using Seurat (R) and Scanpy (Python). Use for differential expression between clusters, identifying cluster-specific markers, scoring gene sets, and assigning cell type labels. Use when finding marker genes and annotating clusters.
development
Reconstruct cell lineage trees from CRISPR barcode tracing or mitochondrial mutations. Use when studying clonal dynamics, cell fate decisions, or developmental trajectories.