population-genetics/linkage-disequilibrium/SKILL.md
Calculate linkage disequilibrium statistics (r², D'), perform LD pruning for population structure analysis, identify haplotype blocks, and visualize LD patterns using PLINK, scikit-allel, and LDBlockShow. Use when calculating LD or pruning variants.
npx skillsauth add GPTomics/bioSkills bio-population-genetics-linkage-disequilibriumInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Reference examples tested with: matplotlib 3.8+, numpy 1.26+, pandas 2.2+
Before using code patterns, verify installed versions match. If versions differ:
pip show <package> then help(module.function) to check signatures<tool> --version then <tool> --help to confirm flagsIf code throws ImportError, AttributeError, or TypeError, introspect the installed package and adapt the example to match the actual API rather than retrying.
"Calculate LD between my variants" -> Compute pairwise LD statistics (r², D'), prune correlated variants for independent sets, and identify haplotype blocks from genotype data.
plink2 --r2 for LD calculation, --indep-pairwise for pruningallel.rogers_huff_r() for windowed LD in scikit-allelCalculate LD statistics, prune correlated variants, and identify haplotype blocks.
# All pairs within window
plink2 --bfile data --r2 --ld-window-kb 1000 --ld-window-r2 0.2 --out ld_results
# With SNP names in output
plink2 --bfile data --r2 inter-chr --ld-window-r2 0 --out all_pairs
# Squared correlation matrix
plink2 --bfile data --r2-phased square --out ld_matrix
# ld_results.ld contains:
CHR_A BP_A SNP_A CHR_B BP_B SNP_B R2
# r² with D' statistics
plink --bfile data --r2 dprime --ld-window-kb 500 --out ld_with_dprime
# Inter-chromosome LD
plink --bfile data --r2 inter-chr --ld-snp-list target_snps.txt --out target_ld
# Calculate pruning list
plink2 --bfile data --indep-pairwise 50 10 0.1 --out prune
# Output files:
# prune.prune.in - Variants to keep
# prune.prune.out - Variants to remove
# Extract pruned set
plink2 --bfile data --extract prune.prune.in --make-bed --out data_pruned
| Parameter | Description | Common Values | |-----------|-------------|---------------| | Window (50) | Variants per window | 50-200 | | Step (10) | Variants to shift | 5-50 | | r² threshold (0.1) | Max LD allowed | 0.1-0.5 |
# Strict pruning for PCA/Admixture
plink2 --bfile data --indep-pairwise 50 10 0.1 --out strict_prune
# Moderate pruning for polygenic scores
plink2 --bfile data --indep-pairwise 200 50 0.5 --out moderate_prune
# Region-based pruning
plink2 --bfile data --indep-pairwise 50 10 0.2 --chr 6 --from-mb 25 --to-mb 35 --out mhc_prune
import allel
import numpy as np
callset = allel.read_vcf('data.vcf.gz')
gt = allel.GenotypeArray(callset['calldata/GT'])
pos = callset['variants/POS']
gn = gt.to_n_alt()
r2 = allel.rogers_huff_r(gn[:100]) ** 2
Goal: Plot LD decay as a function of physical distance to characterize the extent of linkage in the population.
Approach: Compute pairwise r² values between nearby variants using scikit-allel, bin by physical distance, and plot mean r² per distance bin.
import allel
import numpy as np
import matplotlib.pyplot as plt
gn = gt.to_n_alt()
r2, dist = [], []
n_variants = min(1000, gn.shape[0])
for i in range(n_variants):
for j in range(i + 1, min(i + 100, n_variants)):
r = allel.rogers_huff_r(gn[[i, j]])[0, 1] ** 2
d = pos[j] - pos[i]
r2.append(r)
dist.append(d)
r2 = np.array(r2)
dist = np.array(dist)
bins = np.arange(0, 100001, 1000)
bin_means = []
for i in range(len(bins) - 1):
mask = (dist >= bins[i]) & (dist < bins[i + 1])
if mask.sum() > 0:
bin_means.append(np.mean(r2[mask]))
else:
bin_means.append(np.nan)
plt.figure(figsize=(10, 6))
plt.plot(bins[:-1] / 1000, bin_means)
plt.xlabel('Distance (kb)')
plt.ylabel('Mean r²')
plt.title('LD Decay')
plt.savefig('ld_decay.png')
# Identify haplotype blocks (Gabriel et al.)
plink --bfile data --blocks no-pheno-req --out blocks
# Output: blocks.blocks (block boundaries)
# Output: blocks.blocks.det (block details)
import pandas as pd
blocks = pd.read_csv('blocks.blocks.det', sep='\s+')
print(f'Number of blocks: {len(blocks)}')
print(f'Mean block size: {blocks["KB"].mean():.1f} kb')
print(f'Mean SNPs per block: {blocks["NSNPS"].mean():.1f}')
import allel
import numpy as np
import matplotlib.pyplot as plt
gn = gt.to_n_alt()[:200]
r = allel.rogers_huff_r(gn)
r2_matrix = r ** 2
plt.figure(figsize=(10, 10))
plt.imshow(r2_matrix, cmap='hot', vmin=0, vmax=1)
plt.colorbar(label='r²')
plt.xlabel('Variant index')
plt.ylabel('Variant index')
plt.title('LD Matrix')
plt.savefig('ld_matrix.png', dpi=150)
# Clump GWAS results by LD
plink --bfile data \
--clump gwas_results.txt \
--clump-p1 5e-8 \
--clump-p2 1e-5 \
--clump-r2 0.1 \
--clump-kb 250 \
--out clumped
# Output: clumped.clumped (independent signals)
| Parameter | Description | |-----------|-------------| | --clump-p1 | Index SNP p-value threshold | | --clump-p2 | Clumped SNP p-value threshold | | --clump-r2 | LD threshold for clumping | | --clump-kb | Physical distance threshold |
# Pairwise LD for region
vcftools --vcf data.vcf --geno-r2 --ld-window-bp 100000 --out ld_results
# Output: ld_results.geno.ld
# Haplotype-based r²
vcftools --vcf data.vcf --hap-r2 --ld-window-bp 100000 --out hap_ld
# 1. Calculate genome-wide LD
plink2 --bfile data --r2 --ld-window-kb 500 --ld-window-r2 0.2 --out ld_genome
# 2. Generate pruned set for PCA
plink2 --bfile data --indep-pairwise 50 10 0.1 --out prune
plink2 --bfile data --extract prune.prune.in --make-bed --out pruned
# 3. Identify haplotype blocks
plink --bfile data --blocks no-pheno-req --out blocks
# 4. Visualize LD for specific region
plink --bfile data --r2 dprime --chr 6 --from-mb 28 --to-mb 34 --out mhc_ld
development
Find restriction enzyme cut sites in DNA sequences using Biopython Bio.Restriction. Search with single enzymes, batches of enzymes, or commercially available enzyme sets. Returns cut positions for linear or circular DNA. Use when finding restriction enzyme cut sites in sequences.
development
Create restriction maps showing enzyme cut positions on DNA sequences using Biopython Bio.Restriction. Visualize cut sites, calculate distances between sites, and generate text or graphical maps. Use when creating or analyzing restriction maps.
development
Analyze restriction digest fragments using Biopython Bio.Restriction. Predict fragment sizes, get fragment sequences, simulate gel electrophoresis patterns, and perform double digests. Use when analyzing restriction digest fragment patterns.
development
Select restriction enzymes by criteria using Biopython Bio.Restriction. Find enzymes that cut once, don't cut, produce specific overhangs, are commercially available, or have compatible ends for cloning. Use when selecting restriction enzymes for cloning or analysis.