phasing-imputation/haplotype-phasing/SKILL.md
Phase genotypes into haplotypes using Beagle or SHAPEIT. Resolves which alleles are inherited together on each chromosome. Use when preparing VCF files for imputation, HLA typing, or population genetic analyses requiring phased haplotypes.
npx skillsauth add GPTomics/bioSkills bio-phasing-imputation-haplotype-phasingInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Reference examples tested with: bcftools 1.19+
Before using code patterns, verify installed versions match. If versions differ:
<tool> --version then <tool> --help to confirm flagsIf code throws ImportError, AttributeError, or TypeError, introspect the installed package and adapt the example to match the actual API rather than retrying.
"Phase my genotypes into haplotypes" -> Resolve which alleles are inherited together on each chromosome for downstream imputation, HLA typing, or population genetic analyses.
java -jar beagle.jar gt=input.vcf out=phased (Beagle 5.4)shapeit4 --input input.vcf --output phased.vcf (SHAPEIT)# Download Beagle 5.4
wget https://faculty.washington.edu/browning/beagle/beagle.22Jul22.46e.jar
# Basic phasing
java -jar beagle.22Jul22.46e.jar \
gt=input.vcf.gz \
out=phased
# Output: phased.vcf.gz (phased genotypes)
# With genetic map (improves accuracy)
java -jar beagle.22Jul22.46e.jar \
gt=input.vcf.gz \
map=plink.chr22.GRCh38.map \
out=phased
java -jar beagle.22Jul22.46e.jar \
gt=input.vcf.gz \
out=phased \
map=genetic_map.txt \
nthreads=8 \
window=40 \
overlap=4 \
ne=20000 \ # Effective population size
seed=12345 # For reproducibility
# Process each chromosome separately
for chr in {1..22}; do
java -Xmx16g -jar beagle.jar \
gt=input.chr${chr}.vcf.gz \
map=genetic_maps/plink.chr${chr}.GRCh38.map \
out=phased.chr${chr} \
nthreads=8
done
# Concatenate chromosomes
bcftools concat phased.chr*.vcf.gz -Oz -o phased.all.vcf.gz
bcftools index phased.all.vcf.gz
Goal: Phase large biobank-scale datasets using SHAPEIT5's two-stage approach for accurate haplotype resolution of both common and rare variants.
Approach: First phase common variants to build a haplotype scaffold, then phase rare variants onto that scaffold using the common-variant structure as a guide.
# Phase common variants first
shapeit5_phase_common \
--input input.vcf.gz \
--map genetic_map.txt \
--output phased_common.bcf \
--thread 8 \
--log phased.log
# Then phase rare variants
shapeit5_phase_rare \
--input input.vcf.gz \
--scaffold phased_common.bcf \
--map genetic_map.txt \
--output phased.bcf \
--thread 8
# Improves phasing using reference haplotypes
shapeit5_phase_common \
--input input.vcf.gz \
--reference reference_panel.bcf \
--map genetic_map.txt \
--output phased.bcf \
--thread 8
# Use reference panel for better phasing
java -jar beagle.22Jul22.46e.jar \
gt=input.vcf.gz \
ref=reference.vcf.gz \
map=genetic_map.txt \
out=phased \
nthreads=8
# Filter variants before phasing
bcftools view -m2 -M2 -v snps input.vcf.gz -Oz -o biallelic_snps.vcf.gz
# Remove missing genotypes (optional)
bcftools view -g ^miss biallelic_snps.vcf.gz -Oz -o no_missing.vcf.gz
# Normalize (important!)
bcftools norm -f reference.fa -Oz -o normalized.vcf.gz input.vcf.gz
# View phased genotypes (| instead of /)
bcftools query -f '%CHROM\t%POS\t[%GT\t]\n' phased.vcf.gz | head
# Unphased: 0/1
# Phased: 0|1 or 1|0
# Count phased vs unphased
bcftools query -f '[%GT\n]' phased.vcf.gz | grep -c '|'
# Download genetic maps (GRCh38)
wget https://faculty.washington.edu/browning/beagle/genetic_maps/plink.GRCh38.map.zip
unzip plink.GRCh38.map.zip
# Format: chromosome position rate(cM/Mb) genetic_position(cM)
# chr1 55550 2.981822 0.000000
| Parameter | Beagle | SHAPEIT5 | Description | |-----------|--------|----------|-------------| | Threads | nthreads | --thread | CPU threads | | Window | window | --window | Analysis window size | | Eff. pop size | ne | --effective-size | For LD modeling | | Seed | seed | --seed | Random seed |
| Dataset Size | Beagle Memory | SHAPEIT5 Memory | |--------------|--------------|-----------------| | 1,000 samples | 8 GB | 4 GB | | 10,000 samples | 32 GB | 16 GB | | 100,000 samples | 64+ GB | 32 GB |
development
Find restriction enzyme cut sites in DNA sequences using Biopython Bio.Restriction. Search with single enzymes, batches of enzymes, or commercially available enzyme sets. Returns cut positions for linear or circular DNA. Use when finding restriction enzyme cut sites in sequences.
development
Create restriction maps showing enzyme cut positions on DNA sequences using Biopython Bio.Restriction. Visualize cut sites, calculate distances between sites, and generate text or graphical maps. Use when creating or analyzing restriction maps.
development
Analyze restriction digest fragments using Biopython Bio.Restriction. Predict fragment sizes, get fragment sequences, simulate gel electrophoresis patterns, and perform double digests. Use when analyzing restriction digest fragment patterns.
development
Select restriction enzymes by criteria using Biopython Bio.Restriction. Find enzymes that cut once, don't cut, produce specific overhangs, are commercially available, or have compatible ends for cloning. Use when selecting restriction enzymes for cloning or analysis.