skills/genomics-bioinformatics/gatk-variant-calling/SKILL.md
GATK Best Practices for germline SNP/indel calling from WGS/WES BAMs. Per-sample HaplotypeCaller GVCFs, GenomicsDBImport, GenotypeGVCFs joint calling, VQSR or hard filters. Requires BWA-MEM2-aligned, markdup, BQSR BAMs. Use DeepVariant for a faster DL alternative; GATK is the NIH/ENCODE standard.
npx skillsauth add jaechang-hits/sciagent-skills gatk-variant-callingInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
GATK (Genome Analysis Toolkit) implements the GATK Best Practices workflow for calling SNPs and indels from Illumina WGS and WES data. The pipeline runs HaplotypeCaller per sample (producing GVCF files), consolidates GVCFs with GenomicsDBImport, performs joint genotyping with GenotypeGVCFs, and filters variants with VQSR (Variant Quality Score Recalibration) or hard filters. GATK requires BWA-MEM2-aligned, duplicate-marked, and base quality score recalibrated (BQSR) BAM files as input. It integrates with Picard tools, samtools, and bcftools for pre- and post-processing. The GATK4 workflow is the NIH/ENCODE standard for germline variant calling in research and clinical genomics.
@RG read group headers (from BWA-MEM2)Check before installing: The tool may already be available in the current environment (e.g., inside a
pixi/condaenv). Runcommand -v gatkfirst and skip the install commands below if it returns a path. When running inside a pixi project, invoke the tool viapixi run gatkrather than baregatk.
# Install GATK4
wget https://github.com/broadinstitute/gatk/releases/download/4.6.0.0/gatk-4.6.0.0.zip
unzip gatk-4.6.0.0.zip
export GATK="$PWD/gatk-4.6.0.0/gatk"
# Or with conda
conda install -c bioconda gatk4
# Verify
gatk --version
# GATK v4.6.0.0
# Download GATK resource bundle files (GRCh38)
# From gs://gcp-public-data--broad-references/hg38/v0/ (requires gsutil or Broad FTP)
GENOME="GRCh38.fa"
DBSNP="dbsnp_146.hg38.vcf.gz"
# Run HaplotypeCaller in GVCF mode
gatk HaplotypeCaller \
-R $GENOME \
-I sample1.markdup.bam \
-O sample1.g.vcf.gz \
-ERC GVCF \
--dbsnp $DBSNP \
--native-pair-hmm-threads 4
echo "GVCF: sample1.g.vcf.gz"
Correct systematic errors in base quality scores before variant calling.
GENOME="GRCh38.fa"
KNOWN_SITES="dbsnp_146.hg38.vcf.gz Mills_and_1000G_gold_standard.indels.hg38.vcf.gz"
KNOWN_FLAGS=$(printf -- '--known-sites %s ' $KNOWN_SITES)
# Step 1a: Build recalibration table
gatk BaseRecalibrator \
-R $GENOME \
-I sample1.markdup.bam \
$KNOWN_FLAGS \
-O sample1.recal.table
# Step 1b: Apply recalibration
gatk ApplyBQSR \
-R $GENOME \
-I sample1.markdup.bam \
--bqsr-recal-file sample1.recal.table \
-O sample1.bqsr.bam
echo "BQSR BAM: sample1.bqsr.bam"
samtools flagstat sample1.bqsr.bam | head -3
Run per-sample variant calling, producing an intermediate GVCF for joint genotyping.
# HaplotypeCaller in GVCF mode (recommended for cohort analysis)
gatk HaplotypeCaller \
-R GRCh38.fa \
-I sample1.bqsr.bam \
-O gvcfs/sample1.g.vcf.gz \
-ERC GVCF \
--dbsnp dbsnp_146.hg38.vcf.gz \
--native-pair-hmm-threads 4
# For WES: specify target intervals
# gatk HaplotypeCaller ... -L exome_targets.interval_list --interval-padding 100
echo "GVCF: gvcfs/sample1.g.vcf.gz"
zcat gvcfs/sample1.g.vcf.gz | grep -v "^#" | wc -l
Merge per-sample GVCFs for efficient joint genotyping.
# Create sample map file: sample_name\tpath_to_gvcf
printf "ctrl_1\tgvcfs/ctrl_1.g.vcf.gz\n" > sample_map.txt
printf "ctrl_2\tgvcfs/ctrl_2.g.vcf.gz\n" >> sample_map.txt
printf "treat_1\tgvcfs/treat_1.g.vcf.gz\n" >> sample_map.txt
printf "treat_2\tgvcfs/treat_2.g.vcf.gz\n" >> sample_map.txt
# Import GVCFs into GenomicsDB for each chromosome
for CHR in chr1 chr2 chr3; do
gatk GenomicsDBImport \
--sample-name-map sample_map.txt \
--genomicsdb-workspace-path genomicsdb/${CHR} \
-L $CHR \
--reader-threads 4
done
echo "GenomicsDB created for $(ls genomicsdb/ | wc -l) chromosomes"
Genotype all samples simultaneously across the GenomicsDB.
# Joint genotype all samples
mkdir -p vcfs
for CHR in chr1 chr2 chr3; do
gatk GenotypeGVCFs \
-R GRCh38.fa \
-V gendb://genomicsdb/${CHR} \
--dbsnp dbsnp_146.hg38.vcf.gz \
-O vcfs/cohort_${CHR}.vcf.gz
done
# Merge per-chromosome VCFs
gatk MergeVcfs \
$(ls vcfs/cohort_chr*.vcf.gz | sed 's/^/-I /') \
-O vcfs/cohort_all.vcf.gz
echo "Joint genotyping complete: vcfs/cohort_all.vcf.gz"
gatk CountVariants -V vcfs/cohort_all.vcf.gz
Apply hard filters for small cohorts where VQSR is underpowered.
# Separate SNPs and indels
gatk SelectVariants -V vcfs/cohort_all.vcf.gz --select-type-to-include SNP -O vcfs/snps.vcf.gz
gatk SelectVariants -V vcfs/cohort_all.vcf.gz --select-type-to-include INDEL -O vcfs/indels.vcf.gz
# Apply hard filters: SNPs
gatk VariantFiltration \
-V vcfs/snps.vcf.gz \
--filter-expression "QD < 2.0" --filter-name "QD2" \
--filter-expression "FS > 60.0" --filter-name "FS60" \
--filter-expression "MQ < 40.0" --filter-name "MQ40" \
--filter-expression "MQRankSum < -12.5" --filter-name "MQRankSum-12.5" \
-O vcfs/snps_filtered.vcf.gz
# Apply hard filters: Indels
gatk VariantFiltration \
-V vcfs/indels.vcf.gz \
--filter-expression "QD < 2.0" --filter-name "QD2" \
--filter-expression "FS > 200.0" --filter-name "FS200" \
--filter-expression "ReadPosRankSum < -20.0" --filter-name "ReadPosRankSum-20" \
-O vcfs/indels_filtered.vcf.gz
# Merge filtered SNPs + indels
gatk MergeVcfs -I vcfs/snps_filtered.vcf.gz -I vcfs/indels_filtered.vcf.gz \
-O vcfs/cohort_filtered.vcf.gz
echo "PASS variants: $(bcftools view -f PASS vcfs/cohort_filtered.vcf.gz | grep -v '^#' | wc -l)"
Extract variants, annotate with gene info, and prepare a DataFrame.
import subprocess
import pandas as pd
import io
# Use bcftools query to extract fields from filtered VCF
result = subprocess.run(
["bcftools", "query",
"-f", "%CHROM\t%POS\t%ID\t%REF\t%ALT\t%QUAL\t%FILTER\t%INFO/QD\t%INFO/FS\n",
"-i", "FILTER='PASS'",
"vcfs/cohort_filtered.vcf.gz"],
capture_output=True, text=True
)
cols = ["CHR", "POS", "ID", "REF", "ALT", "QUAL", "FILTER", "QD", "FS"]
df = pd.read_csv(io.StringIO(result.stdout), sep="\t", names=cols)
df["QUAL"] = pd.to_numeric(df["QUAL"], errors="coerce")
df["QD"] = pd.to_numeric(df["QD"], errors="coerce")
print(f"PASS variants: {len(df)}")
print(f"SNPs: {(df['REF'].str.len() == 1) & (df['ALT'].str.len() == 1)).sum()}")
print(f"Indels: {((df['REF'].str.len() > 1) | (df['ALT'].str.len() > 1)).sum()}")
print(df.head())
df.to_csv("pass_variants.tsv", sep="\t", index=False)
| Parameter | Default | Range/Options | Effect |
|-----------|---------|---------------|--------|
| -ERC | NONE | GVCF, BP_RESOLUTION | Emit reference confidence mode; GVCF for cohort workflows |
| --native-pair-hmm-threads | 4 | 1–32 | Threads for pair-HMM in HaplotypeCaller (most CPU-intensive step) |
| -L | whole genome | interval file or chr | Restrict calling to intervals (exome targets, BED regions) |
| --dbsnp | — | VCF path | dbSNP VCF for rsID annotation in output |
| --stand-call-conf | 30 | 0–100 | Min genotype quality score to emit a variant call |
| -G | StandardAnnotation | annotation group | Annotation modules to apply (StandardHCAnnotation for HaplotypeCaller) |
| --sample-name-map | — | TSV file | Sample-to-GVCF mapping for GenomicsDBImport |
| --reader-threads | 1 | 1–16 | Threads for GenomicsDBImport reading |
| --filter-expression | — | JEXL expression | Hard filter expression (e.g., "QD < 2.0") |
| --java-options | — | -Xmx4g | Java heap size; use -Xmx16g for large genomes |
# For a single sample, skip GenomicsDBImport and call directly
gatk HaplotypeCaller \
-R GRCh38.fa \
-I sample.bqsr.bam \
-O sample.vcf.gz \
--dbsnp dbsnp_146.hg38.vcf.gz \
--native-pair-hmm-threads 8
# Hard filter the single-sample VCF
gatk VariantFiltration \
-V sample.vcf.gz \
--filter-expression "QD < 2.0" --filter-name "QD2" \
--filter-expression "FS > 60.0" --filter-name "FS60" \
-O sample_filtered.vcf.gz
echo "PASS variants: $(bcftools view -f PASS sample_filtered.vcf.gz | grep -v '^#' | wc -l)"
# Snakefile — GATK BQSR + HaplotypeCaller
configfile: "config.yaml"
SAMPLES = config["samples"]
GENOME = config["genome"]
DBSNP = config["dbsnp"]
KNOWN = config["known_sites"]
rule all:
input: expand("gvcfs/{sample}.g.vcf.gz", sample=SAMPLES)
rule bqsr:
input: bam="markdup/{sample}.markdup.bam"
output: bam="bqsr/{sample}.bqsr.bam"
shell:
"""
gatk BaseRecalibrator -R {GENOME} -I {input.bam} \
--known-sites {KNOWN} -O bqsr/{wildcards.sample}.recal.table &&
gatk ApplyBQSR -R {GENOME} -I {input.bam} \
--bqsr-recal-file bqsr/{wildcards.sample}.recal.table \
-O {output.bam}
"""
rule haplotypecaller:
input: bam="bqsr/{sample}.bqsr.bam"
output: gvcf="gvcfs/{sample}.g.vcf.gz"
threads: 4
shell:
"gatk HaplotypeCaller -R {GENOME} -I {input.bam} -O {output.gvcf} "
"-ERC GVCF --dbsnp {DBSNP} --native-pair-hmm-threads {threads}"
| Output | Format | Description |
|--------|--------|-------------|
| *.g.vcf.gz | GVCF | Per-sample GVCF with reference confidence blocks; input to GenomicsDBImport |
| cohort_all.vcf.gz | VCF | Joint-genotyped multi-sample VCF; unfiltered |
| cohort_filtered.vcf.gz | VCF | Filtered VCF; FILTER=PASS for passing variants |
| *.recal.table | BQSR | Base quality recalibration table from BaseRecalibrator |
| *.bqsr.bam | BAM | Recalibrated BAM; use as HaplotypeCaller input |
| genomicsdb/ | Directory | GenomicsDB workspace per chromosome for joint genotyping |
| Problem | Cause | Solution |
|---------|-------|----------|
| SAM/BAM file has no @RG header | Missing read group from BWA-MEM2 | Re-align with -R "@RG\tID:...\tSM:...\tPL:ILLUMINA" |
| Java OutOfMemoryError | Insufficient heap size | Add --java-options "-Xmx16g" or more |
| HaplotypeCaller very slow | Single-threaded HMM | Add --native-pair-hmm-threads 8; use interval lists to parallelize by chr |
| Empty GVCF output | Wrong interval or no reads in region | Check samtools view -c sample.bam chrN for read counts |
| VQSR fails (< 10k variants) | Too few variants for training | Use hard filters instead of VQSR for small cohorts or exomes |
| GenomicsDB import fails on existing path | GenomicsDB workspace already exists | Delete existing workspace: rm -rf genomicsdb/chr1 before re-running |
| IndexOutOfBoundsException | Chromosome name mismatch between BAM and reference | Ensure genome FASTA and BAM use same chr naming (chr1 vs 1) |
| BCFtools/tabix index missing | Tabix index (.tbi) not created | Run gatk IndexFeatureFile -I file.vcf.gz or tabix -p vcf file.vcf.gz |
tools
Fast short-read DNA aligner for WGS/WES/ChIP-seq. 2× faster BWA-MEM successor; outputs SAM/BAM with read group headers for GATK. Primary plus supplementary records for chimeric reads. Use STAR for RNA-seq splice-aware alignment; Bowtie2 is a comparable alternative.
tools
smina molecular docking CLI. AutoDock Vina fork with customizable scoring functions, native SDF/MOL2/PDB ligand input, autoboxing, local energy minimization, and per-atom score breakdowns. Pipeline: receptor PDBQT prep -> ligand prep (RDKit/OpenBabel) -> dock via autobox or explicit grid -> rescore/minimize with custom scoring -> rank poses by affinity. Choose smina over Vina when you need custom scoring terms (--custom_scoring), local optimization of an existing pose (--local_only), per-atom contributions (--atom_term_data), or SDF/MOL2 ligands without manual PDBQT conversion. For unknown binding sites use diffdock-blind-docking; for the Python-bindings/Vinardo workflow use autodock-vina-docking.
development
mdtraj molecular dynamics trajectory analysis (Python). Reads DCD/XTC/TRR/NetCDF/H5/PDB topologies and trajectories; computes RMSD vs time, radius of gyration, per-residue RMSF, residue-residue contact frequency maps, phi/psi torsions for Ramachandran plots (general + Gly/Pro), and 8-state DSSP secondary structure. Modules: trajectory I/O, geometry (distances/angles/dihedrals), structural analysis (RMSD/Rg/RMSF/SASA), contacts, hydrogen bonds, secondary structure (DSSP), NMR observables. For broader atom-selection grammar use mdanalysis-trajectory; for running MD simulations use OpenMM/GROMACS.
development
Programmatic PubMed access via NCBI E-utilities REST API. Covers Boolean/MeSH queries, field-tagged search, endpoints (ESearch, EFetch, ESummary, EPost, ELink), history server for batches, citation matching, systematic review strategies. Use for biomedical literature search or automated pipelines.