alignment-files/alignment-validation/SKILL.md
Validate alignment quality with insert size distribution, proper pairing rates, GC bias, strand balance, and other post-alignment metrics. Use when verifying alignment data quality before variant calling or quantification.
npx skillsauth add GPTomics/bioSkills bio-alignment-validationInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Reference examples tested with: matplotlib 3.8+, numpy 1.26+, picard 3.1+, pysam 0.22+, samtools 1.19+
Before using code patterns, verify installed versions match. If versions differ:
pip show <package> then help(module.function) to check signatures<tool> --version then <tool> --help to confirm flagsIf code throws ImportError, AttributeError, or TypeError, introspect the installed package and adapt the example to match the actual API rather than retrying.
Post-alignment quality control to verify alignment quality and identify issues.
"Check alignment quality" -> Compute post-alignment QC metrics (mapping rate, pairing, insert size, strand balance) to identify issues before downstream analysis.
samtools flagstat, samtools stats, Picard CollectAlignmentSummaryMetricspysam.AlignmentFile iteration with metric calculations| Concern | Tools | What it catches |
|---------|-------|-----------------|
| File integrity | samtools quickcheck, picard ValidateSamFile | Truncation, missing EOF, malformed records, wrong CIGAR, MAPQ out of range |
| Sequence dictionary identity | samtools dict + M5 diff | BAM aligned to wrong reference flavor / different decoy / chr vs no-chr |
| QC metrics | samtools stats, flagstat, mosdepth, Picard CollectMultipleMetrics / CollectHsMetrics / CollectWgsMetrics | Are the data biologically reasonable for the assay? |
| Contamination / sample swap | verifybamid2, somalier, Picard CrosscheckFingerprints | Cross-sample contamination, tumor-normal swap, mislabeled sample |
A file can pass quickcheck and still be malformed in ways that crash GATK three hours into HaplotypeCaller. Conversely, a QC-poor BAM can be structurally valid.
# Fast: header + EOF block check (misses mid-file truncation, invalid CIGAR)
samtools quickcheck -v in.bam || echo "QUICKCHECK FAILED"
samtools quickcheck -v *.bam > bad_bams.fofn # one fail-line per bad file
# Slow but thorough: structural validation
picard ValidateSamFile I=in.bam MODE=SUMMARY R=ref.fa
# Production: ignore expected-but-noisy
picard ValidateSamFile I=in.bam MODE=SUMMARY R=ref.fa \
IGNORE=INVALID_MAPPING_QUALITY \
IGNORE=MISMATCH_FLAG_MATE_NEG_STRAND
CI-safe one-liner:
test -s in.bam \
&& samtools quickcheck -v in.bam \
&& [ $(samtools view -c -F 2304 in.bam) -gt 1000 ] \
|| { echo "BAM failed integrity"; exit 1; }
# Compare per-contig MD5 between BAM and reference
diff \
<(samtools view -H in.bam | grep '^@SQ' | tr '\t' '\n' | grep '^M5:' | sort) \
<(samtools dict ref.fa | grep '^@SQ' | tr '\t' '\n' | grep '^M5:' | sort)
If M5s differ, the BAM was aligned to a different sequence than the current reference (even if contig names match). Concrete failure modes: GRCh38 vs GRCh38.p13 vs GRCh38_no_alt (alt contigs differ); UCSC chr1 vs Ensembl 1 (names differ, M5s match -- pure renaming); soft-masked vs hard-masked (M5 matches, viewers differ). The M5 tag is the only definitive identity check.
No alignment QC is complete without these in production:
# Cross-sample contamination
verifybamid2 --SVDPrefix /resources/1000g.b38.vcf.gz.SVD \
--Reference ref.fa --BamFile sample.bam --Output sample.contam
# VerifyBamID2 README flags FREEMIX > 0.03 as concerning; values escalate from there.
# Relatedness, sex check, sample swap detection
somalier extract -d extracted/ -s /resources/sites.GRCh38.vcf.gz \
-f ref.fa sample.bam
somalier relate --infer extracted/*.somalier
# Tumor/normal pairing verification
picard CrosscheckFingerprints I=tumor.bam I=normal.bam \
HAPLOTYPE_MAP=Homo_sapiens_assembly38.haplotype_database.txt
# LOD > 5 = same individual; < -5 = different
Sample-swap rates of 0.5-1% in production cohorts are typical. Without somalier or CrosscheckFingerprints, swaps are detected only when a downstream finding contradicts clinical expectation.
Goal: Verify that the fragment length distribution matches the library preparation protocol.
Approach: Extract template_length from properly paired reads and compare the distribution to expected values for the library type.
samtools stats input.bam > stats.txt
grep "^IS" stats.txt | cut -f2,3 > insert_sizes.txt
java -jar picard.jar CollectInsertSizeMetrics \
I=input.bam \
O=insert_metrics.txt \
H=insert_histogram.pdf
| Library | Mean insert | Distribution shape | Diagnostic | |---------|-------------|-------------------|-----------| | TruSeq DNA PCR-free WGS | 400-500 bp | Roughly Gaussian | Sharp peak; bimodality = degraded sample | | TruSeq DNA Nano (PCR) WGS | 300-400 bp | Gaussian, narrower | | | Twist / IDT exome capture | 250-350 bp | Gaussian | | | TruSeq Stranded mRNA | 200-300 bp | Right-skewed (transcript distribution) | Long tail = poor size selection | | Ribo-Zero rRNA-depleted | 250-400 bp | Right-skewed | | | Smart-seq2 / Smart-seq3 | 200-700 bp | Broad | | | 10x Chromium (3') | n/a -- not informative | n/a | | | TruSeq ChIP | 200-400 bp | Sharp | | | ATAC-seq (Buenrostro / Omni-ATAC) | Multimodal | Peaks at ~50, ~180, ~340 bp | Missing multimodal pattern = bad library; missing ~180 bp = under-digested | | Hi-C / Micro-C | Multimodal | Peak at ligation-junction size | | | cfDNA / ctDNA | 160-180 bp | Multimodal; ~167 bp mononucleosomal + ~340 dinuc | Tumor-derived shorter (~145 bp); shape itself is a biomarker | | FFPE | 100-250 bp | Right-skewed, broad | | | aDNA | 30-80 bp | Sharp left-skewed | | | ONT (native) | 1-30 kb | n/a | | | PacBio HiFi | 10-25 kb | Sharp peak | |
For ATAC, the multimodal pattern is the QC. If the mononucleosomal peak (~180 bp) is absent, Tn5 was over-titrated, under-titrated, or DNA was degraded. Use ATACseqQC fragSizeDist() for the standard ATAC fragment-size diagnostic.
import pysam
import numpy as np
import matplotlib.pyplot as plt
def get_insert_sizes(bam_file, max_reads=100000):
sizes = []
bam = pysam.AlignmentFile(bam_file, 'rb')
for i, read in enumerate(bam.fetch()):
if i >= max_reads:
break
if read.is_proper_pair and not read.is_secondary and read.template_length > 0:
sizes.append(read.template_length)
bam.close()
return sizes
sizes = get_insert_sizes('sample.bam')
print(f'Median insert size: {np.median(sizes):.0f}')
print(f'Mean insert size: {np.mean(sizes):.0f}')
print(f'Std dev: {np.std(sizes):.0f}')
plt.hist(sizes, bins=100, range=(0, 1000))
plt.xlabel('Insert Size')
plt.ylabel('Count')
plt.savefig('insert_size_dist.pdf')
Percentage of reads correctly paired.
samtools flagstat input.bam
samtools flagstat input.bam | grep "properly paired"
proper=$(samtools view -c -f 2 input.bam)
mapped=$(samtools view -c -F 4 input.bam)
rate=$(echo "scale=4; $proper / $mapped * 100" | bc)
echo "Proper pairing rate: ${rate}%"
| Metric | Good | Marginal | Poor | |--------|------|----------|------| | Proper pair | > 90% | 80-90% | < 80% | | Mapped | > 95% | 90-95% | < 90% | | Singletons | < 5% | 5-10% | > 10% |
GC content correlation with coverage.
java -jar picard.jar CollectGcBiasMetrics \
I=input.bam \
O=gc_bias_metrics.txt \
CHART=gc_bias_chart.pdf \
S=gc_summary.txt \
R=reference.fa
computeGCBias \
-b input.bam \
--effectiveGenomeSize 2913022398 \
-g hg38.2bit \
-o gc_bias.txt \
--biasPlot gc_bias.pdf
| Issue | Symptom | |-------|---------| | Under-representation | Low GC coverage drops | | Over-representation | High GC coverage elevated | | PCR bias | Strong correlation |
A balanced 0.48-0.52 forward/reverse ratio applies to WGS / WES / generic DNA-seq on autosomes. Expected to deviate for: stranded RNA-seq (deliberately strand-asymmetric -- verify with RSeQC infer_experiment.py), bisulfite (CT vs GA), small-RNA / strand-specific RNA-seq, and chrY/chrM regions. Per-chromosome strand imbalance >5% on autosomes is a field-convention rule of thumb (no single primary citation) — it picks up aligner artifacts; on chrX/chrY it suggests sex-mismatch.
forward=$(samtools view -c -F 16 input.bam)
reverse=$(samtools view -c -f 16 input.bam)
echo "Forward: $forward"
echo "Reverse: $reverse"
ratio=$(echo "scale=4; $forward / $reverse" | bc)
echo "F/R ratio: $ratio"
for chr in chr1 chr2 chr3; do
fwd=$(samtools view -c -F 16 input.bam $chr)
rev=$(samtools view -c -f 16 input.bam $chr)
echo "$chr: F=$fwd R=$rev ratio=$(echo "scale=2; $fwd/$rev" | bc)"
done
samtools view input.bam | cut -f5 | sort -n | uniq -c | sort -k2 -n
samtools view input.bam | awk '{sum+=$5; count++} END {print "Mean MAPQ:", sum/count}'
Mean MAPQ is misleading; distributions are bimodal (0 and aligner-max). For aligner-specific scales and "unique mapping" sentinels, see sam-bam-basics. The fraction of primary mapped reads at MAPQ >= 30 is a more informative summary than the mean.
samtools idxstats input.bam | awk '{print $1, $3/$2}' | head -25
Median-normalized per-autosome coverage (1.0 = expected diploid; 0.5 = monosomy/sex; 1.5 = trisomy):
samtools idxstats in.bam | awk '$2>0 && $1!~/^chr[XYM]|^GL|^KI|^chrUn|^chrEBV/ {
cov[$1] = $3 / $2
}
END {
n = asort(cov, sorted)
med = sorted[int(n/2)+1]
for (c in cov) printf "%s\t%.3f\n", c, cov[c]/med
}'
For full ancestry / contamination / relatedness checking, use verifybamid2, somalier, or peddy -- they account for population AFs, not just per-contig depth.
java -jar picard.jar CollectAlignmentSummaryMetrics \
I=input.bam \
R=reference.fa \
O=alignment_summary.txt
| Metric | Description | Good Value | |--------|-------------|------------| | PCT_PF_READS_ALIGNED | Mapped % | > 95% | | PF_MISMATCH_RATE | Mismatches | < 1% | | PF_INDEL_RATE | Indels | < 0.1% | | STRAND_BALANCE | Strand ratio | ~0.5 |
Goal: Run all key alignment QC checks in a single pass and generate a summary report.
Approach: Combine samtools flagstat, stats, idxstats, and strand counts into one script that outputs pass/warn/fail calls.
#!/bin/bash
BAM=$1
REF=$2
NAME=$(basename $BAM .bam)
OUTDIR=${3:-qc}
mkdir -p $OUTDIR
echo "=== Alignment Validation: $NAME ===" | tee $OUTDIR/report.txt
echo -e "\n--- Flagstat ---" | tee -a $OUTDIR/report.txt
samtools flagstat $BAM | tee -a $OUTDIR/report.txt
echo -e "\n--- Mapping Rate ---" | tee -a $OUTDIR/report.txt
mapped=$(samtools view -c -F 4 $BAM)
total=$(samtools view -c $BAM)
rate=$(echo "scale=2; $mapped / $total * 100" | bc)
echo "Mapping rate: ${rate}%" | tee -a $OUTDIR/report.txt
echo -e "\n--- Proper Pairing ---" | tee -a $OUTDIR/report.txt
proper=$(samtools view -c -f 2 $BAM)
pair_rate=$(echo "scale=2; $proper / $mapped * 100" | bc)
echo "Proper pairing: ${pair_rate}%" | tee -a $OUTDIR/report.txt
echo -e "\n--- Insert Size ---" | tee -a $OUTDIR/report.txt
samtools stats $BAM | grep "insert size average" | tee -a $OUTDIR/report.txt
echo -e "\n--- Strand Balance ---" | tee -a $OUTDIR/report.txt
fwd=$(samtools view -c -F 16 $BAM)
rev=$(samtools view -c -f 16 $BAM)
strand_ratio=$(echo "scale=3; $fwd / $rev" | bc)
echo "Forward: $fwd, Reverse: $rev, Ratio: $strand_ratio" | tee -a $OUTDIR/report.txt
echo -e "\n--- Chromosome Coverage ---" | tee -a $OUTDIR/report.txt
samtools idxstats $BAM | head -25 | tee -a $OUTDIR/report.txt
echo -e "\nReport: $OUTDIR/report.txt"
A skeleton; full implementation is in examples/validate_alignment.py:
import pysam
class AlignmentValidator:
def __init__(self, bam_file):
self.bam = pysam.AlignmentFile(bam_file, 'rb')
def report(self, sample_size=100000):
# Sample reads, compute mapping rate, proper-pair rate, MAPQ dist, strand balance
# See examples/validate_alignment.py for full implementation
...
The first-N-reads sampling pattern is biased toward chr1 (different GC content and complexity than chrM/chrX/chrY/alt contigs). For unbiased per-chromosome statistics, use samtools view -s 42.01 input.bam (seed-prefixed fraction INT.FRAC is reproducible; bare 0.01 is not) instead of head-of-file iteration.
| Metric | Good | Warning | Fail | |--------|------|---------|------| | Mapping rate | > 95% | 90-95% | < 90% | | Proper pairing | > 90% | 80-90% | < 80% | | Duplicate rate (assay-specific) | see bam-statistics decision table | -- | -- | | Strand balance | 0.48-0.52 | 0.45-0.55 | Outside | | Mean MAPQ | > 40 | 30-40 | < 30 | | GC bias | < 1.2x | 1.2-1.5x | > 1.5x (field-convention bands; Picard CollectGcBiasMetrics does not prescribe specific cutoffs) |
testing
Analyze multi-modal single-cell data (CITE-seq, Multiome, spatial). Use when working with data that measures multiple modalities per cell like RNA + protein or RNA + ATAC. Use when analyzing CITE-seq, Multiome, or other multi-modal single-cell data.
data-ai
Analyze metabolite-mediated cell-cell communication using MeboCost for metabolic signaling inference between cell types. Predict metabolite secretion and sensing patterns from scRNA-seq data. Use when studying metabolic crosstalk between cell populations or metabolite-receptor interactions.
development
Find marker genes and annotate cell types in single-cell RNA-seq using Seurat (R) and Scanpy (Python). Use for differential expression between clusters, identifying cluster-specific markers, scoring gene sets, and assigning cell type labels. Use when finding marker genes and annotating clusters.
development
Reconstruct cell lineage trees from CRISPR barcode tracing or mitochondrial mutations. Use when studying clonal dynamics, cell fate decisions, or developmental trajectories.