alignment-files/bam-statistics/SKILL.md
Generate alignment statistics using samtools flagstat, stats, depth, coverage, and mosdepth. Use when assessing alignment quality, calculating coverage, or generating QC reports.
npx skillsauth add GPTomics/bioSkills bio-bam-statisticsInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Reference examples tested with: pysam 0.22+, samtools 1.19+
Before using code patterns, verify installed versions match. If versions differ:
pip show <package> then help(module.function) to check signatures<tool> --version then <tool> --help to confirm flagsIf code throws ImportError, AttributeError, or TypeError, introspect the installed package and adapt the example to match the actual API rather than retrying.
"Get alignment statistics and coverage from my BAM file" -> Generate read counts, mapping rates, per-chromosome statistics, depth profiles, and coverage summaries.
samtools flagstat, samtools stats, samtools depth, samtools coverage (samtools)pysam.AlignmentFile with pileup() and get_index_statistics() (pysam)Generate alignment statistics using samtools and pysam.
| Question | Best tool | Why |
|----------|-----------|-----|
| Quick read counts by FLAG category | samtools flagstat | Fast; counts secondary+supp in totals |
| Per-chromosome counts | samtools idxstats | Fast (needs index); counts secondary+supp |
| Insert size, MAPQ, error, GC | samtools stats -r ref.fa | Comprehensive; feeds MultiQC |
| Per-position depth (small region) | samtools depth or pysam pileup | Slow on full genome |
| Per-position depth (genome-wide) | mosdepth | 3-10x faster than samtools depth |
| Per-region coverage (BED) | mosdepth --by regions.bed | Production default |
| Coverage histogram / cumulative | mosdepth -t 4 --no-per-base | Single-pass histogram |
| Breadth at depth thresholds | mosdepth --thresholds 1,10,30,100 | Standard exome QC |
| Targeted enrichment QC | picard CollectHsMetrics | PCT_OFF_BAIT, FOLD_80_BASE_PENALTY, AT/GC dropout |
| Cross-sample contamination | verifybamid2, somalier | FREEMIX < 0.01 expected |
| Counting category | flagstat | stats | idxstats |
|-------------------|----------|-------|----------|
| Primary alignments | in total minus supp | raw total sequences | mapped column |
| Secondary | secondary line | filtered out | counted in mapped |
| Supplementary | supplementary line | filtered out | counted in mapped |
| Mapping rate denominator | total including supp | primary only | mapped+unmapped |
For long-read data where one read produces many supplementary alignments, the senior cross-check:
input_read_count = flagstat_total - secondary - supplementary
= stats_raw_total_sequences
Reports of "the file has 1.2M reads" where the input was actually 800k with 400k supplementary chimeric splits trace to flagstat misinterpretation.
Fast summary of alignment flags.
samtools flagstat input.bam
Output:
10000000 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
50000 + 0 supplementary
0 + 0 duplicates
9800000 + 0 mapped (98.00% : N/A)
9950000 + 0 paired in sequencing
4975000 + 0 read1
4975000 + 0 read2
9700000 + 0 properly paired (97.49% : N/A)
9750000 + 0 with itself and mate mapped
100000 + 0 singletons (1.01% : N/A)
25000 + 0 with mate mapped to a different chr
10000 + 0 with mate mapped to a different chr (mapQ>=5)
samtools flagstat -@ 4 input.bam
samtools flagstat input.bam > flagstat.txt
Per-chromosome read counts (requires index).
samtools idxstats input.bam
Output format: chrom length mapped unmapped
chr1 248956422 5000000 1000
chr2 242193529 4800000 800
chrM 16569 50000 100
* 0 0 150000
# Total mapped reads
samtools idxstats input.bam | awk '{sum += $3} END {print sum}'
# Mitochondrial percentage
samtools idxstats input.bam | awk '
/^chrM/ {mt = $3}
{total += $3}
END {print mt/total*100 "% mitochondrial"}'
Comprehensive statistics including insert size, base quality, and more.
samtools stats input.bam > stats.txt
samtools stats input.bam | grep "^SN"
Key summary fields:
raw total sequences - Total readsreads mapped - Mapped readsreads mapped and paired - Properly pairedinsert size average - Mean insert sizeinsert size standard deviation - Insert size spreadaverage length - Mean read lengtherror rate - Mismatch ratesamtools stats input.bam > stats.txt
plot-bamstats -p plots/ stats.txt
samtools stats input.bam chr1:1000000-2000000 > region_stats.txt
Per-position read depth.
samtools depth input.bam > depth.txt
Output: chrom position depth
samtools depth -r chr1:1000-2000 input.bam
samtools depth -a input.bam > depth_with_zeros.txt
# samtools mpileup historically capped depth at 8000 per position -- the cap was in mpileup, not depth.
# samtools depth -d/--max-depth is deprecated in 1.13+ (silently ignored).
# For mpileup, raise the cap explicitly when working with deep targeted/amplicon data:
samtools mpileup -d 1000000 -f ref.fa input.bam
Pipelines that historically break the 8000 mpileup cap: targeted oncology hotspots (5000-50000x), mitochondrial DNA (small genome, large read share), amplicon viral (ARTIC: 1000-100000x per amplicon), UMI-deduped capture (14000-17000x post-collapse), highly expressed transcripts (rRNA, mt-RNA).
# When fragment length < 2 * read_length, R1 and R2 overlap.
# Default samtools depth double-counts overlap; -s deducts:
samtools depth -s input.bam
Without -s, doubled support inflates somatic VAFs at sites covered by overlapping pairs (especially in fragmented samples: FFPE, cfDNA). mosdepth does not double-count overlap. samtools mpileup and bcftools mpileup both enable overlap detection by default; pass -x/--ignore-overlaps to disable.
mosdepth -t 4 sample input.bam # genome-wide per-base
mosdepth -t 4 --by exome.bed --thresholds 1,10,20,30,100 --no-per-base sample input.bam # exome QC
mosdepth -t 4 --quantize 0:1:10:100: sample input.bam # CNV-style bands
mosdepth -t 4 -f ref.fa sample input.cram # CRAM with reference
mosdepth filters dup/secondary/supp by default; configurable via --flag. Memory ~ 4 bytes x longest chrom (1 GB for human chr1, 12+ GB for axolotl). Does not honor base quality; use samtools depth -q INT if needed.
samtools depth -b regions.bed input.bam
samtools depth input.bam | awk '{sum += $3; n++} END {print sum/n}'
Per-chromosome or per-region coverage statistics (faster than depth).
samtools coverage input.bam
Output columns:
#rname - Reference namestartpos - Start positionendpos - End positionnumreads - Number of readscovbases - Bases with coveragecoverage - Percentage of bases coveredmeandepth - Mean depthmeanbaseq - Mean base qualitymeanmapq - Mean mapping qualitysamtools coverage -r chr1:1000000-2000000 input.bam
samtools coverage -b regions.bed input.bam
samtools coverage -m input.bam
import pysam
with pysam.AlignmentFile('input.bam', 'rb') as bam:
total = mapped = paired = proper = 0
for read in bam:
total += 1
if not read.is_unmapped:
mapped += 1
if read.is_paired:
paired += 1
if read.is_proper_pair:
proper += 1
print(f'Total: {total}')
print(f'Mapped: {mapped} ({mapped/total*100:.1f}%)')
print(f'Properly paired: {proper} ({proper/paired*100:.1f}%)')
import pysam
with pysam.AlignmentFile('input.bam', 'rb') as bam:
for stat in bam.get_index_statistics():
print(f'{stat.contig}: {stat.mapped} mapped, {stat.unmapped} unmapped')
import pysam
with pysam.AlignmentFile('input.bam', 'rb') as bam:
for pileup in bam.pileup('chr1', 1000000, 1000001):
print(f'Position {pileup.pos}: depth {pileup.n}')
import pysam
def mean_depth(bam_path, chrom, start, end):
depths = []
with pysam.AlignmentFile(bam_path, 'rb') as bam:
for pileup in bam.pileup(chrom, start, end, truncate=True):
depths.append(pileup.n)
if depths:
return sum(depths) / len(depths)
return 0
depth = mean_depth('input.bam', 'chr1', 1000000, 2000000)
print(f'Mean depth: {depth:.1f}x')
Goal: Compute coverage breadth and depth for a genomic region from a BAM file.
Approach: Iterate pileup columns in the region, count covered positions and accumulate depth, then derive percentages and means.
Reference (pysam 0.22+):
import pysam
def coverage_stats(bam_path, chrom, start, end):
covered = 0
total_depth = 0
with pysam.AlignmentFile(bam_path, 'rb') as bam:
for pileup in bam.pileup(chrom, start, end, truncate=True):
covered += 1
total_depth += pileup.n
length = end - start
pct_covered = covered / length * 100
mean_depth = total_depth / length if length > 0 else 0
return {
'length': length,
'covered_bases': covered,
'pct_covered': pct_covered,
'mean_depth': mean_depth
}
stats = coverage_stats('input.bam', 'chr1', 1000000, 2000000)
print(f'Coverage: {stats["pct_covered"]:.1f}%')
print(f'Mean depth: {stats["mean_depth"]:.1f}x')
Goal: Compute the insert size distribution to assess library preparation quality.
Approach: Iterate properly paired read1 records, accumulate template lengths into a Counter, then compute summary statistics.
Reference (pysam 0.22+):
import pysam
from collections import Counter
insert_sizes = Counter()
with pysam.AlignmentFile('input.bam', 'rb') as bam:
for read in bam:
if read.is_proper_pair and read.is_read1 and read.template_length > 0:
insert_sizes[read.template_length] += 1
sizes = list(insert_sizes.keys())
mean_insert = sum(s * c for s, c in insert_sizes.items()) / sum(insert_sizes.values())
print(f'Mean insert size: {mean_insert:.0f}')
print(f'Min: {min(sizes)}, Max: {max(sizes)}')
| Task | Command |
|------|---------|
| Quick counts | samtools flagstat input.bam |
| Per-chrom counts | samtools idxstats input.bam |
| Full stats | samtools stats input.bam |
| Coverage summary | samtools coverage input.bam |
| Per-position depth | samtools depth input.bam |
| Mean depth | samtools depth input.bam \| awk '{sum+=$3;n++}END{print sum/n}' |
A single "mapping rate > 95%" rule rejects valid ATAC, ChIP, RNA-seq, metagenomics, and aDNA samples. The threshold question is "is this rate normal for this assay?" not "is this rate above 95%?"
| Metric | WGS PCR-free | WGS PCR | WES | Targeted panel | Deep panel (UMI) | RNA-seq | scRNA (10x) | ATAC | ChIP | Long-read | aDNA | |--------|--------------|---------|-----|----------------|------------------|---------|-------------|------|------|-----------|------| | Mapping rate | >99% | >98% | >95% | >95% | >95% | >90% | >70% | >50% | >60% | >95% | 1-50% | | Duplicate rate | <5% | 5-15% | 20-50% | 20-50% | 50-90% pre-consensus | (skip) | (use UMI) | 10-30% | 5-30% | n/a | 20-60% | | Proper pair rate | >95% | >95% | >85% | >80% | >80% | >70% | n/a | >50% | >70% | n/a | >60% | | Mean MAPQ | bimodal at 0/60 | bimodal | bimodal | bimodal | bimodal | bimodal incl 255 (STAR) | 0/3/255 | 30-55 | 30-55 | 30-50 | 20-40 | | Mt fraction | 0.1-2% | 0.1-2% | <1% | <0.1% | <0.1% | varies | varies | **<10% target (ENCODE ATAC-seq pipeline standards, Buenrostro 2013-style libraries) ** | <2% | n/a | varies |
Mean MAPQ is misleading; the distribution is bimodal (0 and aligner-max). The fraction at MAPQ >= 30 is more informative:
samtools view -c -F 2308 -q 30 in.bam # primary, mapped, MAPQ>=30
samtools view -c -F 2308 in.bam # primary, mapped (denominator)
# For STAR/STARsolo, use -q 255 instead of -q 30 (255 is the unique-mapping sentinel)
A 99% flagstat mapping rate does NOT mean the data is usable. Common false-positive scenarios:
samtools stats input.bam | grep "bases soft-clipped" # >5% suggests adapter contamination
picard CollectHsMetrics PCT_OFF_BAIT or PCT_SELECTED_BASES.verifybamid2 or somalier (FREEMIX > 1% degrades somatic calling; > 5% breaks germline calling).@SQ M5: from BAM header with samtools dict ref.fa -- see alignment-validation.samtools stats reports the IS section only for FR-oriented properly paired reads. So:
testing
Analyze multi-modal single-cell data (CITE-seq, Multiome, spatial). Use when working with data that measures multiple modalities per cell like RNA + protein or RNA + ATAC. Use when analyzing CITE-seq, Multiome, or other multi-modal single-cell data.
data-ai
Analyze metabolite-mediated cell-cell communication using MeboCost for metabolic signaling inference between cell types. Predict metabolite secretion and sensing patterns from scRNA-seq data. Use when studying metabolic crosstalk between cell populations or metabolite-receptor interactions.
development
Find marker genes and annotate cell types in single-cell RNA-seq using Seurat (R) and Scanpy (Python). Use for differential expression between clusters, identifying cluster-specific markers, scoring gene sets, and assigning cell type labels. Use when finding marker genes and annotating clusters.
development
Reconstruct cell lineage trees from CRISPR barcode tracing or mitochondrial mutations. Use when studying clonal dynamics, cell fate decisions, or developmental trajectories.