alignment-files/sam-bam-basics/SKILL.md
View, convert, and understand SAM/BAM/CRAM alignment files using samtools and pysam. Use when inspecting alignments, converting between formats, or understanding alignment file structure.
npx skillsauth add GPTomics/bioSkills bio-sam-bam-basicsInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Reference examples tested with: pysam 0.22+, samtools 1.19+
Before using code patterns, verify installed versions match. If versions differ:
pip show <package> then help(module.function) to check signatures<tool> --version then <tool> --help to confirm flagsIf code throws ImportError, AttributeError, or TypeError, introspect the installed package and adapt the example to match the actual API rather than retrying.
"Read a BAM file" -> Open a binary alignment file and iterate over aligned reads with their mapping coordinates, flags, and quality scores.
pysam.AlignmentFile() (pysam)samtools view (samtools)scanBam() (Rsamtools)View and convert alignment files using samtools and pysam.
| Format | Description | Use Case | |--------|-------------|----------| | SAM | Text format, human-readable | Debugging, small files | | BAM | Binary compressed SAM | Standard storage format | | CRAM | Reference-based compression | Long-term archival, smaller than BAM |
@HD VN:1.6 SO:coordinate
@SQ SN:chr1 LN:248956422
@RG ID:sample1 SM:sample1
@PG ID:bwa PN:bwa VN:0.7.17
read1 0 chr1 100 60 50M * 0 0 ACGT... FFFF... NM:i:0
Header lines start with @:
@HD - Header metadata (version, sort order)@SQ - Reference sequence dictionary@RG - Read group information@PG - Program used to create fileAlignment fields (tab-separated):
samtools view input.bam | head
samtools view -h input.bam | head -100
samtools view -H input.bam
samtools view input.bam chr1:1000-2000
samtools view -c input.bam
Goal: Convert between SAM (text), BAM (binary), and CRAM (reference-compressed) alignment formats.
Approach: Use samtools view with format flags (-b for BAM, -C for CRAM, -h for SAM with header). CRAM requires a reference FASTA with -T.
samtools view -h -o output.sam input.bam
samtools view -b -o output.bam input.sam
samtools view -C -T reference.fa -o output.cram input.bam
samtools view -b -T reference.fa -o output.bam input.cram
samtools view -b input.sam > output.bam
| Flag | Decimal | Meaning | |------|---------|---------| | 0x1 | 1 | Paired | | 0x2 | 2 | Proper pair | | 0x4 | 4 | Unmapped | | 0x8 | 8 | Mate unmapped | | 0x10 | 16 | Reverse strand | | 0x20 | 32 | Mate reverse strand | | 0x40 | 64 | First in pair | | 0x80 | 128 | Second in pair | | 0x100 | 256 | Secondary alignment | | 0x200 | 512 | Failed QC | | 0x400 | 1024 | PCR duplicate | | 0x800 | 2048 | Supplementary |
# Number to mnemonics
samtools flags 147
# 0x93 147 PAIRED,PROPER_PAIR,REVERSE,READ2
# Mnemonics to number
samtools flags PAIRED,PROPER_PAIR,REVERSE,READ2 # 147
Two different concepts that are routinely conflated:
| Bit | Name | Meaning | Filter implication |
|-----|------|---------|--------------------|
| 0x100 (256) | Secondary | An alternative candidate alignment for the same read; not the primary location | -F 256 is correct for SNV/indel calling on short reads |
| 0x800 (2048) | Supplementary | A piece of a chimeric/split alignment (the read is split across loci) | Carries SA:Z tag; required by SV callers (Manta, Sniffles, cuteSV, GRIDSS, Delly) |
-F 2304 removes both. Strip supplementary only when downstream is small-variant calling; keep supplementary for SV calling, fusion detection, or any analysis that follows split-reads.
samtools view -q 30 does different things depending on what produced the BAM. MAPQ is an aligner-specific scale, not a universal probability:
| Aligner | MAPQ scale | "Unique" sentinel | Common gotcha |
|---------|-----------|-------------------|----------------|
| BWA-MEM / BWA-MEM2 | 0-60 | 60 | -q 30 is sensible "high confidence" |
| minimap2 (DNA / pbmm2) | 0-60 | 60 | Spec-compliant |
| HISAT2 | 0-60 | 60 | Spec-compliant |
| Bowtie2 | 0-42 | 42 (rare) | -q 60 drops everything; -q 23 is the established 99% threshold |
| STAR | 0, 1, 2, 3, 255 | 255 = uniquely mapped (sentinel, not a quality) | -q 255 for "unique only"; -q 30 accidentally keeps unique only too |
| DRAGEN | 0 to --mapq-max (often ~250) | varies | -q 30 still meaningful; distribution shape differs |
| Cell Ranger / STARsolo | inherits STAR | 255 | Same trap as STAR |
Verify the actual scale of any unfamiliar BAM:
samtools view input.bam | awk '{print $5}' | sort -un | head
samtools view -H input.bam | grep '^@PG' | head -1 # which aligner produced this BAM
| Context | Coordinate system |
|---------|-------------------|
| SAM text POS | 1-based, inclusive |
| samtools view chr1:100-200 | 1-based, closed interval |
| samtools faidx chr1:100-200 | 1-based, closed interval |
| BAM binary internal | 0-based, half-open |
| pysam read.reference_start | 0-based |
| bam.fetch('chr1', 100, 200) | 0-based, half-open |
| BED files | 0-based, half-open |
| VCF | 1-based |
| GFF/GTF | 1-based, inclusive |
samtools view bam chr1:100-200 and bam.fetch('chr1', 100, 200) return different read sets at boundaries.
| Op | Description | |----|-------------| | M | Alignment match (can be mismatch) | | I | Insertion to reference | | D | Deletion from reference | | N | Skipped region (introns in RNA-seq; do NOT count as covered bases) | | S | Soft clipping (sequence in SEQ but not aligned) | | H | Hard clipping (sequence not in SEQ) | | = | Sequence match (explicit) | | X | Sequence mismatch (explicit) | | P | Padding (rare; multiple-sequence-alignment context) |
Example: 50M2I30M = 50 bases match, 2 base insertion, 30 bases match
CIGAR M is overloaded -- it is the union of = and X. Some aligners (minimap2, BWA with -Y) emit =/X directly; bcftools / Picard often need M and rebuild MD/NM with samtools calmd. N operations break naive coverage calculations: a 1000 bp RNA-seq read with one 50 kb intron does not cover 50 kb. Distinguish soft-clip (S, bases retained) from hard-clip (H, bases discarded -- irreversible).
Beyond the standard fields, downstream tools depend on optional tags whose presence depends on aligner and assay. Inspect with samtools view input.bam | head -1 | tr '\t' '\n' or pysam read.get_tag('XX').
| Tag | Set by | Meaning | Required by | |-----|--------|---------|-------------| | NM:i | bwa, samtools calmd | Edit distance to reference | mapDamage, many filters | | MD:Z | bwa, samtools calmd | Mismatch positions (text) | bcftools mpileup BAQ, IGV mismatch coloring | | MC:Z | samtools fixmate -m | Mate CIGAR | samtools markdup | | MS:i | samtools fixmate -m | Mate score | samtools markdup | | RG:Z | aligner from -R | Read group ID | GATK BQSR, MarkDuplicates LB lookup | | SA:Z | All split-read aligners | Comma-list of supplementary coords | Sniffles, Manta, cuteSV, GRIDSS, Delly | | NH:i | STAR, HISAT2 | Number of reported hits | featureCounts multimapper handling, Salmon | | HI:i | STAR | Hit index (0-based among NH) | RSEM | | XS:A | STAR, HISAT2, minimap2 -ax splice | Strand inferred from splice motif | StringTie, Cufflinks | | CB:Z | Cell Ranger, STARsolo | Corrected cell barcode | scRNA quantification | | UB:Z | Cell Ranger, STARsolo | Corrected UMI | UMI-aware dedup | | RX:Z | fgbio AnnotateBamWithUmis | Raw UMI (bulk) | fgbio GroupReadsByUmi | | MI:Z | fgbio CallMolecularConsensusReads | Molecular identifier (consensus) | Duplex calling | | cs:Z | minimap2 --cs | Compact CIGAR-with-bases | paftools, SV tools |
Missing tags fail in two modes: silently wrong (featureCounts ignoring multimappers without NH; markdup marking nothing without MC/MS) or loudly (consensus tools rejecting input without MD).
The @PG lines record every tool that touched the BAM, linked through PP (previous program) tags. This is the audit trail.
samtools view -H input.bam | grep '^@PG'
A clean germline pipeline:
@PG ID:bwa-mem PN:bwa VN:0.7.17
@PG ID:samtools.1 PN:samtools VN:1.20 PP:bwa-mem CL:samtools sort
@PG ID:samtools.2 PN:samtools VN:1.20 PP:samtools.1 CL:samtools fixmate
@PG ID:samtools.3 PN:samtools VN:1.20 PP:samtools.2 CL:samtools markdup
A broken/missing chain (no PP, unknown tools, gaps) means the BAM cannot be reliably reproduced. Production pipelines often reject inputs without a complete chain.
CRAM stores reads relative to a reference; without it, the file is unreadable. htslib resolves the reference in this order:
-T ref.fa / --referenceREF_CACHE env var (local MD5-named cache)REF_PATH env var (colon-separated; can include URLs)UR: URL in SAM @SQ headerM5: tag (fails on offline HPC)On HPC nodes without internet, populate a local cache once:
mkdir -p $HOME/cram_cache
seq_cache_populate.pl -root $HOME/cram_cache reference.fa
export REF_CACHE=$HOME/cram_cache/%2s/%2s/%s
export REF_PATH=$REF_CACHE # disables ENA fallback
samtools quickcheck -v file.cram # header + EOF only
samtools view -c file.cram # forces full decode; proves reference reachable
CRAM operations can be irreversibly lossy: --output-fmt-option=archive=1 enables 8-bin Illumina quality binning (~30-50% additional size reduction; benign for >=30x germline WGS, harmful for low-coverage / somatic / forensic / archival). Convert against the exact reference the BAM was aligned to (matched by @SQ M5:); a different reference silently corrupts bases on read-back.
Goal: Read and manipulate alignment data programmatically in Python.
Approach: Use pysam.AlignmentFile to open BAM/CRAM files, iterate over reads, and access properties like coordinates, flags, CIGAR, and tags.
import pysam
with pysam.AlignmentFile('input.bam', 'rb') as bam:
for read in bam:
print(f'{read.query_name}\t{read.reference_name}:{read.reference_start}')
with pysam.AlignmentFile('input.bam', 'rb') as bam:
for sq in bam.header['SQ']:
print(f'{sq["SN"]}: {sq["LN"]} bp')
with pysam.AlignmentFile('input.bam', 'rb') as bam:
for read in bam:
print(f'Name: {read.query_name}')
print(f'Flag: {read.flag}')
print(f'Chrom: {read.reference_name}')
print(f'Pos: {read.reference_start}') # 0-based
print(f'MAPQ: {read.mapping_quality}')
print(f'CIGAR: {read.cigarstring}')
print(f'Seq: {read.query_sequence}')
print(f'Qual: {read.query_qualities}')
break
with pysam.AlignmentFile('input.bam', 'rb') as bam:
for read in bam:
if read.is_paired and read.is_proper_pair:
if read.is_reverse:
strand = '-'
else:
strand = '+'
print(f'{read.query_name} on {strand} strand')
with pysam.AlignmentFile('input.bam', 'rb') as bam:
for read in bam.fetch('chr1', 1000, 2000):
print(read.query_name)
with pysam.AlignmentFile('input.bam', 'rb') as infile:
with pysam.AlignmentFile('output.sam', 'w', header=infile.header) as outfile:
for read in infile:
outfile.write(read)
with pysam.AlignmentFile('input.bam', 'rb') as infile:
with pysam.AlignmentFile('output.cram', 'wc', reference_filename='reference.fa', header=infile.header) as outfile:
for read in infile:
outfile.write(read)
| Task | samtools | pysam |
|------|----------|-------|
| View BAM | samtools view file.bam | AlignmentFile('file.bam', 'rb') |
| View header | samtools view -H file.bam | bam.header |
| Count reads | samtools view -c file.bam | sum(1 for _ in bam) |
| Get region | samtools view file.bam chr1:1-1000 | bam.fetch('chr1', 0, 1000) |
| BAM to SAM | samtools view -h -o out.sam in.bam | Open with 'w' mode |
| SAM to BAM | samtools view -b -o out.bam in.sam | Open with 'wb' mode |
| BAM to CRAM | samtools view -C -T ref.fa -o out.cram in.bam | Open with 'wc' mode |
development
Find restriction enzyme cut sites in DNA sequences using Biopython Bio.Restriction. Search with single enzymes, batches of enzymes, or commercially available enzyme sets. Returns cut positions for linear or circular DNA. Use when finding restriction enzyme cut sites in sequences.
development
Create restriction maps showing enzyme cut positions on DNA sequences using Biopython Bio.Restriction. Visualize cut sites, calculate distances between sites, and generate text or graphical maps. Use when creating or analyzing restriction maps.
development
Analyze restriction digest fragments using Biopython Bio.Restriction. Predict fragment sizes, get fragment sequences, simulate gel electrophoresis patterns, and perform double digests. Use when analyzing restriction digest fragment patterns.
development
Select restriction enzymes by criteria using Biopython Bio.Restriction. Find enzymes that cut once, don't cut, produce specific overhangs, are commercially available, or have compatible ends for cloning. Use when selecting restriction enzymes for cloning or analysis.