alignment-files/alignment-filtering/SKILL.md
Filter alignments by flags, mapping quality, and regions using samtools view and pysam. Use when extracting specific reads, removing low-quality alignments, or subsetting to target regions.
npx skillsauth add GPTomics/bioSkills bio-alignment-filteringInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Reference examples tested with: pysam 0.22+, samtools 1.19+
Before using code patterns, verify installed versions match. If versions differ:
pip show <package> then help(module.function) to check signatures<tool> --version then <tool> --help to confirm flagsIf code throws ImportError, AttributeError, or TypeError, introspect the installed package and adapt the example to match the actual API rather than retrying.
"Filter my BAM file to keep only high-quality reads" -> Select reads by FLAG bits, mapping quality, and genomic regions using samtools view or pysam.
samtools view with -F/-f/-q/-L flags (samtools)pysam.AlignmentFile iteration with attribute filters (pysam)Filter alignments by flags, quality, and regions using samtools and pysam.
| Option | Description |
|--------|-------------|
| -f FLAG | Include reads with ALL bits set |
| -F FLAG | Exclude reads with ANY bits set |
| -G FLAG | Exclude reads with ALL bits set |
| -q MAPQ | Minimum mapping quality |
| -L BED | Include reads overlapping regions |
| Flag | Hex | Meaning | |------|-----|---------| | 1 | 0x1 | Paired | | 2 | 0x2 | Proper pair | | 4 | 0x4 | Unmapped | | 8 | 0x8 | Mate unmapped | | 16 | 0x10 | Reverse strand | | 32 | 0x20 | Mate reverse strand | | 64 | 0x40 | First in pair (read1) | | 128 | 0x80 | Second in pair (read2) | | 256 | 0x100 | Secondary alignment | | 512 | 0x200 | Failed QC | | 1024 | 0x400 | Duplicate | | 2048 | 0x800 | Supplementary |
samtools view -F 4 -o mapped.bam input.bam
samtools view -f 4 -o unmapped.bam input.bam
samtools view -f 2 -o proper.bam input.bam
samtools view -F 1024 -o nodup.bam input.bam
samtools view -F 2304 -o primary.bam input.bam
samtools view -F 256 -F 2048 -o primary.bam input.bam
# Or combined: -F 2304
samtools view -f 64 -o read1.bam input.bam
samtools view -f 128 -o read2.bam input.bam
samtools view -F 16 -o forward.bam input.bam
samtools view -f 16 -o reverse.bam input.bam
samtools view -q 30 -o highqual.bam input.bam
samtools view -F 4 -q 30 -o filtered.bam input.bam
MAPQ scales differ by aligner; the same -q 30 filter does different things. See sam-bam-basics for the full MAPQ-by-aligner table. Filtering recommendations:
| Aligner | "Drop ambiguous" | "High confidence" |
|---------|------------------|-------------------|
| BWA-MEM / BWA-MEM2 | -q 1 | -q 30 (or -q 60 for unique only) |
| Bowtie2 | -q 1 | -q 23 (Bowtie2 MAPQ saturates at 42; 23 is the conventional "uniquely mapped" cutoff in the Langmead lab Bowtie2 manual) |
| STAR | -q 255 | -q 255 (255 is the unique-mapped sentinel; -q 60 drops everything) |
| HISAT2 | -q 1 | -q 60 |
| minimap2 (DNA, long-read) | -q 1 | -q 60 |
| pbmm2 (PacBio) | -q 1 | -q 60 |
For Phred-scaled aligners (BWA, minimap2), MAPQ Q maps to ~10^(-Q/10) probability of wrong mapping. For STAR, the values 0/1/2/3/255 are sentinels, not probabilities.
samtools view -q 1 in.bam # exclude MAPQ=0; works for all aligners
samtools view -o region.bam input.bam chr1:1000000-2000000
samtools view -o regions.bam input.bam chr1:1000-2000 chr2:3000-4000
samtools view -L targets.bed -o targets.bam input.bam
samtools view -q 30 -L targets.bed -o filtered.bam input.bam
Goal: Produce a clean BAM containing only primary, mapped, non-duplicate reads with high mapping confidence.
Approach: Combine FLAG exclusion (-F for unmapped + secondary + duplicate + supplementary) with a MAPQ threshold.
Reference (samtools 1.19+):
samtools view -F 3332 -q 30 -o filtered.bam input.bam
# 3332 = 4 (unmapped) + 256 (secondary) + 1024 (duplicate) + 2048 (supplementary)
Goal: Choose a filter that matches what the downstream caller expects. Stripping supplementary alignments breaks SV callers; requiring proper-pair drops valid spliced RNA-seq reads.
| Assay / caller | Recommended filter | Why |
|----------------|-------------------|-----|
| Germline WGS short-variant (HaplotypeCaller, DeepVariant) | -f 2 -F 3328 -q 20 | Primary, no dup, proper pair, MAPQ>=20 |
| Somatic short-variant (Mutect2, Strelka2) | -F 3328 -q 1 | Drop only MAPQ=0; somatic callers handle low MAPQ; chimeric reads at SVs may carry real somatic SNVs |
| Long-read short-variant (clair3, DeepVariant ONT) | -F 3328 -q 5 | Long-read MAPQ scale is lower |
| Long-read SV (Sniffles, cuteSV) | -F 1024 only | Keep supplementary -- SA tag is the SV signal |
| Short-read SV (Manta, GRIDSS, Delly, SvABA) | -F 1024 only | Same -- supplementary required |
| ChIP-seq peak calling | -F 1804 -q 30 | Drop dup + secondary + supp + unmapped + mate-unmapped + QC-fail |
| ATAC-seq | -F 1804 -q 30 -f 2 | Same plus proper pair |
| RNA-seq quantification (STAR) | -q 255 | Unique only (STAR sentinel) |
| RNA-seq quantification (HISAT2) | -F 256 -q 60 | Different aligner semantics |
| RNA-seq variant (after SplitNCigarReads) | -F 3328 -q 20 | Standard germline after split-N-trim |
| Panel / amplicon | After samtools ampliconclip; -F 1024 -q 20 | Primer overlap makes proper-pair unreliable |
| ctDNA / cfDNA (UMI) | After fgbio consensus; do not pre-filter raw | |
Reference (samtools 1.19+):
# Short-variant germline
samtools view -f 2 -F 3328 -q 20 -o clean.bam input.bam
# 3328 = 256 (secondary) + 1024 (duplicate) + 2048 (supplementary)
# SV calling: KEEP supplementary
samtools view -F 1024 -o sv_input.bam input.bam # NOT -F 2304 or -F 3328
# ChIP-seq / ATAC-seq common filter
samtools view -F 1804 -q 30 -o filtered.bam input.bam
# 1804 = 4 + 8 + 256 + 512 + 1024 = unmapped + mate-unmapped + secondary + QC-fail + duplicate
Cost of getting this wrong: filtering -F 2304 or -F 3328 before SV calling produces zero SV calls -- a single-flag mistake that silently invalidates the analysis.
samtools view -s SEED.FRAC -- integer is the hash seed; fractional is the keep fraction. The hash is on QNAME, so:
-s 1.5 then -s 1.25 keeps a nested 5/8 of the original (not 12.5%). Use different integer seeds for independent samples.# 10% with seed 42 (always the same reads; pair-consistent)
samtools view -s 42.1 -b -o subset.bam input.bam
# Sequential cuts with INDEPENDENT seeds
samtools view -s 1.5 -b in.bam > half1.bam
samtools view -s 2.25 -b half1.bam > quarter.bam # 12.5% of original
# Coverage-matching to a target read count
total=$(samtools view -c -F 2304 input.bam)
target=10000000
frac=$(awk -v t=$target -v n=$total 'BEGIN{printf "%.6f", t/n}')
samtools view -s "1.${frac#*.}" -b -o matched.bam input.bam
# Tumor-normal coverage matching (pull tumor down to normal)
normal_reads=$(samtools view -c -F 2308 normal.bam)
tumor_reads=$(samtools view -c -F 2308 tumor.bam)
if [ "$tumor_reads" -gt "$normal_reads" ]; then
frac=$(awk -v n=$normal_reads -v t=$tumor_reads 'BEGIN{printf "%.6f", n/t}')
samtools view -s "1.${frac#*.}" -b -o tumor_matched.bam tumor.bam
fi
A subsampled BAM without an integer seed (-s 0.1) is non-reproducible -- production pipelines should reject it.
samtools view -e EXPR (or --expr, since samtools 1.16) supports arbitrary expression filtering on tags, FLAG, MAPQ, RNAME, CIGAR, etc. Powerful for filtering by NM, AS, NH, cs, etc. that the FLAG-based filters cannot reach:
# Reads with >=2 mismatches (NM tag)
samtools view -e '[NM] >= 2' in.bam
# Soft clip on the left, on chr1
samtools view -e 'cigar=~"^[0-9]+S" && rname=="chr1"' in.bam
# Combine with FLAG and MAPQ
samtools view -F 2308 -q 30 -e '[NM] <= 5 && [AS] >= 100' in.bam
# Drop reads with low mapped fraction (samtools-internal helpers)
samtools view -e 'sclen / qlen < 0.2' in.bam
Note: in samtools 1.16+, ![NM] is true only if NM is missing (was buggy in earlier versions); NULL values from missing tags propagate through arithmetic.
samtools view -r library_A in.bam # single read group
samtools view -R rg_list.txt in.bam # multiple via file (one ID per line)
import pysam
with pysam.AlignmentFile('input.bam', 'rb') as infile:
with pysam.AlignmentFile('filtered.bam', 'wb', header=infile.header) as outfile:
for read in infile:
if read.is_unmapped:
continue
if read.mapping_quality < 30:
continue
if read.is_duplicate:
continue
outfile.write(read)
Goal: Apply a multi-criteria quality filter to produce clean alignments for downstream analysis.
Approach: Define a predicate checking mapped status, primary alignment, duplicate flag, and MAPQ; stream reads through it.
Reference (pysam 0.22+):
import pysam
def passes_filter(read):
if read.is_unmapped:
return False
if read.is_secondary or read.is_supplementary:
return False
if read.is_duplicate:
return False
if read.mapping_quality < 30:
return False
return True
with pysam.AlignmentFile('input.bam', 'rb') as infile:
with pysam.AlignmentFile('filtered.bam', 'wb', header=infile.header) as outfile:
for read in infile:
if passes_filter(read):
outfile.write(read)
import pysam
with pysam.AlignmentFile('input.bam', 'rb') as infile:
with pysam.AlignmentFile('region.bam', 'wb', header=infile.header) as outfile:
for read in infile.fetch('chr1', 1000000, 2000000):
outfile.write(read)
Goal: Extract only reads overlapping target regions defined in a BED file.
Approach: Parse BED into a list of (chrom, start, end) tuples, then fetch reads from each region and write to output.
Reference (pysam 0.22+):
import pysam
def read_bed(bed_path):
regions = []
with open(bed_path) as f:
for line in f:
if line.startswith('#'):
continue
parts = line.strip().split('\t')
regions.append((parts[0], int(parts[1]), int(parts[2])))
return regions
regions = read_bed('targets.bed')
with pysam.AlignmentFile('input.bam', 'rb') as infile:
with pysam.AlignmentFile('targets.bam', 'wb', header=infile.header) as outfile:
for chrom, start, end in regions:
for read in infile.fetch(chrom, start, end):
outfile.write(read)
Hash on QNAME so mates stay together (a fresh random.random() per read drops mates inconsistently and breaks paired-end tools):
import pysam
import zlib
fraction = 0.1
seed = 42
threshold = int(0xffffffff * fraction)
def template_hash(qname, seed):
return zlib.crc32(qname.encode()) ^ seed
with pysam.AlignmentFile('input.bam', 'rb') as infile:
with pysam.AlignmentFile('subset.bam', 'wb', header=infile.header) as outfile:
for read in infile:
if template_hash(read.query_name, seed) <= threshold:
outfile.write(read)
| Task | samtools command |
|------|------------------|
| Mapped only | view -F 4 |
| Unmapped only | view -f 4 |
| Properly paired | view -f 2 |
| Primary only | view -F 2304 |
| No duplicates | view -F 1024 |
| High MAPQ | view -q 30 |
| Region | view file.bam chr1:1-1000 |
| BED regions | view -L file.bed |
| Subsample 10% (reproducible) | view -s 42.1 |
| Standard filter | view -F 3332 -q 30 |
| Purpose | Flags |
|---------|-------|
| Clean reads | -F 3332 -q 30 (mapped, primary, no dups, high qual) |
| Variant calling | -f 2 -F 3328 -q 20 (proper pair, primary, no dups) |
| Coverage analysis | -F 1284 -q 1 (mapped, primary, no dups) |
| Count unique | -F 2304 (primary only) |
Flag breakdowns:
testing
Analyze multi-modal single-cell data (CITE-seq, Multiome, spatial). Use when working with data that measures multiple modalities per cell like RNA + protein or RNA + ATAC. Use when analyzing CITE-seq, Multiome, or other multi-modal single-cell data.
data-ai
Analyze metabolite-mediated cell-cell communication using MeboCost for metabolic signaling inference between cell types. Predict metabolite secretion and sensing patterns from scRNA-seq data. Use when studying metabolic crosstalk between cell populations or metabolite-receptor interactions.
development
Find marker genes and annotate cell types in single-cell RNA-seq using Seurat (R) and Scanpy (Python). Use for differential expression between clusters, identifying cluster-specific markers, scoring gene sets, and assigning cell type labels. Use when finding marker genes and annotating clusters.
development
Reconstruct cell lineage trees from CRISPR barcode tracing or mitochondrial mutations. Use when studying clonal dynamics, cell fate decisions, or developmental trajectories.