alignment-files/alignment-sorting/SKILL.md
Sort alignment files by coordinate or read name using samtools and pysam. Use when preparing BAM files for indexing, variant calling, or paired-end analysis.
npx skillsauth add GPTomics/bioSkills bio-alignment-sortingInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Reference examples tested with: pysam 0.22+, samtools 1.19+
Before using code patterns, verify installed versions match. If versions differ:
pip show <package> then help(module.function) to check signatures<tool> --version then <tool> --help to confirm flagsIf code throws ImportError, AttributeError, or TypeError, introspect the installed package and adapt the example to match the actual API rather than retrying.
Sort alignment files by coordinate or read name using samtools and pysam.
"Sort a BAM file" -> Reorder reads by genomic coordinate (for indexing/variant calling) or by name (for paired-end processing).
samtools sort -o sorted.bam input.bampysam.sort('-o', 'sorted.bam', 'input.bam')| Order | Flag | Use Case |
|-------|------|----------|
| Coordinate | default | Indexing, visualization, variant calling |
| Name | -n | Paired-end processing, fixmate, markdup |
| Tag | -t TAG | Sort by specific tag value |
samtools sort -o sorted.bam input.bam
samtools sort -n -o namesorted.bam input.bam
samtools sort -@ 8 -o sorted.bam input.bam
samtools sort -m 4G -@ 4 -o sorted.bam input.bam
samtools sort -T /tmp/sort_tmp -o sorted.bam input.bam
# Output as BAM (default)
samtools sort -O bam -o sorted.bam input.bam
# Output as CRAM
samtools sort -O cram --reference ref.fa -o sorted.cram input.bam
# Sort by cell barcode (10x Genomics)
samtools sort -t CB -o sorted_by_barcode.bam input.bam
bwa mem ref.fa reads.fq | samtools sort -o aligned.bam
| Tool | Algorithm | Speed | Memory | Output guarantee |
|------|-----------|-------|--------|------------------|
| sort -n | Full lexicographic sort by QNAME | Slowest | Spills to -T | Strict total order by name |
| collate | Hash-bucket grouping | ~3-10x faster | Bounded | Mates adjacent; between-mate order undefined |
Use collate when extracting paired FASTQ, re-aligning, or streaming through markdup. Use sort -n only when a tool requires true lexicographic name order (e.g. RSEM, Salmon alignment-mode).
# Fast paired FASTQ extraction
samtools collate -O -u in.bam tmp_prefix | \
samtools fastq -1 R1.fq.gz -2 R2.fq.gz -0 /dev/null -s /dev/null -n -
# Markdup pre-processing (collate beats sort -n here)
samtools collate -O -u in.bam tmp_prefix | \
samtools fixmate -m -u - - | \
samtools sort -u - | \
samtools markdup - out.bam
| Operation | Required sort |
|-----------|---------------|
| samtools index | coordinate (hard requirement) |
| samtools fixmate -m | name (or collate; needs mates adjacent) |
| samtools markdup | coordinate (after fixmate) |
| GATK MarkDuplicatesSpark | coordinate or queryname |
| samtools mpileup / bcftools mpileup | coordinate |
| GATK HaplotypeCaller, Mutect2 | coordinate |
| featureCounts / HTSeq | coordinate or name (-p for paired) |
| umi_tools dedup | coordinate (with index) |
| fgbio GroupReadsByUmi | queryname (hard requirement) |
| fgbio CallMolecularConsensusReads | TemplateCoordinate (fgbio SortBam) |
| Sniffles, cuteSV, Manta, Delly | coordinate (need SA tags) |
| Salmon alignment-mode | name |
| RSEM (with STAR --quantMode TranscriptomeSAM) | name (hard requirement) |
samtools view -H input.bam | grep "^@HD"
# SO:coordinate = coordinate sorted
# SO:queryname = name sorted
# SO:unsorted = not sorted
# Check if coordinate sorted (returns 0 if sorted)
samtools view input.bam | awk '$4 < prev {exit 1} {prev=$4}'
import pysam
pysam.sort('-o', 'sorted.bam', 'input.bam')
pysam.sort('-n', '-o', 'namesorted.bam', 'input.bam')
pysam.sort('-@', '4', '-m', '2G', '-o', 'sorted.bam', 'input.bam')
Do not load BAM records into a list and call sorted(). pysam.sort() calls samtools' external-merge sort which spills to disk; loading reads into memory blows up around ~30M reads (~10 GB human BAM). Always delegate to pysam.sort():
import pysam
pysam.sort('-@', '4', '-m', '2G', '-T', '/tmp/sortpfx',
'-o', 'sorted.bam', 'input.bam')
import pysam
with pysam.AlignmentFile('input.bam', 'rb') as bam:
hd = bam.header.get('HD', {})
sort_order = hd.get('SO', 'unknown')
print(f'Sort order: {sort_order}')
For streaming from aligners, use shell pipes (simpler and more reliable):
import subprocess
subprocess.run(
'bwa mem ref.fa reads.fq | samtools sort -o aligned.bam',
shell=True, check=True
)
Combine multiple BAM files into one. samtools merge does NOT validate sort-order consistency across inputs; mismatched inputs silently produce a malformed output.
for f in *.bam; do samtools view -H "$f" | head -1; done | sort -u
# Should print exactly ONE line, e.g. "@HD VN:1.6 SO:coordinate"
# -c deduplicates @RG records; -p deduplicates @PG records (samtools-merge(1))
samtools merge -c -p -@ 8 merged.bam sample1.bam sample2.bam sample3.bam
When merging BAMs from different lanes / machines / aligners, RG IDs may collide. -c and -p deduplicate header records, but RG IDs that genuinely refer to different lane-level read groups must be made unique upstream (samtools addreplacerg) before merge -- otherwise GATK BQSR (which keys models by RGID/PU) silently produces wrong recalibration.
samtools merge -@ 4 merged.bam sample1.bam sample2.bam sample3.bam
samtools merge -b files.txt merged.bam # one BAM path per line
samtools merge -f merged.bam sample1.bam sample2.bam
samtools merge -R chr1:1000000-2000000 merged_region.bam sample1.bam sample2.bam
import pysam
pysam.merge('-c', '-p', '-f', 'merged.bam', 'sample1.bam', 'sample2.bam', 'sample3.bam')
Goal: Combine sorting with other alignment processing steps into efficient pipelines.
Approach: Pipe aligner output directly into samtools sort to avoid writing unsorted intermediates, then index for downstream access.
bwa mem -t 8 ref.fa R1.fq R2.fq | samtools sort -@ 4 -o aligned.bam
samtools index aligned.bam
# Full workflow: sort by name, fixmate, sort by coord, markdup
samtools sort -n -o namesorted.bam input.bam
samtools fixmate -m namesorted.bam fixmate.bam
samtools sort -o sorted.bam fixmate.bam
samtools markdup sorted.bam marked.bam
samtools sort -o coord_sorted.bam name_sorted.bam
samtools index coord_sorted.bam
# Collate first to group pairs
samtools collate -u -O input.bam /tmp/collate | \
samtools fastq -1 R1.fq -2 R2.fq -0 /dev/null -s /dev/null -
| Parameter | Effect |
|-----------|--------|
| -@ N | Use N additional threads |
| -m SIZE | Memory per thread (e.g., 4G) |
| -T PREFIX | Temp file location (use fast SSD scratch) |
| -l LEVEL | Compression level (1-9, default 6) |
| Level | Use | Wall-time vs default | Size vs default |
|-------|-----|----------------------|------------------|
| -l 0 / -u | Pipe between samtools tools | 0% (skips BGZF) | +200-400% |
| -l 1 | Final output if disk is cheap | ~+10% | ~+30% |
| -l 6 | Default | baseline | baseline |
| -l 9 | Archival, write-once | ~+50-100% | ~-2-5% |
# WRONG -- pipe re-compresses then decompresses every step
samtools fixmate -m in.bam - | samtools sort -o out.bam
# RIGHT -- uncompressed (-u) between piped samtools commands
samtools fixmate -m -u in.bam - | samtools sort -o out.bam
# 8 threads, 2GB per thread, low compression for output written to fast disk
samtools sort -@ 8 -m 2G -l 1 -T /scratch/sortpfx -o sorted.bam input.bam
| Task | Command |
|------|---------|
| Sort by coordinate | samtools sort -o out.bam in.bam |
| Sort by name | samtools sort -n -o out.bam in.bam |
| Sort with threads | samtools sort -@ 8 -o out.bam in.bam |
| Collate pairs | samtools collate -o out.bam in.bam |
| Merge BAMs | samtools merge out.bam in1.bam in2.bam |
| Check sort order | samtools view -H in.bam \| grep "^@HD" |
| Sort + index | samtools sort -o out.bam in.bam && samtools index out.bam |
| Error | Cause | Solution |
|-------|-------|----------|
| out of memory | Insufficient RAM | Use -m to limit per-thread memory |
| disk full | Temp files filling disk | Use -T to specify different location |
| truncated file | Interrupted sort | Re-run sort from original |
testing
Analyze multi-modal single-cell data (CITE-seq, Multiome, spatial). Use when working with data that measures multiple modalities per cell like RNA + protein or RNA + ATAC. Use when analyzing CITE-seq, Multiome, or other multi-modal single-cell data.
data-ai
Analyze metabolite-mediated cell-cell communication using MeboCost for metabolic signaling inference between cell types. Predict metabolite secretion and sensing patterns from scRNA-seq data. Use when studying metabolic crosstalk between cell populations or metabolite-receptor interactions.
development
Find marker genes and annotate cell types in single-cell RNA-seq using Seurat (R) and Scanpy (Python). Use for differential expression between clusters, identifying cluster-specific markers, scoring gene sets, and assigning cell type labels. Use when finding marker genes and annotating clusters.
development
Reconstruct cell lineage trees from CRISPR barcode tracing or mitochondrial mutations. Use when studying clonal dynamics, cell fate decisions, or developmental trajectories.