alignment-files/reference-operations/SKILL.md
Generate consensus sequences and manage reference files using samtools. Use when creating consensus from alignments, indexing references, or creating sequence dictionaries.
npx skillsauth add GPTomics/bioSkills bio-reference-operationsInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Reference examples tested with: GATK 4.5+, bcftools 1.19+, pysam 0.22+, samtools 1.19+
Before using code patterns, verify installed versions match. If versions differ:
pip show <package> then help(module.function) to check signatures<tool> --version then <tool> --help to confirm flagsIf code throws ImportError, AttributeError, or TypeError, introspect the installed package and adapt the example to match the actual API rather than retrying.
Generate consensus sequences and manage reference files using samtools.
"Prepare a reference genome" -> Index the FASTA and create a sequence dictionary for downstream tools.
samtools faidx ref.fa + samtools dict ref.fa -o ref.dictpysam.FastaFile('ref.fa') (auto-uses .fai index)"Build a consensus from BAM" -> Derive the most-supported base at each position from aligned reads.
samtools consensus input.bam -o consensus.faCreate index for random access to reference sequences.
samtools faidx reference.fa
# Creates reference.fa.fai
samtools faidx reference.fa chr1:1000-2000
samtools faidx reference.fa chr1:1000-2000 chr2:3000-4000
samtools faidx reference.fa chr1
samtools faidx reference.fa chr1:1000-2000 > region.fa
samtools faidx -i reference.fa chr1:1000-2000
chr1 248956422 6 60 61
chr2 242193529 253404903 60 61
Columns: name, length, offset, line bases, line width
Create SAM header dictionary for reference (used by GATK, Picard).
samtools dict reference.fa -o reference.dict
samtools dict -a GRCh38 -s "Homo sapiens" reference.fa -o reference.dict
@HD VN:1.6 SO:unsorted
@SQ SN:chr1 LN:248956422 M5:6aef897c3d6ff0c78aff06ac189178dd UR:file:reference.fa
@SQ SN:chr2 LN:242193529 M5:f98db672eb0993dcfdabafe2a882905c UR:file:reference.fa
The M5: (MD5) tag is the only definitive reference-identity check -- two references named "GRCh38" with different decoy/alt content have different M5s. CRAM enforces M5 match on read-back. See alignment-validation for BAM-vs-reference M5 cross-check.
| Reference flavor | ALT | Decoy | EBV | HLA | Use case | |------------------|-----|-------|-----|-----|----------| | GRCh38 no-alt | no | no | no | no | Conservative analyses | | GRCh38 + decoy + EBV (1000G analysis set) | no | yes | yes | no | Cohort projects | | GRCh38 ALT + decoy + EBV + HLA (Broad / hs38DH) | yes | yes | yes | yes | GATK Best Practices | | T2T-CHM13 v2.0 | n/a | n/a | n/a | n/a | Distinct coordinates -- NOT interchangeable |
Mixing no-alt and ALT-aware BAMs in one cohort produces inconsistent multi-mapping behavior at HLA, KIR, and segmental-duplication regions. Standardize before joint calling.
| Convention | Source | chr1 | mitochondrion | |-----------|--------|------|---------------| | UCSC (hg19, hg38) | UCSC Genome Browser | chr1 | chrM | | Ensembl (GRCh37, GRCh38) | Ensembl, ENA | 1 | MT | | NCBI RefSeq (recent) | NCBI | chr1 | chrM | | 1000G analysis sets | 1000G Phase II/III | chr1 | chrM |
A BAM with @SQ SN:chr1 cannot be analyzed against a 1-named reference (and vice versa). Detect:
samtools view -H sample.bam | grep '^@SQ' | head -3
samtools dict ref.fa | head -3
Convert: bcftools annotate --rename-chrs for VCF; for BAM there is no clean conversion -- re-align.
Create consensus sequence from alignments.
samtools consensus input.bam -o consensus.fa
samtools consensus -r chr1:1000-2000 input.bam -o region_consensus.fa
# FASTA (default)
samtools consensus -f fasta input.bam -o consensus.fa
# FASTQ (includes quality)
samtools consensus -f fastq input.bam -o consensus.fq
# Minimum depth to call base
samtools consensus -d 5 input.bam -o consensus.fa
# Call all positions (including low coverage)
samtools consensus -a input.bam -o consensus.fa
# Emit IUPAC codes (R, Y, S, W, K, M, B, D, H, V, N) for heterozygous columns
# --ambig is REQUIRED -- without it, output is restricted to A,C,G,T,N,*
samtools consensus --ambig --het-fract 0.2 --call-fract 0.5 input.bam -o consensus.fa
--het-fract controls the fraction of the second-most-common base relative to the most common required to call a heterozygote (default ~0.15). Without --ambig, columns where the second base passes --het-fract resolve to N rather than the IUPAC code. --show-ins / --show-del control insertion / deletion display, not ambiguity.
# Default: Bayesian Illumina profile
samtools consensus -f fasta input.bam -o consensus.fa
# Platform-specific profiles (samtools 1.21+; verify via samtools consensus --help for installed version)
samtools consensus --config hifi input.bam -o consensus.fa # PacBio HiFi
samtools consensus --config ont input.bam -o consensus.fa # ONT R10.4+
samtools consensus --config ultima input.bam -o consensus.fa # Ultima Genomics
samtools consensus --config illumina input.bam -o consensus.fa # default
# Report ref base where consensus unavailable (low coverage)
samtools consensus -T ref.fa input.bam -o consensus.fa
Different operations -- conflating them produces nonsense:
| Tool | Input | Output | Use case |
|------|-------|--------|----------|
| samtools consensus | BAM | Consensus FASTA derived from reads (Bayesian) | Viral, de novo / amplicon, low-coverage species |
| bcftools consensus | reference + VCF | Reference with VCF variants applied | Apply called variants (haplotype reconstruction, custom ref for re-mapping) |
For viral consensus from BAM:
# Modern: samtools consensus
samtools consensus --config illumina -d 10 --het-fract 0.5 \
--show-ins yes --show-del yes input.bam -o consensus.fa
# Apply called variants to reference (different question)
bcftools consensus -f reference.fa variants.vcf.gz -o sample_consensus.fa
bcftools consensus -f reference.fa -H 1 phased.vcf.gz -o haplotype1.fa # phased haplotype 1
For bacterial / phage assembly polishing, prefer Pilon (short-read) or medaka (ONT); samtools consensus is not iterative.
import pysam
with pysam.FastaFile('reference.fa') as ref:
seq = ref.fetch('chr1', 999, 2000) # 0-based
print(seq)
with pysam.FastaFile('reference.fa') as ref:
for name in ref.references:
length = ref.get_reference_length(name)
print(f'{name}: {length:,} bp')
with pysam.FastaFile('reference.fa') as ref:
for chrom in ref.references:
seq = ref.fetch(chrom)
print(f'>{chrom}')
print(seq[:100] + '...')
import pysam
from collections import Counter
def consensus_at_position(bam, chrom, pos):
bases = Counter()
for pileup in bam.pileup(chrom, pos, pos + 1, truncate=True):
if pileup.pos == pos:
for read in pileup.pileups:
if not read.is_del and not read.is_refskip:
bases[read.alignment.query_sequence[read.query_position]] += 1
if bases:
return bases.most_common(1)[0][0]
return 'N'
with pysam.AlignmentFile('input.bam', 'rb') as bam:
consensus = consensus_at_position(bam, 'chr1', 1000000)
print(f'Consensus at chr1:1000000 = {consensus}')
The Python majority-vote consensus below is illustrative, NOT production. samtools consensus is Bayesian, quality-aware, and platform-aware; majority vote ignores base qualities and produces wrong calls on low-coverage / low-quality regions. Use for teaching pileup iteration mechanics; use samtools consensus for any real consensus.
import pysam
from collections import Counter
def build_consensus(bam_path, chrom, start, end, min_depth=3):
consensus = []
with pysam.AlignmentFile(bam_path, 'rb') as bam:
for pileup in bam.pileup(chrom, start, end, truncate=True):
bases = Counter()
for read in pileup.pileups:
if not read.is_del and not read.is_refskip:
base = read.alignment.query_sequence[read.query_position]
bases[base] += 1
if sum(bases.values()) >= min_depth:
consensus.append(bases.most_common(1)[0][0])
else:
consensus.append('N')
return ''.join(consensus)
import pysam
def create_dict_header(fasta_path):
header = {'HD': {'VN': '1.6', 'SO': 'unsorted'}, 'SQ': []}
with pysam.FastaFile(fasta_path) as ref:
for name in ref.references:
length = ref.get_reference_length(name)
header['SQ'].append({'SN': name, 'LN': length})
return header
header = create_dict_header('reference.fa')
for sq in header['SQ'][:5]:
print(f'{sq["SN"]}: {sq["LN"]:,} bp')
Goal: Set up a reference genome with all indices needed by common analysis tools.
Approach: Create FASTA index (.fai), sequence dictionary (.dict), and aligner-specific indices in sequence.
# 1. Index FASTA for samtools/pysam
samtools faidx reference.fa
# 2. Create sequence dictionary for GATK/Picard
samtools dict reference.fa -o reference.dict
# 3. Pre-populate CRAM REF_CACHE (for offline HPC nodes)
seq_cache_populate.pl -root $REF_CACHE_DIR reference.fa
For aligner-specific indices (BWA, Bowtie2, STAR, minimap2, Salmon), see read-alignment.
# Verify FAI exists
ls -la reference.fa.fai
# Verify dict exists
head reference.dict
# Test fetch
samtools faidx reference.fa chr1:1-100
samtools faidx reference.fa chr1 > chr1.fa
samtools faidx chr1.fa # Index the subset
cut -f1,2 reference.fa.fai > chrom.sizes
samtools faidx reference.fa chr1 chr2 chr3 > subset.fa
samtools faidx subset.fa
# Generate consensus
samtools consensus input.bam -o consensus.fa
# Align consensus back to reference
minimap2 -a reference.fa consensus.fa > comparison.sam
| Task | Command |
|------|---------|
| Index FASTA | samtools faidx ref.fa |
| Fetch region | samtools faidx ref.fa chr1:1-1000 |
| Create dict | samtools dict ref.fa -o ref.dict |
| Build consensus | samtools consensus in.bam -o out.fa |
| Chrom sizes | cut -f1,2 ref.fa.fai |
bcftools consensustesting
Analyze multi-modal single-cell data (CITE-seq, Multiome, spatial). Use when working with data that measures multiple modalities per cell like RNA + protein or RNA + ATAC. Use when analyzing CITE-seq, Multiome, or other multi-modal single-cell data.
data-ai
Analyze metabolite-mediated cell-cell communication using MeboCost for metabolic signaling inference between cell types. Predict metabolite secretion and sensing patterns from scRNA-seq data. Use when studying metabolic crosstalk between cell populations or metabolite-receptor interactions.
development
Find marker genes and annotate cell types in single-cell RNA-seq using Seurat (R) and Scanpy (Python). Use for differential expression between clusters, identifying cluster-specific markers, scoring gene sets, and assigning cell type labels. Use when finding marker genes and annotating clusters.
development
Reconstruct cell lineage trees from CRISPR barcode tracing or mitochondrial mutations. Use when studying clonal dynamics, cell fate decisions, or developmental trajectories.