alignment-files/alignment-amplicon-clipping/SKILL.md
Trim PCR primers from aligned reads in amplicon-panel BAMs using samtools ampliconclip. Use when processing SARS-CoV-2 ARTIC, hereditary cancer panels, ctDNA hot-spot panels, or any amplicon assay where primer-derived bases would falsely confirm reference at primer footprints.
npx skillsauth add GPTomics/bioSkills bio-alignment-amplicon-clippingInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Reference examples tested with: samtools 1.19+, pysam 0.22+
Before using code patterns, verify installed versions match. If versions differ:
pip show <package> then help(module.function) to check signatures<tool> --version then <tool> --help to confirm flagsIf code throws ImportError, AttributeError, or TypeError, introspect the installed package and adapt the example to match the actual API rather than retrying.
"Trim primer-derived bases from amplicon BAM" -> Soft- or hard-clip the 5' primer footprint after alignment using a primer BED, then repair fixmate/MD/NM tags.
samtools ampliconclip -b primers.bed input.bam -o clipped.bam (since samtools 1.11)iVar trim, BAMClipper, fgbio ClipBamAmplicon panels (SARS-CoV-2 ARTIC, hereditary cancer panels, ctDNA hot-spot panels, fusion panels, 16S rRNA) use designed PCR primers for enrichment. Primer-derived bases at read 5' ends do NOT reflect biological sequence -- they reflect the primer template. Without trimming:
Standard amplicon BAMs should NEVER be processed by samtools markdup -- by design every read at a primer location is a "duplicate" by coordinate. See duplicate-handling for the assay-aware decision.
| Tool | When | Notes |
|------|------|-------|
| samtools ampliconclip | Default for amplicon panels (since 1.11) | Soft- or hard-clip from BED; modifies CIGAR; invalidates MD/NM |
| iVar trim | SARS-CoV-2 ARTIC pipelines | Coordinates by primer name/position; applies hard or soft clip |
| BAMClipper | Capture / hybrid panels with primer overlap | 5'-end clipping with overlap handling |
| fgbio ClipBam | When read-pair coordination matters | Soft/hard-clip with mate-aware end adjustment |
| cutadapt (pre-alignment) | Legacy / when alignment is downstream | Trims at FASTQ stage; less precise for amplicon |
Goal: Decide whether trimmed bases are kept in the BAM (reversible) or discarded (irreversible).
Approach: Soft-clip is the safe default; hard-clip only when archiving and disk is constrained.
| Mode | Flag | What it does | Reversible? |
|------|------|--------------|-------------|
| Soft-clip | (default) / --soft-clip | Bases kept in SEQ; CIGAR uses S; bases not aligned | Yes (CIGAR can be re-extended) |
| Hard-clip | --hard-clip | Bases removed from SEQ; CIGAR uses H | No (bases lost) |
Soft-clip is the recommended default. Hard-clip is irreversible -- once applied, the trimmed bases cannot be recovered for re-analysis with different primer coordinates.
Goal: Trim primers from a coordinate-sorted, indexed amplicon BAM and produce a downstream-ready BAM.
Approach: Run ampliconclip with primer BED, optionally --both-ends and --strand, then re-fixmate (CIGAR changed) and re-calmd (MD/NM tags invalidated by clip).
# 1. Soft-clip primers (default; reversible)
samtools ampliconclip --both-ends --strand --soft-clip \
-b primers.bed input.bam -o clipped.bam
# 2. Re-pair tags (CIGARs changed -- mate info needs refresh)
samtools sort -n clipped.bam | \
samtools fixmate -m - - | \
samtools sort -o sorted.bam -
# 3. Repair MD/NM tags (invalidated by clip; required by mpileup BAQ and IGV)
samtools calmd -b sorted.bam reference.fa > clipped_final.bam
samtools index clipped_final.bam
--strand clips primer bases only on the strand the primer is designed for. Without --strand, both strands are clipped at the primer site, removing valid biological sequence on the off-strand.
--both-ends allows clipping at both 5' and 3' positions of the read (some primers can appear at either end after alignment). Necessary for amplicon designs where reads can read through the entire amplicon.
# tab-separated, 0-based half-open like all BED
chr1 100 125 primer_1_F +
chr1 500 525 primer_1_R -
chr1 600 625 primer_2_F +
chr1 1000 1025 primer_2_R -
Tools that consume the BED: column 1-3 (region), column 6 (strand) is required for --strand. ARTIC primer schemes ship pre-built BEDs (e.g., primer.bed from artic-network/primer-schemes).
| Tool | Approach | When |
|------|----------|------|
| samtools ampliconclip | Soft-clip from BED, post-alignment | nf-core/viralrecon, modern ARTIC workflows |
| iVar trim | Soft- or hard-clip with primer-position parsing | Original ARTIC field bioinformatics |
Modern viral consensus pipelines tend to use ampliconclip then samtools consensus --config illumina --ambig for IUPAC heterozygote handling. See reference-operations for consensus generation.
Clipping invalidates several tags and CIGAR-derived fields:
| Field | Impact | Repair |
|-------|--------|--------|
| CIGAR | New S or H operations added | Automatic from ampliconclip |
| MD:Z | Mismatch positions now wrong | samtools calmd -b in.bam ref.fa |
| NM:i | Edit distance recomputed | samtools calmd |
| TLEN | Template length changes when both mates clipped | samtools fixmate -m |
| ms, MC:Z | Mate score (lowercase per SAMtags) / mate CIGAR | samtools fixmate -m |
A clipped BAM that bypasses fixmate + calmd causes silent failures in bcftools mpileup BAQ (which depends on MD), IGV mismatch coloring, and any tool using NM for filtering.
Amplicon reads at primer locations are by design coordinate-degenerate -- every read mapped to the same amplicon shares the same start/end coordinates because they all come from the same primer pair. samtools markdup would mark essentially every read as a duplicate and erase the dataset. For amplicon panels:
fgbio GroupReadsByUmi -> CallMolecularConsensusReads instead of markdup. See duplicate-handling.| Error | Cause | Solution |
|-------|-------|----------|
| MD tag mismatch after clipping | calmd not run | Run samtools calmd -b clipped.bam ref.fa |
| Variant calls with strand bias at every amplicon end | Forgot --strand | Re-run with strand-aware clipping |
| Markdup output shows ~100% duplicates | Amplicon BAM was processed with markdup | Restart from raw alignment; use ampliconclip; skip markdup |
| Unexpected reference confirmation at primer-overlapping variants | ampliconclip not run | Run before variant calling |
| Task | Command |
|------|---------|
| Soft-clip primers | samtools ampliconclip --both-ends --strand -b primers.bed in.bam -o clipped.bam |
| Hard-clip (irreversible) | samtools ampliconclip --both-ends --strand --hard-clip -b primers.bed in.bam -o clipped.bam |
| Repair MD/NM after clip | samtools calmd -b clipped.bam ref.fa > final.bam |
| Repair mate info | samtools sort -n - \| samtools fixmate -m - - \| samtools sort -o out.bam - |
-aa -A -d 600000 -B)tools
--- name: bio-phasing-imputation-foundations description: Frames the phasing/imputation pipeline before any tool runs: phasing and imputation are one Li-Stephens copying HMM (recombination is the transition, mutation the emission, the genetic map and Ne set the rates), imputation's honest output is a dosage with a self-estimated quality (INFO/R2/DR2) not a hard genotype, and the stages are ordered and each fails silently (QC, align build and strand to the panel, phase, impute per chromosome, fil
tools
Chooses the enrichment generation before any tool runs, mapping the input shape to a method class - a pre-selected gene list plus a background to over-representation analysis (ORA, hypergeometric), a ranked statistic for all genes to gene set enrichment (GSEA), a signed signaling topology to pathway-topology (SPIA) - then making the null explicit (competitive vs self-contained, gene vs subject sampling) and running a trustworthiness checklist (testable-gene universe, FDR, redundancy collapse, leading-edge check, version reporting). Covers why every clusterProfiler GSEA is the inter-gene-correlation-uncorrected competitive null, why the background not the gene list decides ORA significance, and why no method is universally best. Use when deciding ORA vs GSEA vs topology, which gene-set DB, whether a result is trustworthy, or which null a tool computes. For ORA see go-enrichment, GSEA see gsea, databases kegg-pathways/reactome-pathways/wikipathways; the ranking comes from differential-expression/de-results.
testing
End-to-end GWAS workflow from VCF to association results. Covers PLINK QC, population structure correction, and association testing for case-control or quantitative traits. Use when running genome-wide association studies.
development
Orchestrates the full path from differential expression results to redundancy-collapsed functional enrichment: choose ORA vs GSEA, convert gene IDs per method, run enrichGO/enrichKEGG/enrichPathway/enrichWP or gseGO/gseKEGG (clusterProfiler, ReactomePA, rWikiPathways), and visualize. Routes the ORA-vs-GSEA generation fork and the null/universe/reproducibility theory to pathway-analysis/enrichment-foundations. Use when a DESeq2/edgeR/limma result must become enriched GO terms, KEGG/Reactome/WikiPathways pathways, or a GSEA leading edge; when deciding whether a ranking exists for all genes (GSEA, named decreasing vector) or only a pre-selected list (ORA plus a defensible background universe); or when assembling DE-to-pathway end to end. The DE list and ranking statistic come from differential-expression/de-results; per-method nuance lives in the pathway-analysis skills.