liquid-biopsy/cfdna-preprocessing/SKILL.md
Decides how to preprocess plasma cfDNA sequencing data so the recoverable signal survives - library-prep-aware fragment expectations (dsDNA vs ssDNA/adaptase prep), UMI/duplex consensus with fgbio (ExtractUmisFromBam, GroupReadsByUmi --strategy paired for duplex, CallMolecularConsensusReads vs CallDuplexConsensusReads, FilterConsensusReads min-reads "total s1 s2"), the align->group->consensus->RE-align ordering, and the cfDNA dedup trap where naive coordinate dedup collapses nucleosome-coincident independent molecules. Covers when single-strand consensus suffices vs when duplex is mandatory, the singleton/sensitivity tax at low input, and reading the insert-size histogram as a pre-analytical QC instrument. Use when processing plasma cfDNA reads before fragmentomics, ctDNA mutation calling, or tumor-fraction estimation.
npx skillsauth add GPTomics/bioSkills bio-cfdna-preprocessingInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Reference examples tested with: bwa 0.7.17+, fgbio 2.1+, numpy 1.26+, pysam 0.22+, samtools 1.19+
Before using code patterns, verify installed versions match. If versions differ:
pip show <package> then help(module.function) to check signatures<tool> --version then <tool> --help to confirm flagsIf code throws ImportError, AttributeError, or TypeError, introspect the installed package and adapt the example to match the actual API rather than retrying.
Notes specific to this skill: fgbio flag semantics drift across major versions - in CallDuplexConsensusReads the --min-reads is a permissive PRE-filter (fgbio issue #1009), and in FilterConsensusReads the --min-reads (-M) takes up to three values "total strand1 strand2" where if values two and three differ the more stringent value must come first. Confirm both against the installed fgbio --help before scripting.
"Preprocess my plasma cfDNA reads" -> Convert reads to consensus molecules and an analysis-ready BAM without destroying the fragment-size signal or collapsing independent molecules.
fgbio ExtractUmisFromBam -> bwa mem -Y -> fgbio GroupReadsByUmi -> fgbio Call*ConsensusReads -> RE-align -> fgbio FilterConsensusReadssubprocess; read insert sizes with pysam for QCWhat a cfDNA assay can ever see is fixed before any bioinformatics runs. cfDNA is not randomly sheared - it is a nucleosome footprint with a mononucleosome mode at ~167 bp and a ctDNA-enriched short tail (134-144 bp), so fragment length is structured biological signal, not noise to be normalized away. Two upstream choices determine what survives: (1) the blood draw and plasma prep (a delayed draw dumps leukocyte gDNA into the denominator and irreversibly dilutes tumor fraction - no algorithm recovers it; defer that QC to liquid-biopsy/analytical-validation), and (2) the library chemistry (dsDNA ligation polishes native ends and discards the sub-100 bp population; ssDNA/adaptase prep recovers short and damaged molecules). Reverse engineering a fragment the prep threw away, or a molecule a bad draw never delivered, is impossible.
The second reframe: UMI/duplex consensus is error suppression, not merely duplicate removal. Grouping reads by UMI and majority-voting a consensus erases PCR/sequencing errors that are not shared across a family - but a single-strand consensus votes unanimously for damage (C->T deamination, G->T 8-oxoG) that was on the template before amplification. Only DUPLEX consensus, requiring the same base on both independently-copied strands, removes those lesions. Single-strand consensus cannot. Choosing simplex vs duplex is choosing an error floor, and it trades against molecular recovery at low input (the singleton tax below).
| Choice | What it does | Recoverable distribution / error floor | |--------|--------------|----------------------------------------| | dsDNA ligation prep (NEB/KAPA-style) | needs duplex substrate; end-repair/A-tail polishes native ends | clean ~167 bp mode; sub-100 bp tail and native-end signal LOST | | ssDNA prep, SRSLY/Kircher-Meyer lineage | denatures, ligates single strands; retains native ends + sawtooth | recovers short/nicked/damaged + sub-nucleosomal; prep for fragmentomics | | ssDNA prep, adaptase/tail-based (Swift/Accel-1S) | single-strand but adaptase chemistry shifts apparent size | ~10 bp short of the canonical mode; sawtooth blunted (a chemistry signature, not a bug) | | Single-strand UMI consensus (CallMolecularConsensusReads) | majority-vote one source strand's reads | removes PCR/sequencer error (~1e-4 to 1e-5); CANNOT remove deamination/oxidation damage | | DUPLEX consensus (CallDuplexConsensusReads) | combine both-strand single-strand consensuses | error must occur identically on both strands to survive (~<1e-7); removes deamination/oxidation |
| Scenario | Recommended path | Why |
|----------|------------------|-----|
| Deep targeted panel with single-strand UMIs | adjacency group -> CallMolecularConsensusReads -> FilterConsensusReads | simplex consensus suppresses PCR/sequencer error to ~1e-4; the panel-VAF workhorse |
| Need VAF below ~0.1% / MRD-grade specificity | duplex prep -> --strategy paired -> CallDuplexConsensusReads -> filter 2 1 1 | duplex removes damage artifacts; reaches the ~1e-7 floor single-strand cannot |
| sWGS / ULP-WGS for tumor fraction | minimal processing: trim -> align -> light dedup; NO consensus | TF from copy number needs even coverage, not error suppression; defer to tumor-fraction-estimation |
| Degraded / low-input / FFPE-adjacent sample | ssDNA prep (SRSLY) to recover short+damaged molecules | dsDNA prep discards exactly the molecules a degraded sample has left |
| Fragmentomics / end-motif readout | ssDNA prep (native ends), NO in-silico size selection upstream | dsDNA fills jagged ends; size-selecting conditions on length and biases every feature |
| Picogram input, detection (not genotyping) | call permissively (--min-reads 1), accept singletons | requiring duplicate observation discards the only evidence for a low-VAF variant |
| No UMIs at all, quantitative readout | do NOT coordinate-dedup; document the bias (see Failure Modes) | nucleosome-positioned ends make naive dedup delete real molecules |
Methodology evolves: confirm current fgbio flag semantics and prep-vendor size behavior against live docs before committing a pipeline.
Goal: Turn UMI-tagged raw reads into error-suppressed consensus molecules with correct coordinates.
Approach: Extract UMIs into the RX tag, align (consensus needs coordinates to group), group by UMI + approximate position, call consensus (which emits UNMAPPED reads because the consensus sequence differs from any input read), RE-align the consensus, then apply the real quality gate with FilterConsensusReads. The two-pass alignment is mandatory.
# 1. Extract inline UMIs into RX. Read structure tokens: M=UMI, S=skip/stem, T=template, +=all remaining.
# 6M11S+T per end = 6 bp UMI, 11 bp stem (the S that bleeds into T if omitted), rest = insert.
fgbio ExtractUmisFromBam --input raw.unmapped.bam --output with_umis.bam \
--read-structure 6M11S+T 6M11S+T --single-tag RX
# 2. Align. -Y soft-clips supplementaries so tag-bearing short-fragment sequence is not dropped.
bwa mem -t 8 -Y reference.fa with_umis.bam | samtools sort -o aligned.bam -
samtools index aligned.bam
# 3. Group by UMI. adjacency = simplex default; paired = MANDATORY for duplex (reconstructs strand pairing).
fgbio GroupReadsByUmi --input aligned.bam --output grouped.bam \
--strategy paired --edits 1 # use --strategy adjacency for single-strand UMIs
# 4a. SIMPLEX: single-strand consensus, --min-reads takes ONE value.
fgbio CallMolecularConsensusReads --input grouped.bam --output consensus.unmapped.bam --min-reads 1
# 4b. DUPLEX: --min-reads here is a permissive PRE-filter (fgbio #1009) - call low, filter later.
fgbio CallDuplexConsensusReads --input grouped.bam --output consensus.unmapped.bam --min-reads 1
# 5. RE-align: consensus reads are emitted UNMAPPED by design. Re-map, then ZipperBams
# transfers the consensus/UMI tags from the unmapped BAM onto the new alignments.
samtools fastq consensus.unmapped.bam | bwa mem -t 8 -Y -p reference.fa - \
| fgbio ZipperBams --unmapped consensus.unmapped.bam --ref reference.fa \
| samtools sort -o consensus.bam -
# 6. The REAL quality gate. --min-reads "total strand1 strand2"; "2 1 1" = true duplex (both strands seen).
# If values two and three differ, the more stringent must come first (e.g. "6 3 0", not "0 3").
fgbio FilterConsensusReads --input consensus.bam --output filtered.bam --ref reference.fa \
--min-reads 2 1 1 --max-read-error-rate 0.025 --max-base-error-rate 0.1 \
--min-base-quality 40 --reverse-per-base-tags
Key flags: --strategy paired is non-negotiable for duplex (adjacency cannot reconstruct A/B strand pairing). --min-reads 1 1 0 on the filter accepts single-strand consensus too (one strand may be absent); 2 1 1 requires both strands = true duplex. --reverse-per-base-tags makes per-base depth/error tags read in genomic orientation after alignment.
Goal: Use the fragment-length distribution as a pre-analytical instrument before trusting any downstream number.
Approach: Tabulate proper-pair template lengths from the BAM, locate the mode, and compare the shape against the expected nucleosome footprint - the mode, the sawtooth, and the long-fragment fraction each diagnose a specific failure.
import pysam
import numpy as np
def insert_size_qc(bam_path, max_size=600):
'''Summarize cfDNA fragment lengths as a QC readout.'''
bam = pysam.AlignmentFile(bam_path, 'rb')
sizes = [r.template_length for r in bam.fetch()
if r.is_proper_pair and not r.is_secondary and 0 < r.template_length <= max_size]
bam.close()
sizes = np.array(sizes)
long_frac = np.mean(sizes > 250) # excess >250 bp signals gDNA/leukocyte-lysis contamination
return {'n': len(sizes), 'mode_bp': int(np.bincount(sizes).argmax()),
'median_bp': float(np.median(sizes)), 'frac_over_250bp': float(long_frac)}
A mode at ~167 bp with a ~10.4 bp sawtooth below it is healthy. A mode drifting up plus excess mass >250 bp is gDNA contamination. A mode ~10 bp low is the adaptase chemistry signature, not contamination. A ~120-130 bp spike is adapter dimer (trim adapters before reading the histogram).
Trigger: running Picard MarkDuplicates / samtools markdup on cfDNA without UMIs. Mechanism: cfDNA ends pile up non-randomly at nucleosome/linker boundaries, so many independent molecules genuinely share the same ~167 bp start+end; coordinate dedup declares them PCR duplicates and keeps one. Symptom: deflated unique-molecule counts and depressed VAF for true low-frequency variants - worst exactly where coverage and nucleosome positioning are strongest. Fix: use UMIs and group by family; if none, do not dedup by position alone for quantitative readouts and document the bias. This is the strongest single argument for putting UMIs on a cfDNA assay.
Trigger: setting CallDuplexConsensusReads --min-reads high. Mechanism: it is a PRE-filter (fgbio #1009) that discards data before the real filter sees it. Symptom: lower molecular recovery than expected with no specificity gain. Fix: call permissively (--min-reads 1), apply strictness in FilterConsensusReads.
Trigger: --strategy adjacency for a duplex library. Mechanism: adjacency cannot link the two strands of one duplex (which carry the UMI in opposite order). Symptom: duplex consensus finds no two-strand molecules; output collapses to simplex. Fix: --strategy paired.
Trigger: a global MAPQ >= 30/60 filter. Mechanism: a 40-80 bp insert has fewer anchoring bases and lands in the low-MAPQ tail even when correctly placed. Symptom: the short, ctDNA-enriched fragments are preferentially deleted - the molecules size-selection was meant to keep. Fix: filter MAPQ with the size distribution in mind; use a gentler threshold for fragmentomics.
--min-reads >= 2 at low input (the singleton tax)Trigger: requiring duplicate observation per family on a picogram-input library. Mechanism: a large fraction of unique cfDNA molecules are sequenced once (singletons); requiring two reads discards genuine, often the only, evidence for a variant. Symptom: sensitivity loss disguised as quality. Fix: for detection favor recovery (call permissively); reserve 2 1 1 for genotyping/MRD where error suppression dominates.
Trigger: computing end-motif/ratio/VAF features after in-vitro (Pippin) or in-silico size selection. Mechanism: selection conditions on length, so length-derived features are biased by construction. Symptom: distorted end-motif spectra and fragment-ratio features that do not reproduce. Fix: size-select for detection sensitivity only; never report length-derived features from a length-selected library; keep the selection step in metadata.
| Threshold | Source | Rationale | |-----------|--------|-----------| | Mononucleosome mode 167 bp; ~10.4 bp sub-mode periodicity | Snyder 2016 Cell 164:57 | core particle (~147 bp) + linker; periodicity = helical pitch of nucleosome-bound DNA, a QC sanity check | | TF/CTCF-footprint short fragments 35-80 bp | Snyder 2016 Cell 164:57 | sub-nucleosomal protection; only surfaces with ssDNA prep | | ctDNA principal length 134-144 bp vs 167 bp germline | Underhill 2016 PLoS Genet 12:e1006162 | tumor fragments shorter; BRAF V600E mutant 132-145 bp vs WT 165 bp | | ctDNA-enrichment selection windows 90-150 bp (and 250-320 bp) | Mouliere 2018 Sci Transl Med 10:eaat4921 | selecting the short window enriches mutant allele fraction at no extra sequencing cost | | ssDNA prep mitochondrial / microbial cfDNA enrichment ~10.7x / ~71.3x | Burnham 2016 Sci Rep 6:27859 | dsDNA prep is blind to the ultrashort fraction these reside in | | iDES error-suppression gain ~3x (barcode) x ~3x (in-silico) ~= ~15x | Newman 2016 Nat Biotechnol 34:547 | family-size and background polishing are complementary, not redundant | | Duplex error floor <1 per 1e7 nt | Kennedy 2014 Nat Protoc 9:2586 | both-strand concordance requirement | | FilterConsensusReads defaults: max-read-error 0.025, max-base-error 0.1, max-no-calls 0.2 | fgbio docs | per-read vs per-base vs fraction/count switch (<1.0 = fraction, >=1.0 = count) | | gDNA-contamination signature: excess mass >180-250 bp / ladder distortion | Snyder 2016 (biology); pre-analytical QC | leukocyte-lysis gDNA is long; a "too clean, too long" library is contaminated, not pristine |
| Error / symptom | Cause | Solution |
|-----------------|-------|----------|
| Consensus BAM has garbage coordinates | skipped re-alignment (consensus emitted unmapped) | RE-align consensus reads before FilterConsensusReads |
| UMI bases corrupt mapping / consensus | omitted the S skip in the read structure | include the stem: e.g. 6M11S+T not 6M+T |
| Duplex run finds no two-strand molecules | grouped with --strategy adjacency | regroup with --strategy paired |
| FilterConsensusReads rejects valid strand spec | --min-reads 0 3 (less stringent first) | put the more stringent value first: 3 0 (or 6 3 0) |
| Deflated VAF / low unique-molecule count | coordinate dedup on no-UMI cfDNA | use UMI families; never naive-dedup quantitative cfDNA |
| Short ctDNA fragments missing after filtering | blanket high MAPQ filter | lower MAPQ threshold; account for short-fragment mapping |
| Apparent ~10 bp size shift "needs correcting" | adaptase (Swift/Accel-1S) chemistry signature | record the prep; do not correct a real chemistry effect |
testing
Analyze multi-modal single-cell data (CITE-seq, Multiome, spatial). Use when working with data that measures multiple modalities per cell like RNA + protein or RNA + ATAC. Use when analyzing CITE-seq, Multiome, or other multi-modal single-cell data.
data-ai
Analyze metabolite-mediated cell-cell communication using MeboCost for metabolic signaling inference between cell types. Predict metabolite secretion and sensing patterns from scRNA-seq data. Use when studying metabolic crosstalk between cell populations or metabolite-receptor interactions.
development
Find marker genes and annotate cell types in single-cell RNA-seq using Seurat (R) and Scanpy (Python). Use for differential expression between clusters, identifying cluster-specific markers, scoring gene sets, and assigning cell type labels. Use when finding marker genes and annotating clusters.
development
Reconstruct cell lineage trees from CRISPR barcode tracing or mitochondrial mutations. Use when studying clonal dynamics, cell fate decisions, or developmental trajectories.