copy-number/copy-ratio-segmentation/SKILL.md
Normalize read-depth copy-ratio profiles and segment them into copy-number regions using circular binary segmentation (CBS, DNAcopy), hidden Markov models, HaarSeg, and fused-lasso methods. Covers GC-content, mappability, and replication-timing (wave-artifact) bias correction, panel-of-normals/PCA denoising, diploid-baseline centering, and algorithm selection by sequencing depth and event size. Use when choosing a segmentation algorithm, correcting depth bias, diagnosing oversegmentation or a mis-centered baseline, tuning CBS or HMM parameters, or understanding why a downstream CNV caller produced fragmented or shifted segments.
npx skillsauth add GPTomics/bioSkills bio-copy-number-copy-ratio-segmentationInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Reference examples tested with: R 4.3+ with DNAcopy 1.76+, Python 3.10+ with numpy 1.26+, pandas 2.2+; QDNAseq 1.38+ (optional, GC/mappability normalization).
Before using code patterns, verify installed versions match. If versions differ:
packageVersion('DNAcopy') then ?segment to confirm argumentspip show numpy pandasIf code throws an error, introspect the installed package and adapt the example. CBS lives in Bioconductor DNAcopy; HMM segmentation is provided by caller-specific backends (CNVkit uses pomegranate; HaarSeg has its own R/Python packages).
"Turn noisy per-bin depth into clean copy-number segments" -> Two stages, both error-prone. First, normalize the depth profile so the only remaining variation is copy number (not GC, mappability, or replication timing). Second, partition the normalized profile into segments of constant copy number. The segmentation algorithm choice has a predictable bias signature, and the diploid-baseline choice can invert every call.
DNAcopy::segment (CBS, the reference implementation)pomegranate; HaarSeg via haarsegRaw read depth confounds copy number with three systematic biases:
| Bias | Cause | Correction | |------|-------|------------| | GC content | PCR efficiency and probe hybridization vary with GC | Loess fit of depth vs GC (QDNAseq), or matched normal | | Mappability | Multi-mapping reads under-counted in repetitive regions | Mappability track filter/weight; exclude low-mappability bins | | Replication timing | Late-replicating DNA is under-represented — the "wave artifact" | Matched normal or PoN; GC correction alone does NOT remove it | | Capture efficiency | Per-probe hybridization varies 10-100x (hybrid capture) | Panel of normals — the dominant bias for exomes/panels |
The key postdoc-level point: GC correction alone is insufficient. The wave artifact in cancer WGS is driven by replication timing, a biological signal GC normalization cannot flatten. Only a matched normal or a panel of normals removes it. This is why a CNVkit flat reference (GC-only) produces systematic false focal calls and why GATK tangent normalization exists.
| Algorithm | Model | Strength | Fails when | |-----------|-------|----------|------------| | CBS (circular binary segmentation) | Recursive t-statistic breakpoint test | High precision; excellent on small focal segments | Low depth (~3x): recall drops to ~42% (worse under over-dispersed counts); ~2 orders slower; fragments across assembly gaps | | HMM | Hidden CN states, emission + transition | Depth-robust; high recall at low coverage | Less precise on small focal segments (~5 kb: ~76% precision vs CBS ~96%, Poisson model); EM finds only local optima | | HaarSeg | Wavelet (Haar) multiscale edge detection | Very fast; good for shallow WGS | Less precise breakpoints than CBS; threshold-sensitive | | Fused lasso (flasso) | L1-penalized piecewise-constant fit | Smooth; tunable sparsity | Penalty hard to set; can over-smooth focal events | | ASPCF | Allele-specific piecewise-constant fit | Joint logR+BAF segmentation (ASCAT) | Needs BAF; see allele-specific-copy-number |
Quantitative benchmark (Zhang et al 2024, Brief Bioinform): the cited precision/recall numbers (CBS ~42% recall at 3x; HMM ~81% recall at 3x; CBS ~96% precision vs HMM ~76% on 5 kb focal segments under a Poisson model) summarise that paper's reported direction of the trade-off. Verify the exact figures against the published tables before quoting them in print; the qualitative trade-off (depth-vs-event-size, CBS-vs-HMM) is robust across recent benchmarks but the precise percentages depend on the simulation model (Poisson vs over-dispersed negative-binomial). There is no universally correct choice.
| Scenario | Algorithm | Rationale |
|----------|-----------|-----------|
| Panel / exome, adequate depth, focal events matter | CBS | Precise on small segments |
| Shallow WGS (< ~5x), broad events | HMM or HaarSeg | CBS recall degrades at low depth |
| Heterogeneous / impure tumor | HMM (e.g. CNVkit hmm-tumor) | Broader state transitions absorb noise |
| Germline, near-diploid | HMM with diploid-tight priors | Priors stabilize calls near CN=2 |
| Allele-specific (need BAF) | ASPCF / FACETS joint CBS | See allele-specific-copy-number |
| Very large WGS, speed-critical | HaarSeg | Near-linear; CBS is ~100x slower |
Goal: Remove GC-content bias from a per-bin depth profile.
Approach: Fit a loess curve of depth versus GC content, divide each bin by its fitted value, log2-transform. This corrects GC but not replication timing — use a normal for that.
import numpy as np
import pandas as pd
from statsmodels.nonparametric.smoothers_lowess import lowess
def gc_correct(bins):
'''GC-correct a per-bin depth profile. bins: columns chrom, start, depth, gc.
Returns log2 copy ratio relative to the GC-corrected genome median.'''
df = bins[(bins['depth'] > 0) & bins['gc'].between(0.3, 0.7)].copy()
fitted = lowess(df['depth'], df['gc'], frac=0.3, return_sorted=False)
df['corrected'] = df['depth'] / fitted
df['log2'] = np.log2(df['corrected'] / df['corrected'].median())
return df
For exomes and panels, a panel of normals (per-bin median of normals, or PCA denoising) is preferred over GC-only correction because it also removes capture and replication-timing bias.
Goal: Segment a normalized log2 profile into copy-number regions.
Approach: Build a CNA object, smooth single-bin outliers, run CBS, then merge adjacent segments whose means differ by less than a noise-scaled threshold (sdundo).
library(DNAcopy)
# bins: data frame with chrom, maploc (bin midpoint), log2
cna <- CNA(genomdat = bins$log2, chrom = bins$chrom, maploc = bins$maploc,
data.type = 'logratio', sampleid = 'tumor')
cna <- smooth.CNA(cna) # damp single-bin outliers
# alpha = breakpoint significance; undo.splits='sdundo' merges segments whose means
# are within undo.SD noise standard deviations -- the main guard against oversegmentation.
seg <- segment(cna, alpha = 0.01, undo.splits = 'sdundo', undo.SD = 2,
verbose = 1)
write.table(seg$output, 'tumor.segments.tsv', sep = '\t',
quote = FALSE, row.names = FALSE)
Trigger: CBS alpha too liberal, undo.SD too small, or a noisy (high-MAD) profile; FACETS cval too low.
Mechanism: The breakpoint test fires on noise; the profile shatters into many tiny segments that do not correspond to real copy-number changes.
Symptom: Hundreds of short segments; segment count scales with noise, not biology; downstream integer CN incoherent with per-bin medians.
Fix: Raise alpha toward 0.01 or stricter, increase undo.SD (e.g. 2-3), or denoise the input first (better PoN, drop low-coverage bins). Three signatures (Steele 2022) had to be discarded as oversegmentation artifacts — fragmentation propagates into every downstream analysis, including copy-number signatures.
Trigger: Centering the log2 profile on its median or mode in a hyper-aneuploid or whole-genome-doubled genome.
Mechanism: Centering assumes the commonest log2 value is diploid. In a WGD genome the commonest state is tetraploid; centering on it shifts the whole profile so true diploid regions read as deletions and amplifications read as neutral.
Symptom: Genome-wide gain or loss inconsistent with biology; segmentation is fine but every call has the wrong sign.
Fix: Do not depth-center aneuploid genomes. Anchor the diploid baseline with BAF/SNV data via an allele-specific caller, which estimates absolute ploidy. GISTIC and most callers require a correctly centered seg file as input.
Trigger: CBS on shallow data (< ~5x WGS, or low-coverage bins).
Mechanism: The two-sample t-statistic loses power when per-bin variance swamps the mean difference; CBS misses real breakpoints (recall ~42% at 3x under a Poisson model, worse under over-dispersed counts), while the segments it does call stay fairly precise.
Symptom: Real events absent from the segmentation; recall poor on a genome with known CNVs.
Fix: Use HMM (depth-robust, ~81% recall at 3x) or HaarSeg for shallow data; or increase bin size to raise per-bin counts before segmenting.
Trigger: CBS run over a profile with centromere/telomere gaps not handled as chromosome breaks.
Mechanism: CBS treats gapped data as independent subsets; spurious breakpoints appear at gap edges.
Symptom: Segment boundaries clustered at centromeres; tiny artifactual segments flanking gaps.
Fix: Segment per chromosome arm, or supply gap-aware chromosome coordinates so CBS does not bridge gaps.
Trigger: HMM with poor initial parameters or too few iterations.
Mechanism: Baum-Welch EM is not globally optimal; emission/transition parameters can settle in a local optimum, mis-assigning states.
Symptom: Reruns give different state assignments; CN states inconsistent with the visible profile.
Fix: Use informative priors (diploid-centered for germline, broader for tumor), run multiple initializations, and sanity-check state means against the per-bin distribution.
| Pattern | Likely cause | Action | |---------|--------------|--------| | CBS shatters where HMM gives clean broad segments | Low depth — CBS over-fits noise | Trust HMM; CBS needs more depth | | HMM misses a focal event CBS finds | HMM window resolution too coarse | Trust CBS for focal; HMM blurs small events | | Both agree on arms, differ on focal boundaries | Different breakpoint resolution | Arm calls are robust; treat focal boundaries as approximate | | Segmentation differs run-to-run | HMM local optima, or unfixed random seed | Fix seeds; use multiple HMM initializations |
Operational rule: Match the algorithm to depth and event size — CBS for adequate-depth focal work, HMM/HaarSeg for shallow or broad. Confirm the diploid baseline against an allele-specific ploidy estimate before any sign-dependent interpretation. Report arm-level segments with confidence; treat focal boundaries as algorithm-dependent.
| Threshold | Value | Source / Rationale |
|-----------|-------|--------------------|
| CBS alpha | 0.01 | DNAcopy default; breakpoint significance |
| CBS undo.SD | 2-3 | Merges segments within N noise SD; guards oversegmentation |
| Depth where CBS recall degrades | < ~5x WGS | Zhang 2024; CBS recall falls sharply (HMM is depth-robust) |
| GC range kept for loess | 0.3-0.7 | Extreme-GC bins are unreliable; standard restriction |
| Bin size, shallow WGS CNV | ~500 kb - 1 Mb | Larger bins raise per-bin counts for stable segmentation |
| Sample MAD usable | < 0.5 | Above this, segmentation chases noise regardless of algorithm |
| Error / symptom | Cause | Solution | |-----------------|-------|----------| | Hundreds of tiny segments | Oversegmentation (liberal alpha / low cval / noisy input) | Tighten alpha, raise undo.SD, denoise input | | Whole genome wrong-signed | Baseline centered on a non-diploid mode | Anchor ploidy with BAF; do not depth-center aneuploid genomes | | Real events missed at low depth | CBS recall degrades | Use HMM/HaarSeg or larger bins | | Breakpoints clustered at centromeres | CBS bridging assembly gaps | Segment per arm; supply gap-aware coordinates | | Segmentation not reproducible | HMM local optima / unfixed seed | Fix seeds; multiple initializations | | Wave artifact remains after GC correction | Replication-timing bias, not GC | Use a matched normal or PoN |
development
Find restriction enzyme cut sites in DNA sequences using Biopython Bio.Restriction. Search with single enzymes, batches of enzymes, or commercially available enzyme sets. Returns cut positions for linear or circular DNA. Use when finding restriction enzyme cut sites in sequences.
development
Create restriction maps showing enzyme cut positions on DNA sequences using Biopython Bio.Restriction. Visualize cut sites, calculate distances between sites, and generate text or graphical maps. Use when creating or analyzing restriction maps.
development
Analyze restriction digest fragments using Biopython Bio.Restriction. Predict fragment sizes, get fragment sequences, simulate gel electrophoresis patterns, and perform double digests. Use when analyzing restriction digest fragment patterns.
development
Select restriction enzymes by criteria using Biopython Bio.Restriction. Find enzymes that cut once, don't cut, produce specific overhangs, are commercially available, or have compatible ends for cloning. Use when selecting restriction enzymes for cloning or analysis.