copy-number/recurrent-cnv/SKILL.md
Identify recurrent and driver copy number alterations across a tumor cohort with GISTIC2 (G-score, Ziggurat deconstruction, focal vs broad/arm-level analysis, q-values from permutation) and quantify copy-number signatures with the Steele 2022 COSMIC framework and the Drews 2022 CINSignatures framework. Covers driver-gene localization from recurrence peaks, distinguishing focal drivers from arm-level passengers, and the caller-sensitivity caveats of copy-number signatures. Use when finding recurrently amplified or deleted regions in a cohort, localizing driver genes, separating focal from broad events, running GISTIC2, or extracting copy-number mutational signatures.
npx skillsauth add GPTomics/bioSkills bio-copy-number-recurrent-cnvInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Reference examples tested with: GISTIC 2.0.23, R 4.3+ with CINSignatureQuantification 1.2+; Python 3.10+ with SigProfilerAssignment 0.1+ (optional, COSMIC CN signatures).
Before using code patterns, verify installed versions match. If versions differ:
gistic2 --help (GISTIC 2.0 is a MATLAB-compiled binary; needs the MCR runtime)packageVersion('CINSignatureQuantification')pip show SigProfilerAssignmentGISTIC 2.0 has had no substantive release since ~2017; it is effectively frozen. It runs as a compiled binary against the MATLAB Compiler Runtime — there is no R or Python package. Verify the reference (-refgene) .mat file matches the genome build.
"Which copy number changes recur across my cohort, and which gene is the driver" -> A CNV in one tumor is an observation; a CNV recurring across many tumors beyond chance is evidence of selection. GISTIC2 separates recurrent driver events from passengers by modeling a background rate and scoring each locus by how often, and how strongly, it is altered. Copy-number signatures decompose the genome-wide pattern of alterations into the mutational processes that generated them.
gistic2 — cohort-level recurrence, focal vs broad, driver localizationCINSignatureQuantification (Drews 2022); Python SigProfilerAssignment (Steele 2022 COSMIC)GISTIC2 scores each genomic marker with a G-score = frequency of alteration x mean amplitude, separately for amplifications and deletions. Significance (q-value) comes from permuting events along the genome under the null that all are passengers. Ziggurat deconstruction decomposes each sample's profile into the additive arm-level and focal events that produced it, so the background rate is estimated separately for broad and focal alterations — without this, ubiquitous arm-level events swamp the focal signal. A peel-off procedure removes the contribution of each significant peak before testing the next, so one strong driver does not mask its neighbors.
Two postdoc-level caveats define how GISTIC2 output must be read:
| Goal | Approach | Notes |
|------|----------|-------|
| Find recurrent focal drivers in a cohort | GISTIC2, focal analysis, peak regions | Driver = recurrence-peak gene with a known role |
| Quantify arm-level / broad events | GISTIC2 -broad 1, arm-level output | -brlen sets the focal/broad length cutoff |
| Compare cohorts of different size | Recurrence frequency, not q-value | q-value is not portable across N |
| Characterize mutational processes | Copy-number signatures | Drews CINSignatures or Steele COSMIC CN |
| Localize the gene within a wide peak | GISTIC2 -genegistic 1 + known drivers | Wide peaks need orthogonal driver evidence |
| Single tumor (no cohort) | GISTIC2 does not apply | Use focal-amplification-ecdna / per-sample annotation |
# Segment file: 6 columns -- sample, chrom, start, end, num_markers, seg.mean (log2).
# It MUST be diploid-centered. Pool per-sample segments (e.g. cnvkit.py export seg).
gistic2 \
-b gistic_output/ \
-seg cohort.seg \
-refgene hg38.refgene.mat \
-genegistic 1 \
-broad 1 \
-brlen 0.7 \
-conf 0.99 \
-armpeel 1 \
-savegene 1 \
-gcm extreme \
-rx 0
Key flags: -brlen 0.7 sets the focal/broad cutoff at 70% of a chromosome arm; -conf 0.99 is the peak-boundary confidence — raising it above the 0.75 default yields a wider, more conservative peak with higher confidence the true driver gene lies inside it (the trade-off is more genes per peak); -armpeel 1 peels arm-level events before focal testing; -genegistic 1 runs the gene-level test; -rx 0 keeps sex chromosomes. Output amp_genes.txt / del_genes.txt and all_lesions.txt list peaks, q-values, and genes.
Goal: Decompose the genome-wide copy-number pattern into mutational processes (HRD, chromothripsis, tandem duplication, ecDNA, whole-genome doubling).
Approach: Two competing 2022 frameworks exist. Steele et al (Nature 2022) defined 21 pan-cancer CN signatures from a 48-channel feature matrix, now in COSMIC; Drews et al (Nature 2022) defined 17 signatures via the CINSignatures feature set. Quantify against one framework consistently; signatures require absolute (allele-specific) copy number.
library(CINSignatureQuantification)
# segments: data frame with columns chromosome, start, end, segVal (total CN),
# sample -- absolute copy number from ASCAT/Sequenza/FACETS, NOT relative log2.
res <- quantifyCNSignatures(segments, experimentName = 'cohort',
method = 'drews')
activities <- getActivities(res) # samples x signatures exposure matrix
The critical caveat (Steele 2022): three signatures had to be discarded as oversegmentation artifacts and ten were linear combinations needing manual filtering. Signatures are sensitive to the upstream caller — Steele prescribes the caller per platform (SNP6 -> ASCAT penalty 70; shallow WGS -> ASCAT.sc) precisely for this reason.
Trigger: Stating that cohort A has "more significant" peaks than cohort B when the cohorts differ in N.
Mechanism: GISTIC q-values fall as N rises — the same recurrence frequency clears significance in a larger cohort.
Symptom: A larger cohort appears to have more drivers purely because it is larger; peak lists do not replicate.
Fix: Compare recurrence frequency (fraction of samples altered), not q-value, across cohorts. Re-run GISTIC at matched N (subsample) if a significance comparison is unavoidable.
Trigger: Feeding GISTIC a seg file from a noisy or over-fragmented segmentation.
Mechanism: GISTIC interprets every segment edge as a potential focal event boundary; fragmentation creates many narrow false peaks.
Symptom: Numerous tiny significant peaks at no known driver; peaks not replicated with a cleaner segmentation.
Fix: Quality-control the segmentation first (see copy-ratio-segmentation); merge over-fragmented segments before pooling the cohort seg file.
Trigger: Pooling seg files that are not diploid-centered (e.g. WGD tumors centered on tetraploid).
Mechanism: GISTIC assumes seg.mean ~ 0 is diploid; a shifted baseline turns gains into neutral and neutral into losses before any statistics run.
Symptom: Amplification and deletion peaks swapped relative to known biology; genome-wide deletion bias.
Fix: Center each sample's seg file on its true diploid baseline (anchor with allele-specific ploidy) before pooling. Do not rely on per-sample median centering for aneuploid cohorts.
Trigger: Reporting every gene inside a wide significant peak, or assuming the peak gene is the driver.
Mechanism: Peak width reflects breakpoint heterogeneity across the cohort; a wide peak may contain dozens of genes, and the statistical peak need not coincide with the functional driver.
Symptom: A multi-gene peak reported as one driver; the named gene is a passenger.
Fix: Intersect peaks with known drivers (COSMIC CGC, OncoKB), expression, and dependency data. Raising -conf widens the peak (it does not narrow it) — peak width is set by cohort breakpoint heterogeneity, not a tunable. Wide peaks require orthogonal driver evidence — GISTIC localizes, it does not nominate.
Trigger: Running CN signatures on log2 ratios or relative segments.
Mechanism: Signature features (segment size, copy-number state, change-point) are defined on absolute copy number; relative input gives meaningless states.
Symptom: Implausible signature exposures; ploidy/WGD signatures fire spuriously.
Fix: Use absolute allele-specific copy number from ASCAT/Sequenza/FACETS as input. Apply the framework's prescribed caller for the platform.
| Pattern | Likely cause | Action |
|---------|--------------|--------|
| GISTIC peak with no known driver | Wide peak, passenger locus, or fragile site | Cross-check expression/dependency; treat as candidate |
| Focal peak inside a broad event | Arm-level event not peeled | Confirm -armpeel 1; inspect Ziggurat output |
| Drews vs Steele signatures disagree | Different feature definitions and reference sets | Pick one framework; do not mix exposures |
| Peaks change with segmentation | Input over/under-segmented | Stabilize segmentation; re-run |
Operational rule: Report a GISTIC peak as a candidate driver locus only when (1) the input segmentation is QC-passed and diploid-centered, (2) recurrence frequency (not just q-value) is substantial, and (3) the peak contains a gene with independent driver evidence. Signatures are reportable only from absolute CN with a single, platform-matched framework.
| Threshold | Value | Source / Rationale |
|-----------|-------|--------------------|
| GISTIC significance | q < 0.25 | GISTIC2 default residual-q cutoff for peaks |
| Peak-boundary confidence | -conf 0.99 | Wider, conservative peak; higher confidence the true driver is inside (default 0.75) |
| Focal/broad cutoff | -brlen 0.7 | Events > 70% of an arm are treated as broad |
| Cohort size for stable peaks | tens to hundreds | Mehta-style: too few samples gives unstable peaks; q is N-dependent |
| CN signatures input | absolute (allele-specific) CN | Steele 2022 / Drews 2022; relative log2 is invalid |
| Error / symptom | Cause | Solution |
|-----------------|-------|----------|
| GISTIC2 will not start | MATLAB Compiler Runtime missing | Install the MCR version GISTIC was built against |
| Amp/del peaks swapped vs biology | Seg file not diploid-centered | Center on true ploidy before pooling |
| Many tiny spurious peaks | Oversegmented input | QC and merge segmentation first |
| -refgene errors | Build mismatch (hg19 vs hg38 .mat) | Use the matching reference .mat |
| Implausible signature exposures | Relative CN used as input | Use absolute allele-specific CN |
| Peak lists do not replicate | q-value compared across different N | Compare recurrence frequency |
tools
--- name: bio-phasing-imputation-foundations description: Frames the phasing/imputation pipeline before any tool runs: phasing and imputation are one Li-Stephens copying HMM (recombination is the transition, mutation the emission, the genetic map and Ne set the rates), imputation's honest output is a dosage with a self-estimated quality (INFO/R2/DR2) not a hard genotype, and the stages are ordered and each fails silently (QC, align build and strand to the panel, phase, impute per chromosome, fil
tools
Chooses the enrichment generation before any tool runs, mapping the input shape to a method class - a pre-selected gene list plus a background to over-representation analysis (ORA, hypergeometric), a ranked statistic for all genes to gene set enrichment (GSEA), a signed signaling topology to pathway-topology (SPIA) - then making the null explicit (competitive vs self-contained, gene vs subject sampling) and running a trustworthiness checklist (testable-gene universe, FDR, redundancy collapse, leading-edge check, version reporting). Covers why every clusterProfiler GSEA is the inter-gene-correlation-uncorrected competitive null, why the background not the gene list decides ORA significance, and why no method is universally best. Use when deciding ORA vs GSEA vs topology, which gene-set DB, whether a result is trustworthy, or which null a tool computes. For ORA see go-enrichment, GSEA see gsea, databases kegg-pathways/reactome-pathways/wikipathways; the ranking comes from differential-expression/de-results.
testing
End-to-end GWAS workflow from VCF to association results. Covers PLINK QC, population structure correction, and association testing for case-control or quantitative traits. Use when running genome-wide association studies.
development
Orchestrates the full path from differential expression results to redundancy-collapsed functional enrichment: choose ORA vs GSEA, convert gene IDs per method, run enrichGO/enrichKEGG/enrichPathway/enrichWP or gseGO/gseKEGG (clusterProfiler, ReactomePA, rWikiPathways), and visualize. Routes the ORA-vs-GSEA generation fork and the null/universe/reproducibility theory to pathway-analysis/enrichment-foundations. Use when a DESeq2/edgeR/limma result must become enriched GO terms, KEGG/Reactome/WikiPathways pathways, or a GSEA leading edge; when deciding whether a ranking exists for all genes (GSEA, named decreasing vector) or only a pre-selected list (ORA plus a defensible background universe); or when assembling DE-to-pathway end to end. The DE list and ranking statistic come from differential-expression/de-results; per-method nuance lives in the pathway-analysis skills.