copy-number/cnv-annotation/SKILL.md
Annotate copy number variant segments with overlapping genes, dosage-sensitivity scores, cancer driver databases, population frequencies, and clinical-variant content. Covers bedtools/pybedtools interval intersection, AnnotSV comprehensive annotation and ranking, ClinGen haploinsufficiency/triplosensitivity scoring, gnomAD-SV/DGV frequency filtering, COSMIC Cancer Gene Census, and ClinVar overlap. Use when interpreting which genes a CNV affects, distinguishing the driver gene of a focal event from passengers, filtering against population CNVs, separating whole-gene from partial-gene overlap, or preparing CNVs for clinical classification.
npx skillsauth add GPTomics/bioSkills bio-copy-number-cnv-annotationInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Reference examples tested with: bedtools 2.31+, AnnotSV 3.4+, Python 3.10+ with pybedtools 0.9+, pandas 2.2+, pysam 0.22+; R 4.3+ with clusterProfiler 4.10+.
Before using code patterns, verify installed versions match. If versions differ:
bedtools --version, AnnotSV --versionpip show pybedtools pandas pysampackageVersion('clusterProfiler')If code throws an error, introspect the installed package and adapt the example. AnnotSV output column names change between major versions — verify against the installed version.
"Annotate my CNV calls with the genes they affect" -> Overlap CNV segments with gene models, dosage-sensitivity maps, and clinical databases. The hard part is not the intersection — it is deciding which genes matter. A focal amplification overlapping 30 genes usually has one driver (the peak gene); a deletion's consequence depends on whether each gene is dosage-sensitive and whether the whole gene or only part is removed.
bedtools intersect -a cnvs.bed -b genes.bed -wa -wb; AnnotSV for full annotationpybedtools for interval logic; pysam for VCF database queries| Question | Resource | What it answers | |----------|----------|-----------------| | Which genes does this CNV span? | RefSeq/GENCODE gene BED | Raw overlap (not yet consequence) | | Is loss of this gene damaging? | ClinGen haploinsufficiency (HI) score | Dosage sensitivity to deletion | | Is gain of this gene damaging? | ClinGen triplosensitivity (TS) score | Dosage sensitivity to duplication | | Is this CNV common in the population? | gnomAD-SV, DGV, 1000G CNV | Benign-frequency filtering | | Is this a known recurrent disorder locus? | ClinGen dosage regions, DECIPHER | Genomic-disorder context | | Is this a cancer driver? | COSMIC Cancer Gene Census, OncoKB | Oncogene vs tumor-suppressor role | | Is there pathogenic small-variant content? | ClinVar | Coincident SNV/indel pathogenicity | | One-shot comprehensive annotation + ranking | AnnotSV | Aggregates most of the above |
For constitutional CNV classification (assigning pathogenic/VUS/benign), the annotated output feeds the ACMG/ClinGen points framework — see germline-cnv-interpretation. For cohort-level recurrence and driver-peak identification, see recurrent-cnv.
A CNV overlapping a gene does not necessarily change that gene's dosage in a way that matters. Three refinements separate annotation from interpretation:
Goal: Find genes overlapping each CNV segment, recording overlap extent.
Approach: Convert segments to BED, intersect with a gene model, keep both feature sets (-wo reports the overlap length) so partial vs whole-gene overlap is recoverable.
# Segments to BED (CNVkit .cns example; columns chrom/start/end/log2)
awk 'NR>1 {print $1"\t"$2"\t"$3"\t"$5}' sample.cns > sample.cnv.bed
# Intersect; -wo appends the number of overlapping bases
bedtools intersect -a sample.cnv.bed -b gencode.genes.bed -wo > cnv_gene_overlap.txt
Goal: Annotate CNVs against genes, dosage maps, population frequency, and clinical databases in one pass, with a built-in pathogenicity ranking.
Approach: Export CNVs to VCF or BED and run AnnotSV; it returns a "full" line per gene plus a "split" summary, with an ACMG-aligned rank (1-5).
AnnotSV \
-SVinputFile sample.cnv.vcf \
-genomeBuild GRCh38 \
-annotationMode both \
-outputFile sample_annotated.tsv
# Output includes: overlapped genes, ClinGen HI/TS, gnomAD-SV/DGV frequency, OMIM,
# ClinVar, DECIPHER, and an ACMG-class rank per SV.
AnnotSV's rank is a useful triage signal, not a final classification — confirm against the ClinGen points framework for clinical reporting.
Goal: Tag each affected gene with its dosage sensitivity and, for tumors, its driver role, so passengers can be separated from drivers.
Approach: Join the gene-overlap table to the ClinGen dosage map (HI/TS scores) and to the COSMIC Cancer Gene Census; flag CNVs whose direction matches a known mechanism (oncogene amplified, tumor suppressor deleted).
import pandas as pd
def annotate_dosage_and_drivers(overlap_tsv, clingen_dosage, cgc_file):
'''Tag overlapped genes with ClinGen HI/TS and COSMIC driver role.'''
cols = ['cnv_chrom', 'cnv_start', 'cnv_end', 'log2',
'gene_chrom', 'gene_start', 'gene_end', 'gene', 'overlap_bp']
df = pd.read_csv(overlap_tsv, sep='\t', names=cols)
df['gene_len'] = df['gene_end'] - df['gene_start']
df['gene_frac_covered'] = (df['overlap_bp'] / df['gene_len']).clip(upper=1.0)
df['whole_gene'] = df['gene_frac_covered'] >= 0.99
dosage = pd.read_csv(clingen_dosage, sep='\t') # gene, HI_score, TS_score
df = df.merge(dosage, on='gene', how='left')
cgc = pd.read_csv(cgc_file, sep='\t')
role = dict(zip(cgc['Gene Symbol'], cgc['Role in Cancer']))
df['driver_role'] = df['gene'].map(role)
# Direction-consistent driver hits: oncogene amplified or TSG deleted.
df['driver_hit'] = (
((df['log2'] > 0.3) & df['driver_role'].fillna('').str.contains('oncogene')) |
((df['log2'] < -0.3) & df['driver_role'].fillna('').str.contains('TSG')))
return df
Goal: Remove common, presumed-benign CNVs before clinical interpretation.
Approach: Reciprocal-overlap match each CNV against a population SV catalog (gnomAD-SV, DGV); a CNV with high reciprocal overlap to a common population CNV of the same type is likely benign.
# 50% reciprocal overlap (-f 0.5 -r): same-type, similar-extent population match.
# Reciprocal overlap, not one-sided, prevents a tiny CNV inside a huge population CNV
# (or vice versa) from being wrongly matched.
bedtools intersect -a sample.cnv.bed -b gnomad_sv.bed -f 0.5 -r -wa -wb \
> cnv_population_match.txt
Goal: Test whether genes in amplified (or deleted) regions are enriched for pathways.
Approach: Extract genes by CNV direction, map to Entrez IDs, run GO/KEGG enrichment. Caveat: CNVs are large and gene-dense, so enrichment is biased toward whatever pathways cluster in CNV-prone genomic regions — interpret as hypothesis-generating.
library(clusterProfiler)
library(org.Hs.eg.db)
amp_genes <- unique(cnv_annot$gene[cnv_annot$log2 > 0.3])
entrez <- na.omit(mapIds(org.Hs.eg.db, keys = amp_genes,
keytype = 'SYMBOL', column = 'ENTREZID'))
go_bp <- enrichGO(gene = entrez, OrgDb = org.Hs.eg.db, ont = 'BP',
pAdjustMethod = 'BH', qvalueCutoff = 0.05)
Trigger: CNV coordinates on GRCh37 intersected with a GRCh38 gene model (or vice versa).
Mechanism: Coordinates silently shift; the intersection succeeds and returns wrong genes.
Symptom: Implausible gene assignments; a known driver locus annotated with the wrong gene; systematic offset.
Fix: Confirm both inputs are the same build. If not, liftOver the CNVs (note that liftOver can split or drop segments across assembly gaps) and verify a known landmark.
Trigger: Reporting every gene a focal amplification spans as amplified/driver.
Mechanism: Focal events are megabases wide and gene-dense; only the selected gene is the driver.
Symptom: A 2 Mb amplicon "amplifies" 40 genes; the report cannot distinguish ERBB2 from its passengers.
Fix: For focal events, prioritize the gene at the cohort recurrence peak (GISTIC, see recurrent-cnv) and known drivers (CGC/OncoKB). Report passengers separately or not at all.
Trigger: Naive string matching on the ClinVar CLNSIG INFO field.
Mechanism: CLNSIG is multi-valued, mixes terms ("Conflicting_classifications", "Pathogenic/Likely_pathogenic", "Benign/Likely_benign"), and is per-small-variant — not per-CNV. A substring match for "pathogenic" silently captures "Likely_pathogenic" (intended) but a careless match also fires on records that are conflicting or benign once underscores and slashes are involved.
Symptom: Benign or conflicting variants reported as pathogenic; CNV flagged on incidental nearby SNVs.
Fix: Parse CLNSIG against the controlled vocabulary; exclude "Conflicting" and benign terms explicitly. Remember ClinVar SNV/indel pathogenicity does not transfer to a CNV — use it as context, and use ClinVar's own CNV records or ClinGen dosage regions for the CNV itself.
Trigger: Treating any gene-overlapping CNV as functionally significant.
Mechanism: Most single-copy losses are tolerated; partial overlaps may hit only introns/UTRs.
Symptom: Long lists of "affected" dosage-insensitive genes; benign CNVs over-called as significant.
Fix: Require dosage evidence (ClinGen HI/TS) and record coding-exon overlap and whole-gene-vs-partial status before calling a gene affected.
| Threshold | Value | Source / Rationale |
|-----------|-------|--------------------|
| Population-CNV reciprocal overlap | >= 50% (-f 0.5 -r) | Standard reciprocal-overlap match for benign filtering (convention traces to gnomAD-SV / DGV workflows; Collins SR et al 2020 Nature 581:444 uses comparable reciprocal-overlap thresholds for benign-population matching) |
| Common-CNV benign frequency | > 1% population frequency | ACMG/ClinGen: high frequency supports benign |
| ClinGen HI/TS dosage-sensitive | score = 3 | ClinGen: sufficient evidence for dosage sensitivity |
| Whole-gene overlap | >= 99% gene length covered | Distinguishes clean haploinsufficiency from partial/truncating |
| AnnotSV pathogenic ranks | rank 4-5 | AnnotSV ACMG-aligned ranking (1 benign - 5 pathogenic) |
| Error / symptom | Cause | Solution |
|-----------------|-------|----------|
| Wrong genes assigned to a CNV | hg19/hg38 build mismatch | Match builds; liftOver and verify a landmark |
| 40 genes called "amplified" for one amplicon | All overlapped genes reported as drivers | Prioritize recurrence-peak + known-driver genes |
| Benign variants flagged pathogenic | Substring match on ClinVar CLNSIG | Parse the controlled vocabulary explicitly |
| Tiny CNV matched to a huge population CNV | One-sided overlap used for frequency filter | Use reciprocal overlap (-f 0.5 -r) |
| AnnotSV columns not found | Column names differ across AnnotSV versions | Check head of the output for the installed version |
| Enrichment dominated by gene-dense loci | CNVs span gene clusters | Treat CNV-gene enrichment as hypothesis-generating |
tools
--- name: bio-phasing-imputation-foundations description: Frames the phasing/imputation pipeline before any tool runs: phasing and imputation are one Li-Stephens copying HMM (recombination is the transition, mutation the emission, the genetic map and Ne set the rates), imputation's honest output is a dosage with a self-estimated quality (INFO/R2/DR2) not a hard genotype, and the stages are ordered and each fails silently (QC, align build and strand to the panel, phase, impute per chromosome, fil
tools
Chooses the enrichment generation before any tool runs, mapping the input shape to a method class - a pre-selected gene list plus a background to over-representation analysis (ORA, hypergeometric), a ranked statistic for all genes to gene set enrichment (GSEA), a signed signaling topology to pathway-topology (SPIA) - then making the null explicit (competitive vs self-contained, gene vs subject sampling) and running a trustworthiness checklist (testable-gene universe, FDR, redundancy collapse, leading-edge check, version reporting). Covers why every clusterProfiler GSEA is the inter-gene-correlation-uncorrected competitive null, why the background not the gene list decides ORA significance, and why no method is universally best. Use when deciding ORA vs GSEA vs topology, which gene-set DB, whether a result is trustworthy, or which null a tool computes. For ORA see go-enrichment, GSEA see gsea, databases kegg-pathways/reactome-pathways/wikipathways; the ranking comes from differential-expression/de-results.
testing
End-to-end GWAS workflow from VCF to association results. Covers PLINK QC, population structure correction, and association testing for case-control or quantitative traits. Use when running genome-wide association studies.
development
Orchestrates the full path from differential expression results to redundancy-collapsed functional enrichment: choose ORA vs GSEA, convert gene IDs per method, run enrichGO/enrichKEGG/enrichPathway/enrichWP or gseGO/gseKEGG (clusterProfiler, ReactomePA, rWikiPathways), and visualize. Routes the ORA-vs-GSEA generation fork and the null/universe/reproducibility theory to pathway-analysis/enrichment-foundations. Use when a DESeq2/edgeR/limma result must become enriched GO terms, KEGG/Reactome/WikiPathways pathways, or a GSEA leading edge; when deciding whether a ranking exists for all genes (GSEA, named decreasing vector) or only a pre-selected list (ORA plus a defensible background universe); or when assembling DE-to-pathway end to end. The DE list and ranking statistic come from differential-expression/de-results; per-method nuance lives in the pathway-analysis skills.