scientific-skills/Data Analysis/cnv-caller-plotter/SKILL.md
Detect copy number variations from whole genome sequencing data and generate publication-quality genome-wide CNV plots. Supports CNV calling, segmentation, and visualization for cancer genomics and rare disease analysis.
npx skillsauth add aipoch/medical-research-skills cnv-caller-plotterInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Detect copy number variations (CNVs) from whole genome sequencing (WGS) data and generate genome-wide visualization plots for cancer genomics, rare disease analysis, and population genetics studies. Provides CNV calling, segmentation analysis, and publication-ready visualization.
Key Capabilities:
scripts/main.py.references/ for task-specific guidance.See ## Prerequisites above for related details.
Python: 3.10+. Repository baseline for current packaged skills.Third-party packages: not explicitly version-pinned in this skill package. Add pinned versions if this skill needs stricter environment control.See ## Usage above for related details.
cd "20260318/scientific-skills/Data Analytics/cnv-caller-plotter"
python -m py_compile scripts/main.py
python scripts/main.py --help
Example run plan:
CONFIG block or documented parameters if the script uses fixed settings.python scripts/main.py with the validated inputs.See ## Workflow above for related details.
scripts/main.py.references/ contains supporting rules, prompts, or checklists.Use this command to verify that the packaged script entry point can be parsed before deeper execution.
python -m py_compile scripts/main.py
Use these concrete commands for validation. They are intentionally self-contained and avoid placeholder paths.
python -m py_compile scripts/main.py
# Example invocation: python scripts/main.py --help
# Example invocation: python scripts/main.py --input "Audit validation sample with explicit symptoms, history, assessment, and next-step plan."
Upstream Skills:
fastqc-report-interpreter: Assess sequencing quality before CNV calling; low quality data may produce unreliable CNVsalignment-quality-checker: Verify BAM file quality and coverage uniformity; uneven coverage causes CNV artifactsvariant-caller: Generate SNV/indel calls for combined CNV-SNV analysis in cancer samplesDownstream Skills:
circos-plot-generator: Create circular genome plots integrating CNVs with other genomic featuresgo-kegg-enrichment: Perform pathway enrichment on genes within CNV regionsheatmap-beautifier: Visualize CNV profiles across multiple samplesComplete Workflow:
Raw WGS Data → fastqc-report-interpreter → alignment-quality-checker → cnv-caller-plotter → circos-plot-generator → Publication Figures
Identify genomic regions with copy number gains (amplifications) or losses (deletions) from WGS data by analyzing read depth patterns.
from scripts.main import CNVCaller
# Initialize CNV caller with bin size
caller = CNVCaller(bin_size=1000)
# Call CNVs from BAM file
cnv_calls = caller.call_cnvs(
input_file="sample.bam",
reference="hg38.fa"
)
# Review detected CNVs
for cnv in cnv_calls:
print(f"{cnv['chrom']}:{cnv['start']}-{cnv['end']}")
print(f" Copy Number: {cnv['cn']}")
if cnv['cn'] > 2:
print(f" Type: Amplification (gain)")
elif cnv['cn'] < 2:
print(f" Type: Deletion (loss)")
Parameters:
| Parameter | Type | Required | Description | Default |
|-----------|------|----------|-------------|---------|
| input_file | str | Yes | Path to input BAM or VCF file | None |
| reference | str | Yes | Path to reference genome FASTA | None |
| bin_size | int | No | Size of genomic bins for segmentation (bp) | 1000 |
CNV Calling Strategy:
| Approach | Best For | Sensitivity | Specificity | |----------|----------|-------------|-------------| | Read Depth Analysis | Large CNVs (>10kb) | High | Medium | | Paired-end Mapping | Medium CNVs (1-10kb) | Medium | High | | Split-read Analysis | Small CNVs (<1kb) | Medium | High | | Combined Approach | Comprehensive detection | High | High |
Best Practices:
Common Issues and Solutions:
Issue: False positive CNVs in repetitive regions
Issue: Low sensitivity for small CNVs
Divide the genome into windows/bins for copy number estimation, enabling systematic analysis of the entire genome.
from scripts.main import CNVCaller
# Different bin sizes for different applications
bin_configs = {
"high_resolution": 100, # For small CNV detection
"standard": 1000, # Default for WGS
"low_resolution": 10000 # For large-scale alterations
}
for config_name, bin_size in bin_configs.items():
caller = CNVCaller(bin_size=bin_size)
print(f"\n{config_name} (bin_size={bin_size}bp):")
# Calculate approximate number of bins for human genome
genome_size = 3_000_000_000 # 3 Gb
num_bins = genome_size // bin_size
print(f" Estimated bins: ~{num_bins:,}")
print(f" Resolution: {bin_size}bp")
Bin Size Selection Guide:
| Bin Size | Resolution | Use Case | Coverage Required | |----------|------------|----------|-------------------| | 100 bp | High | Small CNVs (<5kb) | >30x | | 1000 bp | Standard | General WGS analysis | >15x | | 10000 bp | Low | Large chromosomal alterations | >5x | | Variable | Adaptive | Mixed resolution | >20x |
Best Practices:
Common Issues and Solutions:
Issue: Noisy segmentation due to small bins
Issue: Missing large CNVs with large bins
Generate publication-quality plots showing copy number profiles across all chromosomes for visual interpretation and presentation.
from scripts.main import CNVCaller
caller = CNVCaller(bin_size=1000)
# Example CNV calls for plotting
cnv_calls = [
{"chrom": "chr1", "start": 1000000, "end": 2000000, "cn": 3}, # Gain
{"chrom": "chr7", "start": 50000000, "end": 55000000, "cn": 1}, # Loss
{"chrom": "chr17", "start": 35000000, "end": 36000000, "cn": 4} # High-level amplification
]
# Generate plots in different formats
output_dir = "./cnv_results"
for fmt in ["png", "pdf", "svg"]:
plot_file = caller.plot_genome_wide(
cnv_calls=cnv_calls,
output_path=output_dir,
fmt=fmt
)
print(f"Generated: {plot_file}")
# Plot features:
# - Genome-wide view with all chromosomes
# - Copy number on Y-axis (0-6 typical range)
# - Chromosomal position on X-axis
# - Color coding: red=loss, blue=gain, black=neutral
Output Formats:
| Format | Extension | Best For | File Size | |--------|-----------|----------|-----------| | PNG | .png | Web, presentations, quick viewing | Medium | | PDF | .pdf | Publications, high-quality printing | Large | | SVG | .svg | Vector editing, scalable graphics | Small |
Best Practices:
Common Issues and Solutions:
Issue: Plot too crowded with many CNVs
Issue: ChrY not displayed for female samples
Export CNV calls in standard BED format for compatibility with genome browsers and downstream analysis tools.
from scripts.main import CNVCaller
caller = CNVCaller()
# Example CNV calls
cnv_calls = [
{"chrom": "chr1", "start": 1000000, "end": 2000000, "cn": 3},
{"chrom": "chr7", "start": 50000000, "end": 55000000, "cn": 1},
]
# Export to BED format
bed_file = caller.save_bed(cnv_calls, "./output")
# BED format structure:
# chrom start end name score strand
# chr1 1000000 2000000 CN=3 . .
# chr7 50000000 55000000 CN=1 . .
print(f"BED file saved: {bed_file}")
# Read and display BED content
with open(bed_file, 'r') as f:
print("\nBED file content:")
for line in f:
print(line.strip())
BED Format Specification:
| Column | Field | Description | Example | |--------|-------|-------------|---------| | 1 | chrom | Chromosome name | chr1, chrX | | 2 | start | Start position (0-based) | 1000000 | | 3 | end | End position (1-based) | 2000000 | | 4 | name | CNV annotation | CN=3 | | 5 | score | Optional quality score | . | | 6 | strand | Strand info (usually .) | . |
Best Practices:
bedtools or genome browser before distributionCommon Issues and Solutions:
Issue: BED file rejected by genome browser
Issue: Coordinate system confusion
Compare CNV profiles between tumor and matched normal samples to identify somatic copy number alterations (SCNAs).
from scripts.main import CNVCaller
caller = CNVCaller(bin_size=1000)
# Call CNVs in tumor and normal samples
tumor_cnvs = caller.call_cnvs("tumor.bam", "hg38.fa")
normal_cnvs = caller.call_cnvs("normal.bam", "hg38.fa")
# Identify somatic CNVs (present in tumor, not in normal)
def find_somatic_cnvs(tumor_calls, normal_calls):
"""Identify CNVs present in tumor but not normal."""
somatic_cnvs = []
for t_cnv in tumor_calls:
is_somatic = True
# Check if similar CNV exists in normal
for n_cnv in normal_calls:
if (t_cnv['chrom'] == n_cnv['chrom'] and
abs(t_cnv['start'] - n_cnv['start']) < 10000 and
abs(t_cnv['end'] - n_cnv['end']) < 10000 and
t_cnv['cn'] == n_cnv['cn']):
is_somatic = False
break
if is_somatic:
somatic_cnvs.append(t_cnv)
return somatic_cnvs
somatic_cnvs = find_somatic_cnvs(tumor_cnvs, normal_cnvs)
print(f"Total tumor CNVs: {len(tumor_cnvs)}")
print(f"Somatic CNVs: {len(somatic_cnvs)}")
# Categorize somatic alterations
amplifications = [c for c in somatic_cnvs if c['cn'] > 2]
deletions = [c for c in somatic_cnvs if c['cn'] < 2]
print(f" Amplifications: {len(amplifications)}")
print(f" Deletions: {len(deletions)}")
Somatic vs Germline Classification:
| Category | Tumor CN | Normal CN | Interpretation | |----------|----------|-----------|----------------| | Somatic Amplification | >2 | 2 | Tumor-specific gain | | Somatic Deletion | <2 | 2 | Tumor-specific loss | | Germline CNV | ≠2 | ≠2 | Inherited CNV | | LOH | 1 | 2 | Loss of heterozygosity |
Best Practices:
Common Issues and Solutions:
Issue: Normal sample contamination in tumor
Issue: Germline CNVs misclassified as somatic
Apply quality filters to remove artifactual CNV calls and improve result reliability.
from scripts.main import CNVCaller
caller = CNVCaller()
# Example raw CNV calls with QC metrics
cnv_calls = [
{
"chrom": "chr1", "start": 1000000, "end": 2000000, "cn": 3,
"quality_score": 50, "supporting_reads": 150
},
{
"chrom": "chr7", "start": 50000000, "end": 50001000, "cn": 0,
"quality_score": 10, "supporting_reads": 5 # Likely artifact
},
]
# Apply quality filters
def filter_cnvs(cnv_list, min_quality=20, min_size=1000, min_support=20):
"""Filter CNVs based on quality metrics."""
filtered = []
for cnv in cnv_list:
size = cnv['end'] - cnv['start']
quality = cnv.get('quality_score', 0)
support = cnv.get('supporting_reads', 0)
# Apply filters
if quality < min_quality:
continue
if size < min_size:
continue
if support < min_support:
continue
filtered.append(cnv)
return filtered
# Filter with different stringencies
for min_q in [10, 20, 30]:
filtered = filter_cnvs(cnv_calls, min_quality=min_q)
print(f"Quality >= {min_q}: {len(filtered)} CNVs retained")
# Additional filters to consider:
# - Exclude segmental duplications
# - Exclude centromeres and telomeres
# - Minimum number of supporting bins
# - Concordance with paired-end or split-read signals
Quality Metrics:
| Metric | Threshold | Purpose | |--------|-----------|---------| | Quality Score | >20 | Overall confidence in CNV call | | Size | >1kb | Remove small artifactual calls | | Supporting Reads | >20 | Sufficient evidence depth | | Log2 Ratio | |0.3| | Significant deviation from diploid | | Mappability | >0.8 | Reliable unique mapping |
Best Practices:
Common Issues and Solutions:
Issue: Too many low-quality CNV calls
Issue: True CNVs filtered out
From WGS data to CNV visualization:
# Step 1: Call CNVs from tumor sample
# Example invocation: python scripts/main.py \
--input tumor_sample.bam \
--reference hg38.fa \
--output tumor_cnv/ \
--bin-size 1000 \
--plot-format pdf
# Step 2: Call CNVs from matched normal
# Example invocation: python scripts/main.py \
--input normal_sample.bam \
--reference hg38.fa \
--output normal_cnv/ \
--bin-size 1000
# Step 3: Compare and identify somatic CNVs
# (Use Python API for comparison logic)
# Step 4: Generate final plots
# Example invocation: python scripts/main.py \
--input tumor_sample.bam \
--reference hg38.fa \
--output final_results/ \
--plot-format pdf
Python API Usage:
from scripts.main import CNVCaller
from pathlib import Path
def analyze_cancer_genome(
tumor_bam: str,
normal_bam: str,
reference: str,
output_dir: str
) -> dict:
"""
Complete cancer genome CNV analysis workflow.
"""
caller = CNVCaller(bin_size=1000)
# Create output directory
Path(output_dir).mkdir(parents=True, exist_ok=True)
# Call CNVs in both samples
print("Calling CNVs in tumor sample...")
tumor_cnvs = caller.call_cnvs(tumor_bam, reference)
print("Calling CNVs in normal sample...")
normal_cnvs = caller.call_cnvs(normal_bam, reference)
# Identify somatic alterations
somatic_cnvs = identify_somatic(tumor_cnvs, normal_cnvs)
# Generate outputs
tumor_bed = caller.save_bed(tumor_cnvs, output_dir)
somatic_bed = caller.save_bed(somatic_cnvs, f"{output_dir}/somatic")
plot_file = caller.plot_genome_wide(tumor_cnvs, output_dir, "pdf")
# Calculate statistics
stats = {
"total_tumor_cnvs": len(tumor_cnvs),
"somatic_cnvs": len(somatic_cnvs),
"amplifications": len([c for c in somatic_cnvs if c['cn'] > 2]),
"deletions": len([c for c in somatic_cnvs if c['cn'] < 2]),
"output_files": {
"tumor_bed": tumor_bed,
"somatic_bed": somatic_bed,
"genome_plot": plot_file
}
}
return stats
# Execute workflow
results = analyze_cancer_genome(
tumor_bam="tumor.bam",
normal_bam="normal.bam",
reference="hg38.fa",
output_dir="./cnv_analysis"
)
print(f"\nAnalysis complete!")
print(f"Total tumor CNVs: {results['total_tumor_cnvs']}")
print(f"Somatic CNVs: {results['somatic_cnvs']}")
print(f" Amplifications: {results['amplifications']}")
print(f" Deletions: {results['deletions']}")
Expected Output Files:
cnv_analysis/
├── cnv_calls.bed # All CNV calls in BED format
├── somatic/
│ └── cnv_calls.bed # Somatic CNVs only
├── cnv_plot.pdf # Genome-wide visualization
└── analysis_summary.json # Statistics and metadata
Scenario: Identify somatic copy number alterations in a cancer sample compared to matched normal tissue.
{
"analysis_type": "cancer_genome",
"samples": {
"tumor": "tumor_wgs.bam",
"normal": "blood_normal.bam"
},
"reference": "hg38.fa",
"parameters": {
"bin_size": 1000,
"min_cnv_size": 10000,
"plot_format": "pdf"
},
"expected_outputs": [
"Somatic CNV calls (BED format)",
"Genome-wide CNV profile plot",
"CNV statistics and summary"
]
}
Workflow:
Output Example:
Somatic CNV Summary:
Total alterations: 47
Amplifications: 12 (including MYC, EGFR)
Deletions: 35 (including TP53, PTEN)
High-impact alterations:
chr8:128000000-129000000 CN=8 (MYC amplification)
chr17:7000000-8000000 CN=0 (TP53 deletion)
Scenario: Detect pathogenic CNVs in a patient with suspected genomic disorder.
{
"analysis_type": "rare_disease",
"sample": "patient.bam",
"reference": "hg38.fa",
"parameters": {
"bin_size": 500,
"min_cnv_size": 1000,
"max_frequency": 0.01
},
"annotation": [
"OMIM genes",
"ClinVar pathogenic variants",
"Decipher syndromes"
]
}
Workflow:
Output Example:
Rare CNV Findings:
chr22:19000000-21000000 CN=1 (22q11.2 deletion syndrome)
Size: 2.0 Mb
Genes: TBX1, COMT, etc.
Frequency: <0.1% in population
Phenotype match: Cardiac, thymic, facial anomalies
Classification: Pathogenic
Scenario: Compare CNV profiles across multiple samples to identify recurrent alterations.
{
"analysis_type": "population",
"samples": [
"sample1.bam", "sample2.bam", "sample3.bam",
...
],
"cohorts": {
"cases": 50,
"controls": 50
},
"parameters": {
"bin_size": 1000,
"plot_format": "png"
},
"analysis": [
"Recurrent CNV detection",
"Burden analysis",
"Association testing"
]
}
Workflow:
Output Example:
Population CNV Analysis:
Samples analyzed: 100
Total CNVs detected: 2,847
Recurrent alterations:
chr1:1000000-2000000: 23% frequency
chr16:15000000-16000000: 18% frequency
Case vs Control association:
Significant enrichment: 3 CNV regions
Most significant: chr8:128000000-129000000 (p=0.001)
Scenario: Characterize CNV profile of a cancer cell line for research or quality control.
{
"analysis_type": "cell_line",
"sample": "mcf7_cell_line.bam",
"reference": "hg38.fa",
"parameters": {
"bin_size": 1000,
"plot_format": "pdf"
},
"comparison": {
"reference_profile": "mcf7_ccle_cnvs.bed",
"expected_alterations": ["chr8_MYC_amp", "chr20_ZNF217_amp"]
}
}
Workflow:
Output Example:
Cell Line: MCF-7
Identity confirmed: Yes (99.2% match to reference)
Expected alterations detected:
chr8:128000000-129000000: CN=8 (MYC) ✓
chr20:50000000-52000000: CN=6 (ZNF217) ✓
Additional alterations:
chr17:35000000-37000000: CN=3 (ERBB2) ✓
Ploidy: 2.8 (aneuploid)
Genome instability score: High
Pre-analysis Checks:
During Analysis:
Post-analysis Verification:
Before Clinical or Publication Use:
Input Data Issues:
❌ Using low coverage data → Noisy CNV calls with many false positives
❌ Mismatched reference genomes → CNVs called in wrong coordinates
❌ Not using matched normal for tumors → Cannot distinguish somatic vs germline
❌ Poor coverage uniformity → GC bias causes false CNVs
Analysis Parameter Issues:
❌ Bin size too large → Miss small CNVs (<10kb)
❌ Bin size too small → Excessive noise in low coverage regions
❌ Inadequate quality filtering → Too many false positive CNVs
❌ Not filtering common CNVs → Report common polymorphisms as pathogenic
Interpretation Issues:
❌ Ignoring tumor purity → Misinterpret subclonal CNVs
❌ Not validating key findings → Report false positive driver alterations
❌ Over-interpreting small CNVs → Single-exon deletions are often artifacts
❌ Ignoring parental data → Cannot determine inheritance in rare disease
Output and Reporting Issues:
❌ Unclear coordinate system → Confusion between 0-based and 1-based
❌ Missing quality metrics → Cannot assess confidence in CNV calls
❌ Not archiving raw data → Results cannot be reproduced
❌ Inadequate documentation → Others cannot interpret results
Problem: No CNVs detected
Problem: Too many CNV calls (hundreds or thousands)
Problem: False positives in repetitive regions
Problem: CNV signals too weak in tumor samples
Problem: Sex chromosomes have unexpected copy numbers
Problem: Batch effects in multi-sample analysis
Problem: Cannot install or run tool
pip install pysam numpy matplotlib pandassamtools faidx reference.fasample.bam.baiAvailable in references/ directory:
External Resources:
Located in scripts/ directory:
main.py - Main CNV calling and plotting engine| Method | Input | Sensitivity | Resolution | Best For | |--------|-------|-------------|------------|----------| | Read Depth (this tool) | BAM | Medium | 1-10 kb | Large CNVs, WGS | | Paired-end Mapping | BAM | Medium | 100bp-10kb | Deletions, insertions | | Split-read Analysis | BAM | High | 1bp-1kb | Breakpoint detection | | SNP Array | CEL/IDAT | High | 5-25kb | Cost-effective screening | | Optical Mapping | Bionano | High | 500bp+ | Very large SVs |
| Parameter | Type | Default | Required | Description |
|-----------|------|---------|----------|-------------|
| --input, -i | string | - | Yes | Input BAM/VCF file |
| --reference, -r | string | - | Yes | Reference genome FASTA |
| --output, -o | string | ./cnv_output | No | Output directory |
| --bin-size | int | 1000 | No | Bin size for analysis |
| --plot-format | string | png | No | Plot format (png, pdf, svg) |
# Call CNVs from BAM file
# Example invocation: python scripts/main.py --input sample.bam --reference hg38.fa
# Custom output directory and bin size
# Example invocation: python scripts/main.py --input sample.bam --reference hg38.fa --output ./results --bin-size 500
# Generate PDF plots
# Example invocation: python scripts/main.py --input sample.bam --reference hg38.fa --plot-format pdf
| Risk Indicator | Assessment | Level | |----------------|------------|-------| | Code Execution | Python script executed locally | Low | | Network Access | No external API calls | Low | | File System Access | Read BAM/VCF, write results | Low | | Data Exposure | Processes genomic data | Medium | | PHI Risk | May process patient genetic data | High |
# Python 3.7+
# No additional packages required (uses standard library)
Last Updated: 2026-02-09
Skill ID: 162
Version: 2.0 (K-Dense Standard)
Every final response should make these items explicit when they are relevant:
scripts/main.py fails, report the failure point, summarize what still can be completed safely, and provide a manual fallback.This skill accepts requests that match the documented purpose of cnv-caller-plotter and include enough context to complete the workflow safely.
Do not continue the workflow when the request is out of scope, missing a critical input, or would require unsupported assumptions. Instead respond:
cnv-caller-plotteronly handles its documented workflow. Please provide the missing required inputs or switch to a more suitable skill.
Use the following fixed structure for non-trivial requests:
If the request is simple, you may compress the structure, but still keep assumptions and limits explicit when they affect correctness.
tools
Generates complete conventional oncology bulk-transcriptome biomarker and hub-gene research designs from a user-provided cancer type and study direction. Always use this skill whenever a user wants to design, plan, or build a tumor bioinformatics study centered on differential expression, prognostic filtering or risk modeling, PPI-based hub-gene prioritization, diagnostic/prognostic evaluation, clinical association, immune infiltration context, methylation context, and optional tissue or cell validation. Covers five study patterns (signature-first prognostic workflow, hub-gene-first biomarker workflow, hybrid signature-to-hub workflow, immune-context biomarker workflow, translational validation workflow) and always outputs four workload configs (Lite / Standard / Advanced / Publication+) with recommended primary plan, step-by-step workflow, figure plan, validation strategy, minimal executable version, publication upgrade path...
development
Generates complete conventional non-oncology bioinformatics research designs from a user-provided disease context, process-related gene family or biological theme, and validation direction. Use when a study centers on multi-dataset bulk transcriptome integration, DEG analysis, process-gene intersection, enrichment analysis, GSEA, PPI hub-gene prioritization, TF/miRNA regulatory networks, ROC-based biomarker evaluation, and immune infiltration analysis. Covers five study patterns (process-DEG discovery, enrichment/GSEA interpretation, hub-gene prioritization, regulatory-network and immune interpretation, multi-layer public validation) and always outputs Lite / Standard / Advanced / Publication+ with a recommended primary plan, stepwise workflow, figure plan, validation hierarchy, minimal executable version, publication upgrade path, and strictly verified literature retrieval.
tools
Plans confounder control, variable adjustment logic, and bias mitigation strategies at the protocol stage for clinical, epidemiologic, translational, observational, and biomarker studies. Always use this skill when a user needs to identify major confounders, decide which variables should or should not be adjusted for, compare matching/stratification/weighting approaches, anticipate selection or measurement bias, or pressure-test a study design before execution. Focus on bias sensing, causal structure awareness, variable-role classification, and critical design review rather than generic statistical advice.
testing
Generates complete comparative network-toxicology research designs from a user-provided exposure pair, shared toxic phenotype, and validation direction. Use when a study centers on two related exposures under one outcome and needs target collection, shared-vs-specific target decomposition, enrichment, PPI hub prioritization, docking, optional transcriptomic cross-checks, and conservative mechanistic synthesis. Covers five study patterns and always outputs Lite / Standard / Advanced / Publication+ with a recommended primary plan, stepwise workflow, figure plan, validation hierarchy, minimal executable version, publication upgrade path, and strictly verified literature retrieval.