data-visualization/oncoprint-mutation-matrices/SKILL.md
Build OncoPrint and co-mutation matrix plots from somatic-variant cohorts using ComplexHeatmap, maftools, and comut.py with alteration-type stacking, sample ordering by mutational burden, mutual-exclusivity overlays, and clinical annotation tracks. Use when visualizing per-sample mutation patterns across recurrent driver genes, comparing alteration classes, or identifying mutually-exclusive / co-occurring driver pairs.
npx skillsauth add GPTomics/bioSkills bio-data-visualization-oncoprint-mutation-matricesInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Reference examples tested with: ComplexHeatmap 2.18+, maftools 2.18+, comut 0.0.3+, MAFtools requires R 4.0+; comut.py requires pandas 2.0+, matplotlib 3.8+.
Before using code patterns, verify installed versions match. If versions differ:
pip show <package> then help(module.function) to check signaturespackageVersion('<pkg>') then ?function_nameIf code throws ImportError, AttributeError, or TypeError, introspect the installed package and adapt the example to match the actual API rather than retrying.
"Plot mutations across a cohort" -> Render a gene-by-sample matrix where each cell stacks colored rectangles encoding alteration class (missense, truncating, splice, copy-gain, copy-loss, fusion). Sort samples by burden, optionally split by clinical group, and overlay co-mutation / mutual-exclusivity annotations. OncoPrint (Cerami 2012 Cancer Discov 2:401; canonical at cBioPortal) is the genre-defining visualization.
ComplexHeatmap::oncoPrint, maftools::oncoplotcomut.CoMut, cbioportal-style implementationsOncoPrint differs from a generic heatmap because each cell can encode multiple alterations simultaneously through stacked rectangles. A patient with both a missense and a copy-gain in TP53 shows one cell with two overlapping colored rectangles (e.g., green diamond inside red square). This stacking is the whole point — it preserves the multi-modal alteration landscape that flattening to a single category destroys.
In ComplexHeatmap's oncoPrint, the alter_fun argument is the rendering specification: a named list of functions, one per alteration class, each drawing its rectangle inside the cell. Get this right and the figure works; get it wrong and overlapping alterations are invisible.
| Question | Sort by | Display |
|----------|---------|---------|
| Which genes are most altered? | Gene frequency (default) | Bar above samples (sample TMB); bar right of genes (gene frequency) |
| Per-patient burden patterns | Sample burden | TMB bar on top; sample-name labels |
| Subtype-driver enrichment | Clinical group then burden | column_split by group; per-group frequency right bar |
| Mutual exclusivity (BRAF vs NRAS) | Custom (alphabetic-by-mutation pattern) | Memo sort; overlay log10(OR) heatmap |
| Co-occurrence (TP53 + MYC) | Custom | Same pattern; positive OR coloring |
| Driver vs passenger comparison | Two panels | Concatenate two oncoPrints horizontally |
Goal: Render a cohort mutation matrix with stacked alteration-class encoding, sample annotations, and a sample-sorted, gene-frequency-ranked layout.
Approach: Convert the MAF/variant table to a gene-by-sample matrix of ;-delimited alteration strings; define alter_fun rendering one rectangle per class; pass to oncoPrint() with column annotations.
library(ComplexHeatmap)
library(circlize)
# Input: matrix where each cell is a string like 'Missense;Amp' or '' for no alteration
# Rows = genes; columns = samples
# Color per alteration class
col <- c('Missense' = '#56B4E9',
'Truncating' = '#000000',
'Splice' = '#CC79A7',
'Amp' = '#D55E00',
'HomDel' = '#0072B2',
'Fusion' = '#009E73')
# alter_fun -- one function per class, each drawing inside the cell
alter_fun <- list(
background = function(x, y, w, h)
grid.rect(x, y, w - unit(0.5, 'mm'), h - unit(0.5, 'mm'),
gp = gpar(fill = '#EEEEEE', col = NA)),
Amp = function(x, y, w, h)
grid.rect(x, y, w - unit(0.5, 'mm'), h - unit(0.5, 'mm'),
gp = gpar(fill = col['Amp'], col = NA)),
HomDel = function(x, y, w, h)
grid.rect(x, y, w - unit(0.5, 'mm'), h - unit(0.5, 'mm'),
gp = gpar(fill = col['HomDel'], col = NA)),
Missense = function(x, y, w, h)
grid.rect(x, y, w - unit(0.5, 'mm'), h * 0.5,
gp = gpar(fill = col['Missense'], col = NA)),
Truncating = function(x, y, w, h)
grid.rect(x, y, w - unit(0.5, 'mm'), h * 0.33,
gp = gpar(fill = col['Truncating'], col = NA)),
Splice = function(x, y, w, h)
grid.rect(x, y, w - unit(0.5, 'mm'), h * 0.25,
gp = gpar(fill = col['Splice'], col = NA)),
Fusion = function(x, y, w, h)
grid.points(x, y, pch = 17, size = unit(2, 'mm'),
gp = gpar(col = col['Fusion'])))
# Clinical column annotation
ha_clin <- HeatmapAnnotation(
Subtype = clinical$subtype,
Stage = clinical$stage,
col = list(Subtype = c(Luminal='#0072B2', Basal='#D55E00', HER2='#009E73'),
Stage = c(I='#FFFFCC', II='#FED976', III='#FD8D3C', IV='#BD0026')))
oncoPrint(mat,
alter_fun = alter_fun,
col = col,
top_annotation = ha_clin,
column_title = 'TCGA-BRCA mutation landscape',
row_names_gp = gpar(fontsize = 8),
pct_gp = gpar(fontsize = 7),
show_pct = TRUE,
remove_empty_columns = FALSE,
remove_empty_rows = FALSE)
For TCGA-style MAF files, maftools::oncoplot is the lower-friction option:
library(maftools)
maf <- read.maf(maf = 'tcga.maf', clinicalData = clinical)
oncoplot(maf = maf,
top = 20, # top 20 mutated genes
clinicalFeatures = c('Subtype', 'Stage'),
annotationColor = list(Subtype = c(Luminal='#0072B2', Basal='#D55E00'),
Stage = c(I='#FFFFCC', IV='#BD0026')),
sortByAnnotation = TRUE,
removeNonMutated = FALSE)
maftools defaults handle alteration-class colors, sample sorting, and percentage bars automatically. Customization is more limited than ComplexHeatmap.
# maftools provides somaticInteractions
si <- somaticInteractions(maf = maf, top = 20,
pvalue = c(0.05, 0.01),
fontSize = 0.7)
# Plot returns a matrix of -log10(p) with sign by direction (+ co-occur, - mutex)
Mutual-exclusivity testing on small cohorts (N < 50) is underpowered; reported "significant" mutex on n=20 with 2 mutations each is uninterpretable. Aggregate to larger cohorts (TCGA + ICGC pan-cancer) or report effect size with CI rather than p-value.
Fisher exact vs DISCOVER: standard 2x2 Fisher tests sample-mutation pairs, ignoring per-gene mutation rate background. DISCOVER (Canisius 2016 Genome Biol 17:261) models per-tumor mutation probability and is preferred for pan-cancer analyses where mutation rate varies 100× across samples.
import comut
import pandas as pd
# Long-format: columns = sample, category (gene), value (alteration class)
toy_comut = comut.CoMut()
toy_comut.add_categorical_data(
data=mutation_long_df,
name='Mutations',
category_order=top_genes,
value_order=['Truncating', 'Missense', 'Splice', 'Amp', 'HomDel'],
mapping={'Truncating': '#000000', 'Missense': '#56B4E9',
'Splice': '#CC79A7', 'Amp': '#D55E00', 'HomDel': '#0072B2'})
toy_comut.add_categorical_data(
data=clinical_long_df,
name='Subtype',
mapping={'Luminal': '#0072B2', 'Basal': '#D55E00'})
toy_comut.add_continuous_data(
data=tmb_long_df,
name='TMB',
mapping='viridis',
value_range=(0, 30))
toy_comut.plot_comut(figsize=(12, 8))
toy_comut.figure.savefig('comut.pdf', dpi=300, bbox_inches='tight')
Trigger: Reducing each cell to a single most-severe alteration, losing the stack.
Mechanism: Loses the multi-alteration biology (e.g., MYC amp + missense in TP53).
Symptom: OncoPrint looks like a simple heatmap; co-occurring multi-class events invisible.
Fix: Build the cell as ;-separated alteration string; define alter_fun for each class.
Trigger: Default oncoPrint sorts samples by altered-gene-1 status; weakens the "memo sort" pattern.
Mechanism: True OncoPrint uses memoSort (Cerami 2012) which sorts by the binary altered-or-not pattern across the top genes.
Symptom: Samples with the same alteration profile are not adjacent; "staircase" pattern lost.
Fix: ComplexHeatmap oncoPrint uses memoSort by default; do NOT override column_order unless intentional.
remove_empty_columns = TRUE)Trigger: Default in some implementations.
Mechanism: Drops samples with no mutations in the displayed genes — but those samples ARE part of the cohort.
Symptom: Sample count differs from cohort N; denominator-based percentages wrong.
Fix: remove_empty_columns = FALSE to preserve all samples; percentages now reflect true cohort fraction.
Trigger: Cohort with 1-2 POLE-mutant or MSI-H samples; TMB bar saturates.
Mechanism: Hypermutator TMB is 10-100× the typical sample.
Symptom: All other samples' TMB bars are invisible; one column dominates.
Fix: Log-transform the TMB annotation: anno_barplot(log10(tmb + 1)); OR cap with ylim.
Trigger: Fisher exact test on N < 50 with low mutation counts.
Mechanism: With 2 mutations vs 3 mutations in 20 samples, all p-values are dominated by noise.
Symptom: "Significant mutex" claim from a tiny pilot.
Fix: Aggregate to ≥100 samples for credible mutex; use DISCOVER (Canisius 2016) instead of Fisher when mutation rate varies 100× across samples.
For rare-cancer cohorts where N < 50, the standard OncoPrint + Fisher mutex pipeline is statistically uninterpretable:
| Action | What to do |
|--------|------------|
| Report per-gene frequencies | Use exact-binomial CI (Clopper-Pearson via binom.test) — Wald CI is invalid at low frequency |
| Do NOT report mutex p-values | Fisher exact on 2x2 with cell counts ≤ 5 has no power; the "significant" mutex finding is noise |
| Hypothesis generation only | Pool with TCGA Pan-Cancer + ICGC for credible mutex; treat the cohort as the replication not the discovery |
| Co-occurrence reporting | OR with Haldane-Anscombe 0.5 correction for zero cells; report alongside cohort N |
Show the OncoPrint for visual transparency, but the per-gene-frequency table (with exact-binomial CIs) is the load-bearing scientific output, not the mutex test.
| Pattern | Cause | Action |
|---------|-------|--------|
| ComplexHeatmap and maftools show different sample orders | Different memoSort defaults | Specify sortByAnnotation explicitly; report sort criterion in caption |
| Percentage labels differ | remove_empty_columns = TRUE vs FALSE | Document denominator (cohort-N vs altered-N) |
| Some alterations missing from a sample | Filtering: silent SNVs, low VAF | Document filtering criteria upstream |
| Threshold | Value | Source | |-----------|-------|--------| | Cohort N for valid mutex | ≥100 (pan-cancer); ≥50 (single-cohort with effect-size focus) | Common practice | | Display top genes | 10-25 in single panel | More creates visual clutter | | Sample N for OncoPrint | 50-1000 (above: switch to summary panel) | Visualization practical |
| Error / symptom | Cause | Solution |
|-----------------|-------|----------|
| Co-occurring multi-class events invisible | Single-class flattening | Use ;-separated cells + alter_fun list |
| Sample order doesn't show staircase | column_order override | Trust default memoSort |
| Sample count differs from cohort | remove_empty_columns = TRUE | Set to FALSE |
| TMB bar dominated by 1-2 samples | Hypermutators on linear scale | log10 + 1 transform |
| Mutex p-values on N=20 | Underpowered | Aggregate cohorts; use DISCOVER |
| Gene frequency right-bar mismatches percentages | Denominator definition | Document cohort-N vs altered-N |
tools
--- name: bio-phasing-imputation-foundations description: Frames the phasing/imputation pipeline before any tool runs: phasing and imputation are one Li-Stephens copying HMM (recombination is the transition, mutation the emission, the genetic map and Ne set the rates), imputation's honest output is a dosage with a self-estimated quality (INFO/R2/DR2) not a hard genotype, and the stages are ordered and each fails silently (QC, align build and strand to the panel, phase, impute per chromosome, fil
tools
Chooses the enrichment generation before any tool runs, mapping the input shape to a method class - a pre-selected gene list plus a background to over-representation analysis (ORA, hypergeometric), a ranked statistic for all genes to gene set enrichment (GSEA), a signed signaling topology to pathway-topology (SPIA) - then making the null explicit (competitive vs self-contained, gene vs subject sampling) and running a trustworthiness checklist (testable-gene universe, FDR, redundancy collapse, leading-edge check, version reporting). Covers why every clusterProfiler GSEA is the inter-gene-correlation-uncorrected competitive null, why the background not the gene list decides ORA significance, and why no method is universally best. Use when deciding ORA vs GSEA vs topology, which gene-set DB, whether a result is trustworthy, or which null a tool computes. For ORA see go-enrichment, GSEA see gsea, databases kegg-pathways/reactome-pathways/wikipathways; the ranking comes from differential-expression/de-results.
testing
End-to-end GWAS workflow from VCF to association results. Covers PLINK QC, population structure correction, and association testing for case-control or quantitative traits. Use when running genome-wide association studies.
development
Orchestrates the full path from differential expression results to redundancy-collapsed functional enrichment: choose ORA vs GSEA, convert gene IDs per method, run enrichGO/enrichKEGG/enrichPathway/enrichWP or gseGO/gseKEGG (clusterProfiler, ReactomePA, rWikiPathways), and visualize. Routes the ORA-vs-GSEA generation fork and the null/universe/reproducibility theory to pathway-analysis/enrichment-foundations. Use when a DESeq2/edgeR/limma result must become enriched GO terms, KEGG/Reactome/WikiPathways pathways, or a GSEA leading edge; when deciding whether a ranking exists for all genes (GSEA, named decreasing vector) or only a pre-selected list (ORA plus a defensible background universe); or when assembling DE-to-pathway end to end. The DE list and ranking statistic come from differential-expression/de-results; per-method nuance lives in the pathway-analysis skills.