data-visualization/lollipop-protein-maps/SKILL.md
Plot per-gene mutation distributions on a protein-domain map (lollipop / needle plots) showing mutation position, recurrence count, and variant classification with maftools, g3-lollipop, trackViewer, and ProteinPaint. Use when visualizing recurrent mutation hotspots on a single gene's protein, marking domain boundaries from UniProt/Pfam, comparing missense vs truncating distributions, or contrasting two cohorts on the same lollipop.
npx skillsauth add GPTomics/bioSkills bio-data-visualization-lollipop-protein-mapsInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Reference examples tested with: maftools 2.18+, trackViewer 1.38+, g3-lollipop (JavaScript via R g3viz 1.2+), Bio.PDB 1.83+ (for domain coordinates). ProteinPaint is a hosted service.
Before using code patterns, verify installed versions match. If versions differ:
packageVersion('<pkg>') then ?function_namepip show <package> then help(module.function)If code throws ImportError, AttributeError, or TypeError, introspect the installed package and adapt the example to match the actual API rather than retrying.
"Plot mutations on a gene's protein" -> Render a horizontal protein backbone with colored domain rectangles (from UniProt/Pfam/InterPro), then stack vertical lines ("stems") at mutated amino-acid positions, capped with circles ("lollipops") whose size reflects mutation count and whose color encodes variant class. The biological story is hotspot identification — a tall stack of recurrences at a single residue (e.g., KRAS G12, PIK3CA E545/H1047) is the visual signature of a driver mutation.
maftools::lollipopPlot, trackViewer::lolliplot, g3viz::g3LollipoppyLollipop (limited maintenance); ProteinPaint via APIA lollipop plot exists to identify hotspots — residues with disproportionate recurrence. The MutSig hotspot test (Lawrence 2014 Nature 505:495) and statisticalhotspot methods (Chang 2016 Nat Biotechnol 34:155) formalize this: a residue's mutation count should exceed the gene-wide background rate × residue count. Visualizing this on a domain map IS the diagnostic.
Key practical consequences:
| Question | Approach | |----------|----------| | Where are the hotspots? | Lollipop with size = count; label top 5 recurrent residues | | Missense vs truncating distribution? | Color stems by class; tumor suppressors show truncating spread; oncogenes show missense hotspots | | Compare two cohorts | Stacked lollipops (one cohort up, one down) on shared domain map | | 3D-cluster hotspot detection? | Use HotMAPS / 3D Hotspots — beyond linear lollipop | | Druggable position? | Add ClinVar / OncoKB level annotation at the residue |
Goal: Render per-gene mutation distribution on Pfam domain map with count-sized lollipops and class-colored stems.
Approach: Pass MAF and gene to lollipopPlot; maftools queries Pfam for domain coordinates automatically; outputs ggplot2 object.
library(maftools)
maf <- read.maf(maf = 'cohort.maf')
# Default lollipop
lollipopPlot(maf = maf, gene = 'TP53',
AACol = 'HGVSp_Short',
labelPos = c(175, 248, 273), # mark canonical hotspots
labPosSize = 1.0,
showMutationRate = TRUE,
domainLabelSize = 1,
printCount = TRUE,
colors = c(Missense_Mutation = '#D55E00',
Nonsense_Mutation = '#000000',
Frame_Shift_Del = '#0072B2',
Frame_Shift_Ins = '#56B4E9',
Splice_Site = '#CC79A7',
In_Frame_Del = '#009E73'))
# Compare two cohorts -- one up, one down
lollipopPlot2(m1 = cohort_a, m2 = cohort_b,
gene = 'TP53',
m1_name = 'Cohort A',
m2_name = 'Cohort B',
AACol1 = 'HGVSp_Short', AACol2 = 'HGVSp_Short',
colors = my_palette)
library(trackViewer)
library(GenomicRanges)
# Build SNP (lollipop) and feature (domain) GRanges
snps <- GRanges('chr17', IRanges(c(175, 248, 273), width = 1, names = c('R175H', 'R248Q', 'R273H')),
color = c('#D55E00', '#D55E00', '#D55E00'),
score = c(45, 38, 29)) # mutation count
features <- GRanges('chr17',
IRanges(c(102, 323, 363), width = c(190, 30, 30),
names = c('DNA-binding', 'Tetramerization', 'Regulatory')),
fill = c('#0072B2', '#009E73', '#CC79A7'),
height = 0.04)
lolliplot(snps, features, ylab = 'Mutation count',
xaxis = TRUE, yaxis = TRUE)
trackViewer is more flexible than maftools for non-standard layouts (custom domain sources, multi-protein stacking, integration with genome coordinates).
library(g3viz)
mutation_data <- hgvspChange2protein(maf, gene = 'TP53')
g3Lollipop(mutation_data,
gene.symbol = 'TP53',
protein.change.col = 'AA_Change',
plot.options = g3Lollipop.theme(theme.name = 'nature'),
output.filename = 'TP53_lollipop.html')
g3-lollipop produces an interactive HTML — hover tooltips, click-to-filter, exportable. Suitable for supplementary HTML supplement; not for journal figure submission directly.
| Source | Format | Stability | Caveat | |--------|--------|-----------|--------| | Pfam (via maftools) | Pfam-A domain coordinates | Updated occasionally | maftools caches local; may lag Pfam release | | UniProt | Domain + Region features (varied types) | Daily updates | API-driven; rate limits | | InterPro | Integrated multi-database | More inclusive than Pfam | Different sub-classifications | | Custom | Hand-curated for specific paper | Reproducible | Cite source |
For canonical isoform: maftools uses the canonical UniProt isoform by default. For specific isoform: pass refSeqID or proteinID explicitly. Mutations annotated against a different isoform will be off-by-residue.
Trigger: MAF column HGVSp_Short missing or malformed.
Mechanism: maftools expects HGVSp_Short (e.g., 'p.R175H'); falls back to other columns inconsistently.
Symptom: "No mutations to plot" or wrong positions.
Fix: Verify HGVSp_Short column exists; reformat from HGVSp if needed. Use AACol argument to specify which column.
Trigger: Mutations called against ENST00000269305 but plotted against canonical ENST00000288602 (TP53).
Mechanism: Residue numbering differs across isoforms.
Symptom: Known R175H plotted at R177H or in a different domain.
Fix: Annotate the isoform in the figure caption; pass proteinID to lollipopPlot to force a specific isoform.
Trigger: maftools' cached Pfam annotation is older than the protein's current Pfam release.
Mechanism: Domain coordinates can shift across Pfam versions.
Symptom: Domain boundaries off by a few residues; published-figure mismatch.
Fix: Pull domain coordinates from UniProt directly (current); pass via trackViewer::lolliplot features.
Trigger: "Hotspot" identified at a residue with high coverage variance — looks recurrent but is a sequencing artifact.
Mechanism: Capture-bait coverage variability; some residues sequenced more deeply.
Symptom: "Hotspot" in untargeted region; not validated in WGS.
Fix: Verify recurrence in independent cohort (TCGA Pan-Cancer + ICGC); use MutSig hotspot test (Lawrence 2014) for formal hotspot calling.
Trigger: Default printCount = FALSE.
Mechanism: Size-encoded counts beyond ~10 saturate visually.
Symptom: Reader cannot tell whether the top lollipop is 30 vs 300 mutations.
Fix: printCount = TRUE annotates each lollipop with its count.
Trigger: Default rainbow domain colors.
Mechanism: Domains colored by accident, not by function class.
Symptom: Reader cannot quickly identify which domain is the kinase.
Fix: Manually map domain colors by functional class (kinase = blue, binding = green, regulatory = purple).
| Pattern | Cause | Action | |---------|-------|--------| | Hotspot in cohort A absent in B | Cohort A enriched for a subtype OR small N | Stratify by subtype; cite both N | | 3D hotspot test calls residues not on lollipop | Linear adjacency misses 3D proximity | Use HotMAPS / 3D Hotspots for spatial clusters | | Recurrent residue lacks OncoKB evidence | Novel hotspot OR sequencing artifact | Confirm via independent cohort + WGS | | Frame-shift indels not aligned to expected codon | Different annotation tool (VEP vs SnpEff) | Standardize annotation; verify HGVSp |
Operational rule: annotate the isoform; show absolute counts on lollipops; verify hotspots against TCGA Pan-Cancer + ICGC before novel-hotspot claims.
| Threshold | Value | Source | |-----------|-------|--------| | Hotspot recurrence cutoff | depends on gene length + cohort size | Lawrence 2014 — formal MutSig test | | Display all mutations vs filter | Recurrent (count ≥ 2) for clarity; show all in supplement | Visualization practical | | Domain source default | Pfam (maftools default); UniProt for current | Tool-specific | | Cohort N for credible hotspot | ≥200 for a single gene; pan-cancer for novel | Standard practice |
| Error / symptom | Cause | Solution |
|-----------------|-------|----------|
| No mutations on plot | HGVSp_Short column missing | Verify / reformat from HGVSp |
| Mutations at wrong position | Isoform mismatch | Specify proteinID; document isoform |
| Domain boundaries slightly off | maftools Pfam cache outdated | Pull from UniProt; use trackViewer |
| Hotspot size saturates | Counts >10 indistinguishable by size | printCount = TRUE to annotate numbers |
| Random domain colors | Default rainbow | Manual mapping by functional class |
| Novel hotspot from one cohort | Insufficient N | Verify in TCGA + ICGC |
tools
--- name: bio-phasing-imputation-foundations description: Frames the phasing/imputation pipeline before any tool runs: phasing and imputation are one Li-Stephens copying HMM (recombination is the transition, mutation the emission, the genetic map and Ne set the rates), imputation's honest output is a dosage with a self-estimated quality (INFO/R2/DR2) not a hard genotype, and the stages are ordered and each fails silently (QC, align build and strand to the panel, phase, impute per chromosome, fil
tools
Chooses the enrichment generation before any tool runs, mapping the input shape to a method class - a pre-selected gene list plus a background to over-representation analysis (ORA, hypergeometric), a ranked statistic for all genes to gene set enrichment (GSEA), a signed signaling topology to pathway-topology (SPIA) - then making the null explicit (competitive vs self-contained, gene vs subject sampling) and running a trustworthiness checklist (testable-gene universe, FDR, redundancy collapse, leading-edge check, version reporting). Covers why every clusterProfiler GSEA is the inter-gene-correlation-uncorrected competitive null, why the background not the gene list decides ORA significance, and why no method is universally best. Use when deciding ORA vs GSEA vs topology, which gene-set DB, whether a result is trustworthy, or which null a tool computes. For ORA see go-enrichment, GSEA see gsea, databases kegg-pathways/reactome-pathways/wikipathways; the ranking comes from differential-expression/de-results.
testing
End-to-end GWAS workflow from VCF to association results. Covers PLINK QC, population structure correction, and association testing for case-control or quantitative traits. Use when running genome-wide association studies.
development
Orchestrates the full path from differential expression results to redundancy-collapsed functional enrichment: choose ORA vs GSEA, convert gene IDs per method, run enrichGO/enrichKEGG/enrichPathway/enrichWP or gseGO/gseKEGG (clusterProfiler, ReactomePA, rWikiPathways), and visualize. Routes the ORA-vs-GSEA generation fork and the null/universe/reproducibility theory to pathway-analysis/enrichment-foundations. Use when a DESeq2/edgeR/limma result must become enriched GO terms, KEGG/Reactome/WikiPathways pathways, or a GSEA leading edge; when deciding whether a ranking exists for all genes (GSEA, named decreasing vector) or only a pre-selected list (ORA plus a defensible background universe); or when assembling DE-to-pathway end to end. The DE list and ranking statistic come from differential-expression/de-results; per-method nuance lives in the pathway-analysis skills.