data-visualization/distribution-plots/SKILL.md
Plot per-group distributions of continuous data using boxplots, violins, beeswarms, quasirandom jitter, and raincloud plots with sample-size honesty (Weissgerber 2015), KDE-bandwidth awareness, and N-aware encoding choices. Use when comparing distributions across a small number of groups — expression per cluster, biomarker per arm, scores per condition — and the bar-of-mean default is misleading.
npx skillsauth add GPTomics/bioSkills bio-data-visualization-distribution-plotsInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Reference examples tested with: ggplot2 3.5+, ggbeeswarm 0.7+, ggdist 3.3+, gghalves 0.1.4+, seaborn 0.13+, matplotlib 3.8+, ptitprince 0.3+ (Python raincloud).
Before using code patterns, verify installed versions match. If versions differ:
pip show <package> then help(module.function) to check signaturespackageVersion('<pkg>') then ?function_name to verify parametersIf code throws ImportError, AttributeError, or TypeError, introspect the installed package and adapt the example to match the actual API rather than retrying.
"Plot the distribution per group" -> Render boxplot, violin, beeswarm, or raincloud calibrated to N per group, the underlying distribution shape, and the audience's ability to read each encoding. The default geom_bar(stat='summary') is the canonical misleading choice — Weissgerber 2015 PLOS Biol documented that 703 top physiology papers use bar-of-mean despite multiple distinct distributions producing identical bars.
ggplot2::geom_boxplot, ggplot2::geom_violin, ggbeeswarm::geom_quasirandom, ggdist::stat_halfeye, gghalves::geom_half_violinseaborn.boxplot/violinplot/swarmplot/stripplot, ptitprince.RainCloudWeissgerber, Milic, Winham & Garovic 2015 PLOS Biol 13:e1002128 surveyed 703 papers in top physiology journals and found that bar-and-line graphs of means dominate, despite many distinct distributions producing identical bar plots. Bimodal data, skewed data, and data with outliers all collapse to the same bar height and error bar. The bar plot is a hypothesis test result rendered as visualization; the visualization should show the data.
The modern alternative is to show every point for n < 30, layer summary on top, and reserve summary-only plots for large N where points would overplot.
| N per group | Recommended | Avoid | |-------------|-------------|-------| | 3-10 | Dot plot or jittered raw points + median bar | Bar of mean | | 10-30 | Beeswarm OR quasirandom + box overlay | Bare boxplot (hides bimodality) | | 30-200 | Raincloud (Allen 2019) OR box + jitter | Bare violin (default KDE bandwidth oversmooths) | | 200-1000 | Letter-value plot (Hofmann 2017) OR violin with explicit bandwidth | Box alone (collapses tails) | | >1000 | Density (KDE) or histogram + summary stats | Individual points (overplot) |
Always annotate N somewhere on the plot (caption, x-axis tick label, or stratum count).
ggplot(df, aes(group, value, fill = group)) +
geom_boxplot(outlier.shape = NA, alpha = 0.7, width = 0.5) +
geom_jitter(width = 0.2, alpha = 0.5, size = 1) +
scale_fill_manual(values = c('#0072B2', '#D55E00')) +
labs(x = NULL, y = 'Expression') +
theme_classic()
Box shows: median, IQR, 1.5×IQR whiskers, outliers. Hides: bimodality, sample size, density.
Notched boxplot (notch = TRUE): notches show 95% CI for median (±1.58·IQR/√n); non-overlapping notches roughly indicate distinct medians. Use with N ≥ 15.
ggplot(df, aes(group, value, fill = group)) +
geom_violin(alpha = 0.7, trim = FALSE,
bw = 'SJ') + # Sheather-Jones bandwidth
geom_boxplot(width = 0.1, fill = 'white', outlier.shape = NA) +
scale_fill_manual(values = c('#0072B2', '#D55E00'))
KDE bandwidth pitfall: ggplot's default is Silverman's rule of thumb, which oversmooths bimodal data into a single mode. Use bw = 'SJ' (Sheather-Jones plug-in) for honest representation of multimodality.
trim = TRUE (default) cuts the violin at the data range — visually misleading because the violin's tails imply density extending beyond the data. trim = FALSE lets the KDE extend.
library(ggbeeswarm)
ggplot(df, aes(group, value, color = group)) +
geom_quasirandom(method = 'quasirandom', width = 0.3, alpha = 0.7) +
scale_color_manual(values = c('#0072B2', '#D55E00')) +
stat_summary(fun = median, geom = 'crossbar', width = 0.5, color = 'black')
Quasirandom (van der Corput sequence; Bostock implementation) gives reproducible jitter that fills space without random scatter. Beeswarm is similar but with collision avoidance. Both are deterministic — reruns produce identical layouts.
Goal: Show distribution (half-violin), summary (boxplot), and raw observations (jittered points) in a single per-group panel without occlusion.
Approach: Place a half-violin on one side, a thin boxplot in the middle, and jittered points on the other side via gghalves::geom_half_violin + geom_boxplot + geom_half_point with position_nudge offsets; flip to horizontal so the visual reads as a literal raincloud.
library(gghalves)
ggplot(df, aes(group, value, fill = group, color = group)) +
geom_half_violin(side = 'r', alpha = 0.7, position = position_nudge(x = 0.15)) +
geom_boxplot(width = 0.15, outlier.shape = NA, alpha = 0.7,
position = position_nudge(x = -0.05)) +
geom_half_point(side = 'l', alpha = 0.5, size = 1.5, range_scale = 0.4,
position = position_nudge(x = -0.2)) +
scale_fill_manual(values = c('#0072B2', '#D55E00')) +
scale_color_manual(values = c('#0072B2', '#D55E00')) +
coord_flip() # horizontal "raincloud"
import ptitprince as pt
import seaborn as sns
pt.RainCloud(x='group', y='value', data=df,
palette=['#0072B2', '#D55E00'],
bw='scott', cut=0, # bandwidth + trim
width_viol=0.6, orient='h')
Raincloud = half-violin (distribution) + boxplot (summary) + jittered raw points. Allen 2019 Wellcome Open Res 4:63 — modern publication default for N 30-200.
library(lvplot)
ggplot(df, aes(group, value, fill = group)) +
geom_lv(k = 5, alpha = 0.7) +
scale_fill_manual(values = c('#0072B2', '#D55E00'))
Extends Tukey's boxplot via additional letter-value quantiles (Hofmann, Wickham, Kafadar 2017 J Comput Graph Stat 26:469). For large N, the standard boxplot collapses tail structure; letter-value preserves it.
sns.boxenplot(x='group', y='value', data=df,
palette=['#0072B2', '#D55E00']) # seaborn calls it boxenplot
library(introdataviz) # split-violin geom
ggplot(df, aes(group, value, fill = condition)) +
geom_split_violin(alpha = 0.7) +
geom_boxplot(width = 0.15, position = position_dodge(0.5), outlier.shape = NA)
For 2-condition comparison within each group, split-violin shows both densities back-to-back. More compact than dodged violins.
Trigger: geom_bar(stat = 'summary') + geom_errorbar(stat = 'summary', fun.data = mean_se).
Mechanism: Mean ± SEM collapses all distributional information; reader cannot assess bimodality, skew, or N.
Symptom: Reviewer asks to "show the data"; the figure must be redone.
Fix: Replace with raincloud, beeswarm, or boxplot+jitter. Show points for N < 30.
Trigger: geom_violin() without specifying bw.
Mechanism: Silverman's rule of thumb assumes unimodal Gaussian; oversmooths bimodal data into a single peak.
Symptom: Single-cell expression bimodality (off / on) renders as a unimodal violin; biologically false.
Fix: bw = 'SJ' (Sheather-Jones plug-in) for honest bimodality. Note: nrd0 IS Silverman; nrd (Scott) oversmooths less than Silverman but Sheather-Jones is preferred.
Trigger: notch = TRUE with N < 15 per group.
Mechanism: Notch can extend beyond Q1/Q3, producing visually-misleading "inside-out" notches.
Symptom: ggplot warning ("notch went outside hinges"); notches look weird.
Fix: Use notches only with N ≥ 15. For smaller N, show raw points instead.
Trigger: geom_boxplot() + geom_jitter() with default outlier.shape = 19.
Mechanism: Outliers render twice — once from boxplot (large dots), once from jitter (smaller dots) — visually duplicated.
Symptom: Some points appear bigger than others without reason.
Fix: geom_boxplot(outlier.shape = NA) when overlaying raw points.
Trigger: geom_violin() default trim = TRUE.
Mechanism: Default trims violin at the data range; the visual still shows narrowing "tails" implying density extends slightly beyond the data.
Symptom: Reader infers density beyond observed range.
Fix: trim = FALSE to let KDE extend, OR explicitly cap with coord_cartesian. Document the choice.
Trigger: Boxplot with no N per group reported.
Mechanism: Reader cannot assess statistical power; tiny N looks identical to large N at this encoding.
Symptom: Reviewer requests "show N per group."
Fix: Add N to x-axis tick label (Control (n=12)) or use stat_summary(geom='text', fun.data = function(x) data.frame(label = paste('n=', length(x)))).
Trigger: Raincloud applied with N = 5 per group.
Mechanism: KDE with N=5 is meaningless; violin shape is artifact of bandwidth.
Symptom: Smooth violin from 5 points; misleads about underlying distribution.
Fix: For N < 30, drop the violin half; use box + raw points only.
| Pattern | Cause | Action |
|---------|-------|--------|
| Bar of mean shows clear separation; raincloud shows overlapping distributions | Bars hide overlap | Use raincloud; bars exaggerate effect |
| Violin shows unimodal; histogram shows bimodal | Default Silverman bandwidth oversmooths | Re-render with bw = 'SJ' |
| Boxplot medians look distinct; t-test n.s. | Boxplot of small N is unreliable | Show raw points; rerun with appropriate non-parametric test |
| Notched boxplot notches non-overlap; rank test n.s. | Notch is an approximation, not a hypothesis test | Notches are heuristic only; use formal test |
Operational rule: for N < 30, show every point. For N 30-200, raincloud. For N > 200, letter-value or violin with explicit bandwidth. Always annotate N.
| Threshold | Value | Source | |-----------|-------|--------| | N for valid notched boxplot | ≥15 | Common practice | | N to show raw points | <30 | Weissgerber 2015 | | N where violin > box | >30 (with explicit bandwidth) | Visualization guidance | | Whisker length (Tukey) | 1.5 × IQR | Tukey 1977 | | Notch length (McGill 1978) | ±1.58 × IQR / sqrt(N) | McGill 1978 | | KDE bandwidth (Sheather-Jones) | plug-in selector | Sheather-Jones 1991 |
| Error / symptom | Cause | Solution |
|-----------------|-------|----------|
| Bimodal data shown as unimodal violin | Default Silverman bandwidth | bw = 'SJ' |
| Duplicate large points on box + jitter | outlier.shape not suppressed | geom_boxplot(outlier.shape = NA) |
| Notches "inside-out" | N too small | Show raw points; remove notch |
| Raincloud looks smooth at N=5 | KDE meaningless at small N | Drop violin half; box + points only |
| No N visible | Default boxplot | Add n=... to x label or stat_summary text |
| Violin tails extend beyond data | trim = TRUE default + KDE bandwidth | trim = FALSE and cap with coord_cartesian |
| Bar of mean criticized in review | Weissgerber 2015 default failure | Replace with raincloud or box+jitter |
tools
--- name: bio-phasing-imputation-foundations description: Frames the phasing/imputation pipeline before any tool runs: phasing and imputation are one Li-Stephens copying HMM (recombination is the transition, mutation the emission, the genetic map and Ne set the rates), imputation's honest output is a dosage with a self-estimated quality (INFO/R2/DR2) not a hard genotype, and the stages are ordered and each fails silently (QC, align build and strand to the panel, phase, impute per chromosome, fil
tools
Chooses the enrichment generation before any tool runs, mapping the input shape to a method class - a pre-selected gene list plus a background to over-representation analysis (ORA, hypergeometric), a ranked statistic for all genes to gene set enrichment (GSEA), a signed signaling topology to pathway-topology (SPIA) - then making the null explicit (competitive vs self-contained, gene vs subject sampling) and running a trustworthiness checklist (testable-gene universe, FDR, redundancy collapse, leading-edge check, version reporting). Covers why every clusterProfiler GSEA is the inter-gene-correlation-uncorrected competitive null, why the background not the gene list decides ORA significance, and why no method is universally best. Use when deciding ORA vs GSEA vs topology, which gene-set DB, whether a result is trustworthy, or which null a tool computes. For ORA see go-enrichment, GSEA see gsea, databases kegg-pathways/reactome-pathways/wikipathways; the ranking comes from differential-expression/de-results.
testing
End-to-end GWAS workflow from VCF to association results. Covers PLINK QC, population structure correction, and association testing for case-control or quantitative traits. Use when running genome-wide association studies.
development
Orchestrates the full path from differential expression results to redundancy-collapsed functional enrichment: choose ORA vs GSEA, convert gene IDs per method, run enrichGO/enrichKEGG/enrichPathway/enrichWP or gseGO/gseKEGG (clusterProfiler, ReactomePA, rWikiPathways), and visualize. Routes the ORA-vs-GSEA generation fork and the null/universe/reproducibility theory to pathway-analysis/enrichment-foundations. Use when a DESeq2/edgeR/limma result must become enriched GO terms, KEGG/Reactome/WikiPathways pathways, or a GSEA leading edge; when deciding whether a ranking exists for all genes (GSEA, named decreasing vector) or only a pre-selected list (ORA plus a defensible background universe); or when assembling DE-to-pathway end to end. The DE list and ranking statistic come from differential-expression/de-results; per-method nuance lives in the pathway-analysis skills.