imaging-mass-cytometry/data-preprocessing/SKILL.md
Load and preprocess imaging mass cytometry (IMC) and MIBI data from raw MCD/TXT through hot-pixel removal, spillover compensation, and variance-stabilizing transformation, covering readimc/steinbock ingestion, NNLS spillover compensation (CATALYST), IMC-Denoise, and the IMC arcsinh-cofactor question. Use when starting analysis from raw MCD files, building per-channel TIFF stacks, compensating channel spillover, choosing an arcsinh cofactor, or preparing single-cell intensities for phenotyping.
npx skillsauth add GPTomics/bioSkills bio-imaging-mass-cytometry-data-preprocessingInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Reference examples tested with: steinbock 0.16+, readimc 0.7+, numpy 1.26+, CATALYST 1.28+ (R/Bioconductor), cytomapper 1.16+ (R)
Before using code patterns, verify installed versions match. If versions differ:
pip show <package> then help(module.function) to check signaturespackageVersion('<pkg>') then ?function_name to verify parameters<tool> --version then <tool> --help to confirm flagsIf code throws ImportError, AttributeError, or TypeError, introspect the installed package and adapt the example to match the actual API rather than retrying.
Notes specific to this skill: IMC pixels are integer ion COUNTS, not fluorescence intensities. CATALYST::compCytof defaults to method='nnls' (non-negativity preserved); pass cofactor=1 explicitly for IMC single-cell means (the compCytof default is NULL/5, the suspension value). The spillover-matrix and SCE channel names must both be (metal)(mass)Di (e.g. Sm152Di) or compensation silently no-ops. steinbock's hot-pixel filter is steinbock preprocess imc images --hpf 50 (a signed 8-neighbor difference, not a median filter).
"Preprocess my imaging mass cytometry data" -> Ingest raw acquisitions, suppress acquisition noise, compensate channel spillover, and variance-stabilize counts so each cell's measured intensity reflects real antigen abundance.
readimc.MCDFile, steinbock preprocess imc images --hpf 50CATALYST::compCytof, cytomapper::compImage for spillover compensationEvery IMC/MIBI pixel is an integer number of detected metal ions from one ~1 um laser shot, drawn from a low-count, zero-inflated, near-Poisson regime where most pixels read 0-2 counts and the limit of detection is ~6 counts. Four consequences govern every preprocessing decision, and importing fluorescence-microscopy habits violates all four. (1) There is no continuous Gaussian background to subtract -- the floor is a count floor, and a true-negative pixel still reads 1-2 counts by Poisson chance. (2) Negative values are physically meaningless, so flow-style spillover compensation (exact matrix inverse) is wrong because it manufactures negatives; non-negative least squares (NNLS) is mandatory (Chevrier 2018 Cell Syst 6:612). (3) Per-pixel "expression" is mostly shot noise -- signal emerges only after segmentation sums a cell's pixels, so pixel maps are for localization and QC, never quantification. (4) Spillover is SPATIAL: a bright cell bleeding into a neighboring mass channel contaminates adjacent pixels, fabricating co-localization and false marker positivity at cell borders -- which means uncompensated spillover manufactures the exact cell-cell interactions IMC exists to measure. The corollary that trips up suspension-CyTOF veterans: the arcsinh cofactor is NOT 5 (cofactor 1 is the modern IMC default, Hunter 2024 Cytometry A 105:36), and there are no in-stream calibration beads in ablated tissue.
read .mcd (readimc) -> panel filter+sort (keep column) -> hot-pixel removal (DIMR or --hpf 50)
-> [optional] DeepSNiF on low-SNR channels only -> spillover compensation (NNLS)
-> segmentation (on compensated nuclear/membrane channels) -> per-cell aggregation
-> arcsinh(cofactor 1) -> z-score / cohort-anchored normalization
Order is not cosmetic: denoise operates on RAW counts (the Poisson noise model is defined there), compensate BEFORE segmentation when spatial fidelity matters (corrupted membrane channels yield wrong boundaries that no later step recovers), and transform/normalize LAST.
| Method | Targets | Risk | When to use | Fails when |
|--------|---------|------|-------------|------------|
| steinbock --hpf 50 | hot pixels | low | default fast hot-pixel pass | absolute-count threshold over-clips bright channels, under-cleans dim ones |
| IMC-Denoise DIMR (Lu 2023) | hot pixels | low | self-calibrating hot-pixel removal | misclassifies large multi-pixel hot-pixel clusters as signal |
| IMC-Denoise DeepSNiF (Lu 2023) | shot noise | HIGH | only channels with mean positive intensity < ~7 | over-smooths sparse/punctate markers and sub-1-2 um structure; biases extreme-low-count regions |
| 3x3 median filter | (do not use) | severe | never | returns 0 for isolated real single-positive pixels -- erases sparse biology |
| Scenario | Recommended | Why |
|----------|-------------|-----|
| Single-stain QC shows >~2% off-target spillover at relevant masses | Compensate (NNLS) before phenotyping | leak corrupts type calls and (spatially) neighborhood stats |
| Spatial neighborhood / interaction analysis is the endpoint | Pixel-level compensation (compImage) before segmentation | spillover is spatial and fakes interactions; cell-mean compensation comes too late |
| Means-only phenotyping, segmentation channels uncontaminated | Cell-level compCytof on SCE means | cheaper, less per-pixel Poisson noise |
| Channel driven above ~5,000 dual counts | Re-titrate at acquisition; do not compensate | linearity (and thus the matrix) breaks above saturation |
| Panel pre-designed to avoid bright/dim mass adjacencies, QC near-clean | Compensation ~ identity; skipping is defensible | avoids NNLS noise on near-zero pixels |
| Channel mean positive intensity < ~7, cannot phenotype on it | DIMR + DeepSNiF | shot noise dominates; denoising is the only way to use it |
| Channel clean, or punctate, or the one segmentation runs on | DIMR / --hpf only; skip DeepSNiF | DeepSNiF over-smooths real isolated signal and blurs boundaries |
Goal: Ingest the multi-ROI MCD (the canonical source) and keep the metal-to-target panel mapping intact.
Approach: Open the MCD with readimc, iterate slides and acquisitions, and carry both channel_names (metals) and channel_labels (targets) -- conflating them is the most common ingestion bug. Prefer MCD over TXT (one ROI per file, and absent on Hyperion XTi).
from readimc import MCDFile
with MCDFile('slide.mcd') as f:
slide = f.slides[0]
for acq in slide.acquisitions:
img = f.read_acquisition(acq) # (channels, y, x) float32 ion counts
metals = acq.channel_names # e.g. 'Sm152' -- the mass channel
targets = acq.channel_labels # e.g. 'CD3' -- the antibody target
print(acq.id, img.shape, dict(zip(metals, targets)))
Goal: Build per-channel TIFF stacks, filtered to the analysis panel and de-spiked.
Approach: The panel CSV is both a filter and a sort key -- only keep==1 rows are written and their row order defines channel order in the stack, so it must be pinned as a versioned artifact. The --hpf filter compares each pixel to its 8 neighbors with a signed difference and replaces spikes with the neighbor maximum (a conservative, valley-preserving operation), not a median.
# generate the panel template (edit the keep column before extracting)
steinbock preprocess imc panel
# extract TIFFs (keep-filtered, panel-ordered) with hot-pixel removal; 50 is a count
# difference, not a universal constant -- raise it for high-dynamic-range markers
steinbock preprocess imc images --hpf 50
Goal: Remove channel crosstalk (oxide M+16, abundance-sensitivity M+-1, isotopic impurity) without introducing negative counts.
Approach: Estimate the spillover matrix from single-stain controls per positive EVENT then take the median (population-summary estimation overcompensates because IMC's zero background biases ratios), then apply NNLS. Compensate pixels (compImage) before segmentation for spatial work, or cell means (compCytof) after segmentation otherwise. Channel names must be (metal)(mass)Di.
library(CATALYST)
library(imcRtools)
# estimate the matrix from spotted single-stain TXTs (filenames carry the metal)
sce <- readSCEfromTXT('spillover/')
sce <- prepData(sce, transform = TRUE, cofactor = 5) # cofactor 5 for PIXEL spot data
sce <- assignPrelim(sce); sce <- estCutoffs(sce); sce <- applyCutoffs(sce)
sm <- computeSpillmat(sce) # the spillover matrix
# cell-level compensation on segmented single-cell means (NNLS is the default)
sce_cells <- compCytof(sce_cells, sm, method = 'nnls', cofactor = 1, overwrite = FALSE)
# OR pixel-level compensation on the image stack, before segmentation (spatial fidelity)
library(cytomapper)
images <- compImage(images, adaptSpillmat(sm, channelNames(images)))
Goal: Variance-stabilize counts and remove batch offset without destroying cross-sample comparability.
Approach: Apply arcsinh with cofactor 1 on single-cell means (the OPTIMAL-derived IMC default, not the suspension-CyTOF 5), then z-score per channel against cohort-wide statistics. Per-image percentile or min-max scaling is a one-way door that makes equal biology look unequal across samples -- reserve it for visualization and always retain raw/compensated counts.
import numpy as np
def arcsinh_cofactor1(cell_means):
# cofactor 1 for IMC single-cell means (Hunter 2024); state the cofactor explicitly --
# no field-wide standard exists, so reproducibility requires reporting it
return np.arcsinh(cell_means / 1.0)
def zscore_per_channel(expr, mean, std):
# mean/std computed COHORT-WIDE (not per-image) so scales stay comparable across samples
return (expr - mean) / std
Trigger: exact matrix inversion (method='flow' or generic linear unmixing). Mechanism: the inverse violates non-negativity and produces negative ion counts. Symptom: negative compensated values, downstream stats corrupted. Fix: compCytof(..., method='nnls') (the default); negatives are a solver artifact, not evidence compensation is wrong.
Trigger: SCE channel names are Sm152 but the spillover matrix uses Sm152Di (or vice versa). Mechanism: name-based mapping finds no match. Symptom: compensation runs without error but changes nothing. Fix: enforce (metal)(mass)Di; reconcile with adaptSpillmat().
Trigger: blanket denoising "to clean things up." Mechanism: the Hessian continuity prior imposes spatial smoothness on genuinely sparse/punctate markers. Symptom: rare-population or punctate signal blurred into neighbors; biased low-count regions. Fix: denoise only channels with mean positive intensity < ~7; validate against the un-denoised image; never denoise the segmentation channel without checking boundary integrity.
Trigger: independent per-image 99th-percentile or min-max scaling, then comparing samples. Mechanism: image A's 99th percentile (40 counts) and image B's (400 counts) map to the same [0,1]. Symptom: a dim positive in A reads like a bright positive in B; differential abundance is spurious. Fix: derive normalization from cohort-wide statistics or a shared anchor; keep raw counts to re-derive.
Trigger: ndimage.median_filter on the count image. Mechanism: a 3x3 median over sparse single-positive pixels returns 0. Symptom: real isolated membrane/punctate signal erased. Fix: use the neighbor-spike filter (--hpf) or DIMR; never blanket-median IMC.
Trigger: applying this steinbock/CATALYST flow to MIBI-TOF data as if it were IMC. Mechanism: MIBI is SIMS on Au/Ta conductive slides, so it carries a 197Au slide background and crosstalk classes the CATALYST single-stain-bead model does not target. Symptom: gold-background contamination and uncorrected MIBI crosstalk. Fix: remove the Au/native-background channels and run MAUI (Baranski 2021) before this pipeline; treat the CATALYST spillover step as IMC-specific.
| Threshold | Source | Rationale |
|-----------|--------|-----------|
| arcsinh cofactor 1 (single-cell means) | Hunter 2024 Cytometry A 105:36 | maximizes positive/negative separation (Fisher ratio) for IMC counts; 5 over-compresses |
| arcsinh cofactor 5 (pixel/spot data) | CATALYST IMC workflow | pixel spot counts are higher-scale than cell means |
| Linearity ceiling ~5,000 dual counts | Chevrier 2018 Cell Syst 6:612 | above it count->abundance bends and the spillover matrix is invalid |
| --hpf 50 (count difference) | steinbock convention | a per-experiment heuristic, not a universal constant -- tune to dynamic range |
| DeepSNiF only if mean positive intensity < ~7 | Lu 2023 Nat Commun 14:1601 | above ~7 the channel is effectively noise-immune; denoising is pure risk |
| Detection limit ~6 ion counts | Lu 2023 Nat Commun 14:1601 | below this, signal and shot noise are indistinguishable per pixel |
| Error / symptom | Cause | Solution |
|-----------------|-------|----------|
| compCytof runs but values unchanged | channel-name format mismatch | enforce Sm152Di; adaptSpillmat() |
| Negative compensated counts | flow-style inversion | use method='nnls' |
| "Channel 12 is a different antibody for collaborators" | panel keep/order changed after segmentation | pin panel.csv as a versioned artifact; never reorder post-segmentation |
| Rare population vanished after denoising | DeepSNiF on a sparse channel | restrict DeepSNiF to low-SNR non-punctate channels |
| TXT loads only one ROI | analyzing TXT instead of MCD | ingest the multi-ROI .mcd with readimc |
| Cross-sample differences disappear or explode | per-image normalization | cohort-anchored normalization; keep raw counts |
tools
--- name: bio-phasing-imputation-foundations description: Frames the phasing/imputation pipeline before any tool runs: phasing and imputation are one Li-Stephens copying HMM (recombination is the transition, mutation the emission, the genetic map and Ne set the rates), imputation's honest output is a dosage with a self-estimated quality (INFO/R2/DR2) not a hard genotype, and the stages are ordered and each fails silently (QC, align build and strand to the panel, phase, impute per chromosome, fil
tools
Chooses the enrichment generation before any tool runs, mapping the input shape to a method class - a pre-selected gene list plus a background to over-representation analysis (ORA, hypergeometric), a ranked statistic for all genes to gene set enrichment (GSEA), a signed signaling topology to pathway-topology (SPIA) - then making the null explicit (competitive vs self-contained, gene vs subject sampling) and running a trustworthiness checklist (testable-gene universe, FDR, redundancy collapse, leading-edge check, version reporting). Covers why every clusterProfiler GSEA is the inter-gene-correlation-uncorrected competitive null, why the background not the gene list decides ORA significance, and why no method is universally best. Use when deciding ORA vs GSEA vs topology, which gene-set DB, whether a result is trustworthy, or which null a tool computes. For ORA see go-enrichment, GSEA see gsea, databases kegg-pathways/reactome-pathways/wikipathways; the ranking comes from differential-expression/de-results.
testing
End-to-end GWAS workflow from VCF to association results. Covers PLINK QC, population structure correction, and association testing for case-control or quantitative traits. Use when running genome-wide association studies.
development
Orchestrates the full path from differential expression results to redundancy-collapsed functional enrichment: choose ORA vs GSEA, convert gene IDs per method, run enrichGO/enrichKEGG/enrichPathway/enrichWP or gseGO/gseKEGG (clusterProfiler, ReactomePA, rWikiPathways), and visualize. Routes the ORA-vs-GSEA generation fork and the null/universe/reproducibility theory to pathway-analysis/enrichment-foundations. Use when a DESeq2/edgeR/limma result must become enriched GO terms, KEGG/Reactome/WikiPathways pathways, or a GSEA leading edge; when deciding whether a ranking exists for all genes (GSEA, named decreasing vector) or only a pre-selected list (ORA plus a defensible background universe); or when assembling DE-to-pathway end to end. The DE list and ranking statistic come from differential-expression/de-results; per-method nuance lives in the pathway-analysis skills.