imaging-mass-cytometry/phenotyping/SKILL.md
Assign cell types from marker expression in IMC/MIBI data using clustering (PhenoGraph/FlowSOM/Leiden/Pixie), marker-based probabilistic classifiers (Astir), or image-context CNNs (CellSighter), covering the double-positive segmentation artifact, lineage-vs-state markers, the two spillover types, and why a "cell type" in imaging is conditioned on a segmentation guess. Use when phenotyping segmented IMC cells, choosing clustering vs classification, diagnosing implausible double-positive populations, separating lineage from functional markers, or transferring labels across a cohort.
npx skillsauth add GPTomics/bioSkills bio-imaging-mass-cytometry-phenotypingInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Reference examples tested with: scanpy 1.10+, anndata 0.10+, astir 0.1.4+, numpy 1.26+, scikit-learn 1.4+, FlowSOM 2.10+ (R)
Before using code patterns, verify installed versions match. If versions differ:
pip show <package> then help(module.function) to check signaturespackageVersion('<pkg>') then ?function_name to verify parametersIf code throws ImportError, AttributeError, or TypeError, introspect the installed package and adapt the example to match the actual API rather than retrying.
Notes specific to this skill: arcsinh cofactor for IMC single-cell means is ~1, not the suspension-CyTOF 5 -- do not hard-code 5. Astir assigns a per-cell probability and routes below-threshold cells (default 0.7) to "Unknown" rather than forcing a call. CellSighter consumes raw multi-channel image crops + masks (not a mean matrix). FlowSOM consensus metaclustering can override set.seed() via ConsensusClusterPlus.
"Assign cell types to my segmented IMC cells" -> Map each cell's marker profile to an identity, while distinguishing real co-expression from segmentation/spillover artifacts.
scanpy.tl.leiden (cluster then annotate), astir (marker-dictionary classifier)FlowSOM (self-organizing-map clustering)In suspension CyTOF each event is one physically isolated cell; in imaging, every cell-by-marker row is the integral of pixels inside a polygon a segmentation algorithm drew, and that polygon is wrong at a non-trivial fraction of cells -- so the most dangerous phenotypes are not biology but boundary artifacts. The canonical case is the CD3+CD20+ ("T/B") double-positive, also CD3+CD68+ and panCK+CD45+. It arises by two distinct mechanisms that are indistinguishable in the mean matrix: segmentation merging (one polygon spans a T cell and a B cell) and lateral spillover (a neighbor's membrane bleeds across the boundary even with perfect masks). The diagnostic tell that separates artifact from biology is spatial: artifactual double-positives localize to cell BORDERS and to high-density regions, so a suspect population must be mapped back onto the image before it is believed (CellSighter authors state the matrix cannot separate the two). The asymmetry that drives method choice: clustering CREATES the artifact as a named population, while a marker-dictionary classifier (Astir) REFUSES it -- a true double-positive vector matches no defined type and is quarantined as "Unknown" rather than crowned a new lineage. This is why imaging-aware groups increasingly prefer (semi-)supervised phenotyping for the lineage layer, and why mean-expression clustering imported wholesale from CyTOF inherits none of the spatial information that would let it notice the polygon was wrong.
| Approach | Tools | Input | Robust to bad segmentation? | Failure signature | |----------|-------|-------|-----------------------------|-------------------| | Unsupervised clustering | PhenoGraph, FlowSOM, Leiden | cell x marker mean matrix | No -- averages spilled signal into a fake type | phantom double-positive clusters; resolution-dependent type count | | Pixel-then-cell clustering | Pixie (ark-analysis) | pixel x marker, then cell | More -- avoids committing to a segmentation mean early | parameter-sensitive; still unsupervised | | Marker-based probabilistic | Astir | mean matrix + marker->type YAML | Partially -- ambiguous cells -> "Unknown" | high Unknown rate if dictionary/markers wrong | | Image-context CNN | CellSighter, MAPS | raw image crops + masks + labels | Yes -- sees where the signal sits | needs representative labels not harvested from clustering | | Segmentation-aware mixture | STARLING | mean matrix + doublet prior | Yes -- models a cell as a mixture of two | newer; verify priors |
| Scenario | Recommended | Why | |----------|-------------|-----| | Can write marker->celltype rules, no training labels | Astir (lineage layer) | deterministic, fast, "Unknown" for ambiguous, separates type from state | | Have expert-labeled cells, segmentation/spillover is a known problem | CellSighter | image context rejects border/spillover double-positives | | Severe segmentation doubt | STARLING | explicitly models doublet/contamination mixtures | | Annotated reference cohort, want label transfer | STELLAR | graph model using neighborhood + expression | | Exploratory, no priors, accept manual annotation | Pixie (most robust) or Leiden/FlowSOM (least) | always run the double-positive image-diagnostic first | | Any across-condition comparison of the resulting types | hand off to differential-analysis | phenotyping and statistical-unit choice are orthogonal |
Goal: Build the single-cell matrix on the correct count scale.
Approach: Arcsinh with cofactor ~1 for IMC means (not 5), and keep raw counts available. Treat zeros as genuine low ion counts plus Poisson noise, not technical dropout -- scRNA-style imputation hallucinates expression.
import scanpy as sc
import anndata as ad
import numpy as np
adata = ad.read_h5ad('imc_segmented.h5ad')
adata.layers['counts'] = adata.X.copy()
adata.X = np.arcsinh(adata.X / 1.0) # cofactor ~1 for IMC single-cell means, not 5
Goal: Assign lineage with a principled abstention instead of a forced call.
Approach: Encode marker->celltype rules in a YAML with separate cell_type and cell_state blocks; Astir returns a per-cell probability and labels below-threshold cells "Unknown". The Unknown rate is itself QC -- 40% Unknown means the dictionary or panel is mis-specified, not that the cells are exotic.
from astir.data import from_anndata_yaml
# inputs are PATHS: an .h5ad and a marker YAML with a cell_type block (CD3->T, CD20->B,
# CD68->Macrophage; no type is both) and an optional cell_state block (Ki67, PD-1)
ast = from_anndata_yaml('imc_segmented.h5ad', 'markers.yaml')
ast.fit_type()
celltypes = ast.get_celltypes(threshold=0.7) # per-cell labels; < 0.7 -> 'Unknown' (information, not failure)
Goal: Discover structure without splitting one type into activation states.
Approach: Cluster on lineage markers only; mixing continuous state markers (Ki67, PD-1) fragments one type into proliferating/resting pseudo-types. Validating clusters with the same markers used to cluster is circular -- confirm with held-out evidence (spatial context, independent markers).
lineage = ['CD45', 'CD3', 'CD8', 'CD4', 'CD20', 'CD68', 'E-cadherin'] # lineage only, no Ki67/PD-1
sub = adata[:, lineage]
sc.pp.pca(sub, n_comps=min(15, len(lineage)))
sc.pp.neighbors(sub, n_neighbors=15)
sc.tl.leiden(sub, resolution=0.5)
adata.obs['leiden'] = sub.obs['leiden']
# report cluster stability across resolutions/seeds rather than one hand-picked setting
Goal: Decide whether an implausible co-expressing population is biology or artifact.
Approach: A real co-expressing cell has the second marker over its own membrane/cytoplasm; an artifact has it concentrated on the border adjacent to a donor neighbor. Quantify how often the suspect cells sit next to a cell of the donor type -- border + donor-adjacency means spillover/merge, not a lineage.
import squidpy as sq
sq.gr.spatial_neighbors(adata, coord_type='generic', delaunay=True)
suspect = adata.obs['cell_type'] == 'CD3+CD20+?'
# if suspect cells are overwhelmingly adjacent to true B cells (the CD20 donor), the CD20
# is spillover/merge, not endogenous -- treat the population as a QC failure, not a discovery
Trigger: tuning Leiden resolution / FlowSOM metacluster count until clusters match expectation. Mechanism: the resolution directly sets the type count; it is an identifiability hole, not a tuning knob. Symptom: unreproducible type counts; clusters drift across samples. Fix: fix the type set with a dictionary/classifier, or report stability across resolutions and seeds.
Trigger: set.seed() then consensus metaclustering, expecting reproducibility. Mechanism: ConsensusClusterPlus resets the seed internally. Symptom: cluster identities differ between runs. Fix: set the seed inside the consensus call; assess label stability across runs.
Trigger: running CATALYST channel compensation and assuming spatial spillover is handled. Mechanism: channel/isotope spillover and lateral/optical spillover are different physical problems. Symptom: double-positives persist after channel compensation. Fix: channel compensation early (pixel level), REDSEA boundary compensation after segmentation; neither fixes a merged segment -- improve segmentation first.
Trigger: scRNA-style dropout imputation on the count matrix. Mechanism: IMC zeros are largely genuine low counts, not a capture-dropout mechanism. Symptom: hallucinated expression, inflated positivity. Fix: model low counts as low counts; do not impute.
| Threshold | Source | Rationale | |-----------|--------|-----------| | arcsinh cofactor ~1 (IMC means) | Hunter 2024 Cytometry A 105:36 | preserves positive/negative separation; 5 over-compresses | | Astir assignment threshold 0.7 (package default) | Geuenich 2021 Cell Syst 12:1173 | principled abstention; the Unknown rate is a QC metric | | ~40 markers, no redundancy | panel design | one channel can decide a fate -- verify the load-bearing channel per type | | CellSighter labels NOT from clustering | Amitay 2023 Nat Commun 14:4302 | clustering-derived labels re-import the double-positive artifact |
| Error / symptom | Cause | Solution | |-----------------|-------|----------| | Tidy CD3+CD20+ cluster reported as a lineage | clustering legitimized a segmentation/spillover artifact | diagnose border-localization on the image; treat as QC failure | | One T-cell type split into two clusters | state markers (Ki67) mixed into lineage clustering | cluster lineage on lineage markers; profile state within type | | ~40% of cells "Unknown" in Astir | mis-specified dictionary or missing type | inspect Unknown cells; iterate the YAML; tune 0.7 consciously | | Cluster identities drift between analyses | stochastic clustering / FlowSOM seed | pin seeds, assess stability; do not assume "cluster 7" is stable | | "Disease has more Tregs" with p~0 | cell-level testing (pseudoreplication) | aggregate to per-patient proportions; see differential-analysis |
tools
--- name: bio-phasing-imputation-foundations description: Frames the phasing/imputation pipeline before any tool runs: phasing and imputation are one Li-Stephens copying HMM (recombination is the transition, mutation the emission, the genetic map and Ne set the rates), imputation's honest output is a dosage with a self-estimated quality (INFO/R2/DR2) not a hard genotype, and the stages are ordered and each fails silently (QC, align build and strand to the panel, phase, impute per chromosome, fil
tools
Chooses the enrichment generation before any tool runs, mapping the input shape to a method class - a pre-selected gene list plus a background to over-representation analysis (ORA, hypergeometric), a ranked statistic for all genes to gene set enrichment (GSEA), a signed signaling topology to pathway-topology (SPIA) - then making the null explicit (competitive vs self-contained, gene vs subject sampling) and running a trustworthiness checklist (testable-gene universe, FDR, redundancy collapse, leading-edge check, version reporting). Covers why every clusterProfiler GSEA is the inter-gene-correlation-uncorrected competitive null, why the background not the gene list decides ORA significance, and why no method is universally best. Use when deciding ORA vs GSEA vs topology, which gene-set DB, whether a result is trustworthy, or which null a tool computes. For ORA see go-enrichment, GSEA see gsea, databases kegg-pathways/reactome-pathways/wikipathways; the ranking comes from differential-expression/de-results.
testing
End-to-end GWAS workflow from VCF to association results. Covers PLINK QC, population structure correction, and association testing for case-control or quantitative traits. Use when running genome-wide association studies.
development
Orchestrates the full path from differential expression results to redundancy-collapsed functional enrichment: choose ORA vs GSEA, convert gene IDs per method, run enrichGO/enrichKEGG/enrichPathway/enrichWP or gseGO/gseKEGG (clusterProfiler, ReactomePA, rWikiPathways), and visualize. Routes the ORA-vs-GSEA generation fork and the null/universe/reproducibility theory to pathway-analysis/enrichment-foundations. Use when a DESeq2/edgeR/limma result must become enriched GO terms, KEGG/Reactome/WikiPathways pathways, or a GSEA leading edge; when deciding whether a ranking exists for all genes (GSEA, named decreasing vector) or only a pre-selected list (ORA plus a defensible background universe); or when assembling DE-to-pathway end to end. The DE list and ranking statistic come from differential-expression/de-results; per-method nuance lives in the pathway-analysis skills.