multi-omics-integration/similarity-network/SKILL.md
Stratifies patients into multi-omics subtypes by building one patient-by-patient similarity network per omic, fusing them with SNF's cross-network diffusion, and spectral-clustering the fused graph - then defending the clusters with stability, survival separation, and replication. Covers why spectral clustering always returns the requested cluster count so a subtype is a claim not a discovery, why the eigengap is a graph property not a biological truth, why fusion is not automatically better than the best single omic, why SNF needs complete data while NEMO handles mosaic cohorts, and the SNFtool API gotchas (dist2 returns squared distance, affinityMatrix width is sigma, spectralClustering K is the cluster count). Use when discovering patient subtypes from multiple omics, choosing a cluster number, validating subtypes, or handling partial multi-omic data. For feature-space factors see mofa-integration; for supervised signatures see mixomics-analysis; for survival see clinical-biostatistics/survival-analysis.
npx skillsauth add GPTomics/bioSkills bio-multi-omics-similarity-networkInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Reference examples tested with: SNFtool 2.3+, igraph 2.0+, pheatmap 1.0+.
Before using code patterns, verify installed versions match. If versions differ:
packageVersion('SNFtool') then ?function_name to verify parametersIf code throws ImportError, AttributeError, or TypeError, introspect the installed package and adapt the example to match the actual API rather than retrying.
SNFtool argument names are easy to misread: affinityMatrix's width argument is sigma (not alpha or mu), dist2 returns SQUARED Euclidean distance (take the square root), and spectralClustering's K is the number of CLUSTERS, a name collision with the K-neighbors argument of affinityMatrix/SNF.
"Stratify my patients using multi-omics data" -> Fuse per-omic patient-similarity networks into one graph and spectral-cluster it - because the clustering always returns the number of subtypes requested, so the subtype is a claim to defend, not a discovery.
SNF() to fuse networks, spectralClustering() to partition, then validateScope: patient-similarity-space integration - per-omic affinity networks, SNF fusion, spectral clustering into candidate subtypes, cluster-number choice, and the stability/survival/replication defense, plus the integrative-clustering landscape (NEMO, PINS, iCluster, CIMLR, intNMF, consensus). Feature-space latent factors -> mofa-integration. Supervised feature signatures -> mixomics-analysis. Survival mechanics -> clinical-biostatistics/survival-analysis. Single-cell graph clustering -> single-cell/clustering.
Ask the spectral step for four clusters and it returns four, whether or not four real patient groups exist; the eigengap suggesting "four" means only that the fused graph has a roughly four-block shape, not that the disease has four subtypes. SNF's fusion is genuinely powerful - its cross-network diffusion reinforces patient pairs that multiple omics agree on and erodes omic-specific noise - but power to draw a boundary is not evidence the boundary is real. Three defenses, none of which the algorithm supplies:
The honest report names the hyperparameters, shows the clusters are not knife-edge-sensitive to them, and treats the discovery survival p as supporting evidence, not the finding.
| Tool | Citation | Mechanism | When | |------|----------|-----------|------| | SNF (SNFtool) | Wang 2014 Nat Methods 11:333 | per-omic affinity + cross-network diffusion -> spectral | complete data; non-linear consensus reinforcing multi-omic agreement; the default | | NEMO | Rappoport 2019 Bioinformatics 35:3348 | per-omic relative similarity, AVERAGED (no iteration) -> spectral | PARTIAL/mosaic data; fast; comparable accuracy on full data | | PINS / PINSPlus | Nguyen 2017 Genome Res 27:2025; Nguyen 2019 Bioinformatics 35:2843 | perturbation clustering; keeps partitions robust to noise; auto-picks C | when partition stability is the priority | | iCluster / iClusterPlus | Shen 2009 Bioinformatics 25:2906; Mo 2013 PNAS 110:4245 | model-based joint latent-variable + feature selection | want a generative model and feature selection; tends to pick few clusters | | CIMLR | Ramazzotti 2018 Nat Commun 9:4453 | multiple-kernel learning per omic -> k-means | one Gaussian kernel per omic too rigid; strong survival results | | intNMF | Chalise 2017 PLoS One 12:e0176278 | joint non-negative matrix factorization | non-negative data; parts-based factorization + clustering | | Consensus / COCA | Monti 2003 Mach Learn 52:91; Hoadley 2014 Cell 158:929 | cluster each omic, then cluster the matrix-of-clusters | late integration; want each omic's clustering visible and a consensus |
| Scenario | Recommended | Why | |----------|-------------|-----| | Complete multi-omics, want a non-linear consensus partition | SNF + spectralClustering | cross-network diffusion reinforces multi-omic agreement | | Some patients missing an omic (mosaic/partial) | NEMO | built for partial data; no imputation, no patient loss | | Partition robustness/auto cluster number is the priority | PINSPlus | perturbation clustering builds stability in | | Want a generative model and integrated feature selection | iCluster / iClusterBayes | model-based joint latent variable | | Need feature-level interpretation (which genes define a subtype) | -> mofa-integration / mixomics-analysis | SNF has no feature model; feature-space tools live there | | Validate subtypes against outcome | -> clinical-biostatistics/survival-analysis | Cox / log-rank / KM mechanics | | SNF underperforms on survival in a benchmark | MCCA (survival) or rMKL-LPP (clinical enrichment) | topped those criteria in Rappoport and Shamir 2018 | | Which method at all / paired vs mosaic | -> integration-design | the correspondence and method decision |
Goal: Turn each omic into a patient-by-patient similarity network on a common scale, collapsing the high-dimensional feature space into an n-by-n object so feature count buys no votes.
Approach: Standardize each continuous omic per feature, compute the (square-rooted) Euclidean distance, then apply the local-scaled Gaussian kernel. SNF requires every patient to have every omic, so intersect to common samples first and report how many that drops.
library(SNFtool)
K <- 20 # neighbors defining the local kernel bandwidth; SNFtool guidance 10-30; changes cluster count
sigma <- 0.5 # kernel width multiplier (affinityMatrix's third arg, named sigma not alpha); guidance 0.3-0.8
t_iter <- 20 # cross-diffusion iterations; converges by ~10-20
norm_views <- lapply(list(rna=rna, meth=meth, mirna=mirna), standardNormalization) # per-feature z-score before distance
dists <- lapply(norm_views, function(x) dist2(x, x)^(1/2)) # dist2 returns SQUARED distance
affinities <- lapply(dists, function(d) affinityMatrix(d, K, sigma))
Goal: Fuse the per-omic networks into one graph and partition it, treating the cluster number as the central claim rather than a nuisance parameter.
Approach: Run SNF's cross-diffusion, read the four cluster-number estimates the package returns (not as truth but as plausibility), then spectral-cluster. The package itself warns the estimates cannot guarantee accuracy.
fused <- SNF(affinities, K, t_iter)
estimateNumberOfClustersGivenGraph(fused, NUMC=2:8) # returns FOUR estimates: K1/K12 (eigengap), K2/K22 (rotation cost)
clusters <- spectralClustering(fused, K=4, type=3) # here K is the CLUSTER COUNT (not neighbors); type 3 = Ng-Jordan-Weiss default
Goal: Show the clusters are stable, separate outcome, and are not just the best single omic before calling them subtypes.
Approach: Compare the fused clustering against each single-omic clustering, assess stability under resampling, and rank the post-hoc feature attribution with the package function rather than a hand-rolled test. Survival mechanics are routed out.
concordanceNetworkNMI(c(affinities, list(fused)), C=4) # NMI among per-omic and fused clusterings: did fusion beat the best single omic?
feat_rank <- rankFeaturesByNMI(norm_views, fused) # POST-HOC attribution: features that track the clusters, not a model that made them
Validate survival separation in a covariate-adjusted Cox model and report events per arm (clinical-biostatistics/survival-analysis owns the mechanics); assess stability by resampling patients and re-clustering, reporting agreement across subsamples. To assign a new patient to an existing subtype without re-clustering, use groupPredict(train_views, test_views, groups, K=20, method=1) (label propagation). For a mosaic cohort, switch to NEMO rather than dropping the incomplete patients.
Trigger: "I found 4 subtypes" with no defense. Mechanism: spectral clustering returns the requested C regardless of structure. Symptom: a clean-looking partition that does not replicate. Fix: require eigengap plausibility, resampling stability, covariate-adjusted survival, and external replication before claiming a subtype count.
Trigger: grid-searching K and sigma to maximize NMI against known labels. Mechanism: in real discovery there are no labels, so tuning to NMI is circular. Symptom: a result that only holds at the chosen (K, sigma). Fix: fix or pre-register a small (K, sigma) grid and show the clustering is stable across it; report sensitivity as a result.
Trigger: Reduce(intersect, ...) silently dropping patients missing an omic. Mechanism: SNF's cross-diffusion multiplies aligned n-by-n matrices, so it needs complete data. Symptom: a decimated, biased cohort. Fix: report the dropped count and bias; use NEMO for partial data; do not impute a whole omic to keep a patient.
Trigger: reporting the fused clustering without a single-omic comparison. Mechanism: fusion's diffusion can dilute a signal carried by one omic; multi-omics is not consistently better (Rappoport and Shamir). Symptom: a fused result no better than the best layer. Fix: benchmark fused vs each single omic with concordanceNetworkNMI and per-omic survival.
Trigger: a log-rank p on discovery clusters. Mechanism: C, K, sigma were chosen partly to get separable groups, and arms are small. Symptom: an over-optimistic p that does not replicate. Fix: adjust for prognostic covariates, report events per arm, use a permutation p, and require replication.
Trigger: presenting ranked features as what generated the clusters. Mechanism: SNF has no feature-level model; attribution is post-hoc. Symptom: causal claims SNF cannot support. Fix: use rankFeaturesByNMI and present it as post-hoc characterization; for a feature model use MOFA/DIABLO.
Trigger: treating dist2 output as Euclidean, passing alpha to affinityMatrix, or reading spectralClustering's K as neighbors. Mechanism: dist2 is squared, the width arg is sigma, and that K is the cluster count. Symptom: distorted affinities or the wrong number of clusters. Fix: take dist2(...)^(1/2), pass sigma, and read the K name in context.
| Threshold | Source | Rationale |
|-----------|--------|-----------|
| K neighbors ~20 (range 10-30) | Wang 2014 Nat Methods 11:333; SNFtool docs | sets the local kernel bandwidth; small K fragments, large K over-smooths and changes cluster count |
| sigma ~0.5 (range 0.3-0.8) | SNFtool docs | kernel width multiplier; wider blurs clusters, narrower sharpens noise |
| t iterations ~10-20 | Wang 2014 Nat Methods 11:333 | cross-diffusion converges; more iterations do little past convergence |
| estimateNumberOfClustersGivenGraph returns FOUR estimates | SNFtool docs | eigengap (K1/K12) and rotation cost (K2/K22); plausibility not proof |
| Fused must beat the best single omic to justify fusion | Rappoport and Shamir 2018 Nucleic Acids Res 46:10546 | multi-omics is not consistently better; check explicitly |
| Stability across a fixed (K, sigma) grid + permutation survival p | Monti 2003 Mach Learn 52:91 | resampling stability and a permutation p guard against artifact subtypes |
| Error / symptom | Cause | Solution |
|-----------------|-------|----------|
| Distorted affinities / wrong-scale distances | dist2 output used as Euclidean | take dist2(...)^(1/2) |
| Clusters change unexpectedly | affinityMatrix width passed as alpha or wrong arg | the third argument is sigma |
| Wrong number of clusters | spectralClustering's K read as neighbors | that K is the cluster count |
| Function dist2 not found after first call | a variable named dist2 shadowing the function | name distance variables differently (d1, d2) |
| Many patients silently dropped | complete-data intersection on a mosaic cohort | report the drop; use NEMO for partial data |
| Hand-rolled feature ranking | reimplementing attribution with aov | use rankFeaturesByNMI(list_of_views, fused) |
tools
--- name: bio-phasing-imputation-foundations description: Frames the phasing/imputation pipeline before any tool runs: phasing and imputation are one Li-Stephens copying HMM (recombination is the transition, mutation the emission, the genetic map and Ne set the rates), imputation's honest output is a dosage with a self-estimated quality (INFO/R2/DR2) not a hard genotype, and the stages are ordered and each fails silently (QC, align build and strand to the panel, phase, impute per chromosome, fil
tools
Chooses the enrichment generation before any tool runs, mapping the input shape to a method class - a pre-selected gene list plus a background to over-representation analysis (ORA, hypergeometric), a ranked statistic for all genes to gene set enrichment (GSEA), a signed signaling topology to pathway-topology (SPIA) - then making the null explicit (competitive vs self-contained, gene vs subject sampling) and running a trustworthiness checklist (testable-gene universe, FDR, redundancy collapse, leading-edge check, version reporting). Covers why every clusterProfiler GSEA is the inter-gene-correlation-uncorrected competitive null, why the background not the gene list decides ORA significance, and why no method is universally best. Use when deciding ORA vs GSEA vs topology, which gene-set DB, whether a result is trustworthy, or which null a tool computes. For ORA see go-enrichment, GSEA see gsea, databases kegg-pathways/reactome-pathways/wikipathways; the ranking comes from differential-expression/de-results.
testing
End-to-end GWAS workflow from VCF to association results. Covers PLINK QC, population structure correction, and association testing for case-control or quantitative traits. Use when running genome-wide association studies.
development
Orchestrates the full path from differential expression results to redundancy-collapsed functional enrichment: choose ORA vs GSEA, convert gene IDs per method, run enrichGO/enrichKEGG/enrichPathway/enrichWP or gseGO/gseKEGG (clusterProfiler, ReactomePA, rWikiPathways), and visualize. Routes the ORA-vs-GSEA generation fork and the null/universe/reproducibility theory to pathway-analysis/enrichment-foundations. Use when a DESeq2/edgeR/limma result must become enriched GO terms, KEGG/Reactome/WikiPathways pathways, or a GSEA leading edge; when deciding whether a ranking exists for all genes (GSEA, named decreasing vector) or only a pre-selected list (ORA plus a defensible background universe); or when assembling DE-to-pathway end to end. The DE list and ranking statistic come from differential-expression/de-results; per-method nuance lives in the pathway-analysis skills.