gene-regulatory-networks/perturbation-simulation/SKILL.md
Simulate transcription factor perturbation effects on cell state in silico with CellOracle and Dynamo, and predict transcriptional responses to genetic perturbations with GEARS, scGen, and CPA. Covers the direction-not-magnitude principle, local-linear validity, the GRN/velocity error it inherits, baseline discipline (mean and additive baselines), and the validation gap. Use when predicting TF knockout or overexpression effects, ranking driver TFs for fate transitions, or planning perturbation experiments. For GRN construction see multiomics-grn; for experimental Perturb-seq see single-cell/perturb-seq.
npx skillsauth add GPTomics/bioSkills bio-gene-regulatory-networks-perturbation-simulationInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Reference examples tested with: CellOracle 0.18+, scanpy 1.10+, anndata 0.10+; Dynamo (dynamo-release 1.4+), GEARS, CPA (scvi-tools ecosystem) where used.
Before using code patterns, verify installed versions match. If versions differ:
pip show <package> then help(module.function) to check signatures<tool> --version then <tool> --help to confirm flagsIf code throws ImportError, AttributeError, or TypeError, introspect the installed package and adapt the example to match the actual API rather than retrying.
"Predict what happens if I knock out this transcription factor" -> Propagate a forced expression change through a learned GRN (or a learned velocity vector field), then project the shift onto the cell-state manifold to predict the DIRECTION cells move.
celloracle.Oracle for GRN-based in silico KO/overexpressiondynamo for vector-field (Jacobian) perturbation; GEARS/CPA for response predictionIn silico perturbation outputs a shift vector whose orientation is the deliverable; its absolute magnitude is uncalibrated. CellOracle propagates an expression shift (not absolute counts) and is explicit that it is for hypothesis generation, predicting direction not magnitude. The machinery is a local linear approximation: CellOracle's per-cluster GRN is a regularized linear model iterated a small fixed number of steps (n_propagation default 3), and Dynamo's perturbation is a first-order Taylor expansion of the learned vector field via its Jacobian. So predictions are valid only for small perturbations near the observed manifold, for one or a few steps -- they cannot capture long-range feedback, trans effects, or transitions to attractors not already in the data (CellOracle moves probability mass among observed states on the KNN graph; it cannot invent a new cell type). And every prediction inherits the errors of its substrate: CellOracle is only as good as its GRN, Dynamo only as good as its RNA-velocity estimate (which has documented reliability problems, Bergen 2021; Gorin 2022).
The 2025 baseline discipline is non-negotiable: for perturbation response prediction, deep and foundation models do not consistently beat trivial baselines -- Ahlmann-Eltze, Huber & Anders 2025 (Nat Methods 22:1657) show that for unseen single perturbations no DL model beats predicting the training mean response, and for combinations none beats the additive baseline (summing single-gene effects). Always report the mean baseline (unseen singles) and additive baseline (combinations); a method that does not beat them adds no value. And ask the validation question of any result: how many predictions were tested against a real knockout, and were the failures reported?
| Method | Citation | Mechanism | Predicts | Inherits errors of | |--------|----------|-----------|----------|--------------------| | CellOracle | Kamimoto 2023 Nature | shift propagated through per-cluster linear GRN, projected on KNN graph | direction of cell-state shift | the GRN (base + regression) | | Dynamo | Qiu 2022 Cell | Jacobian of a learned analytical vector field (Df = J*Dx) | redirected fate / least-action path | RNA-velocity estimate | | GEARS | Roohani 2024 Nat Biotechnol | GNN over gene + GO graphs | unseen single & combinatorial responses, GI type | training Perturb-seq + graphs | | scGen / CPA | Lotfollahi 2019/2023 | latent-space arithmetic / disentangled composable embeddings | response to known/composed perturbations | training distribution | | Boolean / ODE (BoolNet, BoolODE) | Müssel 2010 | mechanistic logic/kinetic simulation; attractors | extrapolative fate, multistability | the hand-curated wiring |
Data-driven methods (CellOracle/Dynamo/DL) are genome-scale, local, linear, direction-only; mechanistic Boolean/ODE models are hand-curated and small but nonlinear and extrapolative. State which regime a result belongs to.
| Scenario | Recommended | Why | |----------|-------------|-----| | Predict TF KO/OE effect on cell fate from scRNA + accessibility prior | CellOracle | GRN-based shift onto the developmental flow; direction + perturbation score | | Have high-quality (labeling-based) velocity, want fate redirection | Dynamo | learned vector field + Jacobian; no GRN prior needed | | Predict response to unseen perturbations / combinations | GEARS or CPA | generalize via gene/GO graphs or composable embeddings -- but check baselines | | Need mechanistic attractor/multistability reasoning | Boolean/ODE (BoolNet) | only family that extrapolates beyond observed states | | Build the base GRN that CellOracle perturbs | -> multiomics-grn | base GRN construction lives there | | Validate predictions experimentally | -> single-cell/perturb-seq | Perturb-seq is the interventional ground truth |
Goal: Learn context-specific (per-cluster) regulatory weights on top of a base-GRN prior.
Approach: Import scRNA-seq and the base GRN, impute, then fit a regularized linear model per cluster and filter links by significance.
import scanpy as sc
import celloracle as co
import pandas as pd
adata = sc.read_h5ad('clustered.h5ad') # normalized, log, PCA, clustered
oracle = co.Oracle()
oracle.import_anndata_as_raw_count(adata=adata, cluster_column_name='cell_type',
embedding_name='X_umap')
oracle.import_TF_data(TF_info_matrix=pd.read_parquet('base_grn.parquet')) # prior; see multiomics-grn
oracle.perform_PCA()
oracle.knn_imputation(n_pca_dims=50, k=None, balanced=True, b_sight=3000, b_maxl=1500)
links = oracle.get_links(cluster_name_for_GRN_unit='cell_type', alpha=10)
links.filter_links(p=0.001, weight='coef_abs', threshold_number=2000)
oracle.get_cluster_specific_TFdict_from_Links(links_object=links)
oracle.fit_GRN_for_simulation(alpha=10, use_cluster_specific_TFdict=True)
Goal: Predict the direction cells move under a TF knockout or overexpression.
Approach: Set the gene of interest to 0 (KO) or above-max (OE), propagate the shift a few steps, estimate KNN transition probabilities, compute the per-cell embedding shift, and score it against the developmental vector field by inner product.
import numpy as np
# perturb_condition sets the clamped value: 0.0 = knockout; above observed max = overexpression.
# n_propagation is small BY DESIGN (3); increasing it amplifies linear-approximation error.
oracle.simulate_shift(perturb_condition={'GATA1': 0.0}, n_propagation=3)
oracle.estimate_transition_prob(n_neighbors=200, knn_random=True, sampled_fraction=1)
oracle.calculate_embedding_shift(sigma_corr=0.05)
# Perturbation score = inner product of the simulated shift with the developmental flow.
# It is meaningless without a defined development vector field (Gradient_calculator below).
from celloracle.applications import Gradient_calculator
grad = Gradient_calculator(oracle_object=oracle, pseudotime_key='pseudotime')
grad.calculate_p_mass(smooth=0.8, n_grid=40, n_neighbors=200)
grad.calculate_mass_filter(min_mass=0.01, plot=False) # required before calculate_gradient
grad.calculate_gradient()
shift = np.sqrt((oracle.adata.obsm['delta_embedding'] ** 2).sum(axis=1)) # magnitude is uncalibrated
Goal: Predict fate redirection from a learned dynamical system rather than a GRN prior.
Approach: Reconstruct the analytical vector field, compute its Jacobian, then apply a genetic perturbation as a local linear response.
import dynamo as dyn
dyn.vf.VectorField(adata, basis='umap') # learn the analytical vector field first
dyn.vf.jacobian(adata, regulators=['GATA1'], effectors=['SPI1'])
# expression is a LIST aligned to genes (negative = suppress); perturbation writes a new basis.
dyn.pd.perturbation(adata, 'GATA1', [-100], emb_basis='umap')
dyn.pl.streamline_plot(adata, color='cell_type', basis='umap_perturbation')
Goal: Establish whether a perturbation-prediction model beats triviality before trusting it.
Approach: Compare predictions against the training-mean response (unseen single perturbations) and the additive baseline (combinations).
import numpy as np
# Mean baseline: predict the average perturbed profile across the training perturbations.
mean_baseline = train_perturbed.mean(axis=0)
# Additive baseline for a double perturbation: sum the two single-gene log-fold-changes.
additive_pred = lfc_singleA + lfc_singleB
# A model only "works" if its error is below these baselines (Ahlmann-Eltze 2025).
Trigger: citing a predicted fold-change as an effect size. Mechanism: the methods are direction-only; amplitude is uncalibrated. Symptom: quantitative KO transcriptome claims. Fix: report direction / perturbation score; validate magnitude experimentally.
Trigger: a DL/foundation-model perturbation predictor reported without baselines. Mechanism: mean (singles) and additive (combos) baselines are often competitive. Symptom: "state of the art" with no mean/additive comparison. Fix: report both baselines; require the model to beat them.
Trigger: extreme overexpression, knocking out a gene not expressed in the cluster, or expecting a brand-new cell type. Mechanism: local linear validity is violated; CellOracle can only move mass among observed states. Symptom: implausible jumps. Fix: keep perturbations small and near observed states; interpret one step.
Trigger: raising n_propagation to get "long-range" effects. Mechanism: iterating a linear model amplifies approximation error. Symptom: runaway shifts. Fix: keep the small default; treat it as a local estimate.
Trigger: treating the CellOracle GRN as ground truth, or running Dynamo on noisy splicing-based velocity. Mechanism: predictions inherit GRN/velocity error (Bergen 2021; Gorin 2022; and a 2024 preprint critiques CellOracle's GRN for ignoring distal interactions). Symptom: confident predictions on a poorly-validated network/field. Fix: sanity-check the GRN/velocity; present results as hypotheses.
| Threshold | Source | Rationale | |-----------|--------|-----------| | n_propagation = 3 | CellOracle default | small by design; local linear validity | | KO value = 0.0; OE = above observed max | CellOracle convention | clamped perturbation input | | link filter p<0.001, top ~2000 by coef_abs | CellOracle tutorial | sparsify the GRN before simulation | | mean baseline (unseen singles), additive (combos) | Ahlmann-Eltze 2025 | the bar any predictor must clear | | sigma_corr ~0.05 | CellOracle default | embedding-shift smoothing |
| Error / symptom | Cause | Solution |
|-----------------|-------|----------|
| perturbation score all ~0 / meaningless | no development vector field defined | run Gradient_calculator before scoring |
| KeyError on gene in simulate_shift | gene absent from the fitted GRN | confirm the gene is in the base GRN and expressed |
| Dynamo perturbation errors | vector field/Jacobian not computed | run dyn.vf.VectorField then dyn.vf.jacobian first |
| huge predicted shift | n_propagation too high / out-of-manifold perturbation | keep n_propagation small; perturb near observed states |
| "model beats prior methods" but no baseline | missing mean/additive comparison | add baselines (Ahlmann-Eltze 2025) |
testing
Analyze multi-modal single-cell data (CITE-seq, Multiome, spatial). Use when working with data that measures multiple modalities per cell like RNA + protein or RNA + ATAC. Use when analyzing CITE-seq, Multiome, or other multi-modal single-cell data.
data-ai
Analyze metabolite-mediated cell-cell communication using MeboCost for metabolic signaling inference between cell types. Predict metabolite secretion and sensing patterns from scRNA-seq data. Use when studying metabolic crosstalk between cell populations or metabolite-receptor interactions.
development
Find marker genes and annotate cell types in single-cell RNA-seq using Seurat (R) and Scanpy (Python). Use for differential expression between clusters, identifying cluster-specific markers, scoring gene sets, and assigning cell type labels. Use when finding marker genes and annotating clusters.
development
Reconstruct cell lineage trees from CRISPR barcode tracing or mitochondrial mutations. Use when studying clonal dynamics, cell fate decisions, or developmental trajectories.