gene-regulatory-networks/scenic-regulons/SKILL.md
Infer transcription factor regulons from single-cell RNA-seq with pySCENIC by combining GRNBoost2 co-expression, cisTarget motif-enrichment pruning, and AUCell per-cell activity scoring. Covers the motif-pruning-as-directionality principle, regulon specificity scoring, run-to-run stability, and database/species matching. Use when identifying TF regulons, scoring TF activity per cell, finding master regulators of cell identity, or comparing regulon activity across conditions. For enhancer-driven multiomic GRNs see multiomics-grn; for bulk inference and VIPER protein-activity see grn-inference.
npx skillsauth add GPTomics/bioSkills bio-gene-regulatory-networks-scenic-regulonsInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Reference examples tested with: pySCENIC 0.12+, ctxcore 0.2+, arboreto 0.1.6+, scanpy 1.10+, loompy 3.0+.
Before using code patterns, verify installed versions match. If versions differ:
pip show <package> then help(module.function) to check signatures<tool> --version then <tool> --help to confirm flagsIf code throws ImportError, AttributeError, or TypeError, introspect the installed package and adapt the example to match the actual API rather than retrying.
The motif-DB machinery lives in ctxcore; a ctxcore/feather-format version mismatch is the most common silent failure. pySCENIC is most reliable on a dedicated Python 3.10 environment.
"Identify transcription factor regulons and score TF activity from my scRNA-seq data" -> Run the pySCENIC three-step pipeline: infer TF-target co-expression with GRNBoost2, prune to direct targets by cis-regulatory motif enrichment with cisTarget, then score per-cell regulon activity with AUCell.
pyscenic grn -> pyscenic ctx -> pyscenic aucellarboreto_with_multiprocessing.py for the GRN step (avoids the dask breakage)Step 1 (GRNBoost2) produces undirected co-expression only -- it is no better than WGCNA and inherits all of co-expression's confounding (indirect edges, batch, cell-cycle). The entire conceptual payload of SCENIC is Step 2 (cisTarget): for each module it asks whether the candidate TF's binding motif is significantly enriched (NES >= 3.0) in the cis-regulatory space of the module's targets, and keeps only the targets in the motif's leading edge. This (a) imposes a mechanistic prior -- the TF can physically bind near its retained targets, (b) breaks the symmetry of co-expression into a TF -> target direction, and (c) discards indirect targets. A "regulon" is by definition only the post-cisTarget TF plus its direct targets. Modules that were never pruned are co-expression modules, and calling them regulons misuses the word.
The second non-obvious consequence is AUCell: regulon activity is not TF expression. AUCell ranks genes within each cell and computes the area under the recovery curve for the regulon's gene set, so activity can be high even when the TF's own mRNA is dropout-zero (TF transcripts are sparse). Showing TF expression in place of regulon AUC -- or "validating" activity by its correlation with TF expression -- misses the method's point and is circular. SCENIC regulons remain motif-supported co-expression: a strong, directed hypothesis worth a knockdown, not proof of causal regulation.
| Step | Tool | Produces | Key parameter | Watch out for |
|------|------|----------|---------------|---------------|
| 1. GRN | GRNBoost2 (or GENIE3) | TF-target co-expression adjacencies | --seed, --num_workers | stochastic; not reproducible without a fixed seed |
| 2. Prune | cisTarget (ctxcore) | regulons (direct targets) | --nes_threshold 3.0, --rank_threshold 5000 | feather DB + motif2TF version must match |
| 3. Score | AUCell | per-cell regulon activity (AUC) | --auc_threshold 0.05 | this is the top-fraction, NOT the binarization cut |
| Scenario | Recommended | Why |
|----------|-------------|-----|
| scRNA-seq, want TF regulons + per-cell activity | pySCENIC grn/ctx/aucell | the canonical workflow |
| GRN step hangs / KilledWorker | arboreto_with_multiprocessing.py | arboreto's dask backend breaks on newer dask |
| Need reproducible regulons | run GRN 10-100x, keep links recurring >80% | GRNBoost2/GENIE3 are stochastic |
| Which regulons mark a cell type | Regulon Specificity Score (RSS) | JSD-based specificity, not just magnitude |
| Paired scRNA + scATAC available | -> multiomics-grn (SCENIC+) | accessibility defines enhancers; eRegulons add the region layer |
| Bulk RNA-seq / want protein activity | -> grn-inference (ARACNe + VIPER) | SCENIC is single-cell; VIPER reads TF activity from bulk |
| Compare activity across conditions | run SCENIC once on the integrated object | raw AUC is population-relative; batch survives into regulons |
cisTarget needs three matched resources: ranking database(s), motif-to-TF annotations, and the TF list -- all the same species/assembly/symbol namespace. Download from resources.aertslab.org/cistarget/.
# Human hg38 gene-based rankings (~1.5 GB each). Run ctx with BOTH search-space DBs
# (500bp+100bp around TSS, and TSS +/-10kb) so the leading-edge logic pools them.
wget https://resources.aertslab.org/cistarget/databases/homo_sapiens/hg38/refseq_r80/mc9nr/gene_based/hg38__refseq-r80__10kb_up_and_down_tss.mc9nr.genes_vs_motifs.rankings.feather
wget https://resources.aertslab.org/cistarget/motif2tf/motifs-v9-nr.hgnc-m0.001-o0.0.tbl
# The ranking-DB version (mc9nr / v10) and the motif2tf annotation version MUST match.
Goal: Infer TF-target co-expression adjacencies as candidate regulatory modules.
Approach: Run GRNBoost2 via the bundled multiprocessing script (single-node, stable) rather than the dask backend, and fix the seed so the stochastic boosting is reproducible.
# arboreto's dask backend breaks on dask>=2.x (silent hangs, KilledWorker).
# The bundled multiprocessing wrapper is the supported workaround.
python arboreto_with_multiprocessing.py \
filtered.loom allTFs_hg38.txt \
--method grnboost2 --output adj.tsv \
--num_workers 8 --seed 42
Goal: Keep only TF-target links whose target genes are enriched for the TF's binding motif -- the step that confers directness and direction.
Approach: Load the ranking databases and motif2TF annotations, build candidate modules from the adjacencies, and run cisTarget pruning; targets surviving motif enrichment (NES >= 3.0) form the regulon.
import glob, pickle, pandas as pd
from pyscenic.utils import modules_from_adjacencies
from pyscenic.prune import prune2df, df2regulons
from ctxcore.rnkdb import FeatherRankingDatabase
adjacencies = pd.read_csv('adj.tsv', sep='\t')
expr = pd.read_csv('expr.csv', index_col=0) # cells x genes
modules = list(modules_from_adjacencies(adjacencies, expr))
dbs = [FeatherRankingDatabase(f, name=f) for f in glob.glob('*.genes_vs_motifs.rankings.feather')]
# rank_threshold=5000 matches the CLI default (the prune2df Python default is 1500).
df = prune2df(dbs, modules, 'motifs-v9-nr.hgnc-m0.001-o0.0.tbl', rank_threshold=5000)
regulons = df2regulons(df) # TF + direct targets only
with open('regulons.pkl', 'wb') as fh:
pickle.dump(regulons, fh)
CLI equivalent for steps 1-2 (pyscenic grn, then pyscenic ctx adj.tsv DB.feather --annotations_fname motifs.tbl --expression_mtx_fname filtered.loom -o reg.csv). ctx verified defaults: --rank_threshold 5000, --auc_threshold 0.05, --nes_threshold 3.0, --min_genes 20. --mask_dropouts now defaults to False (matching R SCENIC); it changes the TF-target correlation sign that splits activating (+) from repressing (-) regulons, so report the setting used.
Goal: Score each regulon's activity in every cell, robustly to dropout.
Approach: Rank genes within each cell, integrate the recovery curve over the top fraction (auc_threshold, default 0.05 = top 5%), and emit a cell-by-regulon AUC matrix.
from pyscenic.aucell import aucell
# auc_threshold = top 5% of the ranking integrated for the AUC -- NOT a binarization cut.
auc_mtx = aucell(expr, regulons, auc_threshold=0.05, num_workers=8)
auc_mtx.to_csv('auc_matrix.csv')
Goal: Surface the regulons that define each cell type and convert activity to on/off states for clustering.
Approach: Use the Regulon Specificity Score (Jensen-Shannon divergence vs an idealized cell-type-specific distribution) for identity regulators, and binarize the AUC distribution (bimodal -> density threshold) for state heatmaps.
from pyscenic.rss import regulon_specificity_scores
from pyscenic.binarization import binarize
cell_types = pd.read_csv('cell_types.csv', index_col=0)['cell_type']
rss = regulon_specificity_scores(auc_mtx, cell_types) # high RSS = identity regulator
binary_mtx, thresholds = binarize(auc_mtx) # per-regulon on/off
RSS (rewards specificity) and a per-cluster AUC z-score (rewards magnitude) can disagree; prefer RSS for "which regulon marks this cluster."
Trigger: skipping ctx, or dropping the NES threshold to admit everything. Mechanism: without motif enrichment the output is co-expression, not direct regulation. Symptom: no motif DB/version reported; implausibly large "regulons." Fix: always run cisTarget; report DB + motif2TF versions and the search-space windows.
Trigger: native arboreto on dask>=2.x. Mechanism: scheduler incompatibility. Symptom: silent hang or KilledWorker. Fix: use arboreto_with_multiprocessing.py (single-node, stable).
Trigger: mouse genes against an hg38 ranking DB, or HGNC vs MGI symbol mismatch. Mechanism: gene IDs do not map into the database. Symptom: near-empty regulon set. Fix: match expression IDs, ranking DB, and motif2TF to one species/assembly/namespace.
Trigger: comparing raw AUC across separately-run SCENIC analyses or strong batches. Mechanism: AUC is relative to the population it was ranked within; batch-driven co-expression can pass motif enrichment by chance. Symptom: a "condition-specific regulator" that tracks the batch. Fix: run SCENIC once on the integrated object; sanity-check condition regulons against batch.
Trigger: using _extended regulons for direct-binding claims, or building a story on (-) repressor activity. Mechanism: _extended adds orthology/similarity-inferred (low-confidence) motif annotations; negative regulons are sparse and weakly enriched. Symptom: direct-regulation claims from low-confidence edges. Fix: default to high-confidence positive regulons; treat _extended/(-) as hypotheses.
| Threshold | Source | Rationale | |-----------|--------|-----------| | NES >= 3.0 (motif enrichment) | Aibar 2017 / iRegulon (Janky 2014) | recovery-curve enrichment cutoff defining a supported motif | | auc_threshold = 0.05 (top 5%) | pySCENIC default | fraction of the ranking integrated for the AUC | | GRN reruns: keep links recurring >80% of runs | Van de Sande 2020 | GRNBoost2/GENIE3 are stochastic; recurrence = high confidence | | min_genes = 20 per regulon | pySCENIC default | smaller target sets give unstable AUC | | >= a few hundred cells per cell type | practical | rare clusters and doublets inflate spurious regulons |
| Error / symptom | Cause | Solution |
|-----------------|-------|----------|
| "not a cisTarget Feather database in v1 or v2 format" | ctxcore/DB version mismatch | download current DB; align ctxcore version |
| empty regulon set | species/assembly or symbol mismatch | match gene IDs to the DB namespace |
| different regulons each run | unset seed in GRN step | fix --seed; run multiple seeds and intersect |
| activity != TF expression confuses the reader | conflating regulon AUC with TF mRNA | report AUCell activity; that independence is the point |
| ctx returns nothing | missing/mismatched --annotations_fname | supply matching motif2TF; check DB is gene-based (not region-based) |
testing
Analyze multi-modal single-cell data (CITE-seq, Multiome, spatial). Use when working with data that measures multiple modalities per cell like RNA + protein or RNA + ATAC. Use when analyzing CITE-seq, Multiome, or other multi-modal single-cell data.
data-ai
Analyze metabolite-mediated cell-cell communication using MeboCost for metabolic signaling inference between cell types. Predict metabolite secretion and sensing patterns from scRNA-seq data. Use when studying metabolic crosstalk between cell populations or metabolite-receptor interactions.
development
Find marker genes and annotate cell types in single-cell RNA-seq using Seurat (R) and Scanpy (Python). Use for differential expression between clusters, identifying cluster-specific markers, scoring gene sets, and assigning cell type labels. Use when finding marker genes and annotating clusters.
development
Reconstruct cell lineage trees from CRISPR barcode tracing or mitochondrial mutations. Use when studying clonal dynamics, cell fate decisions, or developmental trajectories.