gene-regulatory-networks/multiomics-grn/SKILL.md
Build enhancer-driven gene regulatory networks (eGRNs) by integrating single-cell RNA-seq and ATAC-seq using SCENIC+, CellOracle base GRNs, Pando, FigR, DIRECT-NET, TRIPOD, and scMEGA. Covers the accessibility-defines-enhancers principle, peak-to-gene linking and its cell-composition confound, the paired-vs-unpaired decision, and TF-region-gene eRegulon triplets. Use when analyzing 10x multiome or paired/unpaired scRNA+scATAC to infer cis-regulatory GRNs. For RNA-only regulons see scenic-regulons; for in silico TF perturbation see perturbation-simulation.
npx skillsauth add GPTomics/bioSkills bio-gene-regulatory-networks-multiomics-grnInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Reference examples tested with: SCENIC+ (current Snakemake workflow), pycisTopic 2.0+, pycistarget 1.0+, scanpy 1.10+, MACS3 3.0+; FigR/Signac/ArchR (R).
Before using code patterns, verify installed versions match. If versions differ:
pip show <package> then help(module.function) to check signaturespackageVersion('<pkg>') then ?function_name to verify parametersIf code throws ImportError, AttributeError, or TypeError, introspect the installed package and adapt the example to match the actual API rather than retrying.
SCENIC+ has undergone major API churn: the manual-object API (cisTopicObject, separate pycistarget, SCENICPLUS objects) is superseded by a Snakemake workflow (scenicplus init_snakemake). Pre-2024 tutorials are stale; verify against scenicplus.readthedocs.io before coding.
"Build an enhancer-driven gene regulatory network from my multiome data" -> Integrate scRNA-seq and scATAC-seq to identify eRegulons: transcription-factor -> enhancer-region -> target-gene triplets that link TF motif occupancy in accessible chromatin to gene expression.
The advance of multiomic GRN inference over expression-only methods is where the candidate regulatory regions come from: expression-only tools can only search promoter-proximal motifs, while scATAC nominates the actual distal enhancers active in these cells -- and most cell-type-specific regulatory information is distal. So an eGRN edge is a triplet (TF -> region -> gene), with the region as the mechanistic anchor available for validation (ChIP/CUT&Tag, CRISPRi, reporter). But the reasoning chain leaks at every step: motif-present does not mean the TF binds (motifs are short, degenerate, and shared across a family -- the model cannot tell GATA1 from GATA2); accessible does not mean this TF holds the peak open; and peak-gene correlation does not mean the peak controls the gene. Treat an eGRN as a prioritized hypothesis list, not a wiring diagram.
The hardest and least-appreciated step is peak-to-gene linking, which is confounded by cell-type composition: across a heterogeneous dataset, any peak open in a cell type correlates with any gene expressed in that same type, whether or not the peak regulates it -- cell identity is a massive shared latent factor. Genome-wide peak-gene correlation therefore mostly recovers co-marker pairs. Mitigations (distance window, within-cell-type or GC/accessibility-matched null, requiring a motif, requiring the TF to be co-expressed) reduce but never eliminate it. Metacell aggregation, needed to beat scATAC sparsity, then introduces pseudo-replication: metacell-derived p-values are not calibrated significance and should be treated as ranking scores.
| Method | Citation | Approach | Data regime | Note |
|--------|----------|----------|-------------|------|
| SCENIC+ | Bravo Gonzalez-Blas 2023 Nat Methods | topics -> motif enrichment -> region-to-gene & TF-to-gene GBM -> eRegulons | paired or separate | reference eGRN method; heavy (Snakemake, cluster job) |
| CellOracle base GRN | Kamimoto 2023 Nature | motif scan in Cicero co-accessible regions; prebuilt base GRNs | scRNA alone OK | base GRN is a prior; feeds perturbation-simulation |
| Pando | Fleck 2023 Nature | regression with TF x peak-accessibility interaction term | paired (Seurat) | regions = peaks intersect conserved/annotated CREs |
| FigR | Kartha 2022 Cell Genomics | DORCs (genes with many correlated peaks) -> TF-DORC scores | SHARE-seq/paired | pairCells for unpaired (pairing caps quality) |
| DIRECT-NET | Zhang 2022 Sci Adv | XGBoost CRE-gene importance -> TF via motif | paired or scATAC alone | |
| TRIPOD | Jiang 2022 Cell Syst | nonparametric TF-peak-gene trio test, matched/conditional | paired | strong false-positive control |
| scMEGA | Li 2023 Bioinform Adv | integrate -> trajectory -> TF-gene network | paired, trajectory | lighter-weight |
| GLUE | Cao & Gao 2022 Nat Biotechnol | graph-linked latent embedding | unpaired/diagonal | run first to integrate, then a paired method |
| Scenario | Recommended | Why | |----------|-------------|-----| | Paired 10x Multiome / SHARE-seq, want full eGRN | SCENIC+ | reference method; eRegulon triplets + activity | | Paired data, lighter/faster | Pando or DIRECT-NET | single-regression / XGBoost, less infrastructure | | Rigorous false-positive control on trios | TRIPOD | conditional/matched testing removes the composition confound | | scRNA-seq only (no ATAC) | -> CellOracle prebuilt base GRN | base GRN is a prior; no paired data needed | | Unpaired scRNA + scATAC (separate experiments) | GLUE to integrate first, then FigR/SCENIC+ | computed pairing caps all downstream link confidence | | RNA-only regulons, no enhancers needed | -> scenic-regulons | promoter-proximal motif pruning suffices | | Goal is in silico TF perturbation | -> perturbation-simulation | CellOracle/Dynamo simulate; build the base GRN here |
Goal: Assemble eRegulons (TF -> enhancer -> gene) from paired or separate scRNA + scATAC.
Approach: Topic-model the ATAC with pycisTopic, call consensus peaks from per-cell-type pseudobulk (so cell-type labels are needed before peak calling), run pycistarget motif enrichment, then region-to-gene and TF-to-gene GBM regression to build triplets; orchestrate with Snakemake.
# Initialize and run the Snakemake workflow (the supported modern entry point).
# init_snakemake scaffolds a config.yaml pointing at the scATAC fragments, the scRNA
# AnnData, the motif/cisTarget databases, and the cell-type annotation used for pseudobulk.
# scenicplus init_snakemake --out_dir scenicplus_run
# (edit scenicplus_run/Snakemake/config/config.yaml, then run from inside that dir:)
# cd scenicplus_run/Snakemake && snakemake --cores 16
# Region-to-gene search space defaults to min 1kb / max 150kb from the gene, capped at the
# nearest neighboring gene's promoter -- narrower than the +/-500kb used by ArchR/Signac.
# Inspect the resulting eRegulons (TF -> region -> gene triplets). The output filename and
# directory are set in config.yaml (output_data); the direct (high-confidence) and extended
# (motif-similarity-inferred) tables are written there. The exact spelling has varied across
# versions, so resolve it by glob rather than hard-coding.
import glob, pandas as pd
ereg_file = glob.glob('scenicplus_run/**/eRegulon*direct*.tsv', recursive=True)[0]
eregulons = pd.read_csv(ereg_file, sep='\t')
summary = (eregulons.groupby('TF')
.agg(n_regions=('Region', 'nunique'), n_genes=('Gene', 'nunique'))
.sort_values('n_genes', ascending=False))
# Direct vs extended is a motif-to-TF annotation CONFIDENCE distinction, not topology:
# direct = curated/orthology; _extended adds motif-similarity-inferred (larger, noisier).
Goal: Build a base GRN -- the candidate TF -> gene scaffold -- from accessibility, as a prior for context-specific modeling.
Approach: Define active regions by Cicero co-accessibility (in R), scan them for TF motifs, and format the result as the base GRN; or load a prebuilt base GRN and skip ATAC entirely.
import celloracle as co
import pandas as pd
# Custom base GRN: peaks already filtered to Cicero co-accessible, promoter-linked regions.
peaks = pd.read_parquet('processed_peak_file.parquet') # columns: peak_id, gene_short_name
tfi = co.motif_analysis.TFinfo(peak_data_frame=peaks, ref_genome='hg38')
tfi.scan(fpr=0.02) # motif FPR
tfi.filter_motifs_by_score(threshold=10)
tfi.make_TFinfo_dataframe_and_dictionary()
base_grn = tfi.to_dataframe()
# Or skip ATAC: prebuilt base GRN as a prior (CellOracle ships ~10 species + mouse atlas).
# base_grn = co.data.load_mouse_scATAC_atlas_base_GRN()
The base GRN constrains which TF -> gene edges are allowed; the context-specific weights are then learned per cluster in perturbation-simulation.
Goal: Identify domains of regulatory chromatin (DORCs) and the TFs that regulate them.
Approach: Correlate peaks to genes, call DORCs (genes with an unusually large number of significant peaks), then score TF-DORC associations from motif enrichment plus TF expression correlation.
library(FigR)
# Step 1: peak-gene correlations (smoothed over KNN metacells for sparsity)
cisCor <- runGenePeakcorr(ATAC.se = atac_se, RNAmat = rna_mat,
genome = 'hg38', nCores = 8, p.cut = 0.05)
# Step 2: DORC scores then TF-DORC regulation scores (signed: activator/repressor)
dorcMat <- getDORCScores(atac_se, cisCor, geneList = unique(cisCor$Gene), nCores = 8)
figR <- runFigRGRN(ATAC.se = atac_se, dorcTab = cisCor, dorcMat = dorcMat,
rnaMat = rna_mat, genome = 'hg38', nCores = 8)
Trigger: genome-wide peak-gene correlation with no within-type control. Mechanism: any peak open in a cell type correlates with any gene expressed in it. Symptom: "links" that are just cell-type co-markers. Fix: restrict to the distance window, use a matched null (Signac) or within-type correlation, and require a motif.
Trigger: reporting astronomically small p-values from KNN-smoothed metacells. Mechanism: metacells are not independent (overlapping cells). Symptom: millions of "significant" links. Fix: treat metacell p-values as ranking scores; apply FDR honest about pseudo-replication.
Trigger: naming a specific TF (GATA1) from a family motif. Mechanism: paralogs share near-identical motifs. Symptom: confident single-TF claims where only a family is detectable. Fix: report the motif/family and require orthogonal evidence to single out a member.
Trigger: computing "joint" correlations on separately-measured scRNA and scATAC. Mechanism: cells were computationally matched, not co-measured. Symptom: no pairing method or accuracy reported. Fix: integrate with GLUE/anchors first; report pairing quality; treat links as upper-bounded by it.
Trigger: a published wiring diagram with no orthogonal validation. Mechanism: accessibility != binding != regulation, and motif/proximity validation is circular. Symptom: no ChIP/CRISPRi/perturbation check; validation uses the same motif DB used to build the net. Fix: validate against independent ChIP-seq or perturbation data.
| Threshold | Source | Rationale |
|-----------|--------|-----------|
| Region-to-gene window: 1kb-150kb (SCENIC+) | Bravo Gonzalez-Blas 2023 | capped at nearest gene promoter; narrower than +/-500kb conventions |
| Peak-gene window +/-500kb (ArchR/Signac/Cicero) | tool defaults | distal enhancer reach; pair with a matched null |
| MACS3 --keep-dup all, BEDPE/shift-extsize | ATAC convention | fragment-based peak calling for the region universe |
| Motif FPR ~0.02 (CellOracle scan) | CellOracle default | motif-match false-positive rate |
| eRegulon: direct vs _extended | SCENIC+ | direct = curated motif2TF; extended adds inferred (noisier) |
| Error / symptom | Cause | Solution | |-----------------|-------|----------| | stale SCENIC+ API errors | following pre-2024 manual-object tutorials | use the Snakemake workflow; check scenicplus.readthedocs.io | | consensus peak step fails | no cell-type labels for pseudobulk | annotate cell types before peak calling | | feather DB format error | region-based vs gene-based DB mismatch / old ctxcore | use region-based DBs for ATAC; align versions | | huge spurious link set | composition confound + no matched null | window + within-type/matched-null + motif requirement | | empty eRegulons | species/assembly/motif-collection mismatch | match genome, motif DB, and gene IDs |
testing
Analyze multi-modal single-cell data (CITE-seq, Multiome, spatial). Use when working with data that measures multiple modalities per cell like RNA + protein or RNA + ATAC. Use when analyzing CITE-seq, Multiome, or other multi-modal single-cell data.
data-ai
Analyze metabolite-mediated cell-cell communication using MeboCost for metabolic signaling inference between cell types. Predict metabolite secretion and sensing patterns from scRNA-seq data. Use when studying metabolic crosstalk between cell populations or metabolite-receptor interactions.
development
Find marker genes and annotate cell types in single-cell RNA-seq using Seurat (R) and Scanpy (Python). Use for differential expression between clusters, identifying cluster-specific markers, scoring gene sets, and assigning cell type labels. Use when finding marker genes and annotating clusters.
development
Reconstruct cell lineage trees from CRISPR barcode tracing or mitochondrial mutations. Use when studying clonal dynamics, cell fate decisions, or developmental trajectories.