alternative-splicing/single-cell-splicing/SKILL.md
Analyzes alternative splicing at single-cell resolution. The first decision is library chemistry — 10X 3' is fundamentally limited (RT primes from poly-A, R2 falls in 3' UTR, <0.1 junction read per cell per AS event). Plate-based full-length methods (Smart-seq3, FLASH-seq, VASA-seq, STORM-seq) and single-cell long-read (MAS-Iso-seq, scISOr-Seq2) are the chemistries that give per-cell isoform structure. Tools include MARVEL (R, Smart-seq integrated), BRIE2 (Bayesian PSI with regulatory features and ELBO_gain test), scQuint (junction-cluster, plate-based; not for 10X), SpliZ (annotation-free Z-score), Psix (graph-smoothness regulated AS), and Sierra (alternative polyadenylation, often confused with AS). Use when analyzing isoform usage in scRNA-seq, identifying cell-type-specific splicing, or determining whether scRNA-seq chemistry supports splicing analysis at all.
npx skillsauth add GPTomics/bioSkills bio-single-cell-splicingInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Reference examples tested with: MARVEL 2.0+, BRIE2 0.2.4+, scQuint 0.1+, SpliZ 0.0.1+, Sierra 1.0+, Psix 0.1+, anndata 0.10+, scanpy 1.10+, pandas 2.2+, scipy 1.13+
Before using code patterns, verify installed versions match. If versions differ:
pip show <package> then help(module.function) to check signaturespackageVersion('<pkg>') then ?function_name to verify parameters<tool> --version then <tool> --help to confirm flagsIf code throws ImportError, AttributeError, or TypeError, introspect the installed package and adapt the example to match the actual API rather than retrying.
The fundamental decision is chemistry, not tool. Most droplet 3' scRNA-seq cannot support transcriptome-wide splicing inference because reverse transcription primes from the poly(A) tail and most reads land in the 3' UTR — far from CDS-region splicing events. Plate-based full-length methods and single-cell long-read sequencing are the chemistries that give per-cell isoform structure across the gene body.
Three compounding mechanisms make 10X Chromium 3' (v3.1, GEM-X, v4) hostile to splicing:
Quantitative estimate: Only a small fraction of cassette exons sit close enough to the polyA site to be sampled by 3' chemistry (empirical estimates from APA/3'-end atlases — see Tian & Manley 2017 Nat Rev Mol Cell Biol for the 3' UTR isoform landscape). Effective junction read yield from 10X 3' is <0.1 per cell per AS event — vs the 5-10 needed for stable per-cell PSI. Most splicing analyses on 10X 3' data report artifacts.
The 5' kit (10X 5' GEX) does not solve this — it shifts capture from 3' UTR to 5' UTR / TSS-proximal regions. Marginal improvement; not a transcriptome-wide solution. Note that V(D)J recovery requires the 10X Chromium Single Cell Immune Profiling kit (with TCR/BCR-specific enrichment), not 5' GEX alone — postdocs designing immune-repertoire experiments must use the dedicated V(D)J kit.
| Chemistry | Splicing analysis viable? | Best alternative if no | |-----------|----------------------------|--------------------------| | 10X 3' (Chromium v3, GEM-X, v4, Flex) | No (transcriptome-wide); maybe near-3'-end events | Sierra for APA | | 10X 5' GEX | Limited; near-5'-end events only | Sierra for alternative TSS; switch to MAS-Iso-seq | | Smart-seq2 | Yes (full transcript) | MARVEL or BRIE2 | | Smart-seq3 / Smart-seq3xpress | Yes + UMI molecule counting | MARVEL or BRIE2 | | FLASH-seq | Yes (faster, cheaper Smart-seq3) | MARVEL or BRIE2 | | VASA-seq | Yes + total RNA (incl. nascent, IR) | MARVEL with IR analysis | | STORM-seq | Yes + total RNA + ribodepletion | MARVEL with IR analysis | | MAS-Iso-seq + 10X 5' (PacBio Kinnex) | Yes — full isoforms per cell | FLAMES, scNanoGPS, IsoQuant, see long-read-splicing | | scISOr-Seq2 (PacBio + 10X) | Yes — full isoforms with cell-typing | FLAMES, IsoQuant | | ONT direct cDNA scRNA | Yes | FLAMES | | ONT direct RNA scRNA | Yes + native modifications | FLAMES |
| Tool | Best for | Input | Strengths | Fails when |
|------|----------|-------|-----------|------------|
| MARVEL | Smart-seq plate-based and (v2+) 10X droplet unified workflow | Plate or droplet BAMs + Seurat | SE/A5SS/A3SS/MXE/RI/AFE/ALE; modality classification; native Seurat integration; v2 droplet support | R-only |
| BRIE2 | Plate-based with regulatory feature prior | Plate BAM + GFF3 events | Bayesian variational PSI + ELBO_gain test; principled uncertainty; CLI-driven (brie-count, brie-quant) | TensorFlow dependency; slow at scale |
| scQuint | Plate-based annotation-free junction-cluster quantification (validated on Smart-seq2) | STAR junctions across cells | Cluster-level junction usage; latent Dirichlet | Authors recommend AGAINST use on 10X 3'/5' data (3'-bias confounds); plate-based only |
| SpliZ | Annotation-free discovery of cell-state-associated splicing | STAR-aligned BAMs | Per-gene Z-score; no event database needed | Annotation-free = power tradeoff |
| Psix | Regulated AS along trajectories | PSI matrix + kNN graph | Tests graph smoothness; robust to dropout | Needs cell-state graph upstream |
| Sierra | APA in 10X 3' (NOT splicing) | 10X BAM + GTF | Peak-calling 3' ends; DEXSeq DTU on UTR isoforms | APA only; not for cassette exons |
| pseudobulk leafcutter / rMATS | Between-cell-type differential splicing | Aggregated BAMs | Bulk-level statistical power | Loses within-cluster heterogeneity |
| MAS-Iso-seq + FLAMES | Full-length single-cell isoforms | 10X 5' + PacBio Kinnex | Full isoforms per cell at scale | Cost; complex pipeline |
| Goal | Recommended approach |
|------|----------------------|
| "Will my 10X 3' data support splicing?" | No transcriptome-wide; consider Sierra for APA. Note: scQuint authors recommend against use on 10X data |
| Cassette exon analysis in cell types from Smart-seq2 | MARVEL with ComputePSI + AssignModality + CompareValues |
| Discover cell-state-associated splicing without an event database | SpliZ |
| Test regulated AS along developmental pseudotime | Psix |
| Per-cell PSI with uncertainty in low-coverage cells | BRIE2 |
| Differential splicing between two well-defined cell types | Pseudobulk leafcutter or rMATS on aggregated BAMs |
| APA (alternative polyadenylation, often confused with AS) | Sierra |
| Full-length single-cell isoforms at scale | MAS-Iso-seq + FLAMES (long-read) |
| Microexons (3-27 nt) | Long-read or aligner with low overhang (uLTRA, deSALT) |
| snRNA-seq (nuclei) — IR question | Library captures nuclear RNA enriched for incomplete splicing — interpret IR cautiously |
Goal: Run a unified workflow from STAR junctions to cell-type-specific splicing calls.
Approach: Build a wide splice-junction count matrix (rows = junctions keyed by coord.intron, columns = cells), assemble per-event feature tables, then construct MARVEL object with named slots (SpliceJunction, SplicePheno, SpliceFeature, IntronCounts, GeneFeature, Exp, GTF). Quantify PSI per event class, classify modality, test differential splicing.
library(MARVEL); library(Seurat); library(data.table)
seurat_obj <- readRDS('cells.rds')
# Build wide SJ matrix: first column 'coord.intron' (e.g. 'chr1:100007082:100022621'),
# subsequent columns are per-cell sample IDs with junction counts as values.
# This is constructed from STAR SJ.out.tab files (one per cell) merged on intron coord.
sj_files <- list.files('star_pass2/', pattern='SJ.out.tab$', full.names=TRUE)
sj_long <- rbindlist(lapply(sj_files, function(f) {
d <- fread(f, sep='\t', header=FALSE,
col.names=c('chr','start','end','strand','motif','annot','unique','multi','overhang'))
d$coord.intron <- paste(d$chr, d$start, d$end, sep=':')
d$sample <- gsub('_SJ.out.tab$', '', basename(f))
d[, .(coord.intron, sample, unique)]
}))
sj <- dcast(sj_long, coord.intron ~ sample, value.var='unique', fill=0)
# SpliceFeature is a NAMED LIST keyed by event class
df.feature.list <- list(
SE = read.table('events_SE.txt', header=TRUE, sep='\t'),
A5SS = read.table('events_A5SS.txt', header=TRUE, sep='\t'),
A3SS = read.table('events_A3SS.txt', header=TRUE, sep='\t'),
MXE = read.table('events_MXE.txt', header=TRUE, sep='\t'),
RI = read.table('events_RI.txt', header=TRUE, sep='\t')
)
# SplicePheno: per-cell metadata; sample.id column maps to SpliceJunction column names
df.pheno <- [email protected]
df.pheno$sample.id <- rownames(df.pheno)
marvel <- CreateMarvelObject(
SpliceJunction = sj,
SplicePheno = df.pheno,
SpliceFeature = df.feature.list,
GeneFeature = read.table('gene_features.tsv', header=TRUE, sep='\t'),
Exp = read.table('tpm.tsv', header=TRUE, sep='\t', row.names=1),
GTF = rtracklayer::import('annotation.gtf')
)
marvel <- ComputePSI(marvel, CoverageThreshold=10, EventType='SE')
marvel <- AssignModality(marvel, EventType='SE')
marvel <- CompareValues(
marvel,
cell.group.g1 = neurons, cell.group.g2 = glia,
method = 'wilcox', n.cells = 25, psi.delta = 0.1
)
For 10X droplet data, MARVEL v2+ provides CreateMarvelObject.10x() and AnnotateSJ.10x() constructors. Verify the exact API via ?CreateMarvelObject.10x in installed MARVEL.
MARVEL classifies events into modalities (Song 2017 Mol Cell): included (PSI~1), excluded (PSI~0), bimodal (mixture at 0/1), middle (peaked ~0.5), multimodal. Bimodality usually reflects mixed cell states or stochastic monoallelic-like bursting. Mid-modality (peaked at 0.5) can be technical (mixed cells in a droplet) — confirm with full-length data.
Goal: Estimate per-cell PSI with informative regulatory-feature prior; test cell-state association via likelihood-ratio testing on covariate effects.
Approach: BRIE2 is a CLI-driven workflow (brie-count for read counting, brie-quant for variational inference + LRT). Prepare a GFF3 of splicing events, count cell-barcoded junction reads, then fit the model with covariate testing.
# 1. Count splicing events per cell
brie-count \
-a splicing_events.gff3 \
-S sample_list.tsv \
-o brie_counts/ \
-p 16
# 2. Fit BRIE2 with LRT against the cell-type covariate
brie-quant \
-i brie_counts/brie_count.h5ad \
-c cell_metadata.tsv \
-o brie_quant.h5ad \
--interceptMode gene \
--LRTindex All \
--testBase null \
--MCsize 3 \
--batchSize 1000000 \
-p 16
--interceptMode gene fits a gene-specific intercept (recommended); --LRTindex All tests all covariates; --testBase null uses the null model as the LRT reference. Verify exact flag set via brie-quant -h in installed BRIE2.
import scanpy as sc
adata_splice = sc.read_h5ad('brie_quant.h5ad')
# Per-event covariate effects, ELBO values, and LRT statistics live in
# adata_splice.varm and adata_splice.var; column names depend on BRIE2 version.
# Inspect with: print(adata_splice); print(adata_splice.varm.keys())
# Per-event significance is typically derived from LRT delta-ELBO.
BRIE2 (Huang & Sanguinetti 2021 Genome Biol) uses a sequence-derived feature prior (exon length, GC content, splice site strength, motif counts) to regularize PSI estimates in low-coverage cells. The LRT-based covariate test answers "is this event associated with cell state?" without requiring per-cell PSI accuracy. Threshold the delta-ELBO at ~3 (analogous to log-Bayes-factor); confirm against version-specific output keys via the brie-tutorials repo.
Goal: Identify splicing-defined cell populations without an event database.
Approach: Compute per-gene splicing Z-score across cells; test for cell-state association via permutation.
spliz \
--bams sample1.bam sample2.bam \
--metadata cell_metadata.tsv \
--gtf annotation.gtf \
--output spliz_output/ \
--threads 8
SpliZ (Olivieri 2022 Nat Methods) is robust to dropout because it pools junction information across the gene; particularly useful for discovering splicing diversity in heterogeneous tumor samples.
Goal: Detect AS that varies coherently with cell state along a developmental trajectory, robust to dropout.
Approach: Score whether observed PSI is smooth on the cell-cell kNN graph from expression-space embedding.
import psix
import scanpy as sc
adata = sc.read_h5ad('cells.h5ad')
sc.pp.neighbors(adata, n_neighbors=30, use_rep='X_pca')
psix_obj = psix.Psix(adata, psi_matrix_path='psi_matrix.tsv')
psix_obj.run_psix()
regulated = psix_obj.psix_results.query('psix_score > 1.5 and pvalue < 0.05')
Psix (Buen Abad Najar 2022 Genome Res 32:1385) is the principled alternative to imputing PSI: do not impute (it obliterates heterogeneity); test for graph smoothness instead.
Goal: Detect alternative polyadenylation in 10X 3' data — frequently confounded with AS.
Approach: Peak-call read pile-ups at 3' ends, then DEXSeq-style DTU on 3' UTR isoforms.
library(Sierra)
peak_file <- FindPeaks(
output.file = 'peaks.txt',
gtf.file = 'annotation.gtf',
bam.file = 'possorted_genome_bam.bam'
)
counts <- CountPeaks(
peak.sites.file = 'peaks.txt',
gtf.file = 'annotation.gtf',
bamfile = 'possorted_genome_bam.bam',
whitelist.file = 'barcodes.tsv'
)
apa_results <- DUTest(counts, group1 = ctrl_cells, group2 = trt_cells)
If only 10X 3' data is available, this is often what is actually wanted. Distinct UTRs change miRNA targeting, RBP binding, and stability — biologically meaningful but not splicing.
Goal: Recover bulk-level statistical power for differential splicing between cell types.
Approach: Sum junction counts across cells of the same cluster, then run leafcutter / rMATS on aggregated counts.
import pandas as pd
import numpy as np
def pseudobulk_junctions(junction_counts, cell_metadata, groupby='cell_type'):
out = {}
for group, cells in cell_metadata.groupby(groupby).groups.items():
mask = junction_counts.columns.isin(cells)
out[group] = junction_counts.loc[:, mask].sum(axis=1)
return pd.DataFrame(out)
Use pseudobulk for differential splicing between well-defined cell types; use per-cell methods for within-population heterogeneity (graded splicing along pseudotime, bimodal cell-state mixtures).
In 2024-2026, full-length single-cell long-read sequencing has become practical and is the recommended chemistry for splicing-focused single-cell experiments:
For splicing-specific full-length single-cell analysis, see long-read-splicing skill.
Trigger: Building the SpliceJunction matrix from STAR SJ.out.tab incorrectly (e.g. long-format instead of wide).
Mechanism: MARVEL plate-based CreateMarvelObject(SpliceJunction = ...) expects a wide matrix with first column coord.intron (formatted chr:start:end) and subsequent columns being per-cell sample IDs with integer junction counts. Long-format data.frames or missing coord.intron column cause runtime errors.
Symptom: "no coord.intron column found" errors; or empty PSI tables despite junction reads being present.
Fix: Verify wide-matrix structure; ensure SJ.out.tabs are merged on the chr:start:end key with cells as columns. Use data.table::dcast for the long->wide reshape.
Trigger: Large cohort (>10k cells) with deep coverage.
Mechanism: Variational inference loads full count matrix; TensorFlow allocates GPU memory aggressively.
Symptom: OOM kills; training stalls.
Fix: Reduce --batchSize from default (500000) to 100000 or 50000; train per-chromosome batch; use CPU mode for very small cohorts. Note flag is camelCase --batchSize, not --batch_size.
Trigger: Running scQuint on 10X 3' v3 data hoping for splicing signal.
Mechanism: scQuint's latent Dirichlet model needs junction counts; 10X 3' yields too few junction reads to fit the model robustly.
Symptom: All cells assign to one cluster; no informative splicing signal.
Fix: Pivot to APA analysis with Sierra; or upgrade chemistry to MAS-Iso-seq.
Trigger: Running Psix without precomputed cell-cell graph.
Mechanism: Psix tests PSI smoothness on a pre-existing cell-cell graph; without one, no smoothness statistic.
Symptom: Empty results or error about missing connectivities.
Fix: Run sc.pp.neighbors(adata) before Psix; ensure connectivities is in adata.obsp.
Trigger: GTF missing 3'UTR annotations.
Mechanism: Sierra peak-calls within annotated 3'UTRs; missing annotations mean missed peaks.
Symptom: Few peaks detected; gene-level coverage but no APA calls.
Fix: Use comprehensive GENCODE annotation; or run de-novo peak calling first.
| Pattern | Likely cause | Action | |---------|--------------|--------| | MARVEL sig, BRIE2 not | Per-cell PSI noise (BRIE2 conservative); MARVEL pseudobulk-like | Trust MARVEL for cell-type comparisons; BRIE2 for within-cluster | | BRIE2 sig, MARVEL not | Cell-state effect smoother than cell-type boundary | Test along trajectory with Psix | | SpliZ sig, MARVEL not | Annotation-free SpliZ catches novel events | Investigate junction structure manually | | Sierra sig, MARVEL not | Sierra is APA, MARVEL is splicing — different biology | Distinguish in interpretation | | Pseudobulk sig, per-cell not | Power issue; effect averaged out per-cell | Report at cluster level, not per-cell |
Per-cell PSI vs pseudobulk PSI:
Modality detection in PSI distributions (Song 2017 Mol Cell): | Modality | PSI distribution | Biology | |----------|------------------|---------| | Included | Peaked at 1 | Constitutive inclusion | | Excluded | Peaked at 0 | Constitutive skipping | | Bimodal | Mixture at 0 and 1 | Mixed cell states or monoallelic-like bursting | | Middle | Peaked ~0.5 | Often technical (well-contamination, doublets, or low-coverage shrinkage to prior); confirm with full-length | | Multimodal | Multiple peaks | Complex regulation; deserves follow-up |
Beta-binomial vs binomial models: with sparse counts, binomial PSI is overdispersed. Beta-binomial models (BRIE2; leafcutter2 as Dirichlet-multinomial cluster-level) handle this. For very sparse droplet data, even beta-binomial fits poorly per cell — collapse to pseudobulk.
Imputation pitfalls: naive imputation (MAGIC, scImpute, ALRA) of expression matrices is not appropriate for PSI: imputing missing junction counts averages over neighboring cells and obliterates the very heterogeneity under study. Psix's approach — testing smoothness of observed PSI on the kNN graph — is the principled alternative.
| System | Event | Regulator | |--------|-------|-----------| | Neural microexons | 3-27 nt exons enriched in brain | SRRM4 (nSR100); SRRM3 in retina (Irimia 2014 Cell) | | Neural differentiation | PTBP1 -> PTBP2 switch | miR-124 represses PTBP1; derepresses neural exons (Boutz 2007 Genes Dev) | | T-cell activation | CD45 RA -> RO | hnRNP-L, ESRP-mediated | | Erythropoiesis | EPB41 exon 16 | Splicing factor switching during maturation | | Cardiac development | TTN N2BA -> N2B | MBNL1/CELF1 antagonism | | EMT | FGFR2 IIIb -> IIIc, ENAH exon 11a | ESRP1/2 loss in mesenchymal state (Warzecha 2009 Mol Cell) | | Activated T cell | CD45 isoform shift | Multiple SR/hnRNP regulators |
| Metric | Recommendation | |--------|----------------| | Cells per event with reads | >=50 (per-cell PSI); >=200 cells per cluster (pseudobulk) | | Junction reads per event per cell | >=5 with coverage; <=1 = unreliable | | PSI variance for cell-type call | <0.1 within cluster, >0.2 between clusters | | Library | full-length plate or long-read for transcriptome-wide; 3' for APA only | | Doublet filtering | Required before splicing analysis (DoubletFinder, Scrublet) | | Cells per cluster (pseudobulk) | >=100 ideal; >=50 minimum | | nuclear vs whole-cell | snRNA-seq enriches IR; treat with caution |
| Error | Cause | Solution |
|-------|-------|----------|
| MARVEL: ComputePSI returns empty | STAR SJ.out.tab missing strand info | Re-run STAR with --outSJtype Standard |
| brie.tl.fit: NaN loss | Insufficient junction reads per cell | Filter cells with min_reads=20; raise threshold |
| scQuint: convergence not reached | LDA model fit on too-few junctions | Aggregate by chromosome; or switch chemistry |
| Psix: missing connectivities | Neighbors graph not computed | Run sc.pp.neighbors(adata) first |
| Sierra: no peaks called | GTF missing 3'UTR annotations | Use comprehensive GENCODE; or de-novo peak-call |
| MARVEL: ggplot error | Seurat version mismatch | Match MARVEL and Seurat versions |
| FLAMES: barcode rescue failed | Short-read 10X output not in expected directory | Verify cellranger output structure |
tools
--- name: bio-phasing-imputation-foundations description: Frames the phasing/imputation pipeline before any tool runs: phasing and imputation are one Li-Stephens copying HMM (recombination is the transition, mutation the emission, the genetic map and Ne set the rates), imputation's honest output is a dosage with a self-estimated quality (INFO/R2/DR2) not a hard genotype, and the stages are ordered and each fails silently (QC, align build and strand to the panel, phase, impute per chromosome, fil
tools
Chooses the enrichment generation before any tool runs, mapping the input shape to a method class - a pre-selected gene list plus a background to over-representation analysis (ORA, hypergeometric), a ranked statistic for all genes to gene set enrichment (GSEA), a signed signaling topology to pathway-topology (SPIA) - then making the null explicit (competitive vs self-contained, gene vs subject sampling) and running a trustworthiness checklist (testable-gene universe, FDR, redundancy collapse, leading-edge check, version reporting). Covers why every clusterProfiler GSEA is the inter-gene-correlation-uncorrected competitive null, why the background not the gene list decides ORA significance, and why no method is universally best. Use when deciding ORA vs GSEA vs topology, which gene-set DB, whether a result is trustworthy, or which null a tool computes. For ORA see go-enrichment, GSEA see gsea, databases kegg-pathways/reactome-pathways/wikipathways; the ranking comes from differential-expression/de-results.
testing
End-to-end GWAS workflow from VCF to association results. Covers PLINK QC, population structure correction, and association testing for case-control or quantitative traits. Use when running genome-wide association studies.
development
Orchestrates the full path from differential expression results to redundancy-collapsed functional enrichment: choose ORA vs GSEA, convert gene IDs per method, run enrichGO/enrichKEGG/enrichPathway/enrichWP or gseGO/gseKEGG (clusterProfiler, ReactomePA, rWikiPathways), and visualize. Routes the ORA-vs-GSEA generation fork and the null/universe/reproducibility theory to pathway-analysis/enrichment-foundations. Use when a DESeq2/edgeR/limma result must become enriched GO terms, KEGG/Reactome/WikiPathways pathways, or a GSEA leading edge; when deciding whether a ranking exists for all genes (GSEA, named decreasing vector) or only a pre-selected list (ORA plus a defensible background universe); or when assembling DE-to-pathway end to end. The DE list and ranking statistic come from differential-expression/de-results; per-method nuance lives in the pathway-analysis skills.