alternative-splicing/isoform-switching/SKILL.md
Analyzes differential transcript usage (DTU) and isoform switches with functional consequence prediction (NMD via 50nt rule, ORF disruption, protein domain loss/gain, signal peptide changes, IDR alterations, coding-potential shifts). Tools include IsoformSwitchAnalyzeR v2 (auto-selects satuRn for >5 reps else DEXSeq), the manual DRIMSeq -> DEXSeq/satuRn -> stageR DTU pipeline, and fishpond/swish for inferential-uncertainty-aware DTE. Distinguishes DTU from DGE and DTE; integrates external annotators (CPC2, Pfam, SignalP, IUPred2A or DeepTMHMM). Use when investigating how splicing differences alter protein function or trigger NMD-mediated degradation.
npx skillsauth add GPTomics/bioSkills bio-isoform-switchingInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Reference examples tested with: IsoformSwitchAnalyzeR 2.11+, DRIMSeq 1.34+, DEXSeq 1.52+, satuRn 1.14+, stageR 1.28+, fishpond 2.14+, tximport 1.34+, tximeta 1.24+, Salmon 1.10+
Before using code patterns, verify installed versions match. If versions differ:
packageVersion('<pkg>') then ?function_name to verify parameters<tool> --version then <tool> --help to confirm flagsIf code throws ImportError, AttributeError, or TypeError, introspect the installed package and adapt the example to match the actual API rather than retrying.
Identify shifts in which transcript a gene predominantly uses between conditions, and predict functional consequences. Statistically distinct from DGE and DTE; biologically distinct because the same gene-level expression can hide a complete isoform switch with major protein-level consequences.
| Question | Statistic | Tool | Example claim | |----------|-----------|------|----------------| | DGE Does the gene total change? | Sum of transcript counts | DESeq2, edgeR, limma-voom | "Gene X is upregulated 2-fold" | | DTE Does this transcript change in absolute abundance? | Per-transcript count | swish (fishpond), DESeq2 on transcripts, sleuth | "Transcript X-201 is upregulated 2-fold" | | DTU Do proportions of transcripts within the gene shift? | Vector of per-transcript proportions | DRIMSeq, DEXSeq, satuRn (+ stageR) | "Gene X switches from isoform 201 (50% -> 10%) to 202 (50% -> 90%)" |
DTU is statistically harder than DGE because:
DTU and event-level differential splicing answer related but distinct questions: rMATS' IncLevelDifference is essentially a 1-D projection of a DTU shift onto a single event coordinate. The pragmatic 2026 default: run both an event-level tool (rMATS or leafcutter) and a DTU pipeline; reconcile.
| Tool | Model | When to use | Fails when | |------|-------|-------------|------------| | IsoformSwitchAnalyzeR v2 | Wraps DEXSeq or satuRn + functional consequence annotation | Standard interpretation workflow with NMD/domain output | Manual DTU control needed; very large cohorts (>200) | | DRIMSeq | Dirichlet-multinomial on transcript counts; gene-level DTU | Pre-filter step before DEXSeq/satuRn | Cannot annotate functional consequences alone | | DEXSeq | Negative-binomial GLM on exon-bin or transcript counts | Classic DTU; conservative; <=5 replicates per condition | Slow at scale; uses bins not transcripts in default mode | | satuRn | Quasi-binomial GLM with empirical-Bayes shrinkage | DTU at scale (single-cell, large bulk cohorts) | Newer; less battle-tested than DEXSeq | | swish (fishpond) | Non-parametric SAMseq across Salmon Gibbs samples | DTE/DGE incorporating quantification uncertainty | Requires Gibbs samples; not strictly DTU | | stageR | Two-stage testing framework | Required for proper OFDR control on top of DRIMSeq/DEXSeq/satuRn | Standalone — wraps another tool's output | | sleuth | Bootstrap-based DTE on kallisto | When committed to kallisto pipeline | Less active development; superseded by fishpond+swish |
The IsoformSwitchAnalyzeR v2 default rule is: satuRn if any condition has >5 replicates; else DEXSeq. For exactly 5 replicates per condition (boundary), explicitly choose; results may differ.
| Question | Recommended approach |
|----------|----------------------|
| Functional consequences of switches (domains, NMD, signal peptide) | IsoformSwitchAnalyzeR v2 with full external annotator pipeline |
| Pure statistical DTU (gene-level + transcript-level OFDR) | DRIMSeq (filter) -> DEXSeq -> stageR; or -> satuRn -> stageR for n>5 |
| DTU with proper quantification uncertainty | Salmon --numGibbsSamples 20 -> tximeta -> swish for DTE; concurrent DTU |
| Single-cell DTU | satuRn (DEXSeq doesn't scale to scRNA-seq) |
| Long-read DTU (PacBio Iso-Seq, ONT) | IsoformSwitchAnalyzeR v2 long-read input mode (no Salmon EM uncertainty) |
| Time-course DTU | DEXSeq with time as factor + interaction; or limma::lmFit on logit-prop matrix |
| Cancer / disease — switch hits -> mechanism | Standard pipeline + cross-reference with eCLIP, ClinVar, COSMIC |
| Therapeutic ASO target identification | Standard pipeline + sashimi visualization + SpliceAI design |
Goal: Identify isoform switches with functional consequences in one integrated workflow.
Approach: Import Salmon, pre-filter, run statistical test (satuRn auto-selected if any condition has >5 replicates, else DEXSeq), annotate switches with external tools (CPC2, Pfam, SignalP, IUPred2A or DeepTMHMM), then summarize consequences.
library(IsoformSwitchAnalyzeR)
salmonQuant <- importIsoformExpression(
parentDir = 'salmon_quant/',
addIsofomIdAsColumn = TRUE
)
design <- data.frame(
sampleID = colnames(salmonQuant$counts)[-1],
condition = c('control', 'control', 'control', 'treatment', 'treatment', 'treatment')
)
aSwitchList <- importRdata(
isoformCountMatrix = salmonQuant$counts,
isoformRepExpression = salmonQuant$abundance,
designMatrix = design,
isoformExonAnnoation = 'annotation.gtf',
isoformNtFasta = 'transcripts.fa',
showProgress = TRUE
)
aSwitchList <- preFilter(
aSwitchList,
geneExpressionCutoff = 1,
isoformExpressionCutoff = 0,
IFcutoff = 0.01,
removeSingleIsoformGenes = TRUE,
keepIsoformInAllConditions = TRUE
)
aSwitchList <- isoformSwitchTestSatuRn(
aSwitchList,
reduceToSwitchingGenes = TRUE,
alpha = 0.05,
dIFcutoff = 0.1
)
preFilter parameters:
geneExpressionCutoff = 1 — minimum TPM for gene to be tested (raise for stricter)isoformExpressionCutoff = 0 — minimum TPM per isoform (set to 1 for stricter)IFcutoff = 0.01 — minimum isoform fraction; below = noiseremoveSingleIsoformGenes = TRUE — drop genes with only one detectable isoform (cannot have DTU)keepIsoformInAllConditions = TRUE — require expression across all conditionsFor long-read input, use importRdata with long-read transcript counts directly — bypasses Salmon EM uncertainty entirely.
Goal: Predict how each switch alters protein structure, function, and stability.
Approach: Extract sequences, run external annotators outside R, then re-import results into the switchAnalyzeRlist.
aSwitchList <- extractSequence(
aSwitchList,
pathToOutput = 'sequences/',
writeToFile = TRUE
)
# Run external tools on sequences/isoformSwitchAnalyzeR_isoform_*.fasta
# Then import:
# IMPORTANT: ORF analysis must run BEFORE analyzeSwitchConsequences for NMD_status,
# ORF_seq_similarity, and coding_potential consequences to be computed.
aSwitchList <- analyzeORF(aSwitchList, orfMethod = 'longest', genomeObject = NULL)
aSwitchList <- analyzeCPC2(aSwitchList, pathToCPC2resultFile = 'cpc2_results.txt', removeNoncodinORFs = TRUE)
aSwitchList <- analyzePFAM(aSwitchList, pathToPFAMresultFile = 'pfam_results.txt')
aSwitchList <- analyzeSignalP(aSwitchList, pathToSignalPresultFile = 'signalp_results.txt')
aSwitchList <- analyzeIUPred2A(aSwitchList, pathToIUPred2AresultFile = 'iupred2_results.txt')
aSwitchList <- analyzeAlternativeSplicing(aSwitchList, onlySwitchingGenes = TRUE)
aSwitchList <- analyzeSwitchConsequences(
aSwitchList,
consequencesToAnalyze = c(
'intron_retention',
'coding_potential',
'ORF_seq_similarity',
'NMD_status',
'domains_identified',
'IDR_identified',
'IDR_type',
'signal_peptide_identified'
),
dIFcutoff = 0.1
)
| External tool | Purpose | Required for |
|----------------|---------|--------------|
| CPC2 (Coding Potential Calculator 2) | Coding vs non-coding classification | coding_potential consequence |
| Pfam (HMMER hmmscan against Pfam-A) | Protein domain identification | domains_identified |
| SignalP 6.0+ | Signal peptide prediction | signal_peptide_identified |
| IUPred2A or DeepTMHMM | Intrinsic disorder regions / TM domains | IDR_identified, IDR_type |
| NetSurfP-3 | Surface accessibility (optional) | Extended IDR analysis |
The external tools must be run outside R; IsoformSwitchAnalyzeR provides FASTA outputs and re-imports the parsed results. Plan for ~30-60 minutes of external compute on typical mammalian transcriptomes.
A transcript is predicted NMD-sensitive if its premature termination codon (PTC) lies >50-55 nt upstream of the last exon-exon junction (Maquat 2004 Nat Rev Mol Cell Biol; Lykke-Andersen & Jensen 2015 Nat Rev Mol Cell Biol).
Mechanism: Spliceosome deposits the Exon Junction Complex (EJC) ~20-24 nt upstream of every exon-exon junction. During the pioneer round of translation, ribosome reading through removes EJCs upstream of the stop codon. If a stop codon precedes the last EJC by >50 nt, the EJC remains, recruits UPF1 -> SMG1 phosphorylation -> SMG6/SMG7 -> mRNA decay.
Caveats and exceptions:
IsoformSwitchAnalyzeR's analyzeSwitchConsequences with 'NMD_status' evaluates this from the predicted ORF + transcript model.
A large class of conserved alternative splicing events is deliberately PTC-introducing to titrate functional protein levels:
Functional implication: an increase in PSI of a poison exon decreases functional protein. Sign-of-effect in DTU output is opposite from intuition for these genes. Always check whether the alternative form is PTC-bearing before interpreting direction.
Disease examples:
The canonical reference is the F1000Research "Swimming downstream" workflow (Love, Soneson, Patro 2018; Bioconductor rnaseqDTU).
library(tximeta); library(DRIMSeq); library(DEXSeq); library(stageR)
se <- tximeta(coldata)
counts <- assays(se)$counts
samples <- data.frame(
sample_id = colnames(counts),
condition = c('control', 'control', 'control', 'treatment', 'treatment', 'treatment')
)
txdf <- data.frame(
gene_id = rowData(se)$gene_id,
feature_id = rowData(se)$tx_id,
counts
)
d <- dmDSdata(counts = txdf, samples = samples)
d <- dmFilter(d, min_samps_feature_expr = 3, min_feature_expr = 10,
min_samps_feature_prop = 3, min_feature_prop = 0.1,
min_samps_gene_expr = 6, min_gene_expr = 10)
design_full <- model.matrix(~ condition, data = samples(d))
dxd <- DEXSeqDataSet(
countData = round(as.matrix(counts(d)[, -c(1, 2)])),
sampleData = samples(d),
design = ~ sample + exon + condition:exon,
featureID = counts(d)$feature_id,
groupID = counts(d)$gene_id
)
dxd <- estimateSizeFactors(dxd)
dxd <- estimateDispersions(dxd, quiet = TRUE)
dxd <- testForDEU(dxd, reducedModel = ~ sample + exon)
qval <- perGeneQValue(DEXSeqResults(dxd))
dxr <- DEXSeqResults(dxd, independentFiltering = FALSE)
pConfirmation <- matrix(dxr$pvalue, ncol = 1)
rownames(pConfirmation) <- dxr$featureID
tx2gene <- as.data.frame(dxr[, c('featureID', 'groupID')])
stageRObj <- stageRTx(
pScreen = qval,
pConfirmation = pConfirmation,
pScreenAdjusted = TRUE,
tx2gene = tx2gene
)
stageRObj <- stageWiseAdjustment(stageRObj, method = 'dtu', alpha = 0.05)
results <- getAdjustedPValues(stageRObj, order = FALSE, onlySignificantGenes = FALSE)
stageR semantics:
perGeneQValue from DEXSeq, or DRIMSeq's gene-level p) is filtered at the desired Overall FDR.Goal: Test DTE while propagating quantification uncertainty from Salmon's Gibbs samples.
Approach: Run Salmon with --numGibbsSamples 20, import with tximeta, then use swish to average a non-parametric SAMseq-style test across inferential replicates.
library(fishpond); library(tximeta)
se <- tximeta(coldata)
y <- scaleInfReps(se)
y <- labelKeep(y)
y <- y[mcols(y)$keep, ]
set.seed(1)
y <- swish(y, x = 'condition')
dte_results <- as.data.frame(mcols(y))
sig <- subset(dte_results, qvalue < 0.05)
infRV (inferential relative variance) is a per-feature uncertainty diagnostic; high-infRV transcripts are unreliable and can be filtered before testing. Critical for genes with many similar isoforms (TTN, MAPT, NEFM) where Salmon's EM is uncertain.
Trigger: Bulk cohort with >50 samples or single-cell DTU.
Mechanism: DEXSeq fits a NB GLM per exon-bin per gene; computational cost scales linearly with samples × bins.
Symptom: estimateDispersions takes hours; testForDEU exhausts memory.
Fix: Switch to satuRn (designed for scale, including scRNA-seq); run with parallelization (BPPARAM = MulticoreParam(8)).
Trigger: Default dmFilter parameters too strict for low-expression cohort.
Mechanism: min_samps_feature_expr = 3, min_feature_expr = 10 drops transcripts seen in <=2 samples or with <10 counts.
Symptom: Most candidate genes filtered out; few testable genes.
Fix: Tune to dataset: lower thresholds for low-coverage data, raise for high-coverage. Document choice.
Trigger: Very small cohort (n=2 vs n=2) or very heterogeneous.
Mechanism: Empirical-Bayes shrinkage assumes shared dispersion across genes; collapses with too-few or too-heterogeneous samples.
Symptom: Inflated p-values; few discoveries despite real effects.
Fix: Aggregate replicates (pseudobulk), or switch to DEXSeq for small cohorts; use larger cohorts when possible.
Trigger: Running swish on Salmon output without Gibbs samples.
Mechanism: swish averages over inferential replicates from Salmon's Gibbs sampler; requires --numGibbsSamples 20 (or bootstrap with --numBootstraps) at Salmon time.
Symptom: scaleInfReps errors about missing inferential replicates.
Fix: Re-run Salmon with --numGibbsSamples 20; this triples Salmon runtime but enables uncertainty-aware testing.
Trigger: Forgetting to run all 4 external annotators (CPC2, Pfam, SignalP, IUPred2A).
Mechanism: analyzeSwitchConsequences silently drops consequence types for which annotation wasn't imported.
Symptom: extractConsequenceSummary shows fewer types than requested; specific consequence reports missing.
Fix: Verify all 4 result files exist before analyzeSwitchConsequences; check aSwitchList$AlternativeSplicingAnalysis slot for completeness.
| Pattern | Likely cause | Action | |---------|--------------|--------| | Significant DTU, no rMATS hit | DTU shift across many transcripts; no single canonical event captures it | Examine isoform structure in switchPlot; report at gene level | | rMATS sig, no significant DTU | Single event in single isoform; not a gene-level DTU | Report as event-level result; DTU not the right framing | | Both sig, same gene, different "main" isoforms | Annotation differs (rMATS uses GENCODE basic; ISA uses comprehensive) | Standardize annotation; re-run | | DTU shows poison-exon switch, gene-level DGE shows decrease | NMD-coupled regulation: AS-NMD reducing protein on top of transcription | Mechanism: AS-NMD; report direction carefully |
For high-confidence reporting: concordant DTU + event-level + sashimi visualization.
For scRNA-seq, satuRn scales where DEXSeq does not. IsoformSwitchAnalyzeR v2 supports single-cell input via importRdata with single-cell count matrices, and the underlying satuRn test has explicit single-cell calibration (Gilis 2022 F1000Research).
Strong recommendation: pseudobulk by cell type first; per-cell DTU is rarely powered with droplet 3' chemistry. See single-cell-splicing for chemistry-specific limitations.
extractTopSwitches(
aSwitchList,
filterForConsequences = TRUE,
n = 25,
sortByQvals = TRUE
)
switchPlot(
aSwitchList,
gene = 'TARGET_GENE',
condition1 = 'control',
condition2 = 'treatment',
localTheme = theme_bw(base_size = 12)
)
extractConsequenceSummary(aSwitchList, consequencesToAnalyze = 'all', plotGenes = FALSE)
extractConsequenceEnrichment(aSwitchList, consequencesToAnalyze = 'all')
extractSplicingSummary(aSwitchList, asFractionTotal = FALSE)
| Parameter | Default | Notes | |-----------|---------|-------| | isoform_switch_q_value | < 0.05 | Switch significance | | dIF (delta isoform fraction) | > 0.1 | Minimum biological effect | | Consequence q-value | < 0.05 | Significance per consequence type | | Gene-level OFDR (stageR) | < 0.05 | Gene-level screening FDR | | satuRn alpha | 0.05 | Empirical-Bayes alpha | | swish qvalue | < 0.05 | Local FDR from qvalue package |
| Error | Cause | Solution |
|-------|-------|----------|
| Error in importRdata: ... transcript_ids do not match | Salmon index built from different annotation than provided GTF | Rebuild Salmon index with matching transcripts.fa |
| analyzeSwitchConsequences: not enough switching genes | preFilter too strict; few candidate switches | Lower geneExpressionCutoff, dIFcutoff in test step |
| dmFilter: empty result | Filter parameters too strict | Reduce min_feature_expr, min_samps_feature_expr |
| satuRn: rank-deficient design matrix | Confounder perfectly correlated with condition | Drop the confounder or stratify analysis |
| swish: no inferential replicates found | Salmon run without --numGibbsSamples | Re-run Salmon with --numGibbsSamples 20 |
| analyzeIUPred2A: file not found | External annotator output missing or wrong path | Verify CPC2/Pfam/SignalP/IUPred2A all completed and paths match |
gene_p should be passed as pScreen, not raw transcript p-values.--numGibbsSamples is upstreamtools
--- name: bio-phasing-imputation-foundations description: Frames the phasing/imputation pipeline before any tool runs: phasing and imputation are one Li-Stephens copying HMM (recombination is the transition, mutation the emission, the genetic map and Ne set the rates), imputation's honest output is a dosage with a self-estimated quality (INFO/R2/DR2) not a hard genotype, and the stages are ordered and each fails silently (QC, align build and strand to the panel, phase, impute per chromosome, fil
tools
Chooses the enrichment generation before any tool runs, mapping the input shape to a method class - a pre-selected gene list plus a background to over-representation analysis (ORA, hypergeometric), a ranked statistic for all genes to gene set enrichment (GSEA), a signed signaling topology to pathway-topology (SPIA) - then making the null explicit (competitive vs self-contained, gene vs subject sampling) and running a trustworthiness checklist (testable-gene universe, FDR, redundancy collapse, leading-edge check, version reporting). Covers why every clusterProfiler GSEA is the inter-gene-correlation-uncorrected competitive null, why the background not the gene list decides ORA significance, and why no method is universally best. Use when deciding ORA vs GSEA vs topology, which gene-set DB, whether a result is trustworthy, or which null a tool computes. For ORA see go-enrichment, GSEA see gsea, databases kegg-pathways/reactome-pathways/wikipathways; the ranking comes from differential-expression/de-results.
testing
End-to-end GWAS workflow from VCF to association results. Covers PLINK QC, population structure correction, and association testing for case-control or quantitative traits. Use when running genome-wide association studies.
development
Orchestrates the full path from differential expression results to redundancy-collapsed functional enrichment: choose ORA vs GSEA, convert gene IDs per method, run enrichGO/enrichKEGG/enrichPathway/enrichWP or gseGO/gseKEGG (clusterProfiler, ReactomePA, rWikiPathways), and visualize. Routes the ORA-vs-GSEA generation fork and the null/universe/reproducibility theory to pathway-analysis/enrichment-foundations. Use when a DESeq2/edgeR/limma result must become enriched GO terms, KEGG/Reactome/WikiPathways pathways, or a GSEA leading edge; when deciding whether a ranking exists for all genes (GSEA, named decreasing vector) or only a pre-selected list (ORA plus a defensible background universe); or when assembling DE-to-pathway end to end. The DE list and ranking statistic come from differential-expression/de-results; per-method nuance lives in the pathway-analysis skills.