alternative-splicing/splicing-qc/SKILL.md
Assesses RNA-seq data quality specifically for alternative splicing analysis. QC layers include experimental design audit (library prep, read length, depth, replicates), STAR 2-pass cohort-style alignment, junction saturation curves and discovery plateau detection, novel-vs-known junction ratio diagnostics, junction-overhang distribution, splice-site strength scoring (MaxEntScan intrinsic + SpliceAI context-aware), strandedness verification, GENCODE basic vs comprehensive choice, and rRNA contamination screening. Splicing analysis is more demanding than DGE on read length, depth, library prep, alignment strategy, and annotation choice — failures silently bias PSI estimates and inflate novel-junction false positives. Use when evaluating data suitability for splicing analysis, troubleshooting low event detection, or designing sequencing experiments where AS is a primary endpoint.
npx skillsauth add GPTomics/bioSkills bio-splicing-qcInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Reference examples tested with: RSeQC 5.0+, STAR 2.7.11+, samtools 1.19+, pysam 0.22+, regtools 1.0+, maxentpy 0.0.1+, spliceai 1.3+, matplotlib 3.8+, pandas 2.2+
Before using code patterns, verify installed versions match. If versions differ:
pip show <package> then help(module.function) to check signatures<tool> --version then <tool> --help to confirm flagsIf code throws ImportError, AttributeError, or TypeError, introspect the installed package and adapt the example to match the actual API rather than retrying.
Splicing analysis is more demanding than DGE on read length, depth, library prep, alignment strategy, and annotation choice. Failures in any of these silently bias PSI estimates and inflate novel-junction false positives. The decision sequence is: experimental design -> library prep -> alignment strategy -> annotation -> diagnostic metrics. Each layer's failure mode is distinct.
| Layer | Target | Tool | Fails when |
|-------|--------|------|------------|
| Experimental design | Read length, depth, replicates, library type | Pre-sequencing review | <PE 75nt; n<3 vs n<3; <30M reads/sample |
| Library prep | poly(A) vs rRNA depletion | Pre-sequencing review | poly(A) library used for IR analysis |
| Alignment | STAR 2-pass cohort-style | STAR | 1-pass loses 14% novel junctions; per-sample 2-pass introduces inconsistency |
| Junction discovery | Saturation, novelty | RSeQC junction_saturation, junction_annotation | Curve still rising = under-sequenced; novel% >40% suggests biology or artifact |
| Strand specificity | Library protocol consistency | RSeQC infer_experiment | Wrong --libType halves usable junctions |
| Splice site strength | Cryptic vs canonical | MaxEntScan, SpliceAI | Weak splice sites (MaxEnt<5) may indicate cryptic, regulated, or annotation error |
| Junction overhang | Read-junction support quality | pysam CIGAR parsing | Overhang <8nt = high false-positive rate |
| Contamination | rRNA, adapters | fastq_screen | >20% rRNA in "depleted" library = failed depletion |
| Annotation | GENCODE basic vs comprehensive | Annotation choice | Basic for canonical events; comprehensive for DTU |
| Question | Recommended QC | |----------|-----------------| | Will my planned RNA-seq design support AS analysis? | Pre-sequencing audit: library type, read length, depth, replicates | | Is my data suitable for cassette exon analysis? | Junction saturation + known/novel ratio + read length | | Why does my AS analysis call so few events? | Saturation curve, depth, library type, alignment 2-pass | | Why does my AS analysis call so many novel junctions? | Annotation completeness + novel% + biology check (TDP-43, SF3B1) | | Are my SpliceAI predictions calibrated for my tissue? | MaxEntScan + SpliceAI concordance for known sites | | Did STAR 2-pass actually run cohort-style? | Verify SJ.out.tab merging across samples | | Is intron retention detectable in my data? | Library type (must be rRNA-depleted); strand-specific | | Are my microexons detectable? | Read length >=100; aligner anchor settings; consider VAST-TOOLS |
| Decision | For splicing analysis | Rationale |
|----------|------------------------|-----------|
| Library prep | rRNA depletion (Ribo-Zero, RiboCop) | poly(A) selection loses pre-mRNA, nascent transcripts, and detained introns; for IR analysis rRNA depletion is mandatory (Sims 2014 Genome Res) |
| Read length | PE 100-150 nt (PE 150 strongly preferred) | Junction-spanning reads need >=8 nt overhang on each exon; 50 nt SE biases toward shorter exons (Wang 2008 Nature) |
| Pairing | Paired-end | Single-end loses fragment-level disambiguation of junctions |
| Depth | 50-100M reads/sample | DGE-grade 30M misses low-PSI events; 100M for low-abundance event discovery |
| Strandedness | Stranded library (Illumina TruSeq stranded) | Distinguishes overlapping antisense; some tools double-count unstranded junctions |
| Replicates | n>=3 per condition | n=2 vs n=2 has poor calibration in most tools (especially SUPPA2) |
| Annotation | GENCODE basic for canonical, comprehensive for DTU/discovery | basic = high-confidence; comprehensive includes putative — affects FDR control |
| Microexons | PE 100+ with --alignSJoverhangMin 8; VAST-TOOLS | Default aligners miss 3-27nt exons |
| Long-intron genes (TTN, brain) | Increased --alignIntronMax | Default 1Mb may miss >1Mb introns |
Goal: Maximize novel-junction sensitivity for downstream AS analysis.
Approach: Run STAR once per sample to discover novel junctions (pass 1), merge novel junctions across cohort, then re-align with the augmented junction set (pass 2). Cohort-style 2-pass beats per-sample basic 2-pass for differential splicing because all samples use the same junction reference.
# Pass 1: per-sample
STAR --runMode alignReads \
--runThreadN 8 \
--genomeDir genome_index \
--sjdbGTFfile gencode.v45.basic.gtf \
--sjdbOverhang 149 \
--readFilesIn sample_R1.fq.gz sample_R2.fq.gz \
--readFilesCommand zcat \
--outSAMtype BAM SortedByCoordinate \
--outFileNamePrefix pass1_${sample}_ \
--outSJtype Standard \
--outFilterMultimapNmax 20 \
--alignSJoverhangMin 8 \
--alignSJDBoverhangMin 1
# Cohort-style 2-pass: collect all SJ.out.tab from pass 1
cat pass1_*_SJ.out.tab | awk '$5 > 0 && $7 >= 3' | sort -u > cohort_novel_SJ.tab
# Pass 2: re-align with augmented junctions
STAR --runMode alignReads \
--runThreadN 8 \
--genomeDir genome_index \
--sjdbGTFfile gencode.v45.basic.gtf \
--sjdbFileChrStartEnd cohort_novel_SJ.tab \
--sjdbOverhang 149 \
--readFilesIn sample_R1.fq.gz sample_R2.fq.gz \
--readFilesCommand zcat \
--outSAMtype BAM SortedByCoordinate \
--outFileNamePrefix pass2_${sample}_ \
--outSJtype Standard \
--twopassMode None \
--quantMode GeneCounts \
--alignSJoverhangMin 8 \
--alignSJDBoverhangMin 3
| Approach | Novel-junction recovery | Cohort consistency |
|----------|-------------------------|--------------------|
| 1-pass with annotation | ~80-86% (depends on GENCODE completeness) | High (annotation-based) |
| Per-sample basic 2-pass (--twopassMode Basic) | >=94% | Variable (each sample has its own junction set) |
| Cohort-style 2-pass (manual merge) | >=94% | High (shared junction reference) |
Per-sample 2-pass (--twopassMode Basic) is simpler but produces inconsistent junction sets across samples; for differential splicing the cohort-style version is preferred (Veeneman 2016 Bioinformatics).
The pass-1 filter awk '$5 > 0 && $7 >= 3' keeps junctions with strand info AND >=3 unique reads — adjust threshold to balance discovery vs noise.
Goal: Determine whether sequencing depth is sufficient for comprehensive splicing detection.
Approach: Run RSeQC junction saturation; check whether the discovery curve plateaus.
junction_saturation.py \
-i sample.bam \
-r gencode_v45.bed \
-o sample_junc_sat \
-m 50 \
-v 100000
import subprocess
import pandas as pd
samples = ['s1.bam', 's2.bam', 's3.bam']
for sample in samples:
subprocess.run([
'junction_saturation.py',
'-i', sample,
'-r', 'gencode_v45.bed',
'-o', sample.replace('.bam', '_junc_sat')
], check=True)
The output *.junctionSaturation_plot.r plots known + novel junctions vs subsampled reads.
Plateau detection rule: if from 80% to 100% of reads, the junction count rises by <2%, consider it plateaued. Still rising means more sequencing would yield more junctions.
For AS analysis, plateau on the known junction curve is the requirement; novel-junction curves often don't plateau even at deep coverage (which is biologically informative — novel junctions are inherently rarer events).
Goal: Detect annotation/mapping issues or biologically interesting cryptic splicing.
Approach: Classify junctions with RSeQC and compute the novel:known ratio.
junction_annotation.py -i sample.bam -r gencode_v45.bed -o sample_junc_annot
import pandas as pd
junc = pd.read_csv('sample_junc_annot.junction.xls', sep='\t')
total = junc['total_splicing_events'].sum()
by_class = junc.groupby('annotation')['total_splicing_events'].sum()
known_frac = by_class.get('known', 0) / total
novel_frac = (by_class.get('partial_novel', 0) + by_class.get('novel', 0)) / total
print(f'known: {known_frac:.1%}, novel: {novel_frac:.1%}')
| Known fraction | Status | Interpretation | |----------------|--------|----------------| | >=80% | Healthy | Comprehensive annotation, good alignment | | 60-80% | Acceptable | Check annotation completeness or organism | | <60% | Suspect or interesting | Mapping artifacts, contamination, OR biologically informative |
High novel-junction rate may be biology, not artifact:
If novel% >40%, drill down: check organism, check spliceosomal mutation status, check known disease signatures.
Goal: Profile per-junction read counts and overhang distribution to identify weakly-supported events.
Approach: Parse CIGAR for N (intron) operations; tally per-junction reads and minimum exon overhangs.
import pysam
from collections import defaultdict
def junction_stats(bam_path):
bam = pysam.AlignmentFile(bam_path, 'rb')
counts = defaultdict(int)
min_overhang = defaultdict(lambda: float('inf'))
for read in bam.fetch():
if read.is_unmapped or read.is_secondary:
continue
ref_pos = read.reference_start
cumulative_query = 0
cigar = read.cigartuples
for i, (op, length) in enumerate(cigar):
if op == 3:
left_match = sum(l for o, l in cigar[:i] if o in (0, 7, 8))
right_match = sum(l for o, l in cigar[i+1:] if o in (0, 7, 8))
overhang = min(left_match, right_match)
key = (read.reference_name, ref_pos, ref_pos + length)
counts[key] += 1
min_overhang[key] = min(min_overhang[key], overhang)
if op in (0, 2, 3, 7, 8):
ref_pos += length
bam.close()
return counts, dict(min_overhang)
counts, overhang = junction_stats('sample.bam')
print(f'total junctions: {len(counts)}')
print(f'>= 10 reads: {sum(1 for c in counts.values() if c >= 10)}')
print(f'overhang >= 8 nt: {sum(1 for k, c in counts.items() if overhang[k] >= 8)}')
Junction reads with overhang <8 nt are common false positives, especially for novel sites. Most callers default to >=8 nt anchor for this reason. Microexon-aware aligners use overhang as low as 6 nt with explicit configuration.
Goal: Score donor and acceptor splice sites to flag weak / cryptic sites and to predict variant impact on splicing.
Approach: Use MaxEntScan (sequence information content) and SpliceAI (context-aware deep-learning) — they answer different questions.
from maxentpy.maxent import score5, score3
donor = 'CAGGTAAGT'
acceptor = 'TTTTTTTTTTTTTTTTTTTTCAG'
print(f"5'ss MaxEnt: {score5(donor):.2f}")
print(f"3'ss MaxEnt: {score3(acceptor):.2f}")
| Score | Interpretation | Source | |-------|----------------|--------| | 5'ss MaxEnt > 8 | Strong donor | Yeo & Burge 2004 J Comput Biol | | 5'ss MaxEnt 5-8 | Moderate | | | 5'ss MaxEnt < 5 | Weak / cryptic | | | 3'ss MaxEnt > 8 | Strong acceptor | | | 3'ss MaxEnt < 5 | Weak / cryptic | | | SpliceAI delta > 0.2 | PP3 supporting (ClinGen SVI 2023) | Walker 2023 AJHG | | SpliceAI delta > 0.5 | PP3 moderate | | | SpliceAI delta > 0.8 | PP3 strong | |
MaxEntScan vs SpliceAI:
splice-variant-prediction.Goal: Get integrated RNA-seq QC including intronic / exonic / intergenic mapping rates and gene-body coverage uniformity.
Approach: Run picard CollectRnaSeqMetrics for mapping distribution; RSeQC geneBody_coverage.py for 5'-3' bias.
picard CollectRnaSeqMetrics \
I=sample.bam \
O=sample.rna_metrics.txt \
REF_FLAT=refFlat.txt \
STRAND_SPECIFICITY=SECOND_READ_TRANSCRIPTION_STRAND \
RIBOSOMAL_INTERVALS=rRNA_intervals.interval_list
# Strandedness conversion (foot-gun):
# Reverse-stranded (Illumina TruSeq Stranded; NEB Ultra II Directional — both dUTP):
# rMATS --libType fr-firststrand
# featureCounts -s 2
# Picard STRAND_SPECIFICITY=SECOND_READ_TRANSCRIPTION_STRAND
# Forward-stranded (Lexogen QuantSeq FWD, certain ligation-based kits):
# rMATS --libType fr-secondstrand
# featureCounts -s 1
# Picard STRAND_SPECIFICITY=FIRST_READ_TRANSCRIPTION_STRAND
# STAR has no library-strand flag; pass --outSAMstrandField intronMotif
# (works for any library) so downstream tools can read XS tags.
geneBody_coverage.py \
-i sample.bam \
-r gencode_v45.bed \
-o sample_geneBody
| Metric | Healthy | Concerning | |--------|---------|------------| | PCT_CODING_BASES | >=50% | <30% (suggests degradation or mis-priming) | | PCT_UTR_BASES | 20-40% | >>50% (3' bias) | | PCT_INTRONIC_BASES | <30% (poly(A)); <60% (rRNA-depleted) | >50% (poly(A)) suggests pre-mRNA contamination | | PCT_INTERGENIC_BASES | <10% | >20% (genomic DNA contamination) | | MEDIAN_5PRIME_TO_3PRIME_BIAS | 0.7-1.3 | >2 or <0.5 (severe degradation) | | Gene body coverage curve | Flat | Strong 3' skew = RIN low or library mis-prep |
3' bias (degraded RNA) directly reduces splicing-event detection because junction reads scatter across the gene body; with 3' bias they concentrate near the 3' end and miss CDS junctions.
infer_experiment.py -i sample.bam -r gencode_v45.bed -s 200000
Output reports the fraction of reads consistent with each library type:
| Output pattern | Library type | rMATS --libType |
|----------------|---------------|---------------------|
| ~50% / ~50% | Unstranded | fr-unstranded |
| >=90% "++ , --" | Forward-stranded | fr-secondstrand |
| >=90% "+- , -+" | Reverse-stranded (Illumina TruSeq stranded) | fr-firststrand |
Wrong strand setting halves usable junction reads — always verify before quantification. RSeQC infer_experiment.py is fast and authoritative.
| GENCODE level | Contents | Use for | |---------------|----------|---------| | Basic | High-confidence canonical isoforms | Standard rMATS, leafcutter, SUPPA2 | | Comprehensive | All transcripts including putative/predicted | DTU pipelines (DRIMSeq+DEXSeq, satuRn), isoform discovery | | RefSeq | NCBI curated | Less complete than GENCODE; legacy use | | Ensembl | Same content as GENCODE in vertebrates | Different attribute conventions |
Comprehensive captures more biology but inflates DTU multiple-testing burden and includes annotation noise. For event-level (rMATS) AS, basic is usually adequate; for transcript-level DTU (DRIMSeq, satuRn), comprehensive may be necessary to capture rare isoforms.
fastq_screen --conf fastq_screen.conf --threads 8 sample_R1.fq.gz
Or post-alignment:
samtools view -c sample.bam | awk '{print "total:",$0}'
samtools view -c -L rRNA_intervals.bed sample.bam | awk '{print "rRNA:",$0}'
| rRNA fraction | Library type | Status | |----------------|---------------|--------| | >=20% | "depleted" | Failed depletion; redo | | 5-20% | "depleted" | Acceptable; some rRNA leakage | | <5% | poly(A) | Healthy | | <5% | "depleted" | Excellent depletion | | 1-3% | poly(A) | Suggests RNA degradation |
5% rRNA in a poly(A) library suggests degraded RNA; >20% in a "depleted" library indicates failed depletion.
junction_saturation: Subsampling BehaviorTrigger: Running on extremely deep BAM (>200M reads).
Mechanism: RSeQC subsamples at 5%, 10%, ..., 100%; with very deep BAMs, the early subsamples are still tens of millions of reads, masking saturation behavior.
Symptom: Curve appears flat throughout; uninformative.
Fix: Subsample BAM with samtools view -s 0.1 before running junction_saturation; or use -s flag to set custom step intervals.
Trigger: Using --twopassMode Basic on differential splicing cohorts.
Mechanism: Per-sample 2-pass means each sample has its own SJ.out.tab; samples may differ in which novel junctions they re-align against.
Symptom: Inconsistent novel junction calls across replicates; rMATS --novelSS differential calls don't replicate.
Fix: Switch to cohort-style 2-pass (collect all pass-1 SJ.out.tabs, merge, re-align all samples with merged set).
Trigger: Sequences with N bases or wrong length.
Mechanism: score5 expects exactly 9 nt (3 exon + 6 intron); score3 expects 23 nt (20 intron + 3 exon).
Symptom: ValueError or silently incorrect score.
Fix: Pre-validate sequence length and N-content; use a wrapper that returns NaN for invalid inputs.
Trigger: Running spliceai on large VCF without GPU.
Mechanism: TensorFlow CPU mode is slow; default batch size may exceed memory.
Symptom: OOM kill; very slow runtime (hours per chromosome).
Fix: Use -D 50 for screening (fastest); split VCF by chromosome; use GPU when available.
infer_experiment.py: Sample SizeTrigger: Running on very low-coverage region or small subsample (-s).
Mechanism: Default sample size is 200,000 reads; with low coverage, this isn't met.
Symptom: "0 of 200000 reads" output; cannot infer strand.
Fix: Lower -s to actual available reads; or use -q 30 to filter by quality.
| Error | Cause | Solution |
|-------|-------|----------|
| STAR: SJDBoverhang differs from genome | Index built with different overhang than current run | Rebuild index with --sjdbOverhang matching read length - 1 |
| RSeQC: BED format error | Annotation BED has wrong column order | Convert with awk or bedtools |
| MaxEntScan: invalid sequence character N | N in input | Filter or replace; document |
| samtools view: missing index | BAM not indexed | samtools index sample.bam |
| STAR: too many SJs in cohort merge | Cohort SJ.out.tab too large after merge | Filter to junctions in >=3 samples or with >=3 unique reads |
| regtools: invalid CIGAR | Non-spec read in BAM | Filter with samtools view -h -F 0x100 -F 0x800 |
| Metric | Good | Acceptable | Poor | Source | |--------|------|------------|------|--------| | Read length (PE) | 150 nt | 100 nt | <75 nt | Wang 2008 Nature | | Sequencing depth | >=100M | 50-100M | <30M | DGE-grade insufficient | | Junction saturation | Plateau (<2% growth in last 20%) | Near plateau | Still rising | RSeQC convention | | Known-junction fraction | >=80% | 60-80% | <60% (suspect or interesting) | RSeQC convention | | Junctions >=10 reads | >=50% | 30-50% | <30% | rMATS reliability cutoff | | 5'ss / 3'ss MaxEnt | >8 | 5-8 | <5 | Yeo & Burge 2004 | | Strandedness | >90% one direction | 70-90% | <70% | RSeQC convention | | rRNA in depleted library | <5% | 5-20% | >20% | Sims 2014 Genome Res | | 2-pass STAR | Cohort-style | Per-sample basic | 1-pass only | Veeneman 2016 Bioinformatics |
| Issue | Possible causes | Solutions | |-------|-----------------|-----------| | Few events called | Low depth; short reads; SE; wrong strand | Increase depth; use PE150; verify libType | | High novel junctions | Annotation gaps; mapping artifacts; biology (TDP-43, SF3B1) | Update annotation; check 2-pass; consider biology | | Low IR detection | poly(A) library | Use rRNA depletion | | Microexons missing | Default aligner anchors too long | VAST-TOOLS, MicroExonator, or long-read | | Many weak splice sites | Cryptic splicing | Validate with MaxEnt + SpliceAI; consider RNA-seq from secondary tissue | | FDR uncalibrated at low n | n=2 vs n=2 | Use leafcutter or Shiba; avoid SUPPA2 alone | | PSI variance high across replicates | Library prep / RIN inconsistency | Check RIN; consider RNA degradation | | Sashimi plot mismatch with PSI | Junction-imbalance bias in rMATS | Run Shiba; or filter by overhang distribution |
--libType — halves usable junctions; always verify with infer_experiment.py.development
Find restriction enzyme cut sites in DNA sequences using Biopython Bio.Restriction. Search with single enzymes, batches of enzymes, or commercially available enzyme sets. Returns cut positions for linear or circular DNA. Use when finding restriction enzyme cut sites in sequences.
development
Create restriction maps showing enzyme cut positions on DNA sequences using Biopython Bio.Restriction. Visualize cut sites, calculate distances between sites, and generate text or graphical maps. Use when creating or analyzing restriction maps.
development
Analyze restriction digest fragments using Biopython Bio.Restriction. Predict fragment sizes, get fragment sequences, simulate gel electrophoresis patterns, and perform double digests. Use when analyzing restriction digest fragment patterns.
development
Select restriction enzymes by criteria using Biopython Bio.Restriction. Find enzymes that cut once, don't cut, produce specific overhangs, are commercially available, or have compatible ends for cloning. Use when selecting restriction enzymes for cloning or analysis.