read-qc/rnaseq-qc/SKILL.md
RNA-seq specific quality control including rRNA contamination detection, strandedness verification, gene body coverage, and transcript integrity metrics. Use when validating RNA-seq libraries before differential expression analysis.
npx skillsauth add GPTomics/bioSkills bio-rnaseq-qcInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Reference examples tested with: NCBI BLAST+ 2.15+, numpy 1.26+, picard 3.1+, pysam 0.22+, samtools 1.19+
Before using code patterns, verify installed versions match. If versions differ:
pip show <package> then help(module.function) to check signatures<tool> --version then <tool> --help to confirm flagsIf code throws ImportError, AttributeError, or TypeError, introspect the installed package and adapt the example to match the actual API rather than retrying.
RNA-seq specific QC metrics beyond general read quality.
"Check RNA-seq alignment quality" -> Assess gene body coverage, read distribution (exonic/intronic/intergenic), strand specificity, and rRNA contamination rate.
infer_experiment.py, read_distribution.py (RSeQC)picard CollectRnaSeqMetricsHigh rRNA content indicates failed rRNA depletion or polyA selection.
sortmerna \
--ref rRNA_databases/smr_v4.3_default_db.fasta \
--reads sample.fastq.gz \
--aligned rRNA_reads \
--other non_rRNA_reads \
--fastx \
--threads 8
rrna_count=$(grep -c "^@" rRNA_reads.fastq 2>/dev/null || echo 0)
total_count=$(zcat sample.fastq.gz | grep -c "^@")
rrna_pct=$(echo "scale=2; $rrna_count / $total_count * 100" | bc)
echo "rRNA: ${rrna_pct}%"
seqkit sample -n 10000 sample.fastq.gz | seqkit fq2fa > sample_10k.fasta
blastn -query sample_10k.fasta -db rrna_db -outfmt 6 -evalue 1e-10 -max_target_seqs 1 | wc -l
| Library Type | Expected rRNA | |--------------|---------------| | PolyA selected | < 5% | | rRNA depleted | < 10% | | Total RNA | 50-80% |
infer_experiment.py -i aligned.bam -r genes.bed
Fraction of reads explained by "1++,1--,2+-,2-+": 0.9856 # Forward stranded
Fraction of reads explained by "1+-,1-+,2++,2--": 0.0144 # Reverse (should be low)
| Tool Setting | 1++,1--,2+-,2-+ | 1+-,1-+,2++,2-- | |--------------|-----------------|-----------------| | Forward (dUTP) | ~0 | ~1 | | Reverse (Illumina) | ~1 | ~0 | | Unstranded | ~0.5 | ~0.5 |
salmon quant -i index -l A -r sample.fastq.gz -o quant/
grep "library_types" quant/lib_format_counts.json
Check for 3' or 5' bias indicating RNA degradation.
geneBody_coverage.py \
-i aligned.bam \
-r housekeeping_genes.bed \
-o coverage
| Pattern | Indicates | |---------|-----------| | Even coverage | Good quality | | 3' bias | Degradation or polyA artifacts | | 5' bias | Incomplete reverse transcription | | Steep drop | Severe degradation |
read_distribution.py -i aligned.bam -r genes.bed > distribution.txt
| Region | Good Library | |--------|--------------| | CDS_Exons | 60-80% | | UTRs | 10-20% | | Introns | 5-20% | | Intergenic | < 10% |
Measure of RNA degradation per transcript.
tin.py -i aligned.bam -r genes.bed > tin_scores.txt
| TIN Score | Quality | |-----------|---------| | > 70 | Good | | 50-70 | Moderate | | < 50 | Poor |
java -jar picard.jar MarkDuplicates \
I=aligned.bam \
O=marked.bam \
M=dup_metrics.txt \
REMOVE_DUPLICATES=false
grep -A 1 "LIBRARY" dup_metrics.txt | tail -1 | cut -f9
| Library | Expected | |---------|----------| | High complexity | < 20% | | Low input | 20-50% | | Concerning | > 50% |
java -jar picard.jar CollectInsertSizeMetrics \
I=aligned.bam \
O=insert_metrics.txt \
H=insert_histogram.pdf
for frac in 0.1 0.25 0.5 0.75 1.0; do
samtools view -bs $frac aligned.bam > sub_${frac}.bam
featureCounts -a genes.gtf -o counts_${frac}.txt sub_${frac}.bam
detected=$(awk '$7 > 0' counts_${frac}.txt | wc -l)
echo "$frac: $detected genes"
done
Comprehensive RNA-seq metrics from Picard.
java -jar picard.jar CollectRnaSeqMetrics \
I=aligned.bam \
O=rnaseq_metrics.txt \
REF_FLAT=refFlat.txt \
STRAND=SECOND_READ_TRANSCRIPTION_STRAND \
RIBOSOMAL_INTERVALS=rRNA.interval_list
| Metric | Description | |--------|-------------| | PCT_CODING_BASES | % in coding regions | | PCT_UTR_BASES | % in UTRs | | PCT_INTRONIC_BASES | % in introns | | PCT_INTERGENIC_BASES | % intergenic | | PCT_RIBOSOMAL_BASES | % rRNA | | MEDIAN_5PRIME_TO_3PRIME_BIAS | 3' bias |
Aggregate all QC metrics.
multiqc fastqc/ star_output/ featurecounts/ -o multiqc_report/
Goal: Generate a comprehensive RNA-seq QC report covering strandedness, read distribution, gene body coverage, transcript integrity, duplication, and RNA-seq metrics.
Approach: Run RSeQC tools (infer_experiment, read_distribution, geneBody_coverage, TIN) and Picard (MarkDuplicates, CollectRnaSeqMetrics) sequentially, appending all results to a single summary report file.
#!/bin/bash
SAMPLE=$1
BAM=$2
GENES_BED=$3
REF_FLAT=$4
echo "=== RNA-seq QC: $SAMPLE ===" > qc_report.txt
echo -e "\n--- Strandedness ---" >> qc_report.txt
infer_experiment.py -i $BAM -r $GENES_BED >> qc_report.txt
echo -e "\n--- Read Distribution ---" >> qc_report.txt
read_distribution.py -i $BAM -r $GENES_BED >> qc_report.txt
echo -e "\n--- Gene Body Coverage ---" >> qc_report.txt
geneBody_coverage.py -i $BAM -r $GENES_BED -o coverage
echo -e "\n--- TIN Scores ---" >> qc_report.txt
tin.py -i $BAM -r $GENES_BED > tin.txt
awk '{sum+=$3; count++} END {print "Mean TIN:", sum/count}' tin.txt >> qc_report.txt
echo -e "\n--- Duplication ---" >> qc_report.txt
java -jar picard.jar MarkDuplicates I=$BAM O=/dev/null M=dup.txt 2>/dev/null
grep -A 1 "LIBRARY" dup.txt | tail -1 | awk '{print "Duplication rate:", $9}' >> qc_report.txt
echo -e "\n--- RNA-seq Metrics ---" >> qc_report.txt
java -jar picard.jar CollectRnaSeqMetrics I=$BAM O=rnaseq.txt REF_FLAT=$REF_FLAT STRAND=SECOND_READ_TRANSCRIPTION_STRAND 2>/dev/null
grep -A 2 "## METRICS CLASS" rnaseq.txt >> qc_report.txt
cat qc_report.txt
import pysam
import numpy as np
from collections import Counter
def rnaseq_qc(bam_file, sample_size=100000):
bam = pysam.AlignmentFile(bam_file, 'rb')
strand_counts = Counter()
insert_sizes = []
for i, read in enumerate(bam.fetch()):
if i >= sample_size:
break
if not read.is_unmapped:
if read.is_read1:
if read.is_reverse:
strand_counts['1-'] += 1
else:
strand_counts['1+'] += 1
if read.is_proper_pair and read.template_length > 0:
insert_sizes.append(read.template_length)
bam.close()
total = sum(strand_counts.values())
print(f'Read 1 forward: {strand_counts["1+"]/total:.2%}')
print(f'Read 1 reverse: {strand_counts["1-"]/total:.2%}')
if insert_sizes:
print(f'Median insert: {np.median(insert_sizes):.0f}')
rnaseq_qc('aligned.bam')
| Metric | Good | Warning | Fail | |--------|------|---------|------| | Mapping rate | > 85% | 70-85% | < 70% | | rRNA % | < 10% | 10-20% | > 20% | | Exonic % | > 60% | 40-60% | < 40% | | Duplication | < 20% | 20-40% | > 40% | | Mean TIN | > 70 | 50-70 | < 50 | | 3' bias | < 1.5 | 1.5-2 | > 2 |
testing
Analyze multi-modal single-cell data (CITE-seq, Multiome, spatial). Use when working with data that measures multiple modalities per cell like RNA + protein or RNA + ATAC. Use when analyzing CITE-seq, Multiome, or other multi-modal single-cell data.
data-ai
Analyze metabolite-mediated cell-cell communication using MeboCost for metabolic signaling inference between cell types. Predict metabolite secretion and sensing patterns from scRNA-seq data. Use when studying metabolic crosstalk between cell populations or metabolite-receptor interactions.
development
Find marker genes and annotate cell types in single-cell RNA-seq using Seurat (R) and Scanpy (Python). Use for differential expression between clusters, identifying cluster-specific markers, scoring gene sets, and assigning cell type labels. Use when finding marker genes and annotating clusters.
development
Reconstruct cell lineage trees from CRISPR barcode tracing or mitochondrial mutations. Use when studying clonal dynamics, cell fate decisions, or developmental trajectories.