clip-seq/clip-qc/SKILL.md
Comprehensive quality control for CLIP-seq libraries (eCLIP, iCLIP, iCLIP2, PAR-CLIP) covering library complexity (preseq), FRiP, IDR replicate reproducibility, read-distribution metagene, SMInput vs IgG control rationale, rRNA / snoRNA contamination, fragment-length distribution, and ENCODE-compliance thresholds. Use when assessing whether a CLIP library passed, deciding lenient vs stringent peak thresholds, comparing replicates with IDR rescue and self-consistency ratios, or distinguishing failed IP from over-amplified library.
npx skillsauth add GPTomics/bioSkills bio-clip-seq-clip-qcInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Reference examples tested with: preseq 3.2+, picard 3.1+, samtools 1.19+, bedtools 2.31+, deeptools 3.5+, idr 2.0.4+, MultiQC 1.21+, RSeQC 5.0+, pysam 0.22+, fastp 0.23+.
Before using code patterns, verify installed versions match. If versions differ:
pip show <package> then help(module.function) to check signatures<tool> --version then <tool> --help to confirm flagsIf code throws unexpected errors, introspect the installed binary and adapt the example to match the actual CLI rather than retrying.
"Did my CLIP library pass?" -> Assess preprocessing retention, alignment rate, library complexity, replicate reproducibility (IDR), fraction reads in peaks (FRiP), read-distribution metagene, rRNA/snoRNA contamination, fragment-length distribution, and SMInput vs IP enrichment. ENCODE eCLIP compliance is the canonical bar: >= 1M unique fragments per replicate, IDR rescue and self-consistency ratios both < 2, FRiP >= 0.005 (narrow-binding), library complexity rising linearly with depth on preseq lc_extrap. A library can fail at any of these stages, and the failure mode determines whether the data is salvageable.
preseq lc_extrap -B -P aligned.bam -o complexity.txtbedtools intersect -c -s -a peaks.bed -b dedup.bam | awk '{s+=$NF} END{print s}' then divide by total readsidr --samples rep1.sorted.bed rep2.sorted.bed --input-file-type bed --rank 5 --output-file idr.out --idr-threshold 0.05 --plotRSeQC read_distribution.py -i dedup.bam -r gencode.v38.bed + geneBody_coverage.py -i dedup.bam -r housekeeping.bed -o gbsamtools idxstats dedup.bam | awk '$1 ~ /rRNA|45S|18S|28S/ { sum+=$3 } END {print sum}'multiqc <run_dir> aggregates FastQC + cutadapt + STAR + umi_tools + preseq + samtools statsThe ENCODE eCLIP standards (encodeproject.org/eclip) define: >= 2 biological replicates with >= 1M unique fragments each (or saturated peak detection); IDR rescue and self-consistency ratios both < 2; narrow-binding RBPs FRiP >= 0.005. CLIP libraries should have 40-70% PCR duplication BY DESIGN - the IP enriches a small molecule pool, so high pre-dedup duplication is normal; low duplication suggests failed IP.
CLIP QC progresses through five gates; failure at an earlier gate makes later gates meaningless.
| Gate | Metric | Tool | ENCODE threshold | Failure interpretation | |------|--------|------|------------------|------------------------| | 1. Preprocessing retention | % reads retained after UMI + adapter trim | cutadapt log | >= 70% | Adapter pattern wrong; degraded RNA | | 2. Alignment rate | % reads aligned to genome (unique) | STAR Log.final.out | >= 60% for eCLIP, 70% for iCLIP/PAR-CLIP | Wrong genome; rRNA pre-map missing | | 3. Library complexity | Predicted unique fragments at sequenced depth | preseq lc_extrap | >= 1M unique | Over-amplified or under-input library | | 4. IP enrichment | log2(IP/SMInput) at expected sites; FRiP | bedtools + idr | FRiP >= 0.005; log2 >= 3 at top peaks | Failed antibody / antibody not IP-grade | | 5. Reproducibility | IDR rescue + self-consistency ratios | idr | both < 2 | Biological variation too high; or low complexity |
A library failing at Gate 3 (complexity) cannot be rescued analytically; gates 4-5 fail downstream of complexity by construction.
Goal: Determine whether the CLIP library captured enough independent molecules to support genome-wide peak calling (ENCODE: >= 1M unique fragments per replicate).
Approach: Run preseq lc_extrap on the PRE-dedup BAM (preseq counts PCR duplicates to extrapolate); also compute picard ESTIMATED_LIBRARY_SIZE at sequenced depth. Flag any library predicted to plateau below 3M unique fragments at infinite depth.
# After alignment, BEFORE UMI dedup (preseq counts PCR duplicates)
preseq lc_extrap \
-B -P \
-o sample_complexity.txt \
sample_aligned.bam
# Output columns:
# TOTAL_READS EXPECTED_DISTINCT LOWER_0.95CI UPPER_0.95CI
# At 100M reads, EXPECTED_DISTINCT:
# >= 10M = excellent complexity
# 3-10M = acceptable; restrict to high-expression transcripts
# < 3M = library failed; cannot rescue analytically
# picard direct estimate at current depth
picard EstimateLibraryComplexity \
I=sample_aligned.bam \
O=picard_complexity.txt
# ESTIMATED_LIBRARY_SIZE > 5M = healthy CLIP library
A linear plateau on preseq's curve at low depth indicates over-amplification; a curve still climbing at sequenced depth means more sequencing would yield more unique fragments.
FRiP measures how much of the IP signal falls into the called peak set. ENCODE eCLIP narrow-binding RBP minimum: FRiP >= 0.005. Atypical-binding RBPs (rare-transcript binders like TROVE2 on Y RNAs) are exempt.
# Reads in peaks (use stringent peaks: log2 FC >= 3, -log10 p >= 3)
reads_in_peaks=$(bedtools intersect -c -s -a peaks.stringent.bed -b dedup.bam | awk '{s+=$NF} END {print s}')
total_reads=$(samtools view -c -F 4 dedup.bam)
frip=$(echo "scale=4; $reads_in_peaks / $total_reads" | bc)
echo "FRiP: $frip"
# Per-region FRiP breakdown
for region in three_utr exon intron; do
rip=$(bedtools intersect -c -s -a peaks_${region}.bed -b dedup.bam | awk '{s+=$NF} END {print s}')
echo "${region}: $(echo "scale=4; $rip / $total_reads" | bc)"
done
| RBP class | Expected FRiP (ENCODE eCLIP) | |-----------|------------------------------| | Splicing factors (PTBP1, U2AF2) | 0.01 - 0.10 | | 3' UTR mRNA stability (HuR, PUM2) | 0.02 - 0.20 | | Translation factors (EIF3J) | 0.01 - 0.05 | | Repeat binders (MATR3) | 0.05 - 0.30 (high; concentrated in repeats) | | Mitochondrial (FASTKD2) | 0.05 - 0.40 (very high; chrM is small) | | snoRNA binders (DKC1) | 0.10 - 0.50 (high; snoRNA is rare) | | Failed IP (any RBP) | < 0.005 |
IDR (Li et al 2011) measures peak-rank reproducibility across replicates. ENCODE eCLIP convention applies IDR identically to ChIP-seq, using CLIPper + SMInput log2 FC + -log10 p as the ranking signal. The CLIP-specific consideration: rank by signalValue (log2 FC) or p-value, NOT by score column (CLIPper score is sparse and tied).
# Sort each replicate's peaks by signal
sort -k5,5gr rep1.compressed.bed > rep1.sorted
sort -k5,5gr rep2.compressed.bed > rep2.sorted
# True-replicates IDR (threshold 0.05)
idr --samples rep1.sorted rep2.sorted \
--input-file-type bed --rank 5 \
--output-file idr_true.out \
--idr-threshold 0.05 \
--plot --log-output-file idr.log
# Pseudo-replicates from each individual replicate (split BAM in half)
samtools view -b -h -s 1.5 rep1.dedup.bam > rep1.psr1.bam # seed 1, fraction 0.5
samtools view -b -h -s 2.5 rep1.dedup.bam > rep1.psr2.bam # seed 2 (different)
# Re-run peak calling on each pseudoreplicate, then IDR at threshold 0.10
ENCODE consistency rules for eCLIP:
The eCLIP design uses a size-matched input (SMInput) from the SAME lysate, treated identically (UV, IP buffer, RNase, ligation, IP, but with NO antibody addition - just bead-only control or a non-specific control IP). This is fundamentally different from IgG controls and from RNA-seq.
| Control | What it measures | Pros | Cons | |---------|------------------|------|------| | SMInput | Background from non-specific binding + ligation/RT/gel biases at same size | Captures all CLIP-specific biases; ENCODE standard | Requires same-day prep; cannot use a previous IgG library | | IgG-IP | Non-specific antibody binding to the same RBP-naive lysate | Direct nonspecificity measure | Yields very low (~3-10x less reads); high PCR dup; hard to normalize | | Empty beads (mock) | Bead surface non-specificity only | Cleanest baseline | Misses real CLIP background (IP buffer + ligation step bias) | | RNA-seq (matched cell type) | Transcript abundance | Easy to obtain | Misses CLIP-specific biases entirely; not a real CLIP control | | Total RNA / nuclear RNA | Cellular RNA distribution | Easy | Same as RNA-seq above |
Consensus (ENCODE / Hentze / Yeo): SMInput. The bead-only IgG/mock alternatives are ill-suited for quantification because their library yields are 5-10x lower, dominated by PCR duplicates, and produce sparse read-density tracks. RNA-seq cannot replace SMInput because it does not capture the non-specific binding that occurs during IP.
# SMInput preparation - same as IP except no antibody added during incubation
# Same UV dose, same lysate aliquot, same RNase, same library prep
# Critical: same SDS-PAGE size cut from membrane as the IP
# Check SMInput vs IP enrichment globally
# Whole-genome log2(IP_RPKM / SMInput_RPKM):
# > 0.5 globally indicates IP enrichment of expressed transcripts (good)
# ~ 0 indicates failed IP (SMInput == IP)
# RSeQC read_distribution.py shows fractional read placement
read_distribution.py -i dedup.bam -r gencode.v38.bed > read_dist.txt
# Output reports:
# CDS_Exons, 5'UTR_Exons, 3'UTR_Exons, Introns, TES_down_10kb, TES_down_1kb,
# TSS_up_10kb, TSS_up_1kb, Intergenic_region
# CLIP-seq biology-specific patterns:
# Splicing factor: > 60% Introns + 5' UTR exons (containing 5' splice sites)
# 3' UTR factor: > 50% 3'UTR_Exons
# m6A reader: 3'UTR_Exons + Stop_codon region
# Failed IP: matches RNA-seq distribution (no enrichment)
# GeneBody coverage for 5' vs 3' bias
geneBody_coverage.py \
-i dedup.bam \
-r housekeeping.bed \
-o sample_gb
# Flat curve = no positional bias (normal for most RBPs)
# 3' end bias = polyA-dependent enrichment (suspect)
# 5' end bias = nascent / TSS-proximal (only sensible for some RBPs)
# Paired-end fragment-length distribution
samtools view -f 2 dedup.bam | awk '$9 > 0 && $9 < 500 {print $9}' | sort -n | uniq -c > fragment_lengths.txt
# CLIP normal range: 20-75 nt insert (rises from short trimmed reads to ~75 nt)
# Wider range (20-200) seen in high-RNase / long-fragment protocols
# Narrow peak at one length (e.g., all reads at 30 nt) = over-trimmed
# Bimodal at 30 nt and 150 nt = library preparation artifact
# rRNA reads as fraction of total
total=$(samtools view -c -F 4 dedup.bam)
rRNA=$(samtools idxstats dedup.bam | awk '$1 ~ /rRNA|45S|18S|28S|5_8S/ { sum+=$3 } END {print sum}')
rrna_frac=$(echo "scale=4; $rRNA / $total" | bc)
echo "rRNA fraction: $rrna_frac"
# eCLIP without pre-map: rRNA 5-30% normal
# After pre-map: < 2% expected
# > 30% indicates rRNA dominance - IP captured ribosomes preferentially
# Some RBPs (RPL/RPS) expected to bind ribosomes; verify against RBP biology
The antibody is the single most common point of failure in CLIP. Even commercial "IP-grade" antibodies fail at 30-50% rates in practice. Cross-check IP enrichment against expected biology:
import pandas as pd
# Load peak file
peaks = pd.read_csv('peaks.stringent.bed', sep='\t', header=None,
names=['chr','start','end','name','log2fc','strand'])
# Filter top 100 peaks
top_peaks = peaks.nlargest(100, 'log2fc')
# Check enrichment in expected biology
# 1. If RBP is splicing factor: top peaks should be intronic or splice-site flanking
# 2. If RBP is HuR: > 70% top peaks should be 3' UTR
# 3. If RBP is FASTKD2: top peaks should be chrM
# 4. If RBP is FUS: GUGGU motif should appear in top motif enrichment
# Quick check
chrom_dist = top_peaks['chr'].value_counts(normalize=True)
print(chrom_dist.head(10))
# Healthy RBP: chromosomes represented in proportion to expression
# Failed IP: top chromosome chrM (mt artifact) or chr21/22 (housekeeping bias) > 20%
GO term sanity: top-peak genes should enrich for the expected biology (e.g., HuR -> immune / inflammation / mRNA stability terms; PTBP1 -> RNA splicing terms). If the top GO term is "cellular metabolism" or "translation" for a splicing factor, the IP failed.
Trigger: cutadapt log shows > 30% reads filtered (too short or no adapter).
Mechanism: Adapter pattern wrong; or RNA was degraded before fragmentation (all reads adapter-only).
Symptom: Catastrophic loss at the adapter trim step.
Fix: Verify adapter sequence in library prep documentation; check library quality control (Bioanalyzer / TapeStation) for RIN value; re-prep if RIN < 7.
Trigger: STAR Log.final.out reports Uniquely mapped reads % < 60% for eCLIP, < 70% for iCLIP.
Mechanism: rRNA contamination dominating (no pre-map); wrong species genome; or chimeric library (contamination).
Symptom: Low alignment rate; samtools idxstats shows most reads as rRNA-aligned.
Fix: Run bowtie2 pre-map to rRNA + repeats index; verify genome species; check for sample swap with samtools view -h sample.bam | grep '^@SQ'.
Trigger: preseq lc_extrap predicts plateau < 1M unique at sequenced depth; or picard reports ESTIMATED_LIBRARY_SIZE < 1M.
Mechanism: Over-amplification (> 25 PCR cycles); low input (< 5M cells); failed IP capturing only a few molecules.
Symptom: preseq curve flattens early; UMI clusters have median size > 8.
Fix: No analytic rescue. Re-prep with more input cells (10-20M for eCLIP), fewer PCR cycles (14-18 for eCLIP, 16-20 for iCLIP2). If forced to use this library, restrict downstream analysis to high-expression transcripts and acknowledge the limitation.
Trigger: FRiP value below ENCODE threshold for the RBP class.
Mechanism: IP failed to enrich specific binding sites; antibody cross-reactive or not IP-grade.
Symptom: Top peaks dominated by abundant transcripts (GAPDH, ACTB); GO enrichment generic; motif analysis returns AU-rich background.
Fix: Re-test antibody on a knockdown lysate (siRNA against the RBP) - if WB signal does not decrease, antibody is non-specific; switch antibody. ENCODE-validated antibodies are at encodeproject.org/biosamples.
Trigger: Whole-genome log2(IP/SMInput) is near 0 instead of positive.
Mechanism: IP and SMInput are equivalent - the antibody is not enriching any RBP-bound RNA.
Symptom: Per-peak log2 FC scaled around 0; no obvious enrichment at known motif sites.
Fix: Same as above - antibody failure. Consider tagged-RBP system (Halo-CLIP, GoldCLIP, or knock-in endogenous tag).
Trigger: ENCODE IDR test reports rescue ratio > 2 OR self-consistency > 2.
Mechanism: Replicates inconsistent. Biological variation high; OR one replicate has lower library complexity; OR one replicate failed IP.
Symptom: Per-replicate peak counts differ > 2x; IDR plot shows the two replicates have very different peak rank distributions.
Fix: Run preseq and FRiP on each replicate independently; identify the failing one; re-sequence to higher depth or re-prep. ENCODE-style: down-sample both replicates to common depth, re-test IDR.
Trigger: Saw 60% PCR duplication and panicked.
Mechanism: CLIP libraries have 40-70% PCR duplication BY DESIGN - the IP enriches a small pool. UMI dedup recovers the unique molecules underneath.
Symptom: "60% duplication" headline. Actual unique count is what matters.
Fix: Confirm UMI dedup has been applied and unique fragment count >= 1M. The duplication rate metric is meaningless without UMI context for CLIP.
| Symptom | Likely failure | Action |
|---------|----------------|--------|
| > 50% reads "too short" in cutadapt | Adapter wrong or RNA degraded | Verify adapter; check RIN |
| Most reads align to rRNA | No pre-map; or IP captured ribosomes | bowtie2 pre-map; or accept if RBP is ribosomal |
| Unique frag count < 1M | Over-amplified or under-input | No rescue; re-prep |
| FRiP < 0.005 (narrow RBP) | Failed IP | Switch antibody |
| IP/SMInput global log2 ~ 0 | Antibody non-specific | Switch antibody or use tagged-RBP |
| IDR rescue > 2 | Replicate inconsistency | Identify failing replicate; re-sequence |
| Top GO terms generic | Failed IP or contamination | Check antibody on KD lysate WB |
| chrM peaks abundant for non-mt-RBP | Mitochondrial contamination | Filter chrM or accept if mt-RBP |
| Fragment length all 30 nt | Over-trimmed | Loosen quality trim from -q 20 to -q 6 |
| Strand-specific peaks lost | bedtools without -s | Add -s strand flag |
# Aggregate all QC tools into a single report
multiqc \
fastqc_output/ \
cutadapt_logs/ \
star_logs/ \
umi_tools_logs/ \
preseq_output/ \
samtools_stats/ \
-o multiqc_report/
MultiQC reads logs from FastQC, cutadapt, STAR, umi_tools, preseq, samtools stats, picard, and others, and produces a single HTML report. For CLIP-seq, additionally run RSeQC read_distribution.py and document FRiP / IDR results manually.
| Metric | Threshold | Source | |--------|-----------|--------| | Biological replicates | >= 2 | ENCODE eCLIP | | Read length | >= 50 nt (PE) | ENCODE | | Unique fragments per replicate | >= 1M (or saturated) | ENCODE | | Preprocessing retention | >= 70% | Practitioner consensus | | Genome alignment rate (unique) | >= 60% eCLIP, 70% iCLIP | Practitioner consensus | | Library complexity (preseq) | >= 10M expected unique at 100M | ENCODE | | FRiP (narrow-binding RBP) | >= 0.005 | ENCODE | | FRiP (atypical-binding RBP) | Exempt | ENCODE | | log2(IP/SMInput) at stringent peaks | >= 3 | ENCODE | | -log10 p at stringent peaks | >= 3 | ENCODE | | IDR rescue ratio | < 2 | ENCODE | | IDR self-consistency ratio | < 2 | ENCODE | | rRNA fraction post pre-map | < 2% | Practitioner consensus | | Read distribution match to RBP class | Y | Sanity check | | Top motif consistent with literature | Y | Sanity check |
| Error / symptom | Cause | Solution |
|-----------------|-------|----------|
| preseq "all reads have duplicates" | Pre-deduplicated BAM passed to preseq | Re-run on PRE-dedup BAM |
| FRiP value > 1.0 | Peak set overlaps reads counted twice | Use unique-reads BAM; verify peak BED is non-redundant |
| IDR plot empty | Ranking column wrong; all peaks tied at same value | Rank by --rank 5 (log2 FC); sort BED first |
| MultiQC missing modules | Logs not in expected directory structure | Verify log paths; rerun multiqc with explicit -d |
| Read distribution: > 50% intergenic for mRNA RBP | TxDb missing transcripts; rRNA contamination | Update GENCODE; pre-map rRNA |
| GeneBody coverage 3' biased | polyA-selected library; not CLIP | Verify library prep was random hexamer |
| Fragment-length distribution narrow at 30 nt | Adapter aggressively trimmed at 5' | Loosen trim parameters |
| Antibody KD WB shows no decrease | Antibody non-specific | Switch antibody; use ENCODE-validated |
development
Find restriction enzyme cut sites in DNA sequences using Biopython Bio.Restriction. Search with single enzymes, batches of enzymes, or commercially available enzyme sets. Returns cut positions for linear or circular DNA. Use when finding restriction enzyme cut sites in sequences.
development
Create restriction maps showing enzyme cut positions on DNA sequences using Biopython Bio.Restriction. Visualize cut sites, calculate distances between sites, and generate text or graphical maps. Use when creating or analyzing restriction maps.
development
Analyze restriction digest fragments using Biopython Bio.Restriction. Predict fragment sizes, get fragment sequences, simulate gel electrophoresis patterns, and perform double digests. Use when analyzing restriction digest fragment patterns.
development
Select restriction enzymes by criteria using Biopython Bio.Restriction. Find enzymes that cut once, don't cut, produce specific overhangs, are commercially available, or have compatible ends for cloning. Use when selecting restriction enzymes for cloning or analysis.