clip-seq/binding-site-annotation/SKILL.md
Annotate CLIP-seq peaks or crosslink sites to RNA features (5'UTR, CDS, 3'UTR, intron, splice junction, snoRNA, tRNA, ncRNA, repeat elements) with ChIPseeker, RCAS, RBP-Maps (Yeo splicing regulatory maps), and bedtools, applying feature-priority hierarchies, transcript-context resolution, and metagene aggregation. Use when characterizing where in transcripts an RBP binds, comparing peak distribution across regions, generating splicing-regulatory maps relative to alternative-splicing events, or distinguishing exonic vs intronic vs UTR binding.
npx skillsauth add GPTomics/bioSkills bio-clip-seq-binding-site-annotationInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Reference examples tested with: ChIPseeker 1.40+, RCAS 1.30+, GenomicFeatures 1.56+, GenomicRanges 1.56+, rbp-maps (Yeo github), bedtools 2.31+, pybedtools 0.10+, pyranges 0.0.129+.
Before using code patterns, verify installed versions match. If versions differ:
packageVersion('<pkg>') then ?function_name to verify parameterspip show <package> then help(module.function) to check signatures<tool> --version then <tool> --help to confirm flagsIf code throws unexpected errors, introspect the installed package and adapt the example to match the actual API rather than retrying. ChIPseeker 1.40+ changed the priority defaults; verify the priority vector before pipelining.
"Annotate where in transcripts my RBP binds" -> Map CLIP peaks or single-nucleotide crosslink sites to RNA features and report the per-feature distribution. The interpretation is RBP-class-specific: splicing factors (PTBP1, U2AF2, RBFOX) bind intron-exon junctions; mRNA-stability regulators (HuR, PUM2) bind 3' UTRs; translation factors (EIF3J, RPS19) bind 5' UTRs and CDS; and small-ncRNA-binding RBPs (NSUN2 tRNAs, LARP7 7SK, TROVE2 Y-RNAs) bind specific non-coding transcripts. A correct annotation pipeline (a) resolves overlapping features by priority, (b) preserves transcript-isoform context, (c) generates metagene distributions, and (d) flags repeat-element overlap separately.
ChIPseeker::annotatePeak(peaks, TxDb=txdb, level='gene', tssRegion=c(-100,100)) then plotAnnoPie(anno)runReport(queryRegions=peaks, gffData=gencode_gtf, genomeVersion='hg38')bedtools intersect -s -wa -wb -a peaks.bed -b features.bed (manual hierarchy)yeolab/rbp-maps) generates the 1400 nt vectorized cassette-exon map used in Yeo lab ENCODE papersdeepTools computeMatrix or RSeQC geneBody_coverage.py for read profilesThe Yeo lab convention for ENCODE eCLIP: ChIPseeker for global region distribution + RBP-Maps for position-resolved splicing context + custom analysis for repeat-element overlap. ChIPseeker alone over-counts intronic binding because TSS-region default extends too far for RNA features.
| Tool | Input | Annotation level | Feature priority | Output | Strength | Fails when |
|------|-------|------------------|------------------|--------|----------|------------|
| ChIPseeker (R) | Peak BED + TxDb | Gene-level or transcript-level | Promoter > 5' UTR > 3' UTR > Exon > Intron > Intergenic | Per-peak annotation + pie chart + distance-to-TSS | Mature, widely used, fast | TSS-region default c(-3000, 3000) over-extends; promoter category meaningless for RNA |
| RCAS (R) | Peak BED + GFF | Transcript-level | Customizable | HTML report + per-region table | RNA-specific design; ncRNA-aware | Slow; HTML-heavy; less actively maintained |
| RBP-Maps (Yeo) | eCLIP BAM + alternative splicing event tables | Splice junction-level | NA (positional metagene) | Splicing regulatory map (1400nt vector around cassette exon) | The standard for splicing-factor CLIP analysis | Splicing-only; not for 3' UTR or ncRNA binders |
| ASTERIX (Yeo) | Peak BED | Region | NA | Region distribution + region-specific motifs | Companion to RBP-Maps for non-splicing regions | Less granular than ChIPseeker |
| HOMER annotatePeaks.pl | Peak BED | Gene-level | Promoter > UTR > Exon > Intron > Intergenic | Annotation TSV | Familiar from ChIP-seq | Same TSS issue as ChIPseeker; less RNA-aware |
| bedtools intersect | Peak BED + feature BED | User-defined | User-defined | Per-feature overlap | Most flexible | Manual hierarchy logic must be written |
| pyranges (Python) | Peak BED + GTF | Customizable | User-defined | DataFrame | Fast, pythonic, scriptable | Hand-written priority logic |
| pybedtools | Peak BED + GTF | Customizable | User-defined | iterators | Flexible, scripted | Same as pyranges |
| Peakhood (Reuter 2022) | Peak BED + transcripts | Transcript context | NA | Per-peak transcript context | Resolves transcript-isoform ambiguity | Specialized; not a global region tool |
Methodology evolves; verify the GTF source (GENCODE preferred over Ensembl for human; both have feature differences) and the priority hierarchy against published RBP literature.
RNA features are nested and overlapping. A 3' UTR peak is ALSO in an exon; an intronic peak near a splice site is ALSO in a transcribed-but-spliced region. Without a priority hierarchy the same peak gets counted multiple times.
Standard CLIP hierarchy (most-specific first):
ChIPseeker's default level='gene' reports the FIRST hit by priority. For CLIP, set level='transcript' and explicitly customize tssRegion=c(-100, 100) (default c(-3000, 3000) is for ChIP).
Why CLIP needs the tight TSS window: ChIPseeker was designed for ChIP-seq where the "Promoter" category captures transcription-factor binding within 3 kb upstream of a gene. For CLIP-seq, the binding substrate is RNA - the RBP cannot bind upstream of the TSS because there is no RNA upstream of the TSS. A 6 kb TSS window catches 5'-UTR peaks and unrelated upstream sequence, mislabeling 30-50% of CLIP peaks as "Promoter (2-3kb)" - implausible for an RNA-binding study. Tightening to c(-100, 100) constrains the "Promoter" category to the mRNA 5' end immediate vicinity and pushes deeper-5'-UTR peaks into the "5UTR" category where they belong.
library(ChIPseeker)
library(TxDb.Hsapiens.UCSC.hg38.knownGene)
txdb <- TxDb.Hsapiens.UCSC.hg38.knownGene
peaks <- readPeakFile('peaks.stringent.bed')
# CLIP-appropriate annotation
anno <- annotatePeak(
peaks,
TxDb = txdb,
level = 'transcript', # RNA-isoform-aware
tssRegion = c(-100, 100), # tight; not a meaningful promoter for RNA binding
genomicAnnotationPriority = c(
'Promoter', '5UTR', '3UTR', 'Exon',
'Intron', 'Downstream', 'Intergenic'
)
)
print(anno)
plotAnnoPie(anno)
plotAnnoBar(anno)
plotDistToTSS(anno) # CLIP-specific: dist to TSS is dist to mRNA 5' end
| RBP class | Expected dominant region | RBP examples | |-----------|--------------------------|--------------| | Splicing factors | Intron (within 500 nt of splice junction) | PTBP1, U2AF2, RBFOX1/2, SRSF1-9, HNRNPC, MBNL1 | | 3' UTR mRNA stability/decay | 3' UTR | HuR (ELAVL1), PUM2, AUF1, ZFP36, TIA1, NUDT21 | | 5' UTR translation regulation | 5' UTR, CDS start | EIF4A3, EIF3J, eIF2 subunits, LARP1, PCBP1/2 | | Ribosome-associated | CDS | RPL/RPS proteins, RACK1 | | Nuclear export | CDS, 3' UTR | NXF1, ALYREF, SRSF3 | | m6A reader | 3' UTR, stop codon region | YTHDF1/2/3, IGF2BP1/2/3 | | miRNA effector | 3' UTR (seed-matched sites) | AGO1/2/3/4, DICER1, TNRC6A/B/C | | snoRNA-targeting RBP | snoRNA, rRNA | DKC1, FBL, NOP56, NHP2 | | ncRNA-binding | Specific ncRNA target | TROVE2 (Y RNA), LARP7 (7SK), SSB (vault) | | Mitochondrial | chrM (mt-mRNA + 12S/16S rRNA) | FASTKD2, LRPPRC, TFAM, MTPAP | | Histone mRNA | Histone 3' end stem-loop | SLBP | | Repeat-binding (TE) | Alu, LINE-1, LTR, SINE | MATR3, ZFP36, HNRNPK, HNRNPA2B1, FUS (LINE-1) | | LCD/IDR / phase separation | Diverse (Alu, intron, 3' UTR) | FUS, TDP-43, EWSR1, TAF15 |
Mismatch between observed and expected suggests: (a) failed IP (cross-reactive antibody binding wrong RBP), (b) artifact (high-expression contamination), (c) genuinely new biology - investigate.
For splicing factors (PTBP1, U2AF2, RBFOX, SRSF1, etc.), the Yeo lab RBP-Maps tool aggregates eCLIP signal in a 350 nt window flanking the upstream, cassette, and downstream exons, producing a 1400 nt vectorized regulatory map. This map shows the position-dependent enrichment of RBP binding relative to alternative splicing events and is THE standard for splicing-CLIP analysis.
# Yeo lab rbp-maps Snakemake workflow
git clone https://github.com/YeoLab/rbp-maps
cd rbp-maps
# Inputs:
# 1. eCLIP BAM (rep1 + rep2)
# 2. Cassette exon BED (from RNA-seq KD experiment; see clip-seq/differential-clip)
# 3. Background "native" exons (constitutively spliced)
python rbpmaps/RBPMaps.py \
--ip rep1.bam rep2.bam \
--input sminput.bam \
--positive cassette_inc.bed \
--negative cassette_exc.bed \
--background native_constitutive.bed \
--output rbpmaps_out/ \
--normalization input_normalization
Output: 1400 nt vector with eCLIP signal at every base, plotted as a metagene for cassette inclusion vs exclusion vs constitutive. A peak in the upstream-intron region (right of the cassette exon's 5' splice site) indicates intronic regulation; a peak in the cassette exon body indicates exonic regulation.
Trigger: Used ChIPseeker default tssRegion=c(-3000, 3000) for CLIP peaks.
Mechanism: The 6 kb TSS window catches CLIP peaks deep in the 5' UTR or upstream introns and labels them "Promoter (2-3kb)". For RNA biology, the "promoter" category is meaningless - the RBP cannot bind pre-mRNA upstream of the TSS.
Symptom: 30-50% of peaks labeled "Promoter (2-3kb)" - implausible for an RNA-binding study.
Fix: Set tssRegion=c(-100, 100) or c(-50, 50). Remove the "Promoter" category from the priority vector entirely if it gives 0 hits.
Trigger: Used level='gene' (default) on peaks that should be transcript-specific.
Mechanism: Gene-level annotation picks the canonical/longest transcript and assigns the peak to that transcript's features. A peak in an alternatively-spliced exon may be labeled "intron" because the gene-level reference picks an isoform where the exon is skipped.
Symptom: Splicing factor peaks heavily "intronic" when they should be exonic at alternative-cassette exons.
Fix: Use level='transcript'. Match isoforms to RBP-Maps cassette-exon coordinates for splicing factors.
Trigger: RCAS report on > 100k peaks.
Mechanism: RCAS generates an HTML report with many plots; rendering is slow.
Symptom: Run time > 1 h; out-of-memory; HTML failing to render.
Fix: Sub-sample peaks to top 10k; or use ChIPseeker for the global view and reserve RCAS for per-region deep dives.
Trigger: Splicing-factor CLIP with no companion RNA-seq KD or no SE event BED.
Mechanism: RBP-Maps's regulatory metagene only makes sense relative to cassette exons differentially regulated by the RBP. Without the SE event table, the tool falls back to plotting at all native exons (no differential signal).
Symptom: Flat metagene; no position-dependent enrichment.
Fix: Generate the cassette-exon table from companion RNA-seq KD experiments (rMATS / MAJIQ / LeafCutter; see alternative-splicing skills) OR use a published table from ENCODE shRNA RNA-seq.
Trigger: Hand-written bedtools intersect chain for region annotation.
Mechanism: Forgetting to enforce priority leads to double-counting: a 3' UTR peak intersects 3UTR.bed AND exon.bed.
Symptom: Per-region counts sum to > total peak count by 20-50%.
Fix: Process in priority order; remove already-annotated peaks from the remaining set with -v before next intersect. Or use ChIPseeker / RCAS which encode the hierarchy internally.
Trigger: Repeat-binding RBP (MATR3, HNRNPK) annotated without RepeatMasker BED.
Mechanism: Genomic feature BEDs (UTR/CDS/intron) do not flag repeat instances within them. An MATR3 peak in an intronic LINE-1 is labeled "intron" when "LINE-1 within intron" is the biology.
Symptom: Repeat-binding RBP looks like an intronic binder; functional analysis (GO terms) returns generic "RNA processing".
Fix: Pass RepeatMasker BED (UCSC) as a separate axis; annotate each peak with BOTH region AND repeat-class. Cross-check repeat-class fraction against literature: > 15% Alu / LINE / LTR overlap indicates a repeat binder.
Trigger: Standard TxDb annotation drops chrM transcripts; FASTKD2 / LRPPRC peaks all "intergenic".
Mechanism: Some TxDb objects (older releases) exclude mitochondrial chromosome; ChIPseeker reports those peaks as having no annotation.
Symptom: chrM peaks labeled "Distal Intergenic" or simply missing from the annotation table.
Fix: Verify TxDb includes chrM (genes(txdb, filter=list(tx_chrom='chrM'))); add custom mt-mRNA + mt-rRNA + mt-tRNA BED if missing.
Trigger: bedtools intersect -wa -wb without -s flag.
Mechanism: CLIP is strand-specific. Without -s, peaks on one strand can be annotated to features on the other strand - especially for overlapping transcripts (sense/antisense pairs).
Symptom: Antisense transcript peaks contaminate the sense annotation; per-region distribution shifted.
Fix: Always pass -s to bedtools intersect for CLIP. ChIPseeker handles strand automatically when peaks have BED column 6.
Metagene = aggregated profile of CLIP signal at each base of a normalized feature (e.g., 5' UTR, CDS, 3' UTR scaled to common length). Reveals position-dependent binding.
Goal: Generate a position-resolved CLIP-signal profile aggregated across all 3' UTRs (or other features) to identify position-dependent binding modes (e.g., stop-codon-proximal m6A readers, polyA-proximal CPEB).
Approach: Convert dedup BAM to strand-stranded bigWig, then use deepTools computeMatrix scale-regions to align signal across feature instances of variable length, and plotProfile / plotHeatmap for visualization.
# deepTools computeMatrix for metagene
# Step 1: peaks BED -> bedgraph
bedtools genomecov -bg -strand + -ibam dedup.bam > clip_plus.bg
bedtools genomecov -bg -strand - -ibam dedup.bam > clip_minus.bg
# combine to bigwig
bedGraphToBigWig clip_plus.bg chrom.sizes clip_plus.bw
# Step 2: metagene over 3' UTRs
computeMatrix scale-regions \
-S clip_plus.bw clip_minus.bw \
-R three_prime_UTRs.bed \
--regionBodyLength 500 \
--beforeRegionStartLength 100 \
--afterRegionStartLength 100 \
-o matrix.gz
plotProfile -m matrix.gz -o metagene.png --perGroup
plotHeatmap -m matrix.gz -o metagene_heatmap.png
| Metagene shape | Biology | |----------------|---------| | Peak at stop codon region | m6A reader (YTHDF1/2/3); IGF2BP1; some 3' UTR factors | | Peak in proximal 3' UTR (within 200 nt of stop) | PUM2, IGF2BP, miRNA effectors | | Peak in distal 3' UTR (within 200 nt of poly-A) | NUDT21, CPEB1, polyadenylation regulators | | Peak at 5' splice site flanking intron | RBFOX, MBNL, HNRNPC | | Peak at 3' splice site / branch point | U2AF2, PTBP1 | | Peak at 5' UTR (within 50 nt of TSS) | EIF3, EIF4A3, LARP1 | | Flat across CDS | Cytoplasmic ribosome-associated; less position-specific |
| Scenario | Tool + parameters | Why |
|----------|-------------------|-----|
| Global region distribution (3' UTR / CDS / intron) | ChIPseeker tssRegion=c(-100,100) level='transcript' | Mature, fast, ENCODE-comparable |
| Splicing factor regulatory map | RBP-Maps (Yeo) with RNA-seq KD cassette table | The standard for splicing-CLIP |
| RNA-feature aware (snoRNA, lncRNA-specific) | RCAS or custom GTF + bedtools | RCAS handles ncRNA features |
| Transcript-isoform context resolved | Peakhood + ChIPseeker level='transcript' | Resolves which isoform the peak supports |
| Metagene 3' UTR / 5' UTR | deepTools computeMatrix + plotProfile | Standard metagene visualization |
| Repeat-element overlap (MATR3, HNRNPK) | bedtools intersect with RepeatMasker | Repeat axis is independent of region axis |
| Mitochondrial RBP (FASTKD2) | Custom chrM-aware annotation | Standard TxDb may drop chrM |
| Bulk RBP overview | ChIPseeker plotAnnoPie + plotDistToTSS | Quick orienting plots |
| Compare two RBPs' region distributions | ChIPseeker annotatePeakList + side-by-side bar | Direct comparability |
| miRNA target site (AGO-CLIP) | bedtools intersect with 3' UTR + TargetScan seed scan | miRNA biology specific |
| m6A reader (YTHDF, IGF2BP) | ChIPseeker + custom stop-codon-distance plot | m6A biology stops-codon-proximal |
| Pattern | Likely cause | Action |
|---------|--------------|--------|
| ChIPseeker "Promoter" >> 5' UTR for CLIP | TSS region too wide | Tighten tssRegion=c(-100,100) |
| ChIPseeker gene-level intronic; RBP-Maps shows exonic at cassettes | Gene-level picks canonical isoform; RBP-Maps uses cassettes | Trust RBP-Maps for splicing factors |
| Many peaks "intergenic" near a known gene | TxDb missing the transcript | Verify GENCODE vs Ensembl GTF; add lincRNA / unannotated transcripts |
| chrM peaks missing | TxDb excluded chrM | Add custom mt-mRNA BED |
| Sum of per-region counts > total peaks | Hierarchy not enforced | Use ChIPseeker; or bedtools intersect -v chain |
| Repeat-binding RBP looks "intronic" | RepeatMasker axis not added | Annotate peaks with RepMask intersect separately |
| Metagene flat for known position-specific RBP | Wrong feature BED (e.g., gene-body instead of transcript-isoform) | Use isoform-aware feature BED |
| RBP-Maps signal flat at cassette exons | No RNA-seq KD reference table | Generate cassettes from companion KD; use ENCODE shRNA tables |
Operational rule: Report (a) ChIPseeker global region pie, (b) RBP-Maps metagene for splicing factors, (c) repeat-element fraction separately, (d) transcript-isoform context with Peakhood if ambiguous. Three views together prevent misinterpretation.
| Error / symptom | Cause | Solution |
|-----------------|-------|----------|
| "Promoter" >> 30% of peaks in CLIP | TSS region default too wide | tssRegion=c(-100,100) |
| Strand-mixing artifacts | Bedtools intersect without -s | Always pass -s for CLIP |
| 3' UTR fraction < 10% for HuR | Wrong txdb version (UTR poorly annotated) | Use GENCODE v38+ TxDb |
| chrM peaks lost | TxDb excludes mitochondrial | Add mt-transcript BED |
| RBP-Maps flat metagene | No cassette exon BED | Generate from RNA-seq KD; or use ENCODE table |
| Repeat fraction over-counted | Same repeat instance counted multiple times | bedtools merge repeat BED first |
| ChIPseeker level='gene' collapses isoforms | Default level=gene | Set level='transcript' |
| Annotation runs slow for large peak set | RCAS HTML generation | Use ChIPseeker for global; RCAS for top-K |
| GO enrichment yields generic terms | Repeat-binding RBP annotated as intronic | Re-annotate with RepeatMasker axis |
| Peak in unannotated lincRNA labeled intergenic | Older GENCODE missing the lincRNA | Update GENCODE; add manual transcript BED |
development
Find restriction enzyme cut sites in DNA sequences using Biopython Bio.Restriction. Search with single enzymes, batches of enzymes, or commercially available enzyme sets. Returns cut positions for linear or circular DNA. Use when finding restriction enzyme cut sites in sequences.
development
Create restriction maps showing enzyme cut positions on DNA sequences using Biopython Bio.Restriction. Visualize cut sites, calculate distances between sites, and generate text or graphical maps. Use when creating or analyzing restriction maps.
development
Analyze restriction digest fragments using Biopython Bio.Restriction. Predict fragment sizes, get fragment sequences, simulate gel electrophoresis patterns, and perform double digests. Use when analyzing restriction digest fragment patterns.
development
Select restriction enzymes by criteria using Biopython Bio.Restriction. Find enzymes that cut once, don't cut, produce specific overhangs, are commercially available, or have compatible ends for cloning. Use when selecting restriction enzymes for cloning or analysis.