clip-seq/clip-motif-analysis/SKILL.md
Discover RBP binding motifs from CLIP-seq peaks or single-nucleotide crosslink sites using HOMER, MEME/STREME, kpLogo, mCross (CL-position-registered motifs), PEKA (positional k-mer enrichment), RBPamp (affinity), and RNA Bind-n-Seq (RBNS) cross-validation. Use when characterizing RBP sequence specificity, registering motifs to crosslink positions, validating in vivo CLIP motifs against in vitro RBNS Kd, reconciling motif disagreements across tools, or correcting for the uracil crosslinking bias that contaminates raw CLIP motif logos.
npx skillsauth add GPTomics/bioSkills bio-clip-seq-clip-motif-analysisInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Reference examples tested with: HOMER 4.11+, MEME Suite 5.5+ (STREME, MEME-ChIP, FIMO), bedtools 2.31+, kpLogo 1.1+, mCross v1+, PEKA v1+, RBPamp 0.9+, ggseqlogo 0.1+, biopython 1.83+.
Before using code patterns, verify installed versions match. If versions differ:
<tool> --version then <tool> --help to confirm flagspip show <package> then help(module.function) to check signaturesIf code throws unexpected errors, introspect the installed tool and adapt the example to match the actual API rather than retrying.
"Find enriched RNA motifs at my RBP binding sites" -> Discover the in vivo sequence preference of an RNA-binding protein from CLIP-seq peaks or single-nucleotide crosslink sites. The fundamental confound is the uracil bias of UV254 crosslinking: U is the most-crosslinked base (>50% of CL events), so naive motif logos centered on CL positions are U-enriched even for non-U-binding RBPs. Modern tools (mCross, PEKA) register motifs relative to the CL position and correct for this bias; legacy tools (HOMER, MEME) need careful background selection.
findMotifs.pl peaks.fa fasta motif_out -rna -len 5,6,7,8 -p 4streme --rna --oc streme_out -p peaks.fa -n background.fa --minw 5 --maxw 10mCross -i crosslinks.bed -g genome.fa -k 7 -o mcross_out (jointly models motif + CL position)peka --peak_file_name peaks.bed --crosslinks_file_name crosslinks.bed --genome_file_name genome.fa --regions_file_name regions.bed --kmer_length 5 --percentile 30 --outpath peka_out (the short flags -i/-x/-g/-r/-k/-p shown in earlier docs are not all stable; use the long forms or check peka --help)rbpamp run -i peaks.fa -k 7 -o rbpamp_out (joint affinity + motif model)kpLogo crosslinks_with_kmer_scores.txt -o kplogo_out (position-specific significance)ENCODE eCLIP motif convention: extract sequences from peaks (CLIPper + SMInput stringent), use shuffled or expression-matched background, run HOMER -rna -len 5,6,7,8. The Yeo lab eCLIP papers typically report the top 1-3 HOMER hits per RBP and validate with mCross for CL-registered position. For ASB and variant-effect work, mCross is mandatory.
| Tool | Input | Background | CL-position aware | Output | Strength | Fails when | |------|-------|------------|-------------------|--------|----------|------------| | HOMER findMotifs.pl | Peak FASTA | Auto-shuffled or user-provided | No | PWM + de novo + known scan | Mature, fast, RNA mode (U instead of T) | Background generation is heuristic; uracil bias inflates U-content motifs | | MEME (MEME Suite) | Peak FASTA | Optional Markov model | No | PWM | Statistically rigorous; broadly cited | Slow on large peak sets (> 1000 sequences); single-motif default | | STREME (MEME Suite) | Peak FASTA | Shuffled or user-provided | No | PWM | Fast successor to MEME-ChIP; tolerant of large sets | Same uracil bias issue as HOMER | | MEME-ChIP | Peak FASTA | Markov order 1-3 | No | Multiple motifs | Wrapper for MEME + CentriMo + Tomtom comparison | Designed for ChIP; for CLIP use STREME directly | | kpLogo | k-mer scores with positions | NA (significance from kpLogo internals) | Yes (positional) | Position-specific logo | Visualizes position-specific signal; complements PWM | Requires k-mer-score input file; not a peak-to-motif tool by itself | | mCross (Feng 2019) | Single-nucleotide CL sites + genome | Generated internally | Yes (jointly models motif + CL position) | Registered PWM + CL offset | The standard for in vivo RBP motif registration; SRSF1 example showed clustered GGA half-sites | Requires high-quality single-nt CL sites (PureCLIP or CTK CITS) | | PEKA (Mihai 2022) | Peaks + crosslink BED | Low-count crosslinks within same dataset | Yes (positional k-mer) | Position-enriched k-mers with significance | No external input required; cross-validates with mCross | Less granular than mCross's jointly-modeled PWM | | RBPamp (Jens 2020) | Peak FASTA or CL sites | Joint affinity model | No (affinity-based) | Affinity-weighted motif + Kd estimate | Provides Kd; compatible with RBNS calibration | Slow; small community; needs validation | | RCAS | Peak BED + GTF | Auto | No | Motif + region annotation | One-stop for motif + region distribution | Less granular than per-tool dedicated runs | | MEME FIMO (motif scan) | PWM + FASTA | Markov bg | No | Match locations and scores | Scan known motifs across genome | Not a de novo tool; complements HOMER known scan | | RBPNet (Jens & Gagneur 2024) | Raw sequence | Trained model | Yes (sequence-to-signal at single nt) | Predicted CL count distribution | Deep learning; predicts per-base affinity profile | See clip-seq/clip-deep-learning |
Methodology evolves; verify motif comparison databases (CISBP-RNA, ATtRACT, oRNAment, mCrossBase, RBPDB) for known motif validation. RBNS in vitro Kd values (Lambert 2014; Dominguez 2018 for 78 RBPs) are the most reliable orthogonal reference for in vivo CLIP motifs.
Peak-based (HOMER, MEME/STREME, RBPamp): Extract sequences from peak BED, find enriched k-mers / PWMs. Produces classical motif logos. Fast and intuitive. Captures broad binding zones.
Crosslink-site-based (mCross, PEKA, kpLogo): Use single-nucleotide CL positions; analyze sequence in a window around each CL position. Produces motif logos REGISTERED to the CL site. Required for understanding the structural relationship between motif and crosslink.
The fundamental difference: peak-based logos show "what sequence is enriched near binding"; CL-registered logos show "what sequence the RBP contacts at the crosslink position." For most RBPs they agree on the core motif but diverge on flanking context.
UV254 crosslinking is U-biased: ~60-80% of single-nt CL events occur at U residues (Konig 2010; Sugimoto 2017). Consequence: motif logos centered on raw CL positions are U-enriched at the center even for non-U-binding RBPs (e.g., PUM2 binds UGUANAUA but peak-center is shifted; HNRNPK binds C-rich tracts but centers can show spurious U).
Mitigations:
--markov-order 1 does).For naive HOMER/MEME output, examine the central column of the PWM: if it is U-skewed, suspect CL bias rather than true U preference. Cross-check against published RBNS Kd if available.
Trigger: HOMER auto-generates a shuffled background that preserves only nucleotide frequency, not GC distribution per region; CLIP peaks come disproportionately from 3' UTRs which are AU-rich.
Mechanism: AU-rich foreground vs GC-shuffled background produces "AU-rich motif" as the top hit for any 3' UTR-binding RBP, regardless of biology.
Symptom: Top HOMER motif looks like the 3' UTR mean composition (AU-rich, no clear specificity); known motif scan (-known) finds the RBP's canonical motif at lower rank than the spurious "AU" motif.
Fix: Provide a GC-matched background: extract sequences from random 3' UTR regions of expressed transcripts matched by length distribution. Use -bg matched_bg.fa flag. Or use STREME with -n matched_bg.fa.
Trigger: MEME run on > 1000 peak sequences.
Mechanism: MEME is O(N^2) in sequence count; classical OOPS/ZOOPS/ANR models do not scale.
Symptom: Runtime > 12h; out-of-memory on large machines.
Fix: Use STREME (MEME Suite >= 5.0); it is the modern fast successor (sub-quadratic). Or restrict to top-1000 peaks by score.
Trigger: mCross run on a peak BED instead of crosslink BED.
Mechanism: mCross's model is motif_PWM x CL_position_offset_PMF. Without single-nt CL positions there is no offset axis; the model degenerates.
Symptom: mCross output looks identical to a peak-centered logo; the diagnostic CL-offset histogram is flat.
Fix: Provide CL sites from PureCLIP, CTK CITS, or PARalyzer. The Yeo lab and Zhang lab use the convention of PureCLIP CL sites as mCross input.
Trigger: Concerned that PEKA's "low-count crosslinks within same dataset" background is biased.
Mechanism: PEKA does NOT need external input. It uses low-count crosslinks (below a quantile threshold) as the background and high-count crosslinks (above) as the foreground. This is by design (Mihai 2022) and cross-validated against mCross.
Symptom: False positive concern; user wants to verify.
Fix: Run BOTH PEKA and mCross; cross-validate. PEKA paper shows >90% concordance with mCross for ENCODE RBPs; treat them as orthogonal confirmation.
Trigger: Large peak set (> 5000); RBPamp jointly optimizes affinity + motif iteratively.
Mechanism: Joint optimization is slow; some RBPs have multimodal binding (multiple motifs) that confuse single-PWM RBPamp.
Symptom: Runtime > 24h; output PWM disagrees with HOMER top motif.
Fix: Down-sample to top-2000 peaks; or run HOMER for initial PWM and use it as RBPamp seed.
Trigger: Comparing CLIP-derived motif to RBNS Kd-ranked motif (Dominguez 2018) and finding divergence (e.g., CLIP top motif rank 3 in RBNS; RBNS top motif weakly enriched in CLIP).
Mechanism: In vivo binding is shaped by structure accessibility (RNA secondary structure), cooperativity, and competing factors. RBNS measures intrinsic affinity in vitro. They diverge for:
Symptom: CLIP motif logo plausible but rank-3 in RBNS; or RBNS top motif missing from CLIP peaks.
Fix: Accept the divergence as informative. Cite both. Use SHAPE-eCLIP or icSHAPE-MaP to test structural-accessibility hypothesis. For variant-effect studies, prefer RBNS-derived PWM (RBPamp Kd) over CLIP motif for absolute affinity prediction.
Trigger: Used FIMO with default -thresh 1e-4 and got matches everywhere.
Mechanism: Default threshold is too lenient for in vivo binding; CLIP background sequences contain many low-affinity matches.
Symptom: FIMO match rate > 50% of peaks AND > 20% of random background; no enrichment signal.
Fix: Use -thresh 1e-5 or -thresh 1e-6; or use CentriMo for local enrichment around peak center (testing positional enrichment, not threshold).
| Scenario | Tool | Why |
|----------|------|-----|
| Quick de novo motif from CLIPper peaks, ENCODE-comparable | HOMER -rna -len 5,6,7,8 with GC-matched bg | Fast, mature, ENCODE convention |
| Motif registered to crosslink position (for variant effect) | mCross on PureCLIP single-nt sites | The 2019 Feng paper is the canonical reference |
| Confirm mCross result with independent method | PEKA on same peaks + crosslinks | Cross-validation; PEKA does not need input control |
| Affinity-weighted motif with Kd estimate | RBPamp | Provides quantitative Kd; integrates with RBNS comparison |
| Compare to in vitro RBNS Kd | RBPamp + Dominguez 2018 / Lambert 2014 tables | RBNS is the orthogonal in vitro standard |
| RBP with known motif - validation scan | FIMO -thresh 1e-5 against CISBP-RNA / ATtRACT PWM | Lower-confidence motifs need tight threshold |
| Multiple binding modes (SRSF1 GGA clusters) | mCross + manual inspection of CL-offset histogram | mCross reveals the multimodal structure |
| Position-specific logo (5' vs 3' of CL) | kpLogo with k-mer scores | Position-specific significance visualization |
| AU-rich 3' UTR binders (HuR, AUF1) - validate U bias is real | Compare HOMER motif center vs PUM2-style register | RBP-specific tradition |
| Allele-specific motif (variant effect) | mCross PWM + DeepRiPe/RBPNet for variant scoring | See clip-seq/clip-deep-learning |
| snoRNA / structural RNA - sequence motif less meaningful | Skip motif analysis; emphasize structural context (icSHAPE) | snoRNA binders read structure, not linear sequence |
| Pattern | Likely cause | Action |
|---------|--------------|--------|
| HOMER and STREME agree; mCross disagrees on flanking | mCross's CL-offset registers a flanking base that's identical in peak; HOMER sees both flanking bases as motif | Trust mCross for CL-registered position; HOMER is "what's nearby" |
| Top HOMER motif AU-rich; RBP knows to bind structure | GC-matched background not used | Re-run HOMER with -bg matched_bg.fa |
| HOMER finds canonical motif at rank 1; RBNS Kd ranks at rank 3 | In vitro vs in vivo specificity divergence | Both valid; cite RBNS for absolute affinity, CLIP for in vivo prevalence |
| mCross PWM has multiple CL-position peaks | Multiple binding modes (SRSF1 GGA half-sites; PTBP1 multimer) | Treat as biologically meaningful; report both |
| Skipper window-level motifs differ from CLIPper peak-level | Window size dilutes signal; peak boundaries sharpen | Both valid; report at the level of the downstream analysis |
| PAR-CLIP motifs cleaner than iCLIP for same RBP | PAR-CLIP T->C registers exactly; iCLIP truncates -1 from CL | Both correct; PAR-CLIP is the sharper single-nt method when 4SU labeling tolerated |
| HOMER finds different motif on top vs bottom strand | Strand information lost in upstream BED | Verify BED column 6 (strand) is preserved; HOMER respects strand |
Operational rule for high-confidence motif reporting: Report (a) HOMER top de novo motif with GC-matched background, (b) mCross PWM with CL-position offset histogram, (c) PEKA k-mer enrichment for orthogonal confirmation, (d) FIMO scan against published RBP PWM (CISBP-RNA / ATtRACT) for known-motif validation, (e) RBNS Kd ranking from Dominguez 2018 if available. Three independent methods agreeing on the core motif is the bar.
Goal: Produce a publication-grade motif report combining peak-based de novo discovery, CL-position-registered motif (mCross), and known-motif validation.
Approach: Extract sequences with strand preservation, build a GC-matched 3' UTR background, run HOMER + STREME for de novo, mCross for CL-registered, FIMO scan against CISBP-RNA, and report all three views with information-content QC.
# Step 1: Extract peak sequences (use stringent peak set: log2 FC >= 3, -log10 p >= 3)
bedtools getfasta -fi genome.fa -bed peaks.stringent.bed -s -fo peaks.fa
# Step 2: GC-matched background (random regions from expressed transcripts)
# expressed.bed = transcripts with TPM >= 1 in the same cell type
shuffleBed -i peaks.stringent.bed -g chrom.sizes -incl expressed.bed -seed 42 > shuffled.bed
bedtools getfasta -fi genome.fa -bed shuffled.bed -s -fo background.fa
# Step 3: HOMER de novo + known
findMotifs.pl peaks.fa fasta homer_out \
-rna -len 5,6,7,8 -p 8 \
-fasta background.fa
# Step 4: STREME for cross-validation
streme --rna --oc streme_out -p peaks.fa -n background.fa --minw 5 --maxw 10
# Step 5: mCross requires single-nt CL sites (PureCLIP output)
# crosslinks.bed = PureCLIP -o sites.bed (single-nt CL positions, see clip-seq/crosslink-site-detection)
mCross \
-i crosslinks.bed -g genome.fa -k 7 -n 5 -o mcross_out
# Step 6: PEKA orthogonal positional k-mer
peka --peak_file_name peaks.stringent.bed --crosslinks_file_name crosslinks.bed --genome_file_name genome.fa --regions_file_name regions.bed --kmer_length 5 --percentile 30 --outpath peka_out
# Step 7: Compare to RBNS in vitro Kd (Dominguez 2018 supplementary)
# Manual cross-reference to RBNS Kd-ranked top-5 motifs for the target RBP
import numpy as np
import pandas as pd
from Bio import motifs
def parse_homer_motif(motif_file):
'''Parse HOMER motif PWM and return information content'''
pwm = []
with open(motif_file) as f:
for line in f:
if line.startswith('>'):
continue
pwm.append([float(x) for x in line.strip().split()])
pwm = np.array(pwm)
# Information content per position (bits)
epsilon = 1e-10
ic = np.sum(pwm * np.log2(pwm / 0.25 + epsilon), axis=1)
return {
'length': pwm.shape[0],
'mean_IC_per_position': np.mean(ic),
'max_IC': np.max(ic),
'min_IC': np.min(ic),
'is_low_complexity': np.mean(ic) < 0.5
}
# A high-quality CLIP motif has mean IC per position > 1.0 bit
# Mean IC < 0.5 suggests no real motif - check for background mismatch
| Metric | Good | Investigate | |--------|------|-------------| | Mean IC per position (HOMER PWM) | > 1.0 bit | < 0.5 = no real motif or background mismatch | | Motif center U content | matches RBP biology | > 80% U in central position = U-CL bias suspect | | FIMO scan rate in peaks vs background | peaks 3-10x bg | < 2x = motif too weak; > 50x = motif too narrow | | mCross CL-offset histogram | single mode at offset N | flat = mCross unable to register | | RBNS rank of CLIP top motif | top 3 | > 10 = strong in vivo / in vitro divergence; investigate | | Replicate motif overlap | top motif identical in 2/2 reps | discordance = under-powered or one rep failed |
| Error / symptom | Cause | Solution |
|-----------------|-------|----------|
| HOMER top motif is "ATCC" or similar TruSeq-adapter-like | Adapter trim incomplete in preprocessing | Re-run cutadapt with correct adapter; verify trimming completed before peak calling |
| mCross "no crosslinks found" or output identical to peak-centered logo | Passed peak BED instead of single-nt crosslink-site BED | mCross requires PureCLIP or CTK CITS single-nt site BED; not a CLIPper peak BED. See clip-seq/crosslink-site-detection |
| mCross CL-offset histogram is flat | Strand information missing OR alignment soft-clipped 5' base | Verify BED column 6 (strand); confirm upstream --alignEndsType EndToEnd |
| All HOMER motifs are AU-rich, no specificity | Default shuffled background AU-rich for 3' UTR peaks | Provide GC-matched background from expressed 3' UTRs |
| mCross input "no crosslinks found" | Provided peak BED instead of single-nt CL BED | Pass PureCLIP / CTK CITS single-nt BED instead |
| MEME "out of memory" or > 12h runtime | Peak set too large | Switch to STREME; or down-sample to top-1000 peaks |
| FIMO match in nearly every peak | Threshold too lenient | Tighten to 1e-5 or 1e-6 |
| Motif width > 10 nt looks chimeric | HOMER reported overlapping co-motifs as one | Re-run with -len 5,6,7,8 (force narrower); inspect rank 2-5 motifs |
| PWM info content < 0.5 bits | Background mismatch or peaks have no specificity | Verify peak set is stringent (log2 FC >= 3); regenerate background |
| Strand-specific motif different on +/- | BED column 6 missing or wrong | Verify strand encoding; re-extract with bedtools getfasta -s |
| Multiple distinct motifs per RBP | Biological (multimodal RBPs) OR contaminating multi-RBP CLIP | Validate with mCross multimode + compare to literature |
tools
--- name: bio-phasing-imputation-foundations description: Frames the phasing/imputation pipeline before any tool runs: phasing and imputation are one Li-Stephens copying HMM (recombination is the transition, mutation the emission, the genetic map and Ne set the rates), imputation's honest output is a dosage with a self-estimated quality (INFO/R2/DR2) not a hard genotype, and the stages are ordered and each fails silently (QC, align build and strand to the panel, phase, impute per chromosome, fil
tools
Chooses the enrichment generation before any tool runs, mapping the input shape to a method class - a pre-selected gene list plus a background to over-representation analysis (ORA, hypergeometric), a ranked statistic for all genes to gene set enrichment (GSEA), a signed signaling topology to pathway-topology (SPIA) - then making the null explicit (competitive vs self-contained, gene vs subject sampling) and running a trustworthiness checklist (testable-gene universe, FDR, redundancy collapse, leading-edge check, version reporting). Covers why every clusterProfiler GSEA is the inter-gene-correlation-uncorrected competitive null, why the background not the gene list decides ORA significance, and why no method is universally best. Use when deciding ORA vs GSEA vs topology, which gene-set DB, whether a result is trustworthy, or which null a tool computes. For ORA see go-enrichment, GSEA see gsea, databases kegg-pathways/reactome-pathways/wikipathways; the ranking comes from differential-expression/de-results.
testing
End-to-end GWAS workflow from VCF to association results. Covers PLINK QC, population structure correction, and association testing for case-control or quantitative traits. Use when running genome-wide association studies.
development
Orchestrates the full path from differential expression results to redundancy-collapsed functional enrichment: choose ORA vs GSEA, convert gene IDs per method, run enrichGO/enrichKEGG/enrichPathway/enrichWP or gseGO/gseKEGG (clusterProfiler, ReactomePA, rWikiPathways), and visualize. Routes the ORA-vs-GSEA generation fork and the null/universe/reproducibility theory to pathway-analysis/enrichment-foundations. Use when a DESeq2/edgeR/limma result must become enriched GO terms, KEGG/Reactome/WikiPathways pathways, or a GSEA leading edge; when deciding whether a ranking exists for all genes (GSEA, named decreasing vector) or only a pre-selected list (ORA plus a defensible background universe); or when assembling DE-to-pathway end to end. The DE list and ranking statistic come from differential-expression/de-results; per-method nuance lives in the pathway-analysis skills.