clip-seq/crosslink-site-detection/SKILL.md
Detect single-nucleotide crosslink (CL) sites in CLIP-seq data using truncation patterns (iCLIP/eCLIP CITS), crosslink-induced mutations (HITS-CLIP CIMS deletions, PAR-CLIP T-to-C), or HMM/kernel-density methods (PureCLIP, PARalyzer, CTK). Use when single-nucleotide resolution is required for motif registration (mCross), allele-specific binding (BEAPR), variant-effect prediction, or comparing crosslink chemistry across CLIP variants.
npx skillsauth add GPTomics/bioSkills bio-clip-seq-crosslink-site-detectionInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Reference examples tested with: PureCLIP 1.3.1+, CTK 1.1.4+, PARalyzer 1.5+, wavClusteR 2.34+, pyCRAC 1.5+, samtools 1.19+, bedtools 2.31+, pysam 0.22+, R 4.3+.
Before using code patterns, verify installed versions match. If versions differ:
<tool> --version then <tool> --help to confirm flagspip show <package> then help(module.function) to check signaturespackageVersion('<pkg>') then ?function_name to verify parametersIf code throws unexpected errors, introspect the installed tool and adapt the example to match the actual CLI rather than retrying.
"Detect single-nucleotide crosslink sites in my CLIP data" -> Identify the exact base where the protein-RNA UV adduct caused the reverse transcriptase to stop (truncation, in iCLIP/eCLIP), to read through with a mutation (deletion in HITS-CLIP, T->C in PAR-CLIP), or to leave a multi-base signature (PARalyzer kernel density for PAR-CLIP). Single-nucleotide resolution is the foundation of motif registration (mCross), allele-specific binding (BEAPR/ASPRIN), variant-effect prediction, and the most rigorous comparisons across CLIP variants. The detection chemistry differs by CLIP type: iCLIP/eCLIP read 5' end is one nucleotide downstream of the crosslink (CITS); PAR-CLIP reads contain T->C transitions at the crosslink (CIMS substitution); HITS-CLIP reads contain deletions at the crosslink (CIMS deletion).
pureclip -i dedup.bam -bai dedup.bam.bai -g genome.fa -ibam sminput.bam -ibai sminput.bam.bai -o crosslinks.bed -or regions.bed -nt 8 -dm 8parseAlignment.pl --map-qual 1 --min-len 18 --mutation-file mut.txt dedup.bam dedup.bed then tag2cluster.pl ... -cs5 5 -m 1 for truncation-clustergetMutationType.pl dedup.bed mut.txt -type del then CIMS.pl dedup.bed mut.txt -big -c -p 0.01 cims.bedgetMutationType.pl dedup.bed mut.txt -type sub -nuc t -mut c then CIMS.pl dedup.bed t2c.mut -p 0.001 t2c_cims.bedPARalyzer params.ini (parameters file defines read length, min reads per cluster, mutation rate threshold)wavClusteR::filterClusters(cl, snps=NULL, filterFC=FALSE) after wavelet clusteringSingle-nt CL sites are the input to mCross motif registration (see clip-seq/clip-motif-analysis), to BEAPR/ASPRIN allele-specific binding analyses, and to RBPNet/DeepRiPe deep-learning models (see clip-seq/clip-deep-learning). They are NOT a replacement for broad peak calls; the two outputs are complementary. A common downstream error is passing a CLIPper peak BED to mCross instead of a PureCLIP single-nt BED - mCross requires the single-nt resolution to register motif position relative to the crosslink offset.
The detection method must match the underlying chemistry of how the reverse transcriptase encountered the protein-RNA adduct:
| CLIP variant | RT behavior at CL | Detection signature | Sensitivity (CL captured per read) | Sequence bias | |--------------|-------------------|---------------------|-------------------------------------|---------------| | iCLIP / iCLIP2 / iCLIP3 / eCLIP / irCLIP / FLASH | Truncates at adduct | 5' end of cDNA = CL site - 1 | ~80% of reads truncate, 20% read through with deletion | Strong U bias (~60-80% of CL events at U) | | HITS-CLIP | Reads through with deletion (3-7% of crosslinked positions) | Single-base deletion in read | ~3-7% of reads contain deletion at CL site | U bias plus sequence-specific deletion rate | | PAR-CLIP (4SU) | Reads through with T->C (20-50% of T positions in crosslinked reads) | T->C transition (or G->A on reverse strand) | Per-T rate 20-50%; per-read 1-5 conversions | Restricted to T positions; depends on 4SU incorporation rate | | PAR-CLIP (6SG) | Reads through with G->A | G->A transition | Lower rate than 4SU T->C | Restricted to G positions | | miCLIP / miCLIP2 (m6A) | Primarily truncation at the m6A site (RT stops at antibody-trapped m6A); secondary C->T transition observed in a subset of reads at m6A | 5' end (CL -1); C->T rate is a feature used by m6Aboost ML in addition to truncation, not a stand-alone primary signal | Mixed; m6Aboost ML integrates truncation + sequence context to score | DRACH motif context required | | STAMP | C->T editing on target (not crosslink) | C->T editing in mRNA reads | NA (editing-based; not crosslink-based) | N/A | | TRIBE | A->I editing on target | A->G in cDNA (I read as G) | NA | Adjacent to ADAR consensus |
| Tool | CLIP variant | Detection signal | Statistical model | Output resolution | Strength | Fails when | |------|--------------|------------------|-------------------|-------------------|----------|------------| | PureCLIP (Krakau 2017) | iCLIP/eCLIP/PAR-CLIP | Truncation + enrichment + CL motif | Non-homogeneous HMM | Single-nucleotide | The most comprehensive HMM model | F1 ~0.2 on broad-binding RBPs (only ~4 sites per CLIP on benchmark) | | CTK CITS (Shah 2017) | iCLIP/eCLIP | Truncation only | Empirical FDR vs background | Single-nucleotide | Simple, well-validated for iCLIP | Less granular than PureCLIP; no HMM smoothing | | CTK CIMS deletion (Shah 2017) | HITS-CLIP | Single-base deletions | Empirical FDR | Single-nucleotide | Standard for HITS-CLIP | Requires deletion-tolerant aligner (BWA -e) | | CTK CIMS T->C (Shah 2017) | PAR-CLIP | T->C substitutions | Empirical FDR | Single-nucleotide | Alternative to PARalyzer | Less popular than PARalyzer | | PARalyzer (Corcoran 2011) | PAR-CLIP | T->C transitions, kernel density | Kernel density estimation | Cluster (10-50 nt) | Field standard for PAR-CLIP | Cluster-level, not single-nt; needs careful parameter tuning | | wavClusteR (Comoglio 2015) | PAR-CLIP | T->C with wavelet smoothing | Wavelet transform + clustering | Cluster | Robust to sequencing depth variability | Less single-nt; legacy R package | | pyCRAC / kPLogo | CRAC (yeast) | Read truncation + deletion | Empirical | Single-nucleotide | Original CLIP analytics tool | Yeast-focused; perl/python legacy | | Piranha bins | Any | Coverage in bins | ZTNB | Bin-level (50-200 nt) | Not crosslink-specific | Bin width too coarse for single-nt | | HOMER tag2pos | Any | 5' end position | None | Read-end positions only | Quick truncation site dump | No statistical filtering | | omniCLIP | Any | Coverage + variant pattern | Dirichlet-multinomial HMM | Region | Not focal | Too broad for single-nt | | iCount (paths) | iCLIP | Crosslink site cluster | Empirical | Cluster | Used in nf-core/clipseq | Less single-nt than PureCLIP |
Methodology evolves; PureCLIP 2.0 and CTK 1.2 have minor flag changes; PARalyzer parameters need careful tuning per-RBP. Cross-validate single-nt sites with at least two tools when high-confidence reporting is required.
Three orthogonal approaches:
Truncation-based (CITS, PureCLIP) -- Find positions enriched in cDNA 5' ends. The RT enzyme stops at the adduct; the 5' end of the read maps to CL site - 1. Used for iCLIP/eCLIP. Pro: high yield (~80% of reads truncate). Con: U bias of crosslinking inflates U positions.
Mutation-based (CIMS for deletions/substitutions; PARalyzer for T->C) -- Find positions with crosslink-induced mutations. RT reads through the adduct with high mutation rate at the CL position. Used for HITS-CLIP (deletions, 3-7%) and PAR-CLIP (T->C, 20-50%). Pro: less U bias; PAR-CLIP signal is restricted to T positions. Con: lower yield (only mutated reads count).
HMM-based (PureCLIP, omniCLIP) -- Joint model of enrichment + CL signature + sequence context. State the most-likely posterior probability of CL state per nucleotide. Used across CLIP variants. Pro: integrates multiple signals; explicit input normalization. Con: F1 ~0.2 on bulk RBPs (very focal); single-nt resolution at the cost of breadth.
| Goal | Method | Tool | |------|--------|------| | Single-nt CL map for motif registration (mCross) | HMM | PureCLIP | | Truncation-based ENCODE-compatible iCLIP | Truncation | CTK CITS | | PAR-CLIP T->C signal | Mutation | PARalyzer (cluster) or CTK CIMS sub T->C (single-nt) | | HITS-CLIP deletion signal | Mutation | CTK CIMS deletion | | Cross-CLIP-variant comparable single-nt | HMM | PureCLIP (all variants) | | Allele-specific CLIP (BEAPR) | Truncation + ASB | PureCLIP CL sites + WASP-filtered BAM | | Variant effect (computational from CL) | HMM | PureCLIP + downstream DeepRiPe (see clip-deep-learning) | | Yeast / CRAC | Truncation + deletion | pyCRAC |
Trigger: RBP with low IP enrichment; SMInput depth uneven; PureCLIP run on full genome without -iv interval.
Mechanism: PureCLIP HMM convergence is sensitive to coverage depth. On regions with < 5 reads/100 bp the HMM defaults to background state; if half the genome is sparse, the state distributions become bimodal and the entire run takes > 24 h or runs out of memory.
Symptom: PureCLIP "Convergence not reached" warning; or output has 0 CL sites; or runtime > 24 h on standard server.
Fix: Restrict analysis to expressed transcripts: -iv expressed_tx.bed. Or filter the BAM to high-coverage regions first. Test on chr22 first to verify parameters before genome-wide run.
Trigger: RBP with broad binding (PUM2 3' UTRs, SR proteins exonic enhancers); using PureCLIP exclusively.
Mechanism: PureCLIP HMM emits the CL state at very high stringency. Skipper 2023 benchmark reports F1 ~0.2 on bulk RBPs - only ~4 sites per CLIP on test data.
Symptom: PureCLIP CL site count is 100x lower than CLIPper / Skipper peak count.
Fix: Use PureCLIP for single-nt CL maps AND CLIPper/Skipper for broad-zone peak calls. They are complementary, not substitutes. PureCLIP -or regions output is broader than -o sites but still focal.
Trigger: CTK CITS run on PAR-CLIP data.
Mechanism: PAR-CLIP RT reads through the adduct (T->C mutation), not truncates. The 5' end of PAR-CLIP reads is at the fragment 5' end, not the CL site.
Symptom: CITS returns few sites; the sites that ARE returned do not align with the T->C mutation positions.
Fix: For PAR-CLIP use PARalyzer or CTK CIMS substitution mode (-type sub -nuc t -mut c). CITS is iCLIP/eCLIP only.
Trigger: CTK CIMS run with -type del on iCLIP data.
Mechanism: iCLIP RT truncates, doesn't delete. The deletion rate is < 1% in iCLIP (background level), so CIMS finds nothing significant.
Symptom: CIMS output BED has < 100 sites genome-wide.
Fix: Use CTK CITS for iCLIP (truncation-based) or PureCLIP. CIMS deletion is HITS-CLIP / older PAR-CLIP only.
Trigger: PARalyzer default parameters on a new RBP; cluster count unexpected.
Mechanism: PARalyzer's kernel-density approach has 5+ parameters: minimum read depth, minimum mutation read count, mutation rate threshold, bandwidth, kernel type. Different combinations produce 5-100x different cluster counts.
Symptom: PARalyzer cluster count 5-100x off from literature reports for the same RBP.
Fix: Use the published PARalyzer parameter file for that RBP class (Hafner 2010 supplementary). Validate on a known-binding region with a reference dataset before publication.
Trigger: HITS-CLIP aligned with STAR or bowtie2 in standard mode; CIMS finds few deletion sites.
Mechanism: Standard STAR/bowtie2 are aggressive about handling deletions; CIMS expects the deletions to be visible as CIGAR D operations in BAM. STAR/bowtie2 may soft-clip or fail to align deletion-bearing reads.
Symptom: Few CIGAR D operations in BAM (samtools view dedup.bam | awk '$6 ~ /D/' | wc -l); CIMS deletion site count low.
Fix: Re-align HITS-CLIP with BWA-aln (Yeo lab convention) or STAR --scoreDelOpen -1 --scoreDelBase -1 --scoreInsOpen -1 --scoreInsBase -1 to permit deletions. Or use Novoalign which is more deletion-tolerant.
Trigger: PAR-CLIP aligned with STAR --outFilterMismatchNoverReadLmax 0.04; T->C reads discarded.
Mechanism: PAR-CLIP per-T conversion rate 20-50% means a 30 nt read with 8 Ts may have 4 T->C events = 13% mismatch rate, exceeding the 4% ceiling.
Symptom: 40-70% read loss at alignment for PAR-CLIP; downstream T->C site detection finds nothing.
Fix: Raise to 0.07 for PAR-CLIP only (see clip-seq/clip-alignment).
Trigger: Single-end CLIP; using read 5' end as CL site without strand consideration.
Mechanism: For plus-strand RNA, the cDNA 5' end is downstream of the CL position. For minus-strand RNA, it is upstream. Without strand-aware coordinate conversion, half of CL sites are mis-positioned by ~1 nt.
Symptom: mCross / motif analysis shows the motif positioned 1 nt off from expected.
Fix: Verify the BED output handles strand correctly. CTK CITS, PureCLIP, and most modern tools do this automatically; custom scripts must add if strand == '-': cl_position = read_end_5p + 1.
| Scenario | Tool | Parameters |
|----------|------|-----------|
| Single-nt CL map for mCross motif registration | PureCLIP | -i dedup.bam -ibam sminput.bam -dm 8 |
| iCLIP/eCLIP CITS truncation sites | CTK CITS | tag2cluster.pl -cs5 5 -m 1 |
| HITS-CLIP deletion CIMS | CTK CIMS | CIMS.pl -big -c -p 0.01 |
| PAR-CLIP T->C clusters | PARalyzer | Per-RBP params from Hafner 2010 |
| PAR-CLIP T->C single-nt | CTK CIMS substitution | CIMS.pl -type sub -nuc t -mut c -p 0.001 |
| Yeast CRAC | pyCRAC | pyCRAC.py with HTP-tagged protein |
| Variant effect at CL sites | PureCLIP + DeepRiPe | See clip-seq/clip-deep-learning |
| Allele-specific binding | WASP-filtered BAM + PureCLIP | See clip-seq/clip-alignment for WASP |
| Cross-CLIP comparison | PureCLIP (consistent across variants) | Same parameters |
| m6A miCLIP2 sites | miCLIP2-specific (m6Aboost) | See clip-seq/m6a-clip |
| STAMP edit sites | Not crosslink-based | See clip-seq/stamp-antibody-free |
| Pattern | Likely cause | Action |
|---------|--------------|--------|
| PureCLIP << CTK CITS site count | PureCLIP HMM very focal; CITS empirical broader | Both correct; report which threshold |
| PARalyzer clusters << CIMS T->C sites for PAR-CLIP | PARalyzer is cluster-level; CIMS is single-nt | Aggregate CIMS to clusters for comparison |
| PureCLIP CL sites at unexpected positions | Strand mis-assignment OR aligner soft-clip | Check BED strand column; verify --alignEndsType EndToEnd |
| HITS-CLIP CIMS empty | Aligner not deletion-tolerant | Re-align with BWA-aln or STAR with adjusted scoring |
| PAR-CLIP CIMS << expected | Reads lost at mismatch ceiling | Raise STAR --outFilterMismatchNoverReadLmax to 0.07 |
| eCLIP CITS positions shifted by 1 nt | Strand handling off-by-one | Verify R2 5' end position handling |
| Multiple CITS sites within 5 nt | RT stops near each other; same CL event | Cluster within 10 nt window post-detection |
| Motif registers 5 nt off in mCross | CL positions wrong by RT-stop offset | Confirm tool reports "CL - 1" position consistently |
Operational rule: For motif registration and ASB, run PureCLIP with SMInput. For ENCODE-comparable truncation-based output, also run CTK CITS. Cross-validate by checking that mCross motif PWM is the same when fed PureCLIP sites vs CTK CITS sites. If they diverge by > 2 nt in motif position, suspect strand or off-by-one issue.
# Step 1: PureCLIP single-nt sites
pureclip \
-i sample.dedup.bam -bai sample.dedup.bam.bai \
-g genome.fa \
-ibam sminput.dedup.bam -ibai sminput.dedup.bam.bai \
-o sample.crosslinks.bed \
-or sample.regions.bed \
-nt 8 -dm 8 -iv expressed_tx.bed
# Step 2: mCross motif registration
mCross -i sample.crosslinks.bed -g genome.fa -k 7 -n 5 -o mcross_out
# Step 3: Intersect with heterozygous SNPs for allele-specific binding
bedtools intersect -wa -wb -s -a sample.crosslinks.bed -b het_snps.vcf > cl_at_hets.bed
# Step 4: Test allele bias with BEAPR or ASPRIN
# (see allele-specific CLIP literature; not in this skill)
| Error / symptom | Cause | Solution |
|-----------------|-------|----------|
| PureCLIP "Convergence not reached" | HMM didn't converge on sparse coverage | Restrict to expressed transcripts with -iv expressed.bed |
| PureCLIP > 24h runtime | Full genome on large library | Test on chr22 first; add -iv filter |
| CTK CIMS empty for HITS-CLIP | Aligner not deletion-tolerant | Re-align with BWA-aln |
| CITS truncation sites not single-nt | CIGAR S operations from soft-clip | Verify --alignEndsType EndToEnd upstream |
| PAR-CLIP CIMS empty | Reads lost at mismatch ceiling | Raise STAR mismatch tolerance |
| Off-by-one motif position | Strand handling differs | Verify tool documents whether output is CL or CL-1 |
| Single-nt motif lost in HOMER on PureCLIP sites | Site BED too narrow; need flanking | Extend bedtools slop -b 15 before motif analysis |
| mCross "no crosslinks found" | Passed peak BED instead of CL BED | Use single-nt site output |
| PureCLIP output looks like uniform distribution | R2 5' end trimmed in preprocessing | Re-preprocess; -g and --trim_front2 banned for CLIP |
| Strand-specific motif inverted | BED column 6 wrong | Verify strand encoding; PureCLIP preserves correctly |
testing
Analyze multi-modal single-cell data (CITE-seq, Multiome, spatial). Use when working with data that measures multiple modalities per cell like RNA + protein or RNA + ATAC. Use when analyzing CITE-seq, Multiome, or other multi-modal single-cell data.
data-ai
Analyze metabolite-mediated cell-cell communication using MeboCost for metabolic signaling inference between cell types. Predict metabolite secretion and sensing patterns from scRNA-seq data. Use when studying metabolic crosstalk between cell populations or metabolite-receptor interactions.
development
Find marker genes and annotate cell types in single-cell RNA-seq using Seurat (R) and Scanpy (Python). Use for differential expression between clusters, identifying cluster-specific markers, scoring gene sets, and assigning cell type labels. Use when finding marker genes and annotating clusters.
development
Reconstruct cell lineage trees from CRISPR barcode tracing or mitochondrial mutations. Use when studying clonal dynamics, cell fate decisions, or developmental trajectories.