epitranscriptomics/merip-preprocessing/SKILL.md
Aligns and QCs methylated-RNA-immunoprecipitation (MeRIP / m6A-seq) IP and input libraries using STAR or HISAT2 splice-aware mapping, samtools sort/index, IP/input matched-pair tracking, antibody-lot metadata recording, replicate concordance via deepTools multiBamSummary + plotCorrelation, IP enrichment QC via plotFingerprint and per-transcript IP/input ratio distributions, library-complexity saturation curves via PreSeq c_curve / lc_extrap, and the explicit do-NOT-deduplicate convention for standard non-UMI MeRIP. Use when preparing paired IP and input BAM files for exomePeak2 / MeTPeak / MACS3 peak calling, evaluating MeRIP replicate concordance and IP enrichment, deciding whether to deduplicate (standard MeRIP typically NOT — see Tips), choosing genome-vs-transcriptome alignment for downstream peak vs m6Anet workflows, recording antibody clone and lot metadata for downstream cross-batch reconciliation, detecting failed IPs via saturation curves and IP/input distribution shape, or generating IP-over-Input bigWig tracks for downstream visualisation.
npx skillsauth add GPTomics/bioSkills bio-epitranscriptomics-merip-preprocessingInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Reference examples tested with: STAR 2.7.11+, HISAT2 2.2.1+, samtools 1.19+, deepTools 3.5+, PreSeq 3.2+, fastp 0.23+, Trim Galore 0.6.10+, Picard 3.1+, MultiQC 1.25+, bowtie2 2.5+, BWA-MEM2 2.2+.
Before using code patterns, verify installed versions match. If versions differ:
<tool> --version then <tool> --help to confirm flagspip show <package> then help(module.function)If code throws ImportError, AttributeError, or TypeError, introspect the installed package and adapt the example to match the actual API rather than retrying.
STAR --outSAMtype accepts BAM SortedByCoordinate since 2.5.x; check the Log.final.out file for input/output statistics. deepTools bamCompare --operation log2 is the modern syntax (older --ratio log2 still works but is being phased out). PreSeq c_curve and lc_extrap have stable interfaces; HISAT2 reports unique vs multi-mapped in the summary log.
"Get my MeRIP IP and input libraries ready for peak calling" -> Trim adapters with MeRIP-appropriate defaults (do NOT trim UMIs unless the library is UMI-MeRIP — most are not), splice-aware-align IP and input to the GENOME (not transcriptome) with STAR / HISAT2, sort and index, evaluate replicate concordance and IP enrichment with deepTools, build a saturation curve per library so peak counts can be honestly compared across libraries, record antibody clone and lot metadata so cross-batch comparison is later auditable, and produce IP-over-Input log2 bigWig tracks for downstream visualisation. Crucially, do NOT deduplicate non-UMI MeRIP — see the failure-modes section.
STAR --runMode alignReads -- splice-aware genome alignment, the field defaulthisat2 -x index -1 R1.fq.gz -2 R2.fq.gz -- graph-based alternative; lighter memory footprintsamtools sort -@ 8 -o sorted.bam in.bam && samtools index sorted.bam -- post-alignment mechanicsdeeptools multiBamSummary bins -b *.bam -o cov.npz + plotCorrelation -- replicate Spearmandeeptools plotFingerprint -b IP.bam Input.bam -o fp.pdf -- IP enrichment QC (ChIP-seq term; transfers cleanly)preseq lc_extrap -B -o curve.txt sorted.bam -- library complexity / saturationdeeptools bamCompare -b1 IP.bam -b2 Input.bam --operation log2 -o IP_over_Input.bw -- downstream-ready coverage trackA MeRIP library sequenced to 20 million unique reads finds substantially fewer peaks than the same biology at 60 million reads. Per-sample peak counts reported without saturation curves (PreSeq c_curve / lc_extrap; Daley & Smith 2013 Nat Methods 10:325) are uninterpretable across studies and often across replicates within a study. Subsample BAMs to a common unique-read depth before peak calling for any cross-condition peak-count comparison, OR report peaks alongside the saturation curve. A corollary: do NOT deduplicate standard MeRIP — the typical MeRIP protocol (Synaptic Systems 202-003 / Abcam ab151230 / NEB EpiMark E1610 antibody pull-down on fragmented poly(A)-selected RNA) has NO unique molecular identifiers, and picard MarkDuplicates on such libraries collapses real biological replicates of high-coverage transcripts (the opposite of what dedup achieves in DNA ChIP-seq). UMI-MeRIP is the only exception — and most MeRIP libraries in print are NOT UMI. McIntyre et al. 2020 Sci Rep 10:6590 demonstrated that replicate-to-replicate peak overlap is ~80% within a single lab but drops to a median 45% between labs using nominally identical conditions; this irreducible technical noise constrains how strongly any single MeRIP study can support biological claims, and the preprocessing pipeline is where the variance is set.
| Tool / step | Mechanism | Output | Strength | Fails when |
|-------------|-----------|--------|----------|------------|
| STAR 2.7+ (Dobin 2013 Bioinformatics 29:15) | Two-pass splice-aware alignment with on-the-fly splice junction database | Sorted BAM + splice-junction TSV | Field default; multi-mapper retention configurable; STAR splice-junction DB | Memory-heavy (~30 GB human); slower than HISAT2 |
| HISAT2 2.2.1+ (Kim 2019 Nat Biotechnol 37:907) | Hierarchical graph FM-index; splice-aware | Sorted BAM | ~5x lighter memory than STAR; comparable accuracy | Less mature splice-junction handling for novel introns |
| BWA-MEM2 (Vasimuddin 2019 IPDPS 314) | DNA-style local alignment; NO splice awareness | Sorted BAM | Use ONLY for transcriptome-aligned MeRIP (rare) | Splits reads across exon junctions if used on genome |
| fastp 0.23+ (Chen 2018 Bioinformatics 34:i884) | Streaming adapter detection + quality trim | Trimmed FASTQ + JSON QC | Fast; JSON-readable QC output | UMI handling disabled by default; do NOT pass --umi for standard non-UMI MeRIP (the opposite of the failure direction in some other library types) |
| Trim Galore | Wrapper over cutadapt with paired-end auto-detect | Trimmed FASTQ | Conservative defaults; widely cited | Slower than fastp on large datasets |
| samtools sort / index | BAM coordinate sort + .bai index | Sorted BAM + index | Standard | None at default |
| Picard MarkDuplicates | Identifies PCR duplicates by 5' alignment start | Marked / removed BAM | Standard in DNA / ChIP | The dominant MeRIP convention is to SKIP dedup for non-UMI libraries (collapses real biology at high-coverage transcripts); a minority of pipelines dedup MeRIP — record the choice in metadata |
| deepTools multiBamSummary + plotCorrelation | Per-bin read counts; Spearman / Pearson matrix | Heatmap + clustering | Standard replicate-concordance plot | Bin size sensitive (use 10 kb for transcriptome-genome) |
| deepTools plotFingerprint (Diaz 2012 Stat Appl Genet Mol Biol 11:9) | Cumulative read-fraction vs cumulative-bin-fraction Lorenz curve | PDF + raw counts | Direct IP-vs-input enrichment QC; "good" IP has steep tail | A flat fingerprint = failed IP (or mock IgG) |
| deepTools bamCompare --operation log2 | Per-bin log2 (IP/input) bigWig | bigWig | Ready for downstream visualisation | Pseudocount choice matters at low-coverage bins |
| PreSeq c_curve / lc_extrap (Daley & Smith 2013 Nat Methods 10:325) | Capture-recapture; rational-function extrapolation | Curve TSV | The only honest library-complexity estimate | Requires uniquely-mapped reads to be reliable |
| MultiQC (Ewels 2016 Bioinformatics 32:3047) | Aggregator across tools | HTML report | Consolidates STAR + HISAT2 + samtools + deepTools + PreSeq into one report | None |
| Scenario | Recommended | Why wrong choices fail |
|----------|-------------|------------------------|
| Standard mammalian MeRIP, downstream exomePeak2 | STAR splice-aware -> genome BAM; do NOT deduplicate; build saturation curve | Transcriptome alignment breaks exomePeak2 (expects genome BAM + GTF); dedup collapses biology |
| Downstream m6Anet (ONT direct RNA) | Defer to m6anet-analysis -- alignment is to TRANSCRIPTOME with minimap2 -ax map-ont -uf -k14 | Genome-aligned ONT input breaks m6Anet entirely (signal-level dataprep requires per-transcript coordinates) |
| Limited memory (<16 GB) | HISAT2 instead of STAR | STAR human genome index requires ~30 GB |
| UMI-MeRIP (rare) | Trim UMI to read header (umi_tools / fastp --umi_loc), align, then dedup with umi_tools dedup | Standard picard MarkDuplicates ignores UMI; effective dedup rate wrong |
| Cross-batch comparison (different antibody lots) | Record antibody clone + lot in sample-sheet metadata; include batch factor in downstream design | Pooling cross-batch counts without batch term inflates false-positive differential peaks |
| Spike-in normalisation (NEB EpiMark control oligos) | Align separately to spike-in reference; report IP-spike-in / Input-spike-in ratio per sample; use for absolute normalisation | Most users discard the NEB EpiMark Gluc / Cluc controls; they are the per-sample IP-efficiency QC anchor |
| Cross-library peak-count comparison | Subsample BAMs to common unique-read depth with samtools view -s BEFORE downstream peak calling, OR fit saturation curves and compare at common depth | Raw peak counts are sequencing-depth-dependent and not biologically interpretable |
| Suspect failed IP | Inspect deepTools plotFingerprint AND per-transcript IP/input ratio distribution; failed IP shows shallow Lorenz tail AND median IP/input ~1.0 | Trusting raw peak count alone — failed IPs still produce peaks |
| Viral / contamination-suspect samples | Build a combined host + viral index (and rRNA index) and check unmapped reads | Single-organism indexes hide systematic contamination |
| Aligning to transcriptome (rare; specific downstream tools) | BWA-MEM2 or bowtie2; defer to read-alignment/ | STAR splice-aware on transcriptome causes spurious splice calls inside transcripts |
Methodology evolves; before any high-stakes preprocessing pipeline, web-search "STAR vs HISAT2 MeRIP 2024" and "MeRIP saturation curve preseq" for current consensus parameters.
Goal: Remove sequencing adapters and low-quality 3' ends WITHOUT removing biological signal (UMI-MeRIP must keep UMI sequences; standard MeRIP does not have UMIs and trimming should be minimal).
Approach: Use fastp or Trim Galore with adapter auto-detection; require minimum read length 25-30 nt (shorter reads multi-map and confound exomePeak2); for standard non-UMI MeRIP, do NOT pass --umi flags; preserve random-hexamer-priming artifacts ONLY if downstream pipeline expects them (most do not).
mkdir -p trimmed
for sample in IP_rep1 IP_rep2 IP_rep3 Input_rep1 Input_rep2 Input_rep3; do
fastp \
--in1 raw/${sample}_R1.fastq.gz \
--in2 raw/${sample}_R2.fastq.gz \
--out1 trimmed/${sample}_R1.fq.gz \
--out2 trimmed/${sample}_R2.fq.gz \
--html qc/${sample}_fastp.html \
--json qc/${sample}_fastp.json \
--length_required 25 \
--detect_adapter_for_pe \
--thread 8
done
For UMI-MeRIP (rare), insert --umi --umi_loc read1 --umi_len 8 BEFORE the alignment step. Default fastp output preserves base quality information needed by downstream variant-aware tools; do NOT pass --disable_quality_filtering for MeRIP libraries.
Goal: Produce coordinate-sorted, indexed GENOME BAM files for each IP and input library with splice-junction-aware mapping, retaining a moderate number of multi-mappers for accurate per-window read counts at multi-isoform loci.
Approach: Build STAR genome index once with the matched GENCODE / Ensembl GTF used downstream; loop IP and input samples with identical parameters; retain up to 20 multi-mappers per read (MeRIP read counts at multi-isoform genes need this); request explicit BAM SortedByCoordinate; emit splice-junction tables for QC.
STAR \
--runMode genomeGenerate \
--genomeDir star_index \
--genomeFastaFiles genome.fa \
--sjdbGTFfile annotation.gtf \
--sjdbOverhang 100 \
--runThreadN 12
mkdir -p aligned
for sample in IP_rep1 IP_rep2 IP_rep3 Input_rep1 Input_rep2 Input_rep3; do
STAR \
--runMode alignReads \
--genomeDir star_index \
--readFilesIn trimmed/${sample}_R1.fq.gz trimmed/${sample}_R2.fq.gz \
--readFilesCommand zcat \
--outSAMtype BAM SortedByCoordinate \
--outFilterMultimapNmax 20 \
--outSAMattributes NH HI AS nM NM MD \
--outFileNamePrefix aligned/${sample}_ \
--runThreadN 12
samtools index -@ 4 aligned/${sample}_Aligned.sortedByCoord.out.bam
done
--outFilterMultimapNmax 20 is intentional: MeRIP at rRNA / snoRNA / pseudogene-rich loci needs multi-mapper retention. Reduce to 1 only if downstream analysis explicitly cannot tolerate multi-mappers. --sjdbOverhang should equal (read length - 1) but 100 is the common-enough default for 100-150 bp reads.
Goal: Achieve splice-aware alignment in ~5x less memory than STAR (12-16 GB suffices for human), with comparable accuracy for MeRIP applications.
Approach: Build HISAT2 graph index; align with --dta for downstream-transcript-assembly compatibility; pipe directly to samtools sort.
hisat2-build genome.fa hisat2_index/genome
for sample in IP_rep1 IP_rep2 IP_rep3 Input_rep1 Input_rep2 Input_rep3; do
hisat2 \
-x hisat2_index/genome \
-1 trimmed/${sample}_R1.fq.gz \
-2 trimmed/${sample}_R2.fq.gz \
--dta \
--summary-file qc/${sample}_hisat2.log \
-p 12 | \
samtools sort -@ 8 -o aligned/${sample}.sorted.bam -
samtools index -@ 4 aligned/${sample}.sorted.bam
done
HISAT2 multi-mapper handling is governed by -k; the default reports the primary alignment only. For MeRIP, pass -k 5 if multi-mapper-aware downstream counting is required.
mkdir -p qc
for bam in aligned/*sortedByCoord.out.bam aligned/*sorted.bam; do
name=$(basename ${bam} .bam)
samtools flagstat ${bam} > qc/${name}.flagstat
samtools idxstats ${bam} > qc/${name}.idxstats
done
Inspect flagstat for properly-paired rate (>=85% indicates good pairing); inspect idxstats for unexpected chromosome-level read piles (rRNA bleed-through, mitochondrial domination — both are MeRIP red flags).
Goal: Quantify how similar replicate IP libraries are to each other (and likewise input libraries) using a Spearman correlation matrix; flag a divergent replicate before it propagates into peak calling.
Approach: Compute genome-wide per-bin read counts at 10 kb resolution across all IP and input BAMs; convert to a clustered Spearman heatmap with deepTools plotCorrelation.
multiBamSummary bins \
--bamfiles aligned/IP_rep1*.bam aligned/IP_rep2*.bam aligned/IP_rep3*.bam \
aligned/Input_rep1*.bam aligned/Input_rep2*.bam aligned/Input_rep3*.bam \
--binSize 10000 \
--numberOfProcessors 8 \
--outRawCounts qc/raw_bin_counts.tab \
-o qc/cov.npz
plotCorrelation \
--corData qc/cov.npz \
--corMethod spearman \
--skipZeros \
--whatToPlot heatmap \
--colorMap RdYlBu_r \
--plotNumbers \
-o qc/replicate_correlation.pdf
IP replicates within a condition should cluster (Spearman >= 0.85 typical); input replicates should cluster with each other; IP and input should NOT cluster together. A failed IP looks like input.
Goal: Confirm IP libraries are enriched (a few transcripts have many reads) and input libraries are uniform (reads spread across transcripts); fail-fast on poor IP before peak calling.
Approach: deepTools plotFingerprint builds a cumulative Lorenz-style curve; a steep tail = signal concentrated in few regions (good IP); a diagonal = uniform coverage (input or failed IP). The framework is from ChIP-seq (Diaz 2012 Stat Appl Genet Mol Biol 11:9) and transfers cleanly to MeRIP.
plotFingerprint \
--bamfiles aligned/IP_rep1*.bam aligned/IP_rep2*.bam aligned/IP_rep3*.bam \
aligned/Input_rep1*.bam aligned/Input_rep2*.bam aligned/Input_rep3*.bam \
--labels IP1 IP2 IP3 In1 In2 In3 \
--numberOfProcessors 8 \
--skipZeros \
--outQualityMetrics qc/fingerprint_metrics.tab \
-o qc/fingerprint.pdf
Good MeRIP IP: cumulative-fraction-of-reads vs cumulative-fraction-of-bins curve sits well below the diagonal in the right half (top-X% of bins capture >50% of reads). Input: near-diagonal. The --outQualityMetrics file reports JS distance and synthetic JS distance; the IP-vs-Input JS distance is a single-number IP-quality summary (higher = more concentrated signal).
Goal: Compute per-library complexity so peak counts can be honestly compared across libraries and conditions of different sequencing depth.
Approach: PreSeq c_curve (interpolation up to observed depth) and lc_extrap (extrapolation beyond observed) on the sorted BAM. Daley & Smith 2013 Nat Methods 10:325 capture-recapture model.
mkdir -p complexity
for bam in aligned/*.bam; do
name=$(basename ${bam} .bam)
preseq c_curve -B -o complexity/${name}_c_curve.txt ${bam}
preseq lc_extrap -B -o complexity/${name}_lc_extrap.txt ${bam}
done
Inspect: the lc_extrap curve plots distinct molecules vs total reads; a plateau indicates saturation. For cross-condition peak-count comparison: pick a common depth (often 30M unique reads), subsample with samtools view -s 0.<frac> to that depth, THEN call peaks.
Goal: Produce a per-bin log2 (IP / Input) coverage track per replicate, ready for downstream metagene / browser plots.
Approach: deepTools bamCompare with --operation log2; choose a sensible pseudocount to avoid divide-by-zero at low-coverage bins.
mkdir -p tracks
paste -d ' ' \
<(printf '%s\n' IP_rep1 IP_rep2 IP_rep3) \
<(printf '%s\n' Input_rep1 Input_rep2 Input_rep3) | \
while read ip input; do
bamCompare \
-b1 aligned/${ip}_Aligned.sortedByCoord.out.bam \
-b2 aligned/${input}_Aligned.sortedByCoord.out.bam \
--operation log2 \
--pseudocount 1 \
--binSize 25 \
--normalizeUsing CPM \
--numberOfProcessors 8 \
-o tracks/${ip}_over_${input}.bw
done
--pseudocount 1 prevents division-by-zero at zero-coverage bins; --binSize 25 is fine-grained enough to preserve peak topology while keeping bigWig files reasonably sized.
Trigger: picard MarkDuplicates REMOVE_DUPLICATES=true invoked on a standard MeRIP BAM that has no UMI.
Mechanism: Standard MeRIP libraries have no unique molecular identifiers. PCR duplicates and biological re-sampling at high-coverage transcripts look identical at the alignment level. Dedup removes both, collapsing real coverage at the most-abundant transcripts to an artificially flat profile. This is the opposite of dedup's intent in DNA ChIP-seq.
Symptom: Coverage at housekeeping mRNAs (e.g., GAPDH, ACTB) drops 5-20x after dedup; downstream peak counts at highly-expressed transcripts collapse; volcano plot of differential peaks shows expression-driven false positives.
Fix: Skip dedup for standard non-UMI MeRIP. If the library is UMI-MeRIP, use umi_tools dedup (Smith 2017 Genome Res 27:491) which respects UMI rather than alignment position alone. Record dedup status in sample-sheet metadata.
Trigger: STAR or bowtie2 aligned to transcriptome FASTA, then BAM passed to exomePeak2 / MeTPeak / MACS3.
Mechanism: exomePeak2 and MeTPeak expect a GENOME BAM plus GTF; they project peaks back to transcript features internally. A transcriptome BAM has reads in per-transcript coordinates which the GTF cannot resolve back to genome coordinates without re-alignment.
Symptom: exomePeak2 throws errors on TxDb-genome consistency; MeTPeak returns zero peaks; MACS3 calls peaks on transcript IDs as if they were chromosomes.
Fix: Align to GENOME with STAR / HISAT2 for downstream MeRIP peak calling. Transcriptome alignment is correct only for m6anet-analysis (ONT DRS) and rare quantification-only downstream tools.
Trigger: A single replicate IP library has IP/input ratio distribution centred at 1.0 across all transcripts (no enrichment); fingerprint Lorenz curve sits at the diagonal.
Mechanism: Failed IP — antibody-RNA binding did not enrich m6A-containing fragments. Causes include antibody-batch defect, insufficient pulldown wash, RNA degradation during IP, or accidental mock IgG IP.
Symptom: plotFingerprint shows IP overlaying input on the Lorenz plot; per-transcript IP/input ratio histogram is centred at 1.0; downstream peak callers find few or no peaks AT THE FAILED REPLICATE while other replicates produce normal counts.
Fix: Identify the failed replicate via plotFingerprint AND IP/input ratio distribution BEFORE peak calling; exclude or re-do. Single failed IP in a 3-replicate design routinely produces "differential" peaks driven entirely by the failure.
Trigger: A multi-condition MeRIP study uses Synaptic Systems 202-003 antibody lot A for the control IPs and lot B for the treatment IPs (because lot A ran out mid-study).
Mechanism: Anti-m6A polyclonals (Synaptic Systems 202-003, Abcam ab151230, NEB EpiMark E1610, Cell Signaling 56593, Active Motif 61755) have batch-to-batch variability in pulldown efficiency and m6A-vs-m6Am cross-reactivity. Pooling lot-A and lot-B counts in a downstream differential model attributes lot-effect to condition.
Symptom: "Differential" peaks at high abundance transcripts; effect sizes track antibody lot rather than condition; reanalysis with lot in the design matrix removes most differential peaks.
Fix: Record antibody_clone and antibody_lot per sample in metadata; include lot as a fixed effect in downstream differential analysis. Within a single study, ideally use a single lot for ALL replicates and ALL conditions.
Trigger: "Condition A has 14,000 peaks; condition B has 22,000 peaks; condition B has more m6A."
Mechanism: Peak count is library-size-dependent. A library at 60M unique reads finds more peaks than 30M. Without rarefaction or saturation correction, peak-count comparisons across libraries are dominated by sequencing depth.
Symptom: Peak counts track total mapped reads more closely than they track biological condition; downstream "biological m6A change" claims do not survive rarefaction-to-common-depth.
Fix: Either rarefy all BAMs to common unique-read depth before peak calling, OR fit saturation curves with PreSeq lc_extrap and compare at matched depth, OR report peak count alongside the saturation curve.
Trigger: Aggressive 5' trimming of the first 6-12 nt to remove "random hexamer priming bias" applied to MeRIP libraries.
Mechanism: Random hexamer priming bias affects the 5' nucleotide composition of reads but does NOT degrade downstream peak-calling accuracy. Over-trimming removes biological signal and shortens reads enough to inflate multi-mapper fraction.
Fix: Standard adapter trimming with --length_required 25 is sufficient; do not 5'-trim for hexamer bias unless downstream tooling explicitly requires unbiased 5' ends (most do not). The bias is a known artifact in the RNA-seq community and is robust to standard analytical pipelines.
| Pattern | Likely cause | Action |
|---------|--------------|--------|
| plotFingerprint diagonal but IP/input ratio shows enrichment | Mismatched chromosome naming (chr1 vs 1) between samples | Verify samtools view -H bam | grep '@SQ' matches across samples |
| Replicate Spearman 0.95 but plotFingerprint diverges | Replicates correlate in bulk but differ in IP enrichment depth | Check per-sample sequencing depth; reduce to common depth |
| Saturation curve plateaus early but peak count low | Library complexity exhausted (e.g., over-amplified PCR) | Inspect duplicate rate; re-prep library if possible |
| MultiQC reports input has higher mapping rate than IP | IP enriches non-canonical sequences (m6A on intronic RNA, mt-RNA) that map differently | Acceptable if STAR multi-mapper retention is on; verify with idxstats |
| Properly-paired rate < 60% | Insert size distribution off (RNA degradation; library prep failure) | Inspect samtools view -f 0x2 count; re-prep if severe |
| HISAT2 reports many discordant pairs | Splice-junction not captured in index | Re-build with --dta and confirm GTF matches genome |
| plotFingerprint synthetic JS distance < 0.5 | Marginal IP enrichment; borderline failed | Inspect per-transcript IP/input ratio distribution; consider exclusion |
| Quantity | Threshold | Source / rationale |
|----------|-----------|--------------------|
| Minimum read length after trimming | 25 nt | Below this, multi-mapping fraction inflates; downstream peak callers lose specificity |
| STAR --outFilterMultimapNmax for MeRIP | 20 | Retains multi-isoform mapping; tighten to 1 only when downstream cannot tolerate |
| --sjdbOverhang | read length - 1 | STAR convention; 100 is common for 100-150 bp reads |
| Properly-paired rate (samtools flagstat) | >=85% | Below indicates degraded RNA or library-prep failure |
| Replicate Spearman correlation (multiBamSummary 10 kb bins) | >=0.85 (IP-vs-IP within condition) | Below suggests one replicate is anomalous |
| plotFingerprint IP-vs-input JS distance | >=0.5 | Higher indicates better IP enrichment; <0.3 suggests failed IP |
| Saturation curve plateau depth | ~30-60M unique reads typical | Below this, peak calling under-samples; depth depends on cell type / antibody |
| Per-transcript IP/input ratio median | >1.5 (genome-wide median) | Lower suggests failed IP; conditions / cell lines vary |
| Minimum biological replicates | 3 (4-5 preferred) per condition | McIntyre 2020 Sci Rep 10:6590 — N=2 routinely under-powered |
| Dedup status for non-UMI MeRIP | OFF | Standard convention; UMI-MeRIP is the only exception |
| BAM sort order for downstream tools | Coordinate (SortedByCoordinate) | exomePeak2, MeTPeak, MACS3, deepTools all expect coordinate sort |
| bamCompare --binSize for downstream metagene | 25 | Fine enough to preserve peak topology; coarser only for whole-chromosome browser views |
| bamCompare --pseudocount | 1 | Prevents divide-by-zero at zero-coverage bins; larger values flatten signal |
| Error / symptom | Cause | Solution |
|-----------------|-------|----------|
| STAR runs out of memory on human genome | --genomeDir build needs ~30 GB RAM | Use HISAT2 (~12 GB) or run STAR on a high-memory node |
| samtools index fails with "is not coordinate sorted" | BAM is name-sorted or unsorted | Re-run samtools sort (not sort -n) |
| exomePeak2 errors on TxDb chromosome mismatch | BAM uses chr1, GTF uses 1 (or reverse) | Verify with samtools view -H bam | head and head genes.gtf; reconcile with seqlevelsStyle() in R or rename in shell |
| deepTools bamCompare --ratio log2 deprecation warning | Newer deepTools uses --operation log2 | Switch to --operation log2 |
| PreSeq lc_extrap rejects with "low complexity" | Library too shallow OR genome too small (BAM under 1M unique reads) | Use c_curve only; or sequence deeper |
| MultiQC misses STAR Log.final.out | STAR output naming non-standard | Re-run with --outFileNamePrefix and rerun MultiQC; check multiqc_config.yaml search patterns |
| picard MarkDuplicates collapses all reads to 1 per position | Tiny BAM or single read pair per fragment | Verify BAM has many properly-paired reads; do NOT dedup non-UMI MeRIP regardless |
| Empty fingerprint output | All BAMs have identical bin coverage | Verify BAMs are different files; check multiBamSummary --outRawCounts |
| bigWig file size too large | Bin size too small at deep coverage | Increase --binSize from 25 to 50; bigWig is lossy at large bin sizes |
| Saturation curve never plateaus | Library deeply under-sampled | Sequence deeper OR accept curve does not plateau and report accordingly |
| fastp --umi errors on non-UMI library | UMI flag passed but library has no UMI | Drop --umi flag for standard non-UMI MeRIP |
| Pushback | Response | |----------|----------| | "Was deduplication applied?" | No — standard non-UMI MeRIP protocol; PCR duplicate vs biological resampling indistinguishable without UMI; dedup collapses real coverage at high-expression transcripts | | "What is the IP enrichment QC?" | deepTools plotFingerprint reported per replicate; JS distance >=0.5 vs input | | "Are the replicates concordant?" | Spearman correlation matrix reported via deepTools plotCorrelation on 10 kb bins; IP-IP within condition >=0.85 | | "Saturation curve?" | PreSeq lc_extrap per library; libraries rarefied to common depth before downstream peak calling | | "What antibody clone and lot?" | Recorded per sample in metadata; same lot for all replicates within study | | "Why STAR instead of HISAT2?" | STAR splice-junction-DB-based vs HISAT2 graph-based; both valid for MeRIP; choice driven by memory budget | | "How many biological replicates?" | N >=3 per condition (per McIntyre 2020); N=2 is under-powered for differential downstream | | "Was alignment to genome or transcriptome?" | Genome (required for exomePeak2 / MeTPeak / MACS3 downstream); transcriptome alignment is for m6anet-analysis only |
development
Find restriction enzyme cut sites in DNA sequences using Biopython Bio.Restriction. Search with single enzymes, batches of enzymes, or commercially available enzyme sets. Returns cut positions for linear or circular DNA. Use when finding restriction enzyme cut sites in sequences.
development
Create restriction maps showing enzyme cut positions on DNA sequences using Biopython Bio.Restriction. Visualize cut sites, calculate distances between sites, and generate text or graphical maps. Use when creating or analyzing restriction maps.
development
Analyze restriction digest fragments using Biopython Bio.Restriction. Predict fragment sizes, get fragment sequences, simulate gel electrophoresis patterns, and perform double digests. Use when analyzing restriction digest fragment patterns.
development
Select restriction enzymes by criteria using Biopython Bio.Restriction. Find enzymes that cut once, don't cut, produce specific overhangs, are commercially available, or have compatible ends for cloning. Use when selecting restriction enzymes for cloning or analysis.