crispr-screens/prime-editing-screens/SKILL.md
Designs and analyzes pooled prime-editor (PE) screens for installing precise genetic variants without bystander confounding. Covers pegRNA design with PRIDICT and PRIDICT2 (Mathis 2023/2024) for predicting per-pegRNA editing efficiency, pegRNA architecture (spacer + scaffold + PBS + RTT), PE2 / PE3 / PE3b / PEmax / PEAR variants, MOSAIC in situ saturation mutagenesis (Hsu JY et al 2024 bioRxiv), the PRIME pooled-screen methodology (Erwood/Doman 2023 Nat Biotechnol 41:885; ~3,699 ClinVar variant screens), chromatin context as a primary determinant of PE efficiency, scaffold-incorporation and indel byproduct quantification with CRISPResso2, and the cross-modal validation strategy of PE + base-editor screens for variant function. Use when designing a pegRNA library for variant installation, choosing between BE and PE for a specific edit, predicting pegRNA efficiency before library synthesis, analyzing PE screen output, distinguishing intended-edit from scaffold-incorporation, or scaling PE screens to thousands of variants.
npx skillsauth add GPTomics/bioSkills bio-crispr-screens-prime-editing-screensInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Reference examples tested with: PRIDICT2 v1.0+ (https://github.com/uzh-dqbm-cmi/PRIDICT2), CRISPResso2 2.2.14+, pandas 2.2+, biopython 1.83+, numpy 1.26+.
Before using code patterns, verify installed versions match. If versions differ:
python pridict2_pegRNA_design.py single --help; python pridict2_pegRNA_design.py batch --helpIf code throws ImportError, AttributeError, or TypeError, introspect the installed package and adapt the example to match the actual API rather than retrying.
"Design or analyze a pooled prime-editor screen" -> Design pegRNAs (spacer + scaffold + PBS + RTT) for intended edits, predict efficiency with PRIDICT2, filter pre-synthesis to efficient candidates, install variants in the screen, quantify intended-edit vs scaffold-incorporation vs indel via CRISPResso2, and aggregate to per-variant fitness scores.
PRIDICT2 for pegRNA efficiency predictionCRISPResso --prime_editing_pegRNA_* for amplicon-level analysis| Editor | Year | Mechanism | Indel rate | Use when | |--------|------|-----------|------------|----------| | PE2 (Anzalone 2019) | 2019 | nCas9-RT fusion + pegRNA | 1-3% | Standard PE; lowest indel rate | | PE3 | 2019 | PE2 + nick of opposite strand by additional sgRNA | 2-5% | Higher editing efficiency, slightly more indels | | PE3b | 2019 | PE3 with edit-blocking ssgRNA | 1-3% | When PE3's added nick risks unwanted indels | | PEmax (Chen 2021) | 2021 | Engineered RT + nCas9 | 1-2% | Higher editing rate per pegRNA | | PEAR (Erwood 2023) | 2023 | PE with optimal pegRNA scaffold | 1-2% | Improved PE scaffold | | PE5max (Chen 2021) | 2021 | PEmax with engineered scaffold variants | 1% | Highest efficiency at favorable sites | | Dual-pegRNA / PE6 (2024) | 2024 | Twin pegRNA system | Variable | Specific applications |
Decision rule: For pooled screens at scale, PE2 or PEmax (less RAM-intensive in cells) is preferred over PE3 (additional sgRNA complicates library architecture). For specific high-efficiency edits, PEmax + PRIDICT2-optimized pegRNA.
A pegRNA contains four critical elements that determine efficiency:
5' SPACER (20 nt) -- standard sgRNA spacer; defines target locus via NGG PAM
+
SCAFFOLD (~80 nt) -- canonical or engineered scaffold (Chen 2021 has improved scaffold)
+
PBS (Primer Binding Site, 8-15 nt) -- complements protospacer downstream of cut site
+
RTT (Reverse Transcription Template, 10-30 nt) -- encodes intended edit; copied by RT
3'
Key design parameters:
Mathis N et al 2023 Nat Biotechnol 41:1151 (PRIDICT v1) / 2025 Nat Biotechnol 43(5):712 (PRIDICT2; published online June 2024) developed deep-learning predictors of per-pegRNA editing efficiency. PRIDICT2 is the current state of the art.
# PRIDICT2 is invoked via CLI: pridict2_pegRNA_design.py
# Single sequence input:
python pridict2_pegRNA_design.py single \
--sequence-name BRCA1_c5135 \
--sequence "AGCAGCCT(C/T)CTGAATGCCC...60nt_context" \ # parens = intended edit
--output-dir predictions/ \
--use_5folds # 5-fold ensemble averaging
# Batch input from CSV:
python pridict2_pegRNA_design.py batch \
--input-fname variants_to_design.csv \ # CSV: sequence_name, sequence
--output-dir predictions/ \
--cores 4 \
--summarize # generate summary table
# Output: per-pegRNA predictions in predictions/<sequence_name>/
# Columns: PBS_sequence, PBS_length, RTT_sequence, RTT_length, predicted_editing_efficiency,
# predicted_indel_rate, deep_ensemble_score, etc.
Loading PRIDICT2 results in Python:
import pandas as pd
from pathlib import Path
def load_pridict2_predictions(prediction_dir):
'''Load PRIDICT2 batch outputs from prediction_dir/'''
summary = pd.read_csv(Path(prediction_dir) / 'pridict2_summary.csv')
# summary has columns: sequence_name, PBS, RTT, predicted_efficiency, predicted_indel, etc.
return summary
Key determinants of PE efficiency (Mathis 2024 PRIDICT2):
| Feature | Effect on efficiency | |---------|----------------------| | PBS GC content | 40-55% optimal; high GC slows annealing | | PBS length | 11-13 nt optimal; longer for high-GC PBS | | RTT length | 10-20 nt typical; trade-off between coverage and processivity | | Edit position in RTT | Closest to PBS = highest efficiency | | Chromatin context | Open chromatin = 2-5x higher efficiency than closed | | Cell line / Cas9 expression | Variable; piloting required | | Cell cycle phase | S/G2 = higher efficiency |
Critical insight from Mathis 2024: Chromatin context is the dominant determinant. Sequence-based predictions like PRIDICT under-predict at silenced loci and over-predict at open chromatin. For genome-scale screens, validate predictions empirically at representative loci.
Erwood S, Doman JL et al 2023 Nat Biotechnol 41:885 established the PRIME pooled-screen methodology (earlier 2022 bioRxiv preprint):
Quantified scale: ~3,699 ClinVar variants installed in a single PRIME screen (Erwood/Doman 2023 Nat Biotechnol 41:885), with editing efficiency >5% at >50% of pegRNAs (validation cohort).
MOSAIC (Hsu JY, Lam KC, Shih J, Pinello L, Joung JK 2024 bioRxiv 10.1101/2024.04.25.591078) is a higher-throughput variant of PRIME with multiplexed read-out:
Use case: Cancer-drug-resistance variant scanning; protein-domain function mapping.
Goal: Predict editing efficiency for thousands of pegRNAs before library synthesis.
Approach: Build a CSV with one row per intended edit (sequence + edit notation), run PRIDICT2 in batch mode, parse the per-pegRNA efficiency summary, and filter to candidates above the chosen efficiency threshold.
# Step 1: prepare batch input CSV (sequence_name, sequence with (REF/ALT) edit notation)
cat > variants.csv <<EOF
sequence_name,sequence
BRCA1_R71X,AGCAGCCT(C/T)CTGAATGCCC...
MLH1_c677,GAGCTGAGC(A/G)GAGGCTCTTGAAGC...
EOF
# Step 2: run PRIDICT2 batch
python pridict2_pegRNA_design.py batch \
--input-fname variants.csv \
--output-dir predictions/ \
--cores 8 \
--summarize
# Step 3: parse and filter
import pandas as pd
predictions = pd.read_csv('predictions/pridict2_summary.csv')
# Filter to pegRNAs with predicted efficiency > 50% (Mathis 2024 threshold)
filtered = predictions[predictions['predicted_editing_efficiency'] > 50]
print(f'pegRNAs passing PRIDICT2 >50%: {len(filtered)} / {len(predictions)}')
# Pick top 3 per intended edit
top3 = (filtered.sort_values(['sequence_name', 'predicted_editing_efficiency'],
ascending=[True, False])
.groupby('sequence_name').head(3))
top3.to_csv('peg_library_filtered.csv', index=False)
Goal: Confirm variant-function calls from PE with orthogonal BE screens.
Approach: Design parallel BE library for the same variants; run both screens; intersect hits.
# BE screen output (target conversion + bystander)
be_hits = pd.read_csv('be_screen_hits.tsv', sep='\t')
# PE screen output (intended edit + scaffold-incorp + indel)
pe_hits = pd.read_csv('pe_screen_hits.tsv', sep='\t')
# Intersect on intended variant
concordant = be_hits.merge(pe_hits, on='variant_id', suffixes=('_be', '_pe'))
# Filter to high-confidence: both methods call variant + same direction
concordant['high_confidence'] = (concordant['be_fdr'] < 0.05) & (concordant['pe_fdr'] < 0.05) & \
(np.sign(concordant['be_lfc']) == np.sign(concordant['pe_lfc']))
Critical: PE-only hits in BE-coverable variants are suspect (BE should detect them). PE-only hits in non-BE-coverable variants (e.g., transversions) are genuinely PE-unique.
CRISPResso \
--fastq_r1 pe_sample.fq.gz \
--amplicon_seq <amplicon_seq> \
--guide_seq <20nt_spacer> \
--prime_editing_pegRNA_spacer_seq <spacer> \
--prime_editing_pegRNA_extension_seq <PBS+RTT> \
--prime_editing_pegRNA_scaffold_seq <scaffold> \
--quantification_window_size 25 \ # widen to cover edit
--output_folder pe_results \
--name sample_id
# Output: Prime_editing_outcomes.txt
# Columns: intended_edit_pct, scaffold_incorp_pct, indel_pct, unmodified_pct
Trigger: Sequence-only prediction missed chromatin context. Mechanism: Closed chromatin reduces Cas9 binding and RT activity; PRIDICT2 only sees sequence. Symptom: PRIDICT2 predicts 60% efficiency; observed is 5%. Fix: Cross-reference target with chromatin accessibility data (ATAC-seq) in the cell line; flag pegRNAs at silenced loci; pilot before screen.
Trigger: RTT too short relative to PBS, or RT processivity issue. Mechanism: RT reads past edit into scaffold; resulting product is detectable but undesired. Symptom: Scaffold incorporation >5%; intended edit efficiency low. Fix: Re-design pegRNA with longer RTT; verify with PRIDICT2 score for scaffold_incorp; pilot at representative loci.
Trigger: PE2 construct expressed at low level; insufficient RT for productive editing. Mechanism: PE2 requires high RT expression; some cell lines down-regulate. Symptom: Library-wide editing <10%; not locus-specific. Fix: Verify PE2 expression by Western blot; consider PEmax (higher activity); use better-validated cell lines (K562, HEK293T, U2OS).
Trigger: Long RTT designed for multi-base edit; RT prematurely terminates. Mechanism: RT processivity drops with longer RTT; multi-base edits often incomplete. Symptom: Allele table shows partial-edit alleles (some bases installed, not all). Fix: Re-design with shorter RTT covering only the closest edits; or use PE3 to nick opposite strand and force longer RT processivity.
Trigger: No suitable PAM/PBS/RTT combination for the intended edit. Mechanism: PE requires NGG PAM within 30 nt of edit; rare edits cannot be installed. Symptom: Specific variants absent from library. Fix: Use SpRY-PE for relaxed PAM; accept that some variants cannot be PE-installed; consider BE if applicable.
| Approach | Bystander | Indels | Coverage | When to use | |----------|-----------|--------|----------|-------------| | Cas9 + HDR | None | High | Variable (depends on template integration) | Precise edits at scale; high indel byproduct | | Base editor | YES | Low (<5%) | Limited by editing window | C->T or A->G at editable position | | Prime editor | NONE | Low (<3%) | NGG-PAM within 30 nt of edit | Precise variants; multi-base; transversions | | Cas9 (no template) | NONE | 70%+ | Anywhere with NGG | LoF only; not variant-specific |
Decision tree:
| Threshold | Value | Source / Rationale | |-----------|-------|--------------------| | PRIDICT2 efficiency for library inclusion | >50% | Mathis 2024 | | Intended edit % for screen power | >5% (per Anzalone 2019); >20% at favorable sites | Anzalone 2019 | | Scaffold incorporation | <2% (clean PE); <5% acceptable | Empirical | | Indel byproduct | <3% (PE2); <5% (PE3) | Anzalone 2019; Chen 2021 | | PBS GC content | 40-55% | PRIDICT2 | | PBS length | 11-13 nt | PRIDICT2 | | RTT length | 10-20 nt | PRIDICT2 | | Edit position from cut | 1-30 nt | Anzalone 2019 | | Cell line for PE | K562, HEK293T, U2OS validated | High RT expression |
| Error / symptom | Cause | Solution | |-----------------|-------|----------| | Low editing across library | Cell-line RT inactivity | Verify PE2 expression; switch to validated line | | Scaffold incorporation >10% | RTT too short | Re-design with longer RTT | | Partial multi-base edits | RT processivity limit | Shorter RTT or PE3 | | PRIDICT predicts but observes much lower | Chromatin context | Pilot at chromatin-aware sites | | Library missing variants | No NGG PAM | SpRY-PE; BE alternative | | PE concordant with BE on transitions, disagrees on transversions | PE handles transversions BE doesn't | Expected; trust PE |
tools
--- name: bio-phasing-imputation-foundations description: Frames the phasing/imputation pipeline before any tool runs: phasing and imputation are one Li-Stephens copying HMM (recombination is the transition, mutation the emission, the genetic map and Ne set the rates), imputation's honest output is a dosage with a self-estimated quality (INFO/R2/DR2) not a hard genotype, and the stages are ordered and each fails silently (QC, align build and strand to the panel, phase, impute per chromosome, fil
tools
Chooses the enrichment generation before any tool runs, mapping the input shape to a method class - a pre-selected gene list plus a background to over-representation analysis (ORA, hypergeometric), a ranked statistic for all genes to gene set enrichment (GSEA), a signed signaling topology to pathway-topology (SPIA) - then making the null explicit (competitive vs self-contained, gene vs subject sampling) and running a trustworthiness checklist (testable-gene universe, FDR, redundancy collapse, leading-edge check, version reporting). Covers why every clusterProfiler GSEA is the inter-gene-correlation-uncorrected competitive null, why the background not the gene list decides ORA significance, and why no method is universally best. Use when deciding ORA vs GSEA vs topology, which gene-set DB, whether a result is trustworthy, or which null a tool computes. For ORA see go-enrichment, GSEA see gsea, databases kegg-pathways/reactome-pathways/wikipathways; the ranking comes from differential-expression/de-results.
testing
End-to-end GWAS workflow from VCF to association results. Covers PLINK QC, population structure correction, and association testing for case-control or quantitative traits. Use when running genome-wide association studies.
development
Orchestrates the full path from differential expression results to redundancy-collapsed functional enrichment: choose ORA vs GSEA, convert gene IDs per method, run enrichGO/enrichKEGG/enrichPathway/enrichWP or gseGO/gseKEGG (clusterProfiler, ReactomePA, rWikiPathways), and visualize. Routes the ORA-vs-GSEA generation fork and the null/universe/reproducibility theory to pathway-analysis/enrichment-foundations. Use when a DESeq2/edgeR/limma result must become enriched GO terms, KEGG/Reactome/WikiPathways pathways, or a GSEA leading edge; when deciding whether a ranking exists for all genes (GSEA, named decreasing vector) or only a pre-selected list (ORA plus a defensible background universe); or when assembling DE-to-pathway end to end. The DE list and ranking statistic come from differential-expression/de-results; per-method nuance lives in the pathway-analysis skills.