crispr-screens/base-editing-analysis/SKILL.md
Analyzes base-editing screens for variant function. Covers library design (Sanson 2020 GRACE, Hanna 2021 BRCA1/2 SNV scanning, Cuella-Martin 2021), CBE vs ABE chemistry choice (BE3/BE4 vs ABE7.10/ABE8.20/ABE8e), editing-window math (positions 4-8 from PAM-distal end, wider for ABE8e), bystander-edit quantification and the variant-call ambiguity it creates, sgRNA-efficiency filtering before hit calling, indel byproduct interpretation, the substitution-vs-indel diagnostic, variant annotation against ClinVar / COSMIC, and the Broad be-validation-pipeline. Use when designing a BE variant screen, choosing CBE vs ABE for a specific edit, interpreting bystander-confounded hits, distinguishing functional signal from indel artifact, integrating CRISPResso2 output with screen scoring, or deciding BE vs PE for SNV installation.
npx skillsauth add GPTomics/bioSkills bio-crispr-screens-base-editing-analysisInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Reference examples tested with: CRISPResso2 2.2.14+, BE-Hive 1.0+ (BE prediction), pandas 2.2+, biopython 1.83+, numpy 1.26+, scipy 1.12+, scikit-learn 1.4+, Broad be-validation-pipeline 1.0+ (Python).
Before using code patterns, verify installed versions match. If versions differ:
CRISPResso --version; be-validation-pipeline --helppip show CRISPResso2 be-hiveIf code throws ImportError, AttributeError, or TypeError, introspect the installed package and adapt the example to match the actual API rather than retrying.
"Analyze my base-editor variant-function screen" -> Quantify per-sgRNA target-base conversion, bystander rate, and indel byproducts from amplicon sequencing; filter on editing efficiency; map each sgRNA to its intended SNV (target + bystander pattern); compute per-variant fitness from the screen log-fold change; reconcile target vs bystander variant attribution; annotate against ClinVar / COSMIC.
CRISPResso --base_editor_output for per-amplicon BE quantificationbe-validation-pipeline for end-to-end pooled-screen analysis with editing-efficiency filteringBE-Hive (Arbab 2020) for editing-efficiency predictionBE-Designer (Hwang 2019) for variant-encoding sgRNA design| Editor | Reaction | Editing window | Indel byproduct rate | When to use | |--------|----------|----------------|----------------------|-------------| | BE3 (Komor 2016) | C->T (also G->A on opposite strand) | Pos 4-8 from PAM-distal end | 5-10% | Original; superseded | | BE4 / BE4max (Koblan 2018) | C->T | Pos 4-8 | <5% | CBE standard | | eA3A-BE3 | C->T narrow specificity | Pos 5-7 | <5% | Specifically TC contexts (eA3A prefers TC) | | ABE7.10 (Gaudelli 2017) | A->G (T->C opposite strand) | Pos 4-7 | <2% | First ABE; slow at non-TA contexts | | ABE8.20 (Richter 2020) | A->G | Pos 4-8 | <2% | Modern ABE; high activity | | ABE8e (Lapinaite 2020) | A->G | Pos 4-8 | <2% | Highest editing activity; broader window | | evoCDA-BE | C->T (broader) | Pos 1-9 | 5-10% | Larger editing window; more bystander | | CGBE1 (Kurt 2021) | C->G | Pos 5-7 | 5-10% | C-to-G transversion; rare use | | GBE (Zhao 2021) | C->G or C->A | Pos 4-7 | 5-10% | Transversions; less mature |
Decision rule: For a target SNV at position 4-8 of a candidate spacer with no bystander Cs/As in the same window, BE3-BE4 or ABE7.10 is sufficient. For high-throughput variant scanning where bystander tolerance must be minimized, use eA3A-BE3 (TC contexts only) or ABE8e (narrower effective window).
Why this matters for postdoc-level use: Base editors are tethered to dCas9 (or nCas9) and the deaminase acts on the displaced ssDNA "R-loop" formed when Cas9 binds. The deaminase has a fixed reach -- positions 4-8 from the PAM-distal end of the protospacer for canonical BE3/BE4/ABE7.10. Outside this window, editing efficiency drops by 10-50x.
PAM-distal end PAM-proximal
| |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 NGG
^^^^^^^^^^^
Canonical editing window (positions 4-8)
For BE4max / ABE7.10: positions 4-8 are 5-50x more efficient than positions 1-3 or 9-13
For ABE8e: window extends to positions 4-10 due to enhanced TadA8e activity
For evoCDA-BE: window 1-9 (broader; more bystander)
Critical implication for variant interpretation: If the intended edit is at position 5 and there is an additional editable C/A at position 7, both will be edited in the same molecule. The screen scores the combination of edits, not the intended one alone. This is bystander confounding.
Goal: Tile editing-window-positioned spacers across a protein region of interest to enable variant scanning.
Approach: For each amino acid in the target region, find NGG-adjacent spacers where the SNV-of-interest base falls in editing positions 4-8 with minimal bystander C/A in the same window. Annotate each spacer with the predicted amino acid changes (target + bystander).
import pandas as pd
import re
from Bio.Seq import Seq
def find_be_spacers(cds_sequence, cds_protein_start, target_aa, target_base='C', editor='BE4max'):
'''Find sgRNAs that place target_base in editor-specific window at target_aa.
Returns spacers with bystander annotation.
Args:
cds_sequence: nucleotide CDS (translated frame 1)
cds_protein_start: amino acid number of CDS start (usually 1)
target_aa: amino acid number to install variant (e.g., 130 for residue 130)
target_base: 'C' (CBE) or 'A' (ABE)
editor: 'BE3', 'BE4max', 'eA3A-BE3', 'ABE7.10', 'ABE8.20', 'ABE8e', 'evoCDA-BE'
Returns: DataFrame with spacer, position-in-cds, target-base-position-in-spacer,
bystander_positions, predicted_aa_changes
'''
# Editor-specific editing window (positions from PAM-distal end of spacer)
window_by_editor = {
'BE3': (4, 8), 'BE4max': (4, 8), 'eA3A-BE3': (5, 7),
'ABE7.10': (4, 7), 'ABE8.20': (4, 8), 'ABE8e': (4, 10), # ABE8e wider!
'evoCDA-BE': (1, 9),
}
window_lo, window_hi = window_by_editor[editor]
aa_index = target_aa - cds_protein_start # 0-indexed in protein
aa_start_nt = aa_index * 3 # nt offset in cds
candidates = []
spacer_len = 20
pam_pattern = re.compile(r'(?=([ACGT]GG))')
for strand, seq in [('+', cds_sequence), ('-', str(Seq(cds_sequence).reverse_complement()))]:
for pam_match in pam_pattern.finditer(seq):
pam_pos = pam_match.start()
spacer_start = pam_pos - spacer_len
if spacer_start < 0:
continue
spacer = seq[spacer_start:pam_pos]
# Editor-specific window from PAM-distal end (1-indexed)
# Find all editable bases in window
edit_bases_in_window = []
for i, b in enumerate(spacer[window_lo-1:window_hi], start=window_lo):
if b == target_base:
edit_bases_in_window.append(i)
if not edit_bases_in_window:
continue
# Annotate which edits hit the target_aa codon
target_codon_start = aa_start_nt
target_codon_end = target_codon_start + 3
target_position_in_spacer = []
for i in edit_bases_in_window:
genomic_pos = spacer_start + i - 1
if target_codon_start <= genomic_pos < target_codon_end:
target_position_in_spacer.append(i)
bystander_positions = [i for i in edit_bases_in_window if i not in target_position_in_spacer]
candidates.append({
'spacer': spacer,
'strand': strand,
'spacer_start': spacer_start,
'target_positions': target_position_in_spacer,
'bystander_positions': bystander_positions,
'n_bystanders': len(bystander_positions),
})
return pd.DataFrame(candidates).sort_values('n_bystanders')
Decision rule: Select spacers with target_positions != empty AND n_bystanders minimized. For variant-by-variant scanning, accept up to 1-2 bystanders if biology of those positions is interpretable; flag for downstream variant attribution.
Goal: Drop sgRNAs that do not edit efficiently, since unedited reads represent no biological perturbation.
Approach: From CRISPResso2 output, compute target-base-conversion percentage per sgRNA; filter library to sgRNAs with >50% target editing in a pilot or co-screened control.
def filter_by_editing_efficiency(crispresso_outputs_dir, target_pos, target_base, efficiency_threshold=0.5):
'''Drop sgRNAs that edit <efficiency_threshold of reads at target position.
crispresso_outputs_dir: directory containing CRISPResso per-sample outputs.'''
from pathlib import Path
results = []
for sample_dir in Path(crispresso_outputs_dir).glob('CRISPResso_on_*'):
sgrna_id = sample_dir.name.replace('CRISPResso_on_', '')
quant_file = sample_dir / 'Quantification_window_nucleotide_percentage_table.txt'
if not quant_file.exists():
continue
df = pd.read_csv(quant_file, sep='\t')
# Find target position in the quantification window
target_row = df[df['Position'] == target_pos]
if target_row.empty:
continue
# Editing = sum of non-original bases at target position
original_pct = target_row[target_base].values[0]
editing_pct = (100 - original_pct) / 100
results.append({'sgrna_id': sgrna_id, 'editing_pct': editing_pct,
'pass_filter': editing_pct >= efficiency_threshold})
return pd.DataFrame(results)
Convention: Drop sgRNAs below 50% editing for variant-function screens. Hanna 2021 used a 30% threshold for primary screen and a 50% threshold for confirmed hits. Below 30%, the screen has insufficient power; above 70%, results approach saturation editing.
Why this matters: When a sgRNA's editing window contains the target base AND a bystander base, the screen scores the combination. To attribute screen signal to the target variant alone, either (a) include sgRNAs that edit only the target (no bystander) -- often impossible -- or (b) deconvolute via parallel measurements.
Strategies for variant-by-variant attribution:
Tile multiple sgRNAs with different bystander patterns: If 5 different sgRNAs all hit the target base but have different bystanders, common signal across them is target-attributable (Hanna 2021 approach).
Use orthogonal chemistry: Run the same variant scan with prime editor (no bystanders); cross-validate. See [[prime-editing-screens]].
Bystander stratification: From CRISPResso2 allele table, partition reads by exact edit pattern (target only, target+bystander_1, target+bystander_2, etc.); separately score each pattern's contribution to the phenotype.
Restrict library: Use only sgRNAs with zero bystanders in the editing window (rare; may exclude most candidate spacers).
def deconvolute_bystander(allele_table_path, target_pos, bystander_pos_list):
'''From CRISPResso2 allele table, partition reads by edit pattern at target + bystanders.
Returns: per-pattern frequency for each combination of target/bystander edits.'''
alleles = pd.read_csv(allele_table_path, sep='\t', compression='zip')
# Mark target_edited and per-bystander_edited
alleles['target_edited'] = alleles['Aligned_Sequence'].str[target_pos-1] != alleles['Reference_Sequence'].str[target_pos-1]
for bp in bystander_pos_list:
alleles[f'bystander_{bp}_edited'] = alleles['Aligned_Sequence'].str[bp-1] != alleles['Reference_Sequence'].str[bp-1]
return alleles.groupby(['target_edited'] + [f'bystander_{bp}_edited' for bp in bystander_pos_list])['Reference_pct'].sum().reset_index()
Goal: Score per-variant fitness from a base-editor screen.
Approach: Filter library to efficiency-passing sgRNAs (>50% editing), then run MAGeCK MLE or drugZ on the sgRNA-level counts; map each significant sgRNA to its predicted variant + bystander pattern; aggregate to per-variant scores.
def aggregate_variant_scores(mageck_sgrna_summary, variant_annotation_df):
'''Aggregate sgRNA-level scores to per-variant scores.
variant_annotation_df: per-sgRNA -> predicted variants (target + bystanders).'''
df = mageck_sgrna_summary.merge(variant_annotation_df, on='sgRNA')
# Target-only contribution: sgRNAs with no bystanders
target_only = df[df['n_bystanders'] == 0]
target_only_scores = target_only.groupby('target_variant')['LFC'].agg(['mean', 'std', 'count'])
# Mixed signal: sgRNAs with bystanders
mixed = df[df['n_bystanders'] > 0]
return target_only_scores, mixed
Hanna et al 2021 Cell 184:1066 established the gold-standard methodology for BE variant scanning:
Quantified result: Identified novel loss-of-function variants in BRCA1 RING and BRCT domains, MCL1 and BCL2L1, and PARP1 resistance variants. The approach validated for clinical variant interpretation.
Cuella-Martin et al 2021 Cell 184:1081-1097 screened ~86 DNA-damage-response (DDR) genes (including BRCA1/2) with CBE saturation mutagenesis:
Reconciliation with Hanna 2021: Both studies converged on similar functional variant calls in BRCA1 RING and BRCT domains; the orthogonal methodology gave confidence in clinical interpretation.
| Approach | What it does | Bystander | Indels | When to use | |----------|--------------|-----------|--------|-------------| | Cas9 + HDR template | Installs precise edit + template | None | High (NHEJ competition) | When precise edit needed; high indel byproduct | | Cas9 (no template) | Random indels at cut site | None | 70%+ | Loss-of-function; not variant-specific | | CBE (BE3/BE4) | C->T at editing window | Yes (multiple Cs) | <5% | C->T variants with manageable bystanders | | ABE (ABE7.10/ABE8e) | A->G at editing window | Yes (multiple As) | <2% | A->G variants; clean for single-A spacers | | CGBE / GBE | C->G or C->A | Yes | 5-10% | Transversions; rare use cases | | Prime editor (PE2/PE3) | Templated edit; any base change | None | 1-3% | Precise variants; lower efficiency |
Decision: For C->T or A->G with available editing window: base editor is preferred (higher efficiency than PE). For other transitions/transversions, multi-base edits, or zero-bystander requirements: prime editor.
The Broad Institute's be-validation-pipeline (https://broadinstitute.github.io/be-validation-pipeline/) is an end-to-end snakemake workflow for BE variant-function screens:
# Install
git clone https://github.com/broadinstitute/be-validation-pipeline
cd be-validation-pipeline
conda env create -f environment.yaml
conda activate bevalidation
# Configure
edit config.yaml # Specify library, FASTQ paths, reference, target genes
# Run
snakemake --use-conda --cores 16
# Outputs:
# results/per_sgrna_editing_efficiency.tsv
# results/per_variant_attribution.tsv
# results/hit_calls.tsv
The pipeline handles editing-efficiency filtering, bystander attribution, and variant calling in a single end-to-end workflow optimized for the GRACE library (Sanson 2020) but adaptable.
Trigger: Cas9 contamination, wrong vector (e.g., used pCas9-BE3 plasmid but selected on Cas9 line), or evoCDA-BE / broader-window chemistry. Mechanism: Cas9 cuts dsDNA; BE relies on nicked-ssDNA deamination. Cas9 expression in the same cell creates indels. Symptom: Substitution-vs-indel ratio <3 in CRISPResso output. Fix: Verify vector (nCas9-BE3 not Cas9-BE3); confirm cell line lacks Cas9 background; restrict to specifically engineered BE-cell lines.
Trigger: Bystander C/A is dominating; intended variant is not the perturbation driving phenotype. Mechanism: When target is at position 5 and bystander is at position 7, the molecule carries both; phenotype is from the bystander. Symptom: Strong screen signal but variant attribution unclear. Fix: Run orthogonal prime-editor scan of the same intended variants; restrict library to bystander-free spacers when possible; deconvolute via allele-frequency table.
Trigger: Intended variant is silent or compensatory; the protein function is unchanged. Mechanism: Variants can be tolerated; not all variants are LoF or GoF. Symptom: High editing efficiency (>70%) but per-sgRNA LFC near zero. Fix: Expected outcome for many variants; flag silent / compensatory variants in the report.
Trigger: Wrong cell line for the BE; cell line has poor BE activity (some lines lack APOBEC or have low expression). Mechanism: BE efficiency depends on cell-line expression of TadA or APOBEC components. Symptom: Median editing <30% across library. Fix: Test in a BE-validated cell line (HEK293T, U2OS, K562 generally work); pilot before full screen.
Trigger: No NGG-adjacent spacer places target base in editing window for that codon. Mechanism: Editor window is fixed; some codons cannot be targeted with given chemistry. Symptom: Specific variants absent from screen. Fix: Use PAM-relaxed BE variants (SpRY-CBE, SpRY-ABE); use prime editor for variants outside BE accessibility; accept that some variants cannot be installed.
| Threshold | Value | Source / Rationale | |-----------|-------|--------------------| | Editing window | Positions 4-8 from PAM-distal end (BE3/BE4); 4-10 (ABE8e) | Komor 2016; Gaudelli 2017; Lapinaite 2020 | | Editing efficiency for screen power | >30% (primary); >50% (validation) | Hanna 2021 conventions | | Indel byproduct (clean BE) | <5%; <2% for ABE | Koblan 2018 (BE4max); Gaudelli 2017 (ABE) | | Substitution-vs-indel ratio | >10 (clean BE); <3 (Cas9-like) | CRISPResso2 diagnostic | | Target editing % for variant inclusion | >30% Hanna 2021; >50% strict | Empirical | | Bystander rate (target attribution) | <10% acceptable; <5% ideal for clean attribution | Application-dependent | | Cell-line BE activity (pilot) | >30% editing at validated target | Below = wrong cell line for BE | | Per-amino-acid sgRNA density | 10-15 (Hanna 2021); 5-8 (smaller screens) | Tradeoff with library size |
| Error / symptom | Cause | Solution | |-----------------|-------|----------| | Substitution-vs-indel ratio <3 | Cas9 contamination or wrong BE | Verify vector / cell line; pilot first | | All edits at bystander positions | Target base outside window | Re-design spacer with target at pos 4-8 | | Variant attribution unclear | Bystander confounding | Run orthogonal PE; restrict library | | Library lacks intended variant | No NGG-PAM accessibility | SpRY-CBE; prime editor; accept exclusion | | Editing <30% library-wide | Cell-line BE inactivity | Re-validate cell line | | Hit list dominated by single sgRNA | Bystander-driven phenotype | Cross-check with bystander-free sgRNAs |
tools
--- name: bio-phasing-imputation-foundations description: Frames the phasing/imputation pipeline before any tool runs: phasing and imputation are one Li-Stephens copying HMM (recombination is the transition, mutation the emission, the genetic map and Ne set the rates), imputation's honest output is a dosage with a self-estimated quality (INFO/R2/DR2) not a hard genotype, and the stages are ordered and each fails silently (QC, align build and strand to the panel, phase, impute per chromosome, fil
tools
Chooses the enrichment generation before any tool runs, mapping the input shape to a method class - a pre-selected gene list plus a background to over-representation analysis (ORA, hypergeometric), a ranked statistic for all genes to gene set enrichment (GSEA), a signed signaling topology to pathway-topology (SPIA) - then making the null explicit (competitive vs self-contained, gene vs subject sampling) and running a trustworthiness checklist (testable-gene universe, FDR, redundancy collapse, leading-edge check, version reporting). Covers why every clusterProfiler GSEA is the inter-gene-correlation-uncorrected competitive null, why the background not the gene list decides ORA significance, and why no method is universally best. Use when deciding ORA vs GSEA vs topology, which gene-set DB, whether a result is trustworthy, or which null a tool computes. For ORA see go-enrichment, GSEA see gsea, databases kegg-pathways/reactome-pathways/wikipathways; the ranking comes from differential-expression/de-results.
testing
End-to-end GWAS workflow from VCF to association results. Covers PLINK QC, population structure correction, and association testing for case-control or quantitative traits. Use when running genome-wide association studies.
development
Orchestrates the full path from differential expression results to redundancy-collapsed functional enrichment: choose ORA vs GSEA, convert gene IDs per method, run enrichGO/enrichKEGG/enrichPathway/enrichWP or gseGO/gseKEGG (clusterProfiler, ReactomePA, rWikiPathways), and visualize. Routes the ORA-vs-GSEA generation fork and the null/universe/reproducibility theory to pathway-analysis/enrichment-foundations. Use when a DESeq2/edgeR/limma result must become enriched GO terms, KEGG/Reactome/WikiPathways pathways, or a GSEA leading edge; when deciding whether a ranking exists for all genes (GSEA, named decreasing vector) or only a pre-selected list (ORA plus a defensible background universe); or when assembling DE-to-pathway end to end. The DE list and ranking statistic come from differential-expression/de-results; per-method nuance lives in the pathway-analysis skills.