genome-engineering/off-target-prediction/SKILL.md
Nominates and assesses CRISPR off-target sites genome-wide. Enumerates candidate sites by mismatch and bulge tolerance with Cas-OFFinder/CRISPRitz, ranks them with the published CFD score (SpCas9-only, relative ranker) or MIT/CRISTA/energy models, runs variant-aware screening against gnomAD/individual genomes (CRISPRme), and frames the empirical genome-wide discovery assays (GUIDE-seq, CIRCLE-seq, CHANGE-seq, DISCOVER-seq, Digenome-seq) and high-fidelity nuclease choice (HiFi Cas9, Sniper-Cas9, eSpCas9, SpCas9-HF1). Use when assessing guide RNA specificity, choosing among candidate guides, screening a therapeutic guide against population variation, or planning empirical off-target validation. Distinguishes predicted vs detected vs validated. On-target activity scoring and deaminase (Cas-independent) base/prime-editor off-targets are separate skills.
npx skillsauth add GPTomics/bioSkills bio-genome-engineering-off-target-predictionInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Reference examples tested with: Cas-OFFinder 3.0+, pandas 2.2+, Python 3.10+.
Before using code patterns, verify installed versions match. If versions differ:
pip show <package> then help(module.function) to check signatures<tool> --version then <tool> --help to confirm flagsIf code throws ImportError, AttributeError, or TypeError, introspect the installed package and adapt the example to match the actual API rather than retrying.
Results depend on inputs far more than tool versions: the candidate list is bounded by the reference genome build, the mismatch/bulge tolerance, and the PAM pattern searched, not by the Cas-OFFinder version. The CFD matrix is SpCas9/NGG-specific and a relative ranker, not a calibrated cutting probability. Load the published CFD tables (Doench 2016 / CRISPOR distribution) rather than hand-typing values. Cas-OFFinder is the maintained snugel/cas-offinder repository (native DNA/RNA bulge support from v3.0.0).
"Check my guide for off-targets" -> Enumerate candidate sites genome-wide by mismatch/bulge tolerance, rank them by a per-site score, decide whether in-silico is sufficient or empirical discovery is required, and report each claim at the right rung: predicted, detected, or validated.
cas-offinder input.txt G output.txt enumerates sites (no ranking)CRISPRme for variant-aware (gnomAD + individual) nomination; CRISPOR to aggregateThe naive model -- "search the genome within N mismatches, score by CFD, the high scorers are my off-targets" -- is wrong in three structural ways no better scoring fixes:
The corollary, and the central professor-level point: in-silico lists overlap only partially with empirically validated off-targets, and the empirical genome-wide assays disagree with each other too. No single method is authoritative. Off-target evidence escalates: predicted -> detected by an unbiased assay -> validated by targeted amplicon deep-seq. Conflating these rungs is the field's most common error. Therapeutic-grade assessment is triangulation (variant-aware in-silico + >=2 orthogonal empirical assays + amplicon validation + a structural readout), never one tool's output.
| Layer | Tool | Citation | Role / caveat | |-------|------|----------|---------------| | Enumerate | Cas-OFFinder | Bae 2014 Bioinformatics 30:1473 | exhaustive, alignment-free, GPU; DNA/RNA bulges (native v3.0.0); returns sites, no ranking | | Enumerate (variant) | CRISPRitz | Cancellieri 2020 Bioinformatics 36:2001 | enumerates against genome + a VCF of variants, with bulges; backend of CRISPRme | | Enumerate (scale) | GuideScan2 | Schmidt 2025 Genome Biol 26:41 | genome-wide specificity databases (NOT Nat Biotechnol) | | Score (per-site) | CFD | Doench 2016 Nat Biotechnol 34:184 | position x mismatch-type matrix x PAM penalty; SpCas9/NGG only, poor on bulges; de facto standard | | Score (legacy) | MIT/Hsu | Hsu 2013 Nat Biotechnol 31:827 | original; deprecated/flawed -- report, don't lead with it | | Score (ML) | CRISTA; Elevation | Abadi 2017; Listgarten 2018 | Elevation folds in chromatin accessibility | | Aggregate | CRISPOR; CRISPRme | Concordet 2018 NAR 46:W242; Cancellieri 2023 Nat Genet 55:34 | CRISPOR = research one-stop; CRISPRme = variant-aware therapeutic nominator |
The unifying caveat: every score is bounded by the enumerator's coverage -- if the enumerator didn't propose a site (bulge, distal PAM, beyond the mismatch cutoff), no scorer will ever flag it.
| Assay | Citation | Class | Bias | |-------|----------|-------|------| | CIRCLE-seq | Tsai 2017 Nat Methods 14:607 | in-vitro (cell-free) | over-calls (no chromatin); most sensitive candidate generator | | CHANGE-seq | Lazzarotto 2020 Nat Biotechnol 38:1317 | in-vitro | scalable CIRCLE-seq; same over-call caveat | | Digenome-seq | Kim 2015 Nat Methods 12:237 | in-vitro (WGS) | unbiased but depth-limited, expensive | | SITE-seq | Cameron 2017 Nat Methods 14:600 | in-vitro | concentration series ranks sensitivity | | GUIDE-seq | Tsai 2015 Nat Biotechnol 33:187 | cell-based (dsODN tag) | physiological; misses rare sites, cell-type-specific, hard in primary/RNP | | DISCOVER-seq | Wienert 2019 Science 364:286 | cell-based (MRE11 ChIP, in situ) | tag-free, works in vivo; depends on transient MRE11 occupancy | | TTISS | Schmid-Burgk 2020 Mol Cell 78:794 | cell-based | high-throughput; benchmarks fidelity variants |
The load-bearing reality: in-vitro assays over-call (high sensitivity, low cellular specificity); cell-based assays under-call rare sites and are cell-type-dependent (K562 yields far more hits than HEK293 for the same guide). Cross-method discordance is information, not noise -- sites found by both are high-confidence; in-vitro-only sites are likely chromatin-protected. The defensible workflow is the VIVO logic (Akcakaya 2018): sensitive in-vitro generator -> cell-based assay in the relevant cell type -> amplicon validation.
| Scenario | Recommended | Why | |----------|-------------|-----| | Research knockout / screen (some off-target tolerable) | CRISPOR or GuideScan2 to pick the most specific guide; Cas-OFFinder (<=4 mm + bulges) to eyeball top sites | in-silico is sufficient when being wrong is cheap | | Choosing among candidate guides | rank by aggregate CFD specificity (compare guides, not absolute safety) | specificity is a separate axis from on-target activity (-> grna-design) | | Human therapeutic guide | variant-aware CRISPRme vs gnomAD (+ patient genome), bulges on | a common ancestry-enriched SNP can create a real off-target (rs114518452 / BCL11A) | | Therapeutic, choosing the nuclease | high-fidelity variant in the delivery format actually used | RNP -> HiFi Cas9 (R691A) or Sniper-Cas9; plasmid-tuned variants can lose their edge as RNP | | Therapeutic validation | >=2 orthogonal empirical assays -> amplicon deep-seq with stated LoD -> structural readout | predicted != detected != validated; amplicons miss large deletions/translocations | | Base/prime-editor off-targets | this skill covers Cas-dependent only | deaminase (Cas-independent) DNA/RNA off-targets -> base-editing-design / prime-editing-design |
| Variant | Citation | Note | |---------|----------|------| | eSpCas9(1.1) | Slaymaker 2016 Science 351:84 | neutralizes non-target-strand contacts; characterized mostly as plasmid | | SpCas9-HF1 | Kleinstiver 2016 Nature 529:490 | weakens 4 Cas9-DNA H-bonds; plasmid-characterized | | HypaCas9 | Chen 2017 Nature 550:407 | conformational proofreading gate | | evoCas9 | Casini 2018 Nat Biotechnol 36:265 | ~79x fidelity; ~90% residual on-target | | Sniper-Cas9 | Lee 2018 Nat Commun 9:3048 | high specificity and works as RNP | | HiFi Cas9 (R691A) | Vakulskas 2018 Nat Med 24:1216 | single mutation; the RNP-favored therapeutic variant |
Two tacit points: (1) delivery format matters -- eSpCas9/HF1 can lose their fidelity advantage delivered as a high transient RNP bolus; HiFi Cas9 and Sniper-Cas9 stay specific and active as RNP. (2) Fidelity has a guide-dependent on-target tax -- a variant clean and active on guide A can be nearly dead on guide B. Pick the variant, then test it on the target guide in the intended delivery format; transferability is not assumable.
A patient is not GRCh38. A common SNP can restore a PAM or remove the protective mismatch at a near-target site, creating an off-target that exists only in some individuals -- and because variant frequencies differ by ancestry, reference-only screening systematically misses off-targets common in under-represented populations. For a human therapeutic guide, an off-target check must expand from "checked off-targets" to "checked off-targets variant-aware, across ancestries" -- run CRISPRme against gnomAD (and the treated individual's genome).
Goal: Generate the genome-wide candidate-site list for one or more guides, including bulges and relaxed PAMs.
Approach: Write the Cas-OFFinder input file -- genome path, an optional DNA/RNA bulge line (v3.0.0+), a pattern with N's at guide positions and the PAM (use NRG to also catch NAG/NGG), then one query line per guide (guide bases + N's for the PAM positions, same length as the pattern) with its mismatch tolerance. Run on GPU if available. The output is a flat site list with mismatch counts -- it is a hypothesis set to score downstream, not a verdict.
# input.txt
# /path/to/genome_dir # directory of FASTA (Cas-OFFinder indexes it)
# 2 2 # DNA bulge, RNA bulge (omit this line for no-bulge search)
# NNNNNNNNNNNNNNNNNNNNNRG # 20 N (guide) + NRG PAM -> also catches NAG
# GGCCGACCTGTCGCTGACGCNNN 4 # query: 20 guide bases + NNN (PAM positions), <=4 mismatches
cas-offinder input.txt G output.txt # G=GPU, C=CPU, A=auto
Goal: Rank candidate sites by relative cleavage propensity and compute an aggregate guide-specificity score for comparing guides.
Approach: Do NOT hand-type the CFD matrix. Load the published Doench 2016 tables (mismatch_score.pkl, pam_scores.pkl -- they ship with CRISPOR and the Doench code), take the product of per-position mismatch penalties x the PAM penalty for each site, and aggregate as 100/(1 + sum(CFD)) with per-site CFDs on a 0-1 scale (the CRISPOR specificity formulation; equivalently 10000/(100 + 100*sum)). Compare aggregate scores among candidate guides; never read an absolute CFD as a safety guarantee. (See examples/off_target_analysis.py.)
import pickle
def load_cfd_tables(mismatch_pkl, pam_pkl):
'''Load the published Doench 2016 CFD tables (distributed with CRISPOR) -- do not fabricate.'''
with open(mismatch_pkl, 'rb') as f:
mismatch = pickle.load(f) # keys like 'rA:dG,3' -> penalty
with open(pam_pkl, 'rb') as f:
pam = pickle.load(f) # keys like 'AG' -> penalty
return mismatch, pam
Validating only with a short amplicon at each predicted site systematically misses the large-scale outcomes that are often the real safety concern:
| Rung | Meaning | Method | Floor | |------|---------|--------|-------| | Predicted | sequence-similar candidate | Cas-OFFinder/CRISPOR/CRISPRme | n/a | | Detected | nuclease acts there (unbiased) | GUIDE-/CIRCLE-/DISCOVER-/CHANGE-seq | assay-dependent | | Validated | confirmed editing + allele frequency | targeted amplicon deep-seq (rhAmpSeq) + CRISPResso2 | ~0.1-0.5% (~0.1% with UMI/duplex) |
"Not detected" means "below the LoD," never "zero." State the LoD: 0.05% editing is irrelevant for a research knockout but is ~50,000 mis-edited cells in a 10^8-cell therapy.
Trigger: treating an in-silico mismatch list as a verdict. Mechanism: the search sees sequence homology, not cellular cutting; bulges/chromatin/sub-LoD editing are invisible. Symptom: clean report, real off-targets later. Fix: in-silico chooses which guide to try; validate empirically when being wrong matters.
Trigger: amplicon-seq only at predicted sites. Mechanism: large deletions drop out of PCR (Kosicki 2018); the panel can't discover sites the in-silico search missed. Symptom: falsely clean. Fix: feed the panel from an unbiased discovery assay; add a structural/translocation readout; state the LoD.
Trigger: "CIRCLE-seq is the gold standard." Mechanism: in-vitro over-calls, cell-based under-calls rare/cell-type-specific sites; they disagree by design. Symptom: over- or under-stated risk. Fix: triangulate (in-vitro generator + cell-based in the relevant cell type + validation).
Trigger: "use eSpCas9 for specificity." Mechanism: plasmid-tuned variants can lose the advantage as RNP; the on-target tax is guide-dependent. Symptom: lost activity or lost specificity. Fix: RNP -> HiFi Cas9/Sniper-Cas9; test the variant on the target guide in the intended format.
Trigger: searching GRCh38 only. Mechanism: ancestry-enriched SNPs create/destroy off-targets. Symptom: a real, population-specific off-target missed. Fix: CRISPRme vs gnomAD + the individual's genome.
Trigger: mismatch-only search at NGG. Mechanism: the mismatch-count abstraction can't represent a 1 nt bulge or an NAG site. Symptom: assay finds an off-target the search "missed." Fix: enable bulges (Cas-OFFinder v3) and search a relaxed PAM (NRG).
| Parameter | Value | Rationale | |-----------|-------|-----------| | Mismatch cutoff | <=4 typical (CRISPOR default); up to 6 for thorough | meaningful cutting rare beyond 4-5 mm, but bulges/variants rescue more-distant sites | | Bulge size | up to ~2 (DNA + RNA) | real validated off-targets occur with 1-2 nt bulges | | CFD per-site | relative ranker; attention >~0.1-0.2; high-risk near on-target | not a calibrated probability | | Aggregate specificity (CRISPOR) | higher better; >~80 commonly "good" for choosing guides | research heuristic, NOT a clinical pass/fail | | Amplicon LoD | ~0.1-0.5% (~0.1% with UMI/duplex) | below this, PCR/sequencer error dominates | | High-fidelity on-target tax | guide- and format-dependent | always test the variant on the target guide |
| Error / symptom | Cause | Solution |
|-----------------|-------|----------|
| Cas-OFFinder returns nothing | wrong genome path / pattern-query length mismatch | query length must equal pattern length; check the genome dir/FASTA |
| CFD scores look fabricated/wrong | hand-typed matrix | load the published mismatch_score.pkl/pam_scores.pkl |
| Assay finds an off-target the search missed | mismatch-only, NGG-only search | enable bulges; search NRG |
| "No detectable off-targets" claimed as zero | LoD not stated | report the limit of detection; absence is bounded, not absolute |
testing
Analyze multi-modal single-cell data (CITE-seq, Multiome, spatial). Use when working with data that measures multiple modalities per cell like RNA + protein or RNA + ATAC. Use when analyzing CITE-seq, Multiome, or other multi-modal single-cell data.
data-ai
Analyze metabolite-mediated cell-cell communication using MeboCost for metabolic signaling inference between cell types. Predict metabolite secretion and sensing patterns from scRNA-seq data. Use when studying metabolic crosstalk between cell populations or metabolite-receptor interactions.
development
Find marker genes and annotate cell types in single-cell RNA-seq using Seurat (R) and Scanpy (Python). Use for differential expression between clusters, identifying cluster-specific markers, scoring gene sets, and assigning cell type labels. Use when finding marker genes and annotating clusters.
development
Reconstruct cell lineage trees from CRISPR barcode tracing or mitochondrial mutations. Use when studying clonal dynamics, cell fate decisions, or developmental trajectories.