workflows/crispr-screen-pipeline/SKILL.md
End-to-end pooled and single-cell CRISPR screen analysis from FASTQ to hit genes. Orchestrates library design QC, guide counting, six-stage screen QC (plasmid Gini, replicate Pearson, CEGv2 PR-AUC, copy-number artifact), method-appropriate hit calling across MAGeCK RRA/MLE, BAGEL2, drugZ, JACKS, and Chronos, cancer-cell-line copy-number correction (CRISPRcleanR / Chronos), batch correction for multi-batch screens, and the specialized branches for combinatorial paralog screens, single-cell Perturb-seq, base-editor variant-function screens, prime-editor screens, and in vivo bottleneck-aware screens. Use when analyzing any pooled CRISPR screen end-to-end, choosing the correct hit-calling method by experimental design, integrating copy-number correction into the pipeline, or branching the workflow for single-cell, combinatorial, base-editor, prime-editor, or in vivo variants.
npx skillsauth add GPTomics/bioSkills bio-workflows-crispr-screen-pipelineInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Reference examples tested with: MAGeCK 0.5.9+, BAGEL2 1.0.5+, drugZ Aug 2019+, JACKS 0.2.0+, Chronos 2.0+, CRISPRcleanR 3.0+ (R), Pertpy 0.6+, PRIDICT2, CRISPResso2 2.2.14+, MAGeCKFlute 2.0+, pandas 2.2+, numpy 1.26+, matplotlib 3.8+.
Before using code patterns, verify installed versions match. If versions differ:
mageck --version, BAGEL.py fc --help, drugz -h, CRISPResso --versionpip show pertpy mageck-vispr jacks chronos-cnpackageVersion('CRISPRcleanR'), packageVersion('MAGeCKFlute')If code throws ImportError, AttributeError, or TypeError, introspect the installed package and adapt the example to match the actual API rather than retrying.
"Analyze my pooled or single-cell CRISPR screen end-to-end" -> Pick the screen design branch, run guide counting, audit six QC stages, apply copy-number and batch correction as needed, run the design-matched hit-calling method, and consolidate across methods for high-confidence hits.
Library Design ([[library-design]])
|
v
FASTQ Files -> mageck count -> count matrix
|
v
Six-Stage QC ([[screen-qc]])
|
+---------------------+---------------------+
| |
v v
Cancer cell line? Non-cancer?
Apply CN correction No CN correction needed
([[copy-number-correction]])
| |
+---------------------+---------------------+
v
Multi-batch? Apply batch covariate
([[batch-correction]])
|
v
Pick hit-calling method by design ([[hit-calling]])
|
+-----------+---------+---------+-----------+-----------+
| | | | | |
v v v v v v
2-cond Time Drug Essential Multi- Specialized
MAGeCK RRA MAGeCK drugZ BAGEL2 screen (PE/BE/SC/
MLE JACKS or in vivo/
Chronos combinat)
| | | | | |
+-----------+---------+---------+-----------+-----------+
v
Tier-based consensus
v
Orthogonal validation
Reference [[library-design]] for full library composition. Verify before sequencing:
=99% guides detected at >25 reads/guide
mageck count \
--list-seq library.csv \
--sample-label Plasmid,Day0,Veh_r1,Veh_r2,Drug_r1,Drug_r2 \
--fastq Plasmid.fq.gz Day0.fq.gz Veh_r1.fq.gz Veh_r2.fq.gz Drug_r1.fq.gz Drug_r2.fq.gz \
--norm-method median \
--output-prefix experiment \
--trim-5 CACCG
For Cas12a libraries (Inzolia, in4mer): see [[combinatorial-screens]]. For 10X single-cell direct capture: use cellranger-arc or pertpy-aware counting; see [[perturb-seq-analysis]].
import pandas as pd
import numpy as np
from sklearn.metrics import precision_recall_curve, auc
counts = pd.read_csv('experiment.count.txt', sep='\t', index_col=0)
genes = counts['Gene']
count_matrix = counts.drop('Gene', axis=1)
def gini(x):
x = np.sort(x[x > 0].astype(float))
if x.size == 0:
return np.nan
n = x.size
cumx = np.cumsum(x)
return (n + 1 - 2 * np.sum(cumx) / cumx[-1]) / n
per_sample = pd.DataFrame({
'pct_zero': (count_matrix == 0).sum() / len(count_matrix) * 100,
'gini': count_matrix.apply(gini),
'reads_per_sgrna': count_matrix.sum() / len(count_matrix),
})
log_counts = np.log10(count_matrix + 1)
pearson = log_counts.corr()
print(per_sample)
print('Replicate Pearson:', pearson.values[pearson.values < 1].mean())
Hard gates from [[screen-qc]]:
If screening in a cancer cell line, apply CRISPRcleanR (unsupervised, no CN profile needed) or Chronos (joint with CN profile). Required to remove Aguirre 2016 / Munoz 2016 amplicon artifact.
library(CRISPRcleanR)
data(KY_Library_v1.0)
norm <- ccr.NormfoldChanges(read.table('experiment.count.txt', header=TRUE, sep='\t'),
min_reads = 30, EXPname = 'screen',
libraryAnnotation = KY_Library_v1.0)
gw_lfc <- ccr.logFCs2chromPos(norm$norm_fold_changes, KY_Library_v1.0)
cleaned <- ccr.GWclean(gw_lfc, display = TRUE, label = 'screen')
corrected_counts <- ccr.correctCounts(my_screen = norm, correction = cleaned,
outprefix = 'screen_cleanr',
libraryAnnotation = KY_Library_v1.0)
# Feed corrected counts into MAGeCK / BAGEL2 / drugZ downstream
For DepMap-scale panels with longitudinal data + matched CN, use Chronos. See [[copy-number-correction]].
For multi-batch screens, add batch as a covariate in MAGeCK MLE rather than pre-correcting with ComBat. See [[batch-correction]] for full decision tree.
mageck test \
--count-table experiment.count.txt \
--treatment-id Day14_r1,Day14_r2,Day14_r3 \
--control-id Day0 \
--norm-method median \
--output-prefix essentiality_rra
BAGEL.py fc -i experiment.count.txt -o foldchange.txt -c Day0 --min-reads 30
BAGEL.py bf -i foldchange.txt -o bayes_factor.txt -e CEGv2.txt -n NEGv1.txt \
-c Day14_r1,Day14_r2,Day14_r3 -k 1000
mageck mle --count-table experiment.count.txt --design-matrix design.txt \
--output-prefix timecourse_mle --norm-method median
python drugz.py \
-i experiment.count.txt \
-o drugz_output.txt \
-c Veh_r1,Veh_r2,Veh_r3 \
-x Drug_r1,Drug_r2,Drug_r3 \
-p 5
drugZ requires vehicle as control, not Day-0. See [[drugz-chemogenomic]].
python run_JACKS.py experiment.count.txt replicatemap.txt guidemap.txt \
--rep_hdr Replicate --sample_hdr Sample --ctrl_sample_hdr Control \
--sgrna_hdr sgRNA --gene_hdr Gene --outprefix jacks_out --apply_w_hp
import chronos
model = chronos.Chronos(sequence_map=sequence_map, guide_gene_map=guide_gene_map,
reads=counts_df, copy_number=cn_df)
model.train(n_steps=2000)
gene_effects = model.gene_effect()
DepMap quarterly standard; handles CN bias + screen quality + longitudinal jointly.
mageck = pd.read_csv('essentiality_rra.gene_summary.txt', sep='\t')[['id', 'neg|fdr']].rename(
columns={'id': 'gene', 'neg|fdr': 'mageck_neg_fdr'})
bagel = pd.read_csv('bayes_factor.txt', sep='\t')[['GENE', 'BF']].rename(
columns={'GENE': 'gene', 'BF': 'bagel_bf'})
drugz_df = pd.read_csv('drugz_output.txt', sep='\t')[['GENE', 'fdr_synth']].rename(
columns={'GENE': 'gene', 'fdr_synth': 'drugz_synth_fdr'})
merged = mageck.merge(bagel, on='gene', how='outer').merge(drugz_df, on='gene', how='outer')
merged['mageck_hit'] = merged['mageck_neg_fdr'] < 0.05
merged['bagel_hit'] = merged['bagel_bf'] > 6
merged['drugz_hit'] = merged['drugz_synth_fdr'] < 0.05
merged['tier'] = merged[['mageck_hit', 'bagel_hit', 'drugz_hit']].astype(int).sum(axis=1)
tier1 = merged[merged['tier'] >= 3]
tier2 = merged[merged['tier'] == 2]
| Screen design | Specialized workflow | |---------------|----------------------| | Single-cell Perturb-seq / CROP-seq / Multiome | [[perturb-seq-analysis]] -- Pertpy + Mixscape + SCEPTRE | | Combinatorial paralog (Cas12a Inzolia / Big Papi) | [[combinatorial-screens]] -- GI scoring; synthetic-lethal identification | | Base-editor variant-function (Hanna 2021 style) | [[base-editing-analysis]] + [[crispresso-editing]] | | Prime-editor variant installation | [[prime-editing-screens]] -- PRIDICT2 pegRNA design | | In vivo tumor / immune screens | [[in-vivo-screens]] -- focused library; per-animal meta-analysis |
import matplotlib.pyplot as plt
fig, ax = plt.subplots(figsize=(10, 8))
gene_summary = pd.read_csv('essentiality_rra.gene_summary.txt', sep='\t')
sig = gene_summary['neg|fdr'] < 0.05
ax.scatter(gene_summary.loc[~sig, 'neg|lfc'],
-np.log10(gene_summary.loc[~sig, 'neg|fdr'].clip(lower=1e-10)),
c='lightgray', alpha=0.5, s=10)
ax.scatter(gene_summary.loc[sig, 'neg|lfc'],
-np.log10(gene_summary.loc[sig, 'neg|fdr'].clip(lower=1e-10)),
c='red', alpha=0.7, s=18)
ax.axhline(-np.log10(0.05), ls='--', c='black', lw=0.5)
ax.set_xlabel('Log2 Fold Change')
ax.set_ylabel('-Log10(FDR)')
plt.savefig('volcano.png', dpi=150)
MAGeCKFlute R package provides one-shot FluteRRA / FluteMLE dashboards with KEGG/Reactome enrichment.
| File | Source step | Description | |------|-------------|-------------| | experiment.count.txt | mageck count | Raw count matrix | | experiment.countsummary.txt | mageck count | Per-sample Gini, mapping, % zero | | screen_cleanr_corrected_counts.txt | CRISPRcleanR | CN-corrected counts (cancer lines) | | essentiality_rra.gene_summary.txt | mageck test | Gene-level RRA scores | | bayes_factor.txt | BAGEL2 | Per-gene Bayes factors | | drugz_output.txt | drugZ | sumZ, normZ, per-direction FDR | | jacks_out_gene_JACKS_results.txt | JACKS | Gene effect + sgRNA efficacy | | tier_consensus.csv | Custom aggregation | Tier-1/2/3 hits across methods |
tools
--- name: bio-phasing-imputation-foundations description: Frames the phasing/imputation pipeline before any tool runs: phasing and imputation are one Li-Stephens copying HMM (recombination is the transition, mutation the emission, the genetic map and Ne set the rates), imputation's honest output is a dosage with a self-estimated quality (INFO/R2/DR2) not a hard genotype, and the stages are ordered and each fails silently (QC, align build and strand to the panel, phase, impute per chromosome, fil
tools
Chooses the enrichment generation before any tool runs, mapping the input shape to a method class - a pre-selected gene list plus a background to over-representation analysis (ORA, hypergeometric), a ranked statistic for all genes to gene set enrichment (GSEA), a signed signaling topology to pathway-topology (SPIA) - then making the null explicit (competitive vs self-contained, gene vs subject sampling) and running a trustworthiness checklist (testable-gene universe, FDR, redundancy collapse, leading-edge check, version reporting). Covers why every clusterProfiler GSEA is the inter-gene-correlation-uncorrected competitive null, why the background not the gene list decides ORA significance, and why no method is universally best. Use when deciding ORA vs GSEA vs topology, which gene-set DB, whether a result is trustworthy, or which null a tool computes. For ORA see go-enrichment, GSEA see gsea, databases kegg-pathways/reactome-pathways/wikipathways; the ranking comes from differential-expression/de-results.
testing
End-to-end GWAS workflow from VCF to association results. Covers PLINK QC, population structure correction, and association testing for case-control or quantitative traits. Use when running genome-wide association studies.
development
Orchestrates the full path from differential expression results to redundancy-collapsed functional enrichment: choose ORA vs GSEA, convert gene IDs per method, run enrichGO/enrichKEGG/enrichPathway/enrichWP or gseGO/gseKEGG (clusterProfiler, ReactomePA, rWikiPathways), and visualize. Routes the ORA-vs-GSEA generation fork and the null/universe/reproducibility theory to pathway-analysis/enrichment-foundations. Use when a DESeq2/edgeR/limma result must become enriched GO terms, KEGG/Reactome/WikiPathways pathways, or a GSEA leading edge; when deciding whether a ranking exists for all genes (GSEA, named decreasing vector) or only a pre-selected list (ORA plus a defensible background universe); or when assembling DE-to-pathway end to end. The DE list and ranking statistic come from differential-expression/de-results; per-method nuance lives in the pathway-analysis skills.