Version Compatibility

Reference examples tested with: MAGeCK 0.5+ (count + VISPR), MAGeCKFlute 2.0+ (R), pandas 2.2+, numpy 1.26+, scikit-learn 1.4+, matplotlib 3.8+, seaborn 0.13+.

Before using code patterns, verify installed versions match. If versions differ:

Python: pip show mageck then mageck count --help; pip show mageckflute
R: packageVersion('MAGeCKFlute') then ?BatchRemove / ?FluteRRA

If code throws ImportError, AttributeError, or TypeError, introspect the installed package and adapt the example to match the actual API rather than retrying.

CRISPR Screen Quality Control

"Audit my CRISPR screen quality before hit calling" -> Assess library representation, replicate concordance, depth, drift, and biological signal recovery using DepMap-grade metrics, then decide whether the screen is usable, salvageable, or must be repeated.

Python: pandas + scikit-learn for Gini, AUC, PCA; MAGeCKFlute (R) for one-shot QC dashboard
CLI: mageck count --gini, --mapping_summary flags; MAGeCK-VISPR for interactive dashboard

QC Stage Hierarchy

A pooled screen has six distinct bottlenecks where complexity can collapse. Audit each:

| Stage | Metric | Acceptable threshold | Failure consequence | |-------|--------|----------------------|----------------------| | Plasmid pool | Gini, skew, % zero-count guides | Gini <0.1, skew <2, zero <0.5% | Missing guides cannot be screened; dropout indistinguishable from non-coverage | | Day-0 infection | Library coverage, MOI verification | ≥99% guide detection at 500x cells/sgRNA; MOI 0.3 | Founder effects; polyclonality with high MOI | | Selection (puro/blast) | % cells surviving, time-course Gini | 30-40% survival at 5-7 days; Gini drift <0.05 | Selection artifact; fast-growers enriched | | Endpoint | Replicate correlation, depth | Pearson >0.9 on log-counts (MAGeCK-VISPR), Spearman >0.7-0.8, ≥200 reads/sgRNA | Noise dominates; FDR inflates | | Biological signal | CEGv2 PR-AUC, NEGv1 false-positive rate | PR-AUC >0.7 at FDR 5% (DepMap "passing"); CEGv2 enrichment in top 1k | Screen lacks essentiality signal; hits not credible | | Copy-number artifact | Amplified-region enrichment, sgRNA-cut-count correlation | No correlation between sgRNA off-target count and depletion | False-positive essentiality at amplicons; ERBB2 in HER2+ etc. |

Each metric below quantifies one of these stages.

Library Representation Metrics

Goal: Detect dropout, oversaturation, and library bottlenecks at each sequencing stage.

Approach: Compute per-sample zero-count fraction, low-count fraction (<30 reads, Joung 2017 convention), and percentile-based skew, then track how these change between plasmid -> Day-0 -> endpoint to localize the bottleneck.

import pandas as pd
import numpy as np

def library_representation(counts_df):
    '''Per-sample library coverage diagnostics.
    counts_df: rows = sgRNAs, columns = samples (numeric counts).'''
    out = pd.DataFrame(index=counts_df.columns)
    out['n_sgrnas_detected'] = (counts_df > 0).sum()
    out['pct_zero'] = (counts_df == 0).sum() / len(counts_df) * 100
    out['pct_lowcount'] = (counts_df < 30).sum() / len(counts_df) * 100
    out['median_count'] = counts_df.median()
    out['p10_count'] = counts_df.quantile(0.10)
    out['p90_count'] = counts_df.quantile(0.90)
    out['skew_ratio'] = out['p90_count'] / out['p10_count'].replace(0, np.nan)
    return out

def stage_specific_thresholds():
    '''Joung 2017 + DepMap conventions per stage.'''
    return {
        'plasmid':  {'pct_zero_max': 0.5, 'skew_max': 2.0, 'gini_max': 0.10},
        'day_0':    {'pct_zero_max': 1.0, 'skew_max': 2.5, 'gini_max': 0.12},
        'endpoint': {'pct_zero_max': 5.0, 'skew_max': 10.0, 'gini_max': 0.30},
    }

Interpretation: Plasmid pool failing Gini <0.1 indicates synthesis or amplification bias; the screen is unfit for use. Endpoint Gini drifting above 0.30 indicates either heavy biological selection (acceptable for strong-phenotype drug screens) or a bottleneck (must be diagnosed). The Day-0 vs plasmid delta isolates whether the issue arose during infection (cloning is unlikely to lose specific guides between extraction and infection -- the change happens in cells).

Gini Coefficient

Goal: Quantify how unevenly reads are distributed across sgRNAs in a single sample.

Approach: Sort non-zero counts ascending, compute Gini via the cumulative-fraction formula. Compare against stage-specific thresholds.

def gini(x):
    '''Gini coefficient: 0 = perfect equality, 1 = maximal inequality.
    Uses non-zero counts only; zero-count sgRNAs handled separately by % zero.'''
    x = np.sort(x[x > 0].astype(float))
    if x.size == 0:
        return np.nan
    n = x.size
    cumx = np.cumsum(x)
    return (n + 1 - 2 * np.sum(cumx) / cumx[-1]) / n

Stage-specific thresholds (Joung 2017 Nat Protoc; DepMap quality grades):

| Stage | Excellent | Acceptable | Concerning | Failure | |-------|-----------|------------|------------|---------| | Plasmid pool | <0.10 | <0.15 | 0.15-0.20 | >0.20 | | Day 0 (post-infection) | <0.12 | <0.18 | 0.18-0.25 | >0.25 | | Endpoint (post-selection) | <0.30 | <0.40 | 0.40-0.55 | >0.55 |

A Gini that climbs from 0.10 (plasmid) to 0.45 (endpoint) is expected when the screen exerts strong selection (drug, lethal-condition). A Gini that climbs to 0.45 without any biological selection (e.g., a control-vs-control timepoint comparison) indicates technical drift.

Replicate Concordance

Goal: Verify that biological/technical replicates agree before testing for between-condition differences.

Approach: Compute pairwise Pearson on log10(counts+1) (MAGeCK-VISPR convention) and Spearman ρ on raw rank, between every replicate pair within a condition. Flag any pair below 0.8 Pearson on log-scale.

def replicate_concordance(counts_df, condition_map):
    '''condition_map: {condition_name: [sample_col1, sample_col2, ...]}.'''
    log_counts = np.log10(counts_df + 1)
    rows = []
    for cond, samples in condition_map.items():
        if len(samples) < 2:
            continue
        for i in range(len(samples)):
            for j in range(i+1, len(samples)):
                r_pearson = log_counts[[samples[i], samples[j]]].corr().iloc[0, 1]
                r_spearman = counts_df[[samples[i], samples[j]]].corr(method='spearman').iloc[0, 1]
                rows.append({'condition': cond, 'rep1': samples[i], 'rep2': samples[j],
                             'pearson_log': r_pearson, 'spearman': r_spearman})
    return pd.DataFrame(rows)

Thresholds (MAGeCK-VISPR + DepMap):

| Metric | Excellent | Acceptable | Failure | |--------|-----------|------------|---------| | Pearson on log10(counts+1) | >0.95 | >0.85 | <0.80 | | Spearman on raw ranks | >0.85 | >0.70 | <0.60 |

When Pearson is high but Spearman is low, a few outlier sgRNAs are driving correlation (one extreme guide dominates). Inspect the scatterplot; typically caused by PCR jackpotting at a single guide. Hit calling should use a method that ranks (RRA, drugZ) rather than one that fits per-sgRNA fold change directly.

Essentialome Recovery (CEGv2 PR-AUC)

Goal: Verify the screen has detectable biological essentiality signal by checking whether known essentials (Hart 2017 CEGv2) drop out faster than known non-essentials (NEGv1).

Approach: Compute precision-recall AUC where positives are CEGv2 genes and negatives are NEGv1; the screen "passes" if PR-AUC >0.7 (Hart 2017 calibration).

from sklearn.metrics import precision_recall_curve, auc, roc_auc_score

def essentialome_recovery(gene_lfc_df, cegv2_set, negv1_set):
    '''gene_lfc_df: must have ["gene", "lfc"] columns (gene-level mean LFC, negative = depleted).
    cegv2_set, negv1_set: sets of gene symbols from Hart 2017.'''
    labeled = gene_lfc_df[gene_lfc_df['gene'].isin(cegv2_set | negv1_set)].copy()
    labeled['is_essential'] = labeled['gene'].isin(cegv2_set).astype(int)
    y_score = -labeled['lfc']  # negative LFC = depleted = more essential -> higher score
    precision, recall, _ = precision_recall_curve(labeled['is_essential'], y_score)
    return {
        'pr_auc': auc(recall, precision),
        'roc_auc': roc_auc_score(labeled['is_essential'], y_score),
        'n_essential_detected': labeled['is_essential'].sum(),
        'n_nonessential_detected': (1 - labeled['is_essential']).sum(),
    }

Source / threshold: Hart 2017 G3 7:2719 defines CEGv2 (~684 core essentials) and NEGv1 (~927 non-essentials); both lists at https://github.com/hart-lab/bagel/blob/master/CEGv2.txt and NEGv1.txt. DepMap convention: PR-AUC >0.7 at FDR 5% is the "passing" threshold; <0.5 means the screen has no essentiality signal and is not interpretable.

When PR-AUC is low despite good Gini and Pearson: cause is usually one of (a) Cas9 was not selected for before screen start (lots of Cas9-negative cells in the pool diluting signal), (b) puromycin selection truncated too aggressively (over-bottleneck), (c) the timepoint is too early (need 14-21 days for KO + decay + selection to manifest). Each has a different remediation.

Copy-Number Amplicon Bias Diagnostic

Goal: Detect the Aguirre 2016 / Munoz 2016 copy-number artifact where sgRNAs targeting amplified loci appear "essential" purely from DNA-damage burden.

Approach: Bin genes by copy number (if known from matched WGS/SNP-array) and check whether mean LFC correlates with CN. Alternatively, count off-target cut sites per sgRNA and check correlation with depletion -- amplified loci share many identical cut sites.

def cn_bias_diagnostic(gene_lfc_df, cn_df):
    '''cn_df: per-gene copy number (from WGS/SNP-array/matched ASCAT).
    Tests whether amplified genes show systematically lower LFC.'''
    merged = gene_lfc_df.merge(cn_df, on='gene')
    bins = pd.qcut(merged['copy_number'], q=5, duplicates='drop')
    bin_lfc = merged.groupby(bins, observed=True)['lfc'].agg(['mean', 'median', 'std', 'count'])
    from scipy.stats import spearmanr
    rho, p = spearmanr(merged['copy_number'], merged['lfc'])
    return {'cn_vs_lfc_rho': rho, 'cn_vs_lfc_p': p,
            'amplified_mean_lfc': merged[merged['copy_number'] > 4]['lfc'].mean(),
            'diploid_mean_lfc': merged[(merged['copy_number'] >= 1.5) & (merged['copy_number'] <= 2.5)]['lfc'].mean(),
            'per_bin': bin_lfc}

Interpretation: A Spearman ρ < -0.1 between copy number and LFC indicates copy-number artifact. The diagnostic threshold is conservative -- Aguirre 2016 showed that even 4-copy amplifications generate detectable artifact. Remediation: use CRISPRcleanR, CERES, or Chronos (see [[copy-number-correction]]) before hit calling.

Sequencing Depth Audit

Goal: Verify that sequencing depth is sufficient to resolve fold changes at the smallest interesting effect size.

Approach: Compute reads/sgRNA per sample and the coefficient of variation (CV) of total reads across samples. Compare against the 200x minimum (Joung 2017) or 300x DepMap convention.

def depth_audit(counts_df):
    '''Verify depth: 200x per sgRNA per sample is Joung 2017 minimum;
    300x is DepMap convention; 500x for low-confidence-hit recovery.'''
    total = counts_df.sum()
    n_sgrnas = len(counts_df)
    depth = total / n_sgrnas
    cv = total.std() / total.mean()
    return pd.DataFrame({'total_reads': total, 'reads_per_sgrna': depth,
                          'depth_grade': np.where(depth < 200, 'FAIL',
                                          np.where(depth < 300, 'CAUTION',
                                          np.where(depth < 500, 'OK', 'EXCELLENT')))}).assign(across_sample_cv=cv)

CV interpretation: CV >0.5 across samples in total reads indicates demultiplexing imbalance or library-pooling error; even if individual samples pass depth thresholds, the relative count is then biased.

MOI Verification

Goal: Confirm that infection occurred at MOI 0.3-0.5 so that ≤1 sgRNA/cell predominates.

Approach: From titration plate (control wells with serial-diluted virus), compute infection efficiency, then verify by qPCR of integrated proviral copy number in the screen pool.

| MOI | P(≥1 sgRNA/cell) | P(≥2 sgRNAs/cell) | Cells with 2+ guides as fraction of infected | |-----|------------------|--------------------|-----------------------------------------------| | 0.3 | 26% | 4% | 14% | | 0.5 | 39% | 9% | 23% | | 1.0 | 63% | 26% | 41% |

Decision rule: Always titrate to 0.3. At 0.5, 14-23% of "perturbed" cells carry combinatorial perturbations that confound single-gene scoring. The Poisson math is non-negotiable -- there is no analytical correction for high-MOI confounding.

PCA and Batch Effect Detection

Goal: Visualize whether samples cluster by biology or by batch.

Approach: PCA on log10(counts+1); samples should cluster by condition, not by batch/replicate-day/library-lot.

from sklearn.decomposition import PCA

def screen_pca(counts_df, metadata_df, condition_col='condition'):
    '''metadata_df: rows = samples, columns include condition_col, batch (optional).'''
    log_counts = np.log10(counts_df + 1).T  # samples as rows for PCA
    pca = PCA(n_components=3)
    pcs = pca.fit_transform(log_counts)
    out = pd.DataFrame(pcs, columns=['PC1', 'PC2', 'PC3'], index=counts_df.columns)
    out = out.join(metadata_df)
    return out, pca.explained_variance_ratio_

Interpretation: If PC1 separates batches, see [[batch-correction]]. If PC1 separates conditions cleanly, the screen has interpretable biology. If neither separates anything, the screen has no signal (failed) or is dominated by technical noise.

Composite DepMap-Style Quality Score

Goal: Generate a single quality grade combining all metrics for pipeline gating.

Approach: Z-normalize each metric against DepMap's distribution (Pacini 2024) and aggregate to a "screen quality score." Screens scoring <-1 SD are typically excluded from DepMap.

def composite_qc_score(per_sample_qc):
    '''per_sample_qc: one row per sample with columns from library_representation()
    plus pearson, spearman, pr_auc, depth.'''
    metrics = {
        'gini_inv': 1 - per_sample_qc['gini'],
        'pearson': per_sample_qc['pearson_min_replicate'],
        'pr_auc': per_sample_qc['pr_auc'],
        'depth_log': np.log10(per_sample_qc['reads_per_sgrna']),
        'detected_frac': per_sample_qc['n_sgrnas_detected'] / per_sample_qc['n_sgrnas_total'],
    }
    return pd.DataFrame(metrics).mean(axis=1)

This is a pipeline gate, not a publication metric -- DepMap reports separate scores for gene effect score quality (Chronos-derived) and screen quality (Gini-based). See Pacini 2024 Nat Commun for the canonical implementation.

Failure Modes

High Gini in plasmid pool despite passing all design rules

Trigger: Library was cloned and amplified through too many PCR cycles (>20) or used a high-GC-bias polymerase. Mechanism: Each PCR cycle compounds GC bias by ~5%; high-GC and low-GC guides become non-linear functions of starting abundance. Symptom: Gini >0.15 in plasmid, GC-content stratification of dropout. Fix: Cap PCR at 15 cycles for amplification; use Q5 / NEBNext Ultra II / KAPA HiFi (low-bias); re-sequence post-amp; if still bad, re-clone from glycerol stock.

Falling PR-AUC across timepoints despite stable Gini

Trigger: Cas9 was not selected for before screen start; Cas9-negative cells in the pool dilute essentiality signal. Mechanism: Each Cas9-negative cell carries a sgRNA but no editing; its sgRNA persists despite biological essentiality of the target. Symptom: PR-AUC declines from 0.7 at week 1 to 0.4 at week 3; Gini and Pearson both pass. Fix: Always select Cas9-positive cells (FACS or blast) before infection. For a salvage of an already-run screen, model Cas9-expression heterogeneity as a noise floor and accept reduced sensitivity.

Apparent essentiality of amplified loci

Trigger: Cancer cell line with focal amplification (ERBB2 in SK-BR-3, MYC in colorectal, FGFR1 in head and neck). Mechanism: Aguirre 2016 / Munoz 2016: many simultaneous Cas9 cuts trigger p53-dependent DNA-damage response and G2 arrest; sgRNAs at amplified loci appear depleted independently of target essentiality. Symptom: Hits include genes within known amplicons; sgRNAs with more genome-wide cut sites are more depleted. Fix: Apply CRISPRcleanR pre-hoc or use Chronos/CERES with matched CN profile (see [[copy-number-correction]]). Always required for cancer-cell-line screens, not optional.

Outlier replicate dragging Pearson down

Trigger: One technical replicate had a library-prep failure (low input, PCR jackpot, sequencing-lane swap). Mechanism: Outlier sample has different total reads or different per-sgRNA distribution but passes individual sample QC. Symptom: Pearson between replicates 0.85-0.90 with one pair as outlier; condition-level means look fine. Fix: Drop the outlier replicate; re-derive Pearson on the remaining pair. If only two replicates and one is outlier, the condition lacks replication and must be re-run.

Low Day-0 coverage from high MOI

Trigger: Infection at MOI >0.5. Mechanism: Poisson: at MOI 0.5, 23% of infected cells carry multiple sgRNAs; the "single-perturbation" assumption underlying every analysis method is violated. Symptom: Apparent gene-gene interactions in single-gene screens; gene-level z-scores noisy; Pearson lower than expected for high-quality counts. Fix: No analytical correction. Re-titrate, re-infect at MOI 0.3, re-run screen.

CRISPRi/a screen with no signal on validated essentials

Trigger: Library targets wrong TSS (Ensembl canonical vs FANTOM5 highest-rank). Mechanism: dCas9-KRAB knockdown is maximal within ±100 bp of the actual Pol II loading site; canonical annotation can be off 1-10 kb. Symptom: RPS/RPL/EIF families dropping out as expected (these have clean canonical TSSs) but downstream genes failing; PR-AUC on broader CEGv2 panel drops. Fix: Re-design library against FANTOM5 highest-CAGE-peak TSS (Sanson 2018); for tissue-specific lines, use matched CAGE / GRO-seq.

Quantitative Thresholds

| Threshold | Value | Source / Rationale | |-----------|-------|--------------------| | Plasmid Gini | <0.10 | Joung 2017 Nat Protoc 12:828 | | Plasmid skew ratio (p90/p10) | <2 | Joung 2017 | | % zero-count sgRNAs (plasmid) | <0.5% | DepMap convention | | % zero-count sgRNAs (endpoint) | <5% | MAGeCK-VISPR; >5% impairs FDR | | Replicate Pearson on log10(counts+1) | >0.85 acceptable, >0.95 ideal | Li et al 2014 MAGeCK paper; MAGeCK-VISPR | | Replicate Spearman | >0.70 | MAGeCKFlute (Wang 2019 Nat Protoc) | | CEGv2 PR-AUC at FDR 5% | >0.70 passing; >0.85 high quality | Hart 2017 G3 7:2719 | | Reads per sgRNA per sample | ≥200 minimum, 300+ DepMap, 500+ low-effect screens | Joung 2017; Pacini 2024 DepMap | | Library coverage at infection | 500x cells/sgRNA | Joung 2017; DepMap | | In-vivo coverage | 200-1000x bottleneck-adjusted | Chen 2015 / Manguso 2017; see [[in-vivo-screens]] | | MOI at infection | 0.3 strict | Poisson: P(≥2)=4% at 0.3 vs 9% at 0.5 | | CN-bias Spearman ρ (LFC vs copy number) | abs(ρ) <0.10 | Aguirre 2016 Cancer Discov |

Common Errors

| Error / symptom | Cause | Solution | |-----------------|-------|----------| | Plasmid Gini >0.2 | PCR over-amplification | Re-sequence; cap PCR at 15 cycles | | Endpoint PR-AUC <0.5 | Cas9 not selected pre-screen | Select Cas9+; redo if mid-screen | | Pearson high, Spearman low | A few outlier sgRNAs dominate | Use RRA / rank-based hit calling | | Replicate Pearson <0.8 | Library-prep failure on one rep | Drop outlier; re-run if singleton | | % zero increases dramatically Day-0 -> endpoint | Selection bottleneck | Reduce selection pressure; or increase coverage | | Top hits include amplified-region genes | CN bias | CRISPRcleanR or Chronos | | MOI verification shows 0.6+ | Over-infected | Re-run at lower MOI; no rescue | | Detected fraction <90% in Day 0 | Coverage too low | Increase cells; expect drift |

References

Joung J et al. 2017. Nat Protoc 12:828. Genome-wide screen protocol; coverage and depth conventions.
Li W et al. 2014. Genome Biol 15:554. MAGeCK; original Pearson cutoffs.
Wang B et al. 2019. Nat Protoc 14:756. MAGeCKFlute; QC dashboard.
Hart T et al. 2017. G3 7:2719. CEGv2 / NEGv1 reference essentiality gene sets.
Aguirre AJ et al. 2016. Cancer Discov 6:914. Copy-number amplicon false-essentiality.
Munoz DM et al. 2016. Cancer Discov 6:900. Copy-number gene-independent toxicity.
Pacini C et al. 2024. Nat Commun 15:1230. DepMap screen-quality scoring.
Meyers RM et al. 2017. Nat Genet 49:1779. CERES; mechanism of CN bias.
Sanson KR et al. 2018. Nat Commun 9:5416. Dolcetto/Calabrese TSS rules.
Dempster JM et al. 2021. Genome Biol 22:343. Chronos screen-quality model.

Related Skills

crispr-screens/library-design - Compose libraries that pass plasmid QC
crispr-screens/mageck-analysis - Run MAGeCK count to generate QC inputs
crispr-screens/copy-number-correction - Remediate Aguirre / Munoz CN artifact
crispr-screens/batch-correction - Address inter-batch / cell-line confounding
crispr-screens/hit-calling - Pick method by QC grade
crispr-screens/in-vivo-screens - In-vivo-specific bottleneck QC

Version Compatibility

Reference examples tested with: MAGeCK 0.5+ (count + VISPR), MAGeCKFlute 2.0+ (R), pandas 2.2+, numpy 1.26+, scikit-learn 1.4+, matplotlib 3.8+, seaborn 0.13+.

Before using code patterns, verify installed versions match. If versions differ:

Python: pip show mageck then mageck count --help; pip show mageckflute
R: packageVersion('MAGeCKFlute') then ?BatchRemove / ?FluteRRA

If code throws ImportError, AttributeError, or TypeError, introspect the installed package and adapt the example to match the actual API rather than retrying.

CRISPR Screen Quality Control

Python: pandas + scikit-learn for Gini, AUC, PCA; MAGeCKFlute (R) for one-shot QC dashboard
CLI: mageck count --gini, --mapping_summary flags; MAGeCK-VISPR for interactive dashboard

QC Stage Hierarchy

A pooled screen has six distinct bottlenecks where complexity can collapse. Audit each:

Each metric below quantifies one of these stages.

Library Representation Metrics

Goal: Detect dropout, oversaturation, and library bottlenecks at each sequencing stage.

import pandas as pd
import numpy as np

def library_representation(counts_df):
    '''Per-sample library coverage diagnostics.
    counts_df: rows = sgRNAs, columns = samples (numeric counts).'''
    out = pd.DataFrame(index=counts_df.columns)
    out['n_sgrnas_detected'] = (counts_df > 0).sum()
    out['pct_zero'] = (counts_df == 0).sum() / len(counts_df) * 100
    out['pct_lowcount'] = (counts_df < 30).sum() / len(counts_df) * 100
    out['median_count'] = counts_df.median()
    out['p10_count'] = counts_df.quantile(0.10)
    out['p90_count'] = counts_df.quantile(0.90)
    out['skew_ratio'] = out['p90_count'] / out['p10_count'].replace(0, np.nan)
    return out

def stage_specific_thresholds():
    '''Joung 2017 + DepMap conventions per stage.'''
    return {
        'plasmid':  {'pct_zero_max': 0.5, 'skew_max': 2.0, 'gini_max': 0.10},
        'day_0':    {'pct_zero_max': 1.0, 'skew_max': 2.5, 'gini_max': 0.12},
        'endpoint': {'pct_zero_max': 5.0, 'skew_max': 10.0, 'gini_max': 0.30},
    }

Gini Coefficient

Goal: Quantify how unevenly reads are distributed across sgRNAs in a single sample.

Approach: Sort non-zero counts ascending, compute Gini via the cumulative-fraction formula. Compare against stage-specific thresholds.

def gini(x):
    '''Gini coefficient: 0 = perfect equality, 1 = maximal inequality.
    Uses non-zero counts only; zero-count sgRNAs handled separately by % zero.'''
    x = np.sort(x[x > 0].astype(float))
    if x.size == 0:
        return np.nan
    n = x.size
    cumx = np.cumsum(x)
    return (n + 1 - 2 * np.sum(cumx) / cumx[-1]) / n

Stage-specific thresholds (Joung 2017 Nat Protoc; DepMap quality grades):

Replicate Concordance

Goal: Verify that biological/technical replicates agree before testing for between-condition differences.

def replicate_concordance(counts_df, condition_map):
    '''condition_map: {condition_name: [sample_col1, sample_col2, ...]}.'''
    log_counts = np.log10(counts_df + 1)
    rows = []
    for cond, samples in condition_map.items():
        if len(samples) < 2:
            continue
        for i in range(len(samples)):
            for j in range(i+1, len(samples)):
                r_pearson = log_counts[[samples[i], samples[j]]].corr().iloc[0, 1]
                r_spearman = counts_df[[samples[i], samples[j]]].corr(method='spearman').iloc[0, 1]
                rows.append({'condition': cond, 'rep1': samples[i], 'rep2': samples[j],
                             'pearson_log': r_pearson, 'spearman': r_spearman})
    return pd.DataFrame(rows)

Thresholds (MAGeCK-VISPR + DepMap):

| Metric | Excellent | Acceptable | Failure | |--------|-----------|------------|---------| | Pearson on log10(counts+1) | >0.95 | >0.85 | <0.80 | | Spearman on raw ranks | >0.85 | >0.70 | <0.60 |

Essentialome Recovery (CEGv2 PR-AUC)

Goal: Verify the screen has detectable biological essentiality signal by checking whether known essentials (Hart 2017 CEGv2) drop out faster than known non-essentials (NEGv1).

Approach: Compute precision-recall AUC where positives are CEGv2 genes and negatives are NEGv1; the screen "passes" if PR-AUC >0.7 (Hart 2017 calibration).

from sklearn.metrics import precision_recall_curve, auc, roc_auc_score

def essentialome_recovery(gene_lfc_df, cegv2_set, negv1_set):
    '''gene_lfc_df: must have ["gene", "lfc"] columns (gene-level mean LFC, negative = depleted).
    cegv2_set, negv1_set: sets of gene symbols from Hart 2017.'''
    labeled = gene_lfc_df[gene_lfc_df['gene'].isin(cegv2_set | negv1_set)].copy()
    labeled['is_essential'] = labeled['gene'].isin(cegv2_set).astype(int)
    y_score = -labeled['lfc']  # negative LFC = depleted = more essential -> higher score
    precision, recall, _ = precision_recall_curve(labeled['is_essential'], y_score)
    return {
        'pr_auc': auc(recall, precision),
        'roc_auc': roc_auc_score(labeled['is_essential'], y_score),
        'n_essential_detected': labeled['is_essential'].sum(),
        'n_nonessential_detected': (1 - labeled['is_essential']).sum(),
    }

Copy-Number Amplicon Bias Diagnostic

Goal: Detect the Aguirre 2016 / Munoz 2016 copy-number artifact where sgRNAs targeting amplified loci appear "essential" purely from DNA-damage burden.

def cn_bias_diagnostic(gene_lfc_df, cn_df):
    '''cn_df: per-gene copy number (from WGS/SNP-array/matched ASCAT).
    Tests whether amplified genes show systematically lower LFC.'''
    merged = gene_lfc_df.merge(cn_df, on='gene')
    bins = pd.qcut(merged['copy_number'], q=5, duplicates='drop')
    bin_lfc = merged.groupby(bins, observed=True)['lfc'].agg(['mean', 'median', 'std', 'count'])
    from scipy.stats import spearmanr
    rho, p = spearmanr(merged['copy_number'], merged['lfc'])
    return {'cn_vs_lfc_rho': rho, 'cn_vs_lfc_p': p,
            'amplified_mean_lfc': merged[merged['copy_number'] > 4]['lfc'].mean(),
            'diploid_mean_lfc': merged[(merged['copy_number'] >= 1.5) & (merged['copy_number'] <= 2.5)]['lfc'].mean(),
            'per_bin': bin_lfc}

Sequencing Depth Audit

Goal: Verify that sequencing depth is sufficient to resolve fold changes at the smallest interesting effect size.

Approach: Compute reads/sgRNA per sample and the coefficient of variation (CV) of total reads across samples. Compare against the 200x minimum (Joung 2017) or 300x DepMap convention.

def depth_audit(counts_df):
    '''Verify depth: 200x per sgRNA per sample is Joung 2017 minimum;
    300x is DepMap convention; 500x for low-confidence-hit recovery.'''
    total = counts_df.sum()
    n_sgrnas = len(counts_df)
    depth = total / n_sgrnas
    cv = total.std() / total.mean()
    return pd.DataFrame({'total_reads': total, 'reads_per_sgrna': depth,
                          'depth_grade': np.where(depth < 200, 'FAIL',
                                          np.where(depth < 300, 'CAUTION',
                                          np.where(depth < 500, 'OK', 'EXCELLENT')))}).assign(across_sample_cv=cv)

MOI Verification

Goal: Confirm that infection occurred at MOI 0.3-0.5 so that ≤1 sgRNA/cell predominates.

Approach: From titration plate (control wells with serial-diluted virus), compute infection efficiency, then verify by qPCR of integrated proviral copy number in the screen pool.

PCA and Batch Effect Detection

Goal: Visualize whether samples cluster by biology or by batch.

Approach: PCA on log10(counts+1); samples should cluster by condition, not by batch/replicate-day/library-lot.

from sklearn.decomposition import PCA

def screen_pca(counts_df, metadata_df, condition_col='condition'):
    '''metadata_df: rows = samples, columns include condition_col, batch (optional).'''
    log_counts = np.log10(counts_df + 1).T  # samples as rows for PCA
    pca = PCA(n_components=3)
    pcs = pca.fit_transform(log_counts)
    out = pd.DataFrame(pcs, columns=['PC1', 'PC2', 'PC3'], index=counts_df.columns)
    out = out.join(metadata_df)
    return out, pca.explained_variance_ratio_

Composite DepMap-Style Quality Score

Goal: Generate a single quality grade combining all metrics for pipeline gating.

Approach: Z-normalize each metric against DepMap's distribution (Pacini 2024) and aggregate to a "screen quality score." Screens scoring <-1 SD are typically excluded from DepMap.

def composite_qc_score(per_sample_qc):
    '''per_sample_qc: one row per sample with columns from library_representation()
    plus pearson, spearman, pr_auc, depth.'''
    metrics = {
        'gini_inv': 1 - per_sample_qc['gini'],
        'pearson': per_sample_qc['pearson_min_replicate'],
        'pr_auc': per_sample_qc['pr_auc'],
        'depth_log': np.log10(per_sample_qc['reads_per_sgrna']),
        'detected_frac': per_sample_qc['n_sgrnas_detected'] / per_sample_qc['n_sgrnas_total'],
    }
    return pd.DataFrame(metrics).mean(axis=1)

Failure Modes

High Gini in plasmid pool despite passing all design rules

Falling PR-AUC across timepoints despite stable Gini

Apparent essentiality of amplified loci

Outlier replicate dragging Pearson down

Low Day-0 coverage from high MOI

CRISPRi/a screen with no signal on validated essentials

Quantitative Thresholds

Common Errors

References

Joung J et al. 2017. Nat Protoc 12:828. Genome-wide screen protocol; coverage and depth conventions.
Li W et al. 2014. Genome Biol 15:554. MAGeCK; original Pearson cutoffs.
Wang B et al. 2019. Nat Protoc 14:756. MAGeCKFlute; QC dashboard.
Hart T et al. 2017. G3 7:2719. CEGv2 / NEGv1 reference essentiality gene sets.
Aguirre AJ et al. 2016. Cancer Discov 6:914. Copy-number amplicon false-essentiality.
Munoz DM et al. 2016. Cancer Discov 6:900. Copy-number gene-independent toxicity.
Pacini C et al. 2024. Nat Commun 15:1230. DepMap screen-quality scoring.
Meyers RM et al. 2017. Nat Genet 49:1779. CERES; mechanism of CN bias.
Sanson KR et al. 2018. Nat Commun 9:5416. Dolcetto/Calabrese TSS rules.
Dempster JM et al. 2021. Genome Biol 22:343. Chronos screen-quality model.

Related Skills

crispr-screens/library-design - Compose libraries that pass plasmid QC
crispr-screens/mageck-analysis - Run MAGeCK count to generate QC inputs
crispr-screens/copy-number-correction - Remediate Aguirre / Munoz CN artifact
crispr-screens/batch-correction - Address inter-batch / cell-line confounding
crispr-screens/hit-calling - Pick method by QC grade
crispr-screens/in-vivo-screens - In-vivo-specific bottleneck QC

Adoption

GPTomics/bio-crispr-screens-screen-qc

$ install --global

Security Scan Results

SKILL.md

Version Compatibility

CRISPR Screen Quality Control

QC Stage Hierarchy

Library Representation Metrics

Gini Coefficient

Replicate Concordance

Essentialome Recovery (CEGv2 PR-AUC)

Copy-Number Amplicon Bias Diagnostic

Sequencing Depth Audit

MOI Verification

PCA and Batch Effect Detection

Composite DepMap-Style Quality Score

Failure Modes

High Gini in plasmid pool despite passing all design rules

Falling PR-AUC across timepoints despite stable Gini

Apparent essentiality of amplified loci

Outlier replicate dragging Pearson down

Low Day-0 coverage from high MOI

CRISPRi/a screen with no signal on validated essentials

Quantitative Thresholds

Common Errors

References

Related Skills

Related Skills

GPTomics/bio-workflows-clip-pipeline

GPTomics/bio-comparative-genomics-whole-genome-duplication

GPTomics/bio-comparative-genomics-whole-genome-alignment

GPTomics/bio-comparative-genomics-synteny-analysis

GPTomics/bio-crispr-screens-screen-qc

$ install --global

Security Scan Results

SKILL.md

Version Compatibility

CRISPR Screen Quality Control

QC Stage Hierarchy

Library Representation Metrics

Gini Coefficient

Replicate Concordance

Essentialome Recovery (CEGv2 PR-AUC)

Copy-Number Amplicon Bias Diagnostic

Sequencing Depth Audit

MOI Verification

PCA and Batch Effect Detection

Composite DepMap-Style Quality Score

Failure Modes

High Gini in plasmid pool despite passing all design rules

Falling PR-AUC across timepoints despite stable Gini

Apparent essentiality of amplified loci

Outlier replicate dragging Pearson down

Low Day-0 coverage from high MOI

CRISPRi/a screen with no signal on validated essentials

Quantitative Thresholds

Common Errors

References

Related Skills

Related Skills

GPTomics/bio-workflows-clip-pipeline

GPTomics/bio-comparative-genomics-whole-genome-duplication

GPTomics/bio-comparative-genomics-whole-genome-alignment

GPTomics/bio-comparative-genomics-synteny-analysis