skills/genomics-bioinformatics/pydeseq2-differential-expression/SKILL.md
Bulk RNA-seq DE with PyDESeq2: load counts, normalize, fit negative binomial models, Wald test (BH-FDR), LFC shrinkage, volcano/MA plots. Use for two-group comparisons, multi-factor designs with batch correction, multiple contrasts.
npx skillsauth add jaechang-hits/sciagent-skills pydeseq2-differential-expressionInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
PyDESeq2 is a Python reimplementation of the R DESeq2 package for differential gene expression analysis from bulk RNA-seq count data. It fits negative binomial generalized linear models per gene, estimates dispersion with empirical Bayes shrinkage, and performs Wald tests with Benjamini-Hochberg FDR correction. This skill covers the full pipeline from raw counts to publication-ready result tables and visualizations.
~batch + condition)pydeseq2>=0.4, pandas>=1.4, numpy>=1.23, scipy>=1.11, scikit-learn>=1.1, anndata>=0.8matplotlib, seaborn for visualizationpip install pydeseq2 matplotlib seaborn
Load the count matrix and metadata. PyDESeq2 expects counts as a samples x genes DataFrame with non-negative integers, and metadata as a samples x variables DataFrame with matching indices.
import pandas as pd
# Load data — typical CSV has genes as rows, samples as columns
counts_raw = pd.read_csv("counts.csv", index_col=0)
metadata = pd.read_csv("metadata.csv", index_col=0)
# Transpose if needed: PyDESeq2 requires samples x genes
if counts_raw.shape[0] > counts_raw.shape[1]:
counts_df = counts_raw.T # genes x samples → samples x genes
else:
counts_df = counts_raw
# Validate alignment
common_samples = counts_df.index.intersection(metadata.index)
counts_df = counts_df.loc[common_samples]
metadata = metadata.loc[common_samples]
print(f"Samples: {counts_df.shape[0]}, Genes: {counts_df.shape[1]}")
print(f"Metadata columns: {list(metadata.columns)}")
print(f"Condition counts:\n{metadata['condition'].value_counts()}")
Remove lowly expressed genes to improve statistical power and reduce multiple testing burden.
# Filter genes with total counts below threshold
min_total_counts = 10
gene_counts = counts_df.sum(axis=0)
genes_to_keep = gene_counts[gene_counts >= min_total_counts].index
counts_df = counts_df[genes_to_keep]
# Optional: require minimum counts in a minimum number of samples
min_count_per_sample = 5
min_samples = 3
genes_expressed = (counts_df >= min_count_per_sample).sum(axis=0) >= min_samples
counts_df = counts_df.loc[:, genes_expressed]
print(f"Genes after filtering: {counts_df.shape[1]}")
Create the DESeq dataset object, specify the design formula, and run the full pipeline (size factor estimation, dispersion estimation, model fitting).
from pydeseq2.dds import DeseqDataSet
dds = DeseqDataSet(
counts=counts_df,
metadata=metadata,
design="~condition", # Wilkinson-style formula
refit_cooks=True, # Refit after Cook's outlier removal
n_cpus=4 # Parallel threads
)
# Run: size factors → dispersions → trend → MAP shrinkage → LFC fitting
dds.deseq2()
# Inspect normalization
print(f"Size factors (first 5): {dds.obsm['size_factors'][:5]}")
print(f"Size factor range: {dds.obsm['size_factors'].min():.2f} - {dds.obsm['size_factors'].max():.2f}")
Perform Wald tests to identify differentially expressed genes. Specify the contrast as [variable, test_level, reference_level].
from pydeseq2.ds import DeseqStats
ds = DeseqStats(
dds,
contrast=["condition", "treated", "control"],
alpha=0.05, # FDR threshold
cooks_filter=True, # Filter Cook's outliers
independent_filter=True # Independent filtering for power
)
ds.summary()
# Access full results
results = ds.results_df
print(f"Total genes tested: {len(results)}")
print(f"Significant (padj < 0.05): {(results.padj < 0.05).sum()}")
Apply apeGLM shrinkage to reduce noise in log2 fold change estimates. Use shrunk values for visualization and ranking, not for significance calls.
# Apply shrinkage — modifies results_df.log2FoldChange in place
ds.lfc_shrink()
# Compare pre/post shrinkage effect
print(f"Max |LFC| after shrinkage: {results.log2FoldChange.abs().max():.2f}")
print(f"Genes with |LFC| > 2: {(results.log2FoldChange.abs() > 2).sum()}")
Filter significant genes and export results for downstream analysis.
import numpy as np
# Significance + effect size filter
significant = results[
(results.padj < 0.05) &
(results.log2FoldChange.abs() > 1.0)
].copy()
# Separate up/down-regulated
up = significant[significant.log2FoldChange > 0].sort_values("padj")
down = significant[significant.log2FoldChange < 0].sort_values("padj")
print(f"Upregulated: {len(up)}, Downregulated: {len(down)}")
# Export
results.to_csv("deseq2_all_results.csv")
significant.to_csv("deseq2_significant.csv")
up.to_csv("deseq2_upregulated.csv")
down.to_csv("deseq2_downregulated.csv")
print("Results exported to CSV files")
import matplotlib.pyplot as plt
import numpy as np
fig, ax = plt.subplots(figsize=(10, 7))
res = results.dropna(subset=["padj"]).copy()
res["-log10_padj"] = -np.log10(res.padj)
# Color categories
is_sig = (res.padj < 0.05) & (res.log2FoldChange.abs() > 1.0)
is_up = is_sig & (res.log2FoldChange > 0)
is_down = is_sig & (res.log2FoldChange < 0)
ax.scatter(res.loc[~is_sig, "log2FoldChange"], res.loc[~is_sig, "-log10_padj"],
c="grey", alpha=0.3, s=8, label="Not significant")
ax.scatter(res.loc[is_up, "log2FoldChange"], res.loc[is_up, "-log10_padj"],
c="firebrick", alpha=0.6, s=12, label=f"Up ({is_up.sum()})")
ax.scatter(res.loc[is_down, "log2FoldChange"], res.loc[is_down, "-log10_padj"],
c="steelblue", alpha=0.6, s=12, label=f"Down ({is_down.sum()})")
ax.axhline(-np.log10(0.05), ls="--", c="black", alpha=0.4)
ax.axvline(-1, ls="--", c="black", alpha=0.4)
ax.axvline(1, ls="--", c="black", alpha=0.4)
ax.set_xlabel("Log2 Fold Change")
ax.set_ylabel("-Log10(Adjusted P-value)")
ax.set_title("Volcano Plot — Treated vs Control")
ax.legend()
plt.tight_layout()
plt.savefig("volcano_plot.png", dpi=300)
print("Saved volcano_plot.png")
fig, ax = plt.subplots(figsize=(10, 7))
ax.scatter(np.log10(res.loc[~is_sig, "baseMean"] + 1),
res.loc[~is_sig, "log2FoldChange"],
c="grey", alpha=0.3, s=8)
ax.scatter(np.log10(res.loc[is_sig, "baseMean"] + 1),
res.loc[is_sig, "log2FoldChange"],
c="firebrick", alpha=0.6, s=12)
ax.axhline(0, ls="--", c="black", alpha=0.5)
ax.set_xlabel("Log10(Mean Normalized Count + 1)")
ax.set_ylabel("Log2 Fold Change")
ax.set_title("MA Plot")
plt.tight_layout()
plt.savefig("ma_plot.png", dpi=300)
print("Saved ma_plot.png")
| Parameter | Default | Range / Options | Effect |
|-----------|---------|-----------------|--------|
| design | (required) | Wilkinson formula | Model formula; put covariates before variable of interest |
| contrast | None | [var, test, ref] | Which comparison to test; None uses last coefficient |
| alpha | 0.05 | 0.01–0.10 | FDR threshold for significance calling |
| refit_cooks | True | True/False | Refit model after removing Cook's distance outliers |
| cooks_filter | True | True/False | Apply Cook's distance filtering during testing |
| independent_filter | True | True/False | Independent filtering to optimize detection power |
| n_cpus | 1 | 1–N | Number of parallel threads for dispersion fitting |
| min_total_counts (user) | 10 | 5–50 | Gene filtering: minimum total reads across all samples |
| lfc_shrink() | off | call after summary() | apeGLM shrinkage; reduces noisy LFC estimates |
When comparing multiple treatment groups against a shared control.
dds = DeseqDataSet(counts=counts_df, metadata=metadata, design="~condition")
dds.deseq2()
contrasts = {
"A_vs_ctrl": ["condition", "treatment_A", "control"],
"B_vs_ctrl": ["condition", "treatment_B", "control"],
"C_vs_ctrl": ["condition", "treatment_C", "control"],
}
for name, contrast in contrasts.items():
ds = DeseqStats(dds, contrast=contrast, alpha=0.05)
ds.summary()
n_sig = (ds.results_df.padj < 0.05).sum()
ds.results_df.to_csv(f"results_{name}.csv")
print(f"{name}: {n_sig} significant genes")
When samples come from multiple batches or sequencing runs.
# Verify batch is not confounded with condition
print(pd.crosstab(metadata["batch"], metadata["condition"]))
dds = DeseqDataSet(
counts=counts_df, metadata=metadata,
design="~batch + condition" # batch first, then variable of interest
)
dds.deseq2()
ds = DeseqStats(dds, contrast=["condition", "treated", "control"])
ds.summary()
print(f"Significant genes (batch-corrected): {(ds.results_df.padj < 0.05).sum()}")
Diagnostic check — a healthy analysis shows a flat histogram with a spike near 0.
import matplotlib.pyplot as plt
fig, axes = plt.subplots(1, 2, figsize=(12, 5))
# P-value histogram
axes[0].hist(ds.results_df.pvalue.dropna(), bins=50, edgecolor="black")
axes[0].set_xlabel("P-value")
axes[0].set_ylabel("Frequency")
axes[0].set_title("P-value Distribution")
# Dispersion plot
axes[1].scatter(
np.log10(dds.varm["_normed_means"] + 1),
np.log10(dds.varm["dispersions"]),
alpha=0.3, s=5
)
axes[1].set_xlabel("Log10(Mean Expression + 1)")
axes[1].set_ylabel("Log10(Dispersion)")
axes[1].set_title("Dispersion vs Mean")
plt.tight_layout()
plt.savefig("qc_diagnostics.png", dpi=300)
print("Saved qc_diagnostics.png")
For resuming analysis without re-running the expensive fitting step.
import pickle
# Save
with open("dds_fitted.pkl", "wb") as f:
pickle.dump(dds.to_picklable_anndata(), f)
print("Saved fitted DESeqDataSet")
# Reload
with open("dds_fitted.pkl", "rb") as f:
adata = pickle.load(f)
# Note: reload requires re-constructing DeseqDataSet from the AnnData
deseq2_all_results.csv — Full results table (baseMean, log2FoldChange, lfcSE, stat, pvalue, padj) for all tested genesdeseq2_significant.csv — Filtered results (padj < 0.05 and |LFC| > 1)deseq2_upregulated.csv — Significant upregulated genes sorted by padjdeseq2_downregulated.csv — Significant downregulated genes sorted by padjvolcano_plot.png — Volcano plot with significance and fold change thresholdsma_plot.png — MA plot showing fold change vs mean expressionqc_diagnostics.png — P-value distribution and dispersion plot| Problem | Cause | Solution |
|---------|-------|----------|
| ValueError: Index mismatch | Sample names differ between counts and metadata | Use counts_df.index.intersection(metadata.index) to align |
| All genes have padj = NaN | Genes have zero variance or all zero counts | Apply stricter gene filtering (increase min_total_counts) |
| Design matrix is not full rank | Confounded variables (e.g., all treated in one batch) | Check with pd.crosstab(); simplify design or remove confounded variable |
| No significant genes found | Small effect size, high variability, or low sample size | Check p-value distribution; relax alpha or |LFC| threshold; inspect QC plots |
| MemoryError during fitting | Too many genes or very large dataset | Pre-filter more aggressively; reduce n_cpus; use machine with more RAM |
| Very large size factors (>5) | Extreme library size differences | Verify raw counts are unnormalized; check for contamination or failed libraries |
| Shrinkage produces unexpected LFCs | Calling lfc_shrink() before summary() | Always call ds.summary() first, then ds.lfc_shrink() |
tools
Fast short-read DNA aligner for WGS/WES/ChIP-seq. 2× faster BWA-MEM successor; outputs SAM/BAM with read group headers for GATK. Primary plus supplementary records for chimeric reads. Use STAR for RNA-seq splice-aware alignment; Bowtie2 is a comparable alternative.
tools
smina molecular docking CLI. AutoDock Vina fork with customizable scoring functions, native SDF/MOL2/PDB ligand input, autoboxing, local energy minimization, and per-atom score breakdowns. Pipeline: receptor PDBQT prep -> ligand prep (RDKit/OpenBabel) -> dock via autobox or explicit grid -> rescore/minimize with custom scoring -> rank poses by affinity. Choose smina over Vina when you need custom scoring terms (--custom_scoring), local optimization of an existing pose (--local_only), per-atom contributions (--atom_term_data), or SDF/MOL2 ligands without manual PDBQT conversion. For unknown binding sites use diffdock-blind-docking; for the Python-bindings/Vinardo workflow use autodock-vina-docking.
development
mdtraj molecular dynamics trajectory analysis (Python). Reads DCD/XTC/TRR/NetCDF/H5/PDB topologies and trajectories; computes RMSD vs time, radius of gyration, per-residue RMSF, residue-residue contact frequency maps, phi/psi torsions for Ramachandran plots (general + Gly/Pro), and 8-state DSSP secondary structure. Modules: trajectory I/O, geometry (distances/angles/dihedrals), structural analysis (RMSD/Rg/RMSF/SASA), contacts, hydrogen bonds, secondary structure (DSSP), NMR observables. For broader atom-selection grammar use mdanalysis-trajectory; for running MD simulations use OpenMM/GROMACS.
development
Programmatic PubMed access via NCBI E-utilities REST API. Covers Boolean/MeSH queries, field-tagged search, endpoints (ESearch, EFetch, ESummary, EPost, ELink), history server for batches, citation matching, systematic review strategies. Use for biomedical literature search or automated pipelines.