chemoinformatics/virtual-screening/SKILL.md
Performs structure-based virtual screening using AutoDock Vina, SMINA, GNINA (CNN scoring), and DiffDock-L hybrid workflows with explicit choice rules across rigid vs flexible docking, cross-docking vs self-docking, binding-site detection (P2Rank, fpocket), receptor preparation (PDB2PQR, PROPKA), ligand preparation (meeko, OpenBabel), and ultralarge-library screening (ZINC22, Enamine REAL). Use when screening chemical libraries against a protein target to find candidate binders, ranking docking poses, or selecting a docking workflow for a specific scenario.
npx skillsauth add GPTomics/bioSkills bio-virtual-screeningInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Reference examples tested with: AutoDock Vina 1.2.5+, SMINA 2020-12+, GNINA 1.1+, RDKit 2024.09+, meeko 0.5+, P2Rank 2.4+, ProDy 2.4+, pdb2pqr 3.6+.
Before using code patterns, verify installed versions match. If versions differ:
pip show <package> then help(module.function) to check signaturesvina --version; gnina --version; smina --versionIf code throws ImportError, AttributeError, or TypeError, introspect the installed package and adapt the example to match the actual API rather than retrying.
Screen chemical libraries against protein targets via molecular docking. Vina is the de-facto default, SMINA adds flexibility (Vinardo scoring, custom scoring), and GNINA adds CNN-based pose scoring (Top-1 redock 58%->73% over Vina, cross-dock 27%->37%). Deep-learning docking (DiffDock-L, EquiBind, NeuralPLexer) competes in pose accuracy but fails physical-plausibility tests (PoseBusters) ~50% of the time; the postdoc-grade workflow combines ML pose sampling with classical scoring and physical validation. For ultralarge libraries (>1M), library preparation, hierarchical filtering, and HPC orchestration become the limiting steps.
For pose physical-validity QC, see chemoinformatics/pose-validation. For ML-driven docking + rescoring, see chemoinformatics/ml-docking-rescoring. For covalent docking, see chemoinformatics/covalent-design. For affinity calculations (FEP), see chemoinformatics/free-energy-calculations.
| Tool | Scoring | Speed (sec/lig) | Best at | Fails when | |------|---------|-----------------|---------|------------| | AutoDock Vina 1.2 | Vina (empirical) | 5-30 | Default, well-validated | Cross-dock; cryptic pockets; metal centers | | SMINA | Vina + flexible + custom | 5-30 | Custom scoring; flexible side chains | Same Vina-scoring caveats | | Vinardo | Vinardo (improved Vina) | 5-30 | Better scoring than Vina on DUD-E | Limited adoption | | GNINA 1.1 | CNN (default) or Vina | 30-120 | Pose ranking, redocking | Slower; GPU recommended | | AutoDock 4 | AD4 + grid maps | 30-60 | Legacy reference | Slower than Vina | | DOCK 6/7 | DOCK + Amber | 60-300 | UCSF DOCK ecosystem | Steep learning curve | | Glide (Schrodinger) | GlideScore | 5-30 | Commercial SOTA | License cost | | GOLD (CCDC) | GOLDScore / ChemScore | 60-180 | Commercial; metal coordination | License cost | | FlexX (BioSolveIT) | FlexX | 30-60 | Fragment-based | License cost | | rDock | rDock | 30-60 | Open-source alternative | Slower than Vina | | DiffDock-L | Diffusion-generative | 5 (GPU) | Pose sampling for cross-docking | High PB-invalid rate (~50%); see ml-docking-rescoring | | EquiBind | Equivariant NN | <1 (GPU) | Single-shot pose | Lowest accuracy in PoseBuster benchmarks | | Boltz-2 + GNINA rescore | Foundation model + CNN | 10 (GPU) | Modern SOTA hybrid | High GPU; not all proteins |
Decision: For most screens, GNINA 1.1 with CNN scoring is the modern default (better than Vina on every benchmark; 30s/ligand on GPU). For >1M library scale, hierarchical Vina -> GNINA -> MM/GBSA. For cross-docking (predicted target structure or apo-holo), GNINA's CNN scoring transfers better than Vina.
| Scenario | Recommended workflow |
|----------|---------------------|
| Self-dock against known ligand pocket | GNINA gnina --cnn_scoring rescore |
| Cross-dock to apo or related-target structure | DiffDock-L pose + GNINA rescore + PoseBusters |
| Ultralarge library (10M+) | Vina hierarchical: pre-filter (Lipinski, PAINS) -> Vina dock -> top 1% GNINA rescore -> top 0.1% MM/GBSA |
| Cryptic pocket / induced fit | Ensemble docking (10 conformer snapshots) + AlphaFold3 or Boltz-1 holo prediction |
| Allosteric / undefined site | P2Rank for pocket detection -> ensemble dock all pockets |
| Metal-coordinated ligand | GOLD (commercial) or manually parameterize Vina metal scoring |
| Covalent inhibitor | See chemoinformatics/covalent-design: DOCKovalent, HCovDock |
| Fragment screen (<300 Da) | rDock or constrained Vina with seed atoms |
| Hit-to-lead refinement | Use co-crystal structure if available; MD-relaxed receptor; FEP for affinity |
Goal: Convert a protein PDB into a docking-ready format with correct protonation, missing atoms, and removed waters.
Approach: Strip ligands and waters -> fill missing atoms (PROPKA or pdbfixer) -> add hydrogens at pH 7.4 (PDB2PQR / Reduce) -> assign partial charges -> convert to PDBQT (Open Babel).
import subprocess
def prepare_receptor(pdb_in, pdbqt_out, pH=7.4, remove_het=True):
base = pdb_in.replace('.pdb', '')
fixed = f'{base}_fixed.pdb'
propka_pdb = f'{base}_propka.pdb'
if remove_het:
with open(pdb_in) as fin, open(f'{base}_clean.pdb', 'w') as fout:
for line in fin:
if line.startswith('HETATM') and 'HOH' in line:
continue
if line.startswith('HETATM'):
continue
fout.write(line)
subprocess.run(['pdb2pqr', '--ff=AMBER', f'--with-ph={pH}',
f'{base}_clean.pdb', propka_pdb], check=True)
subprocess.run(['obabel', propka_pdb, '-O', pdbqt_out,
'-xr', '--partialcharge', 'gasteiger'], check=True)
return pdbqt_out
Common pitfall: Forgetting to add hydrogens at protein pH (7.4) but using pH 7.0 ligand charges. Hist mistakenly protonated. Use PROPKA + manual review of catalytic residues.
Goal: Generate a 3D, docking-ready ligand file from SMILES with appropriate protonation and conformation.
Approach: Protonate at pH 7.4 with Chem.MolFromSmiles then rdMolStandardize.Uncharger -> embed 3D with ETKDGv3 -> minimize with MMFF94 -> assign Gasteiger charges -> write PDBQT with meeko.
from rdkit import Chem
from rdkit.Chem import AllChem
from rdkit.Chem.MolStandardize import rdMolStandardize
from meeko import MoleculePreparation, PDBQTWriterLegacy
def prepare_ligand(smiles, pdbqt_out):
mol = Chem.MolFromSmiles(smiles)
uncharger = rdMolStandardize.Uncharger()
mol = uncharger.uncharge(mol)
mol = Chem.AddHs(mol)
AllChem.EmbedMolecule(mol, AllChem.ETKDGv3())
AllChem.MMFFOptimizeMolecule(mol)
# meeko 0.5+ API: prepare() returns a list of MoleculeSetup objects;
# use PDBQTWriterLegacy.write_string() to materialize the PDBQT block.
mk_prep = MoleculePreparation()
setups = mk_prep.prepare(mol)
pdbqt_text, is_ok, err = PDBQTWriterLegacy.write_string(setups[0])
if not is_ok:
raise RuntimeError(f'meeko PDBQT export failed: {err}')
with open(pdbqt_out, 'w') as f:
f.write(pdbqt_text)
return pdbqt_out
meeko (AutoDock developers' tool) handles torsion tree creation, rotamer flagging, and PDBQT writing -- preferred over Open Babel's PDBQT writer. Note: meeko 0.5+ separated the writer (PDBQTWriterLegacy) from MoleculePreparation; older code using prep.write_pdbqt_file() is deprecated.
When the binding pocket is not known (apo target, novel allosteric site):
| Tool | Approach | Output | |------|----------|--------| | P2Rank (Krivak 2018) | ML on protein surface descriptors | Ranked pocket list with center coords | | fpocket (Le Guilloux 2009) | Voronoi tessellation | Pocket descriptor list | | DoGSiteScorer | Geometric + drugability | Pocket list with score | | AutoSite (Vina) | Affinity map clustering | Pocket centers | | AlphaFill | Co-fold known ligand | Plausible binding site |
prank predict -f receptor.pdb -o pockets/
P2Rank output <receptor>_predictions.csv lists pocket centers with scores. Pocket 1 (highest score) is typically the orthosteric site for known protein families.
# AutoDock Vina Python API requires Vina 1.2+; for Vina 1.1 use subprocess CLI:
# subprocess.run(['vina', '--receptor', ..., '--ligand', ..., '--center_x', ...], check=True)
from vina import Vina
def dock_single(receptor_pdbqt, ligand_pdbqt, center, box_size,
exhaustiveness=8, n_poses=10):
v = Vina(sf_name='vina')
v.set_receptor(receptor_pdbqt)
v.set_ligand_from_file(ligand_pdbqt)
v.compute_vina_maps(center=center, box_size=box_size)
v.dock(exhaustiveness=exhaustiveness, n_poses=n_poses)
return v.energies(), v.poses()
Exhaustiveness:
Vina pose RMSD lower bound rmsd_lb and upper bound rmsd_ub are NOT pose-vs-reference RMSDs; they are bounds on the conformational space sampled. Use external RMSD to a reference pose for accuracy QC.
gnina -r receptor.pdb -l ligand.sdf \
--autobox_ligand reference_ligand.sdf \
--cnn_scoring rescore \
-o poses.sdf.gz \
--num_modes 9 --exhaustiveness 8
--cnn_scoring:
rescore (default): Vina sampling + CNN scoring (best validated)refinement: CNN refines top Vina posemetrorefine: Metropolis-style CNN refinement (slower, ~10% better pose accuracy)none: Vina-only (use SMINA equivalently)--autobox_ligand: define box from reference ligand SDF/PDB. Otherwise specify --center_x/y/z + --size_x/y/z.
Critical: GNINA's CNN was trained on PDBbind 2019; performance on novel chemotypes outside this distribution is reduced. Validate with a known co-crystal redock if possible.
Goal: Screen 10M-compound library down to top-1k candidates for follow-up.
Approach: Three-stage filter:
Pseudo-code skeleton (orchestrator). Each helper function delegates to a dedicated skill: drug-likeness filter to chemoinformatics/admet-prediction, single-ligand Vina/GNINA to dock_single defined earlier in this skill, PoseBusters QC to chemoinformatics/pose-validation.
import pandas as pd
from concurrent.futures import ProcessPoolExecutor
# Stub helpers to be implemented per project; see the cross-referenced skills.
def drug_like_filter(df):
raise NotImplementedError('Implement via chemoinformatics/admet-prediction (Lipinski+Veber+PAINS)')
def vina_dock(smi, receptor_pdbqt, center, box):
raise NotImplementedError('Wrap dock_single() above; return best affinity')
def gnina_rescore(smi, receptor_pdbqt, center, box):
raise NotImplementedError('Wrap gnina --cnn_scoring rescore subprocess call')
def pose_validate(df):
raise NotImplementedError('Implement via chemoinformatics/pose-validation (PoseBusters)')
def vs_pipeline(library_smi, receptor_pdbqt, center, box, output_dir, n_workers=16):
df = pd.read_csv(library_smi)
df_stage1 = drug_like_filter(df) # ~10x reduction
with ProcessPoolExecutor(max_workers=n_workers) as ex:
affinities = list(ex.map(
lambda smi: vina_dock(smi, receptor_pdbqt, center, box),
df_stage1['smiles']))
df_stage1['vina_affinity'] = affinities
df_stage2 = df_stage1.nsmallest(int(len(df_stage1) * 0.01), 'vina_affinity')
df_stage2['gnina_affinity'] = df_stage2['smiles'].apply(
lambda smi: gnina_rescore(smi, receptor_pdbqt, center, box))
df_stage3 = df_stage2.nsmallest(1000, 'gnina_affinity')
return pose_validate(df_stage3)
For 10M-compound libraries (ZINC22, Enamine REAL), use slurm-orchestrated parallel docking. Single-node GPU GNINA: ~3000 ligands/day. SLURM cluster (50 nodes): ~150k ligands/day.
| Library | Size (2024) | Format | Notes | |---------|-------------|--------|-------| | ZINC22 | 4.1B | 3D SMILES + xyz | Pre-built; 37GB tranches | | Enamine REAL | 29B | SMILES | Make-on-demand; via Enamine | | Enamine HTS | 4M | SMILES + SDF | Stock for HTS | | Mcule | 25M | SMILES | Real, in-stock | | ChEMBL | 2.5M | SMILES + bioactivity | Activity-labeled |
For ultralarge VS:
Lyu et al. 2019 ZINC15 ultralarge VS gold standard: 138M docked, top 96 hits picked, 13 nM-µM actives.
Trigger: Receptor structure not the holo (co-crystal with ligand from another binder).
Mechanism: Vina pose accuracy degrades 2-3x from self-dock (RMSD<=2A 70%) to cross-dock (RMSD<=2A 30%).
Symptom: Top-ranked pose makes no geometric sense; key contacts missing.
Fix: GNINA CNN scoring or ensemble docking. For genuine apo, predict holo with AlphaFold3 / Boltz-1 then dock.
Trigger: Ligand chemotype not in PDBbind training.
Mechanism: CNN scoring overfits to PDBbind chemotypes; novel macrocycle / peptide / PROTAC scores poorly.
Symptom: Affinity prediction far worse than Vina alone.
Fix: Use --cnn_scoring rescore (sampling still by Vina) rather than CNN sampling. Validate against co-crystal of close analog.
Trigger: Binding box defined tightly around small ligand reference.
Mechanism: Vina explores only within the box; large analogs cannot fit.
Symptom: Many ligands report "no valid pose"; chemotype-biased hits.
Fix: Add padding 5-10 A beyond reference ligand bounding box. For unknown ligand size, use 25x25x25 A.
Trigger: Protein has multiple binding sites (orthosteric + allosteric).
Mechanism: P2Rank or AutoBox picks the most "drugable" pocket; not always the desired one.
Symptom: Hits dock in wrong pocket; SAR confusing.
Fix: Verify pocket from co-crystal data; explicitly set center_x/y/z from known ligand centroid.
Trigger: Default DiffDock-L output for any receptor.
Mechanism: Diffusion generates poses without physical-validity loss; ~50% fail planarity / stereochemistry / vdW overlap tests.
Symptom: Poses look reasonable but fail PoseBusters checks.
Fix: Filter to PB-valid (PoseBusters); rescore with GNINA. See chemoinformatics/pose-validation.
Trigger: Ligand or receptor residues protonated incorrectly at pH 7.4.
Mechanism: Aspartate/glutamate/histidine protonation depends on local environment; default protonation may be wrong.
Symptom: Salt bridges missing; poses misranked.
Fix: Run PROPKA on receptor (cabinet residue pKas); for catalytic His, manually inspect rotamer state.
| Vina top pose | GNINA top pose | Action | |---------------|----------------|--------| | Same pose, similar score | Same pose, similar score | Trust pose; use GNINA score for ranking | | Vina top pose ≠ GNINA top pose | Same pocket, different orientation | Use GNINA (better trained for pose ranking) | | Vina excellent, GNINA mediocre | Different pose, very different score | GNINA found pose Vina missed; trust GNINA | | Both poor scores | Many ligands score similarly poor | Wrong pocket / protein conformation; reconsider receptor |
| Symptom | Cause | Fix |
|---------|-------|-----|
| Vina segfault | PDBQT corrupted (atom names) | Re-prep with meeko |
| GNINA hangs | GPU OOM | Reduce batch / -num_modes 5 |
| All affinities very poor (-3 to -5) | Wrong protonation; ligand too large for box | Re-check pKa; expand box |
| Identical affinity across ligands | Receptor grid not computed | Call v.compute_vina_maps() before dock |
| Pose poses make no sense | Receptor and ligand in different frames | Ensure same coordinate origin |
| AutoDock 4 needed | Vina cannot handle Zn / Fe metals | Use AutoDock 4 with metal scoring or GOLD |
| GPU mode slow | Vina is CPU-only; only GNINA is GPU | Use GNINA for GPU; Vina is multi-core CPU |
tools
--- name: bio-phasing-imputation-foundations description: Frames the phasing/imputation pipeline before any tool runs: phasing and imputation are one Li-Stephens copying HMM (recombination is the transition, mutation the emission, the genetic map and Ne set the rates), imputation's honest output is a dosage with a self-estimated quality (INFO/R2/DR2) not a hard genotype, and the stages are ordered and each fails silently (QC, align build and strand to the panel, phase, impute per chromosome, fil
tools
Chooses the enrichment generation before any tool runs, mapping the input shape to a method class - a pre-selected gene list plus a background to over-representation analysis (ORA, hypergeometric), a ranked statistic for all genes to gene set enrichment (GSEA), a signed signaling topology to pathway-topology (SPIA) - then making the null explicit (competitive vs self-contained, gene vs subject sampling) and running a trustworthiness checklist (testable-gene universe, FDR, redundancy collapse, leading-edge check, version reporting). Covers why every clusterProfiler GSEA is the inter-gene-correlation-uncorrected competitive null, why the background not the gene list decides ORA significance, and why no method is universally best. Use when deciding ORA vs GSEA vs topology, which gene-set DB, whether a result is trustworthy, or which null a tool computes. For ORA see go-enrichment, GSEA see gsea, databases kegg-pathways/reactome-pathways/wikipathways; the ranking comes from differential-expression/de-results.
testing
End-to-end GWAS workflow from VCF to association results. Covers PLINK QC, population structure correction, and association testing for case-control or quantitative traits. Use when running genome-wide association studies.
development
Orchestrates the full path from differential expression results to redundancy-collapsed functional enrichment: choose ORA vs GSEA, convert gene IDs per method, run enrichGO/enrichKEGG/enrichPathway/enrichWP or gseGO/gseKEGG (clusterProfiler, ReactomePA, rWikiPathways), and visualize. Routes the ORA-vs-GSEA generation fork and the null/universe/reproducibility theory to pathway-analysis/enrichment-foundations. Use when a DESeq2/edgeR/limma result must become enriched GO terms, KEGG/Reactome/WikiPathways pathways, or a GSEA leading edge; when deciding whether a ranking exists for all genes (GSEA, named decreasing vector) or only a pre-selected list (ORA plus a defensible background universe); or when assembling DE-to-pathway end to end. The DE list and ranking statistic come from differential-expression/de-results; per-method nuance lives in the pathway-analysis skills.