skills/structural-biology-drug-discovery/zinc-database/SKILL.md
Query ZINC15/ZINC22 virtual compound libraries (1.4B compounds, 750M purchasable). Search lead/fragment/drug-like compounds by MW, logP, reactivity, or SMILES similarity; download 3D sets for docking. For bioactivity use chembl-database-bioactivity; for approved drugs use drugbank-database-access.
npx skillsauth add jaechang-hits/sciagent-skills zinc-databaseInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
ZINC (ZINC Is Not Commercial) is a free database of commercially available compounds curated for virtual screening. ZINC22 contains over 1.4 billion compounds (ZINC20: 1.4B, including purchasable 3D conformers), organized by molecular property filters (lead-like, fragment-like, drug-like) and reactivity class. The REST API enables SMILES-based searches, property-filtered downloads, and compound subset exports for docking campaigns.
chembl-database-bioactivity; for approved drug structures use drugbank-database-access; for RDKit property calculation use rdkit-cheminformaticsrequests, pandaspip install requests pandas
import requests
# Search ZINC15 REST API for drug-like compounds
BASE = "https://zinc15.docking.org"
r = requests.get(f"{BASE}/substances.json",
params={"mwt__gte": 250, "mwt__lte": 350,
"logp__gte": 0, "logp__lte": 3,
"availability": "for-sale", "count": 5})
r.raise_for_status()
compounds = r.json()
print(f"Returned {len(compounds)} compounds")
for c in compounds[:3]:
print(f" ZINC: {c['zinc_id']:20s} MW: {c['mwt']:.1f} logP: {c['logp']:.2f} SMILES: {c['smiles'][:40]}")
Search ZINC15 by molecular property ranges (Lipinski, lead-like, fragment-like criteria).
import requests, pandas as pd
BASE = "https://zinc15.docking.org"
def zinc_search(params, max_results=500):
"""Search ZINC15 with property filters. Returns DataFrame."""
all_results = []
params = dict(params)
params["count"] = min(100, max_results)
r = requests.get(f"{BASE}/substances.json", params=params)
r.raise_for_status()
compounds = r.json()
all_results.extend(compounds)
return pd.DataFrame(all_results)
# Lead-like set: MW 250-350, logP 1-3, HBD ≤ 3
df_leads = zinc_search({
"mwt__gte": 250, "mwt__lte": 350,
"logp__gte": 1, "logp__lte": 3,
"hbd__lte": 3, "hba__lte": 7,
"availability": "for-sale",
})
print(f"Lead-like compounds: {len(df_leads)}")
print(df_leads[["zinc_id", "mwt", "logp", "smiles"]].head())
# Fragment-like set: MW < 300, logP < 3 (Rule of Three)
df_frags = zinc_search({
"mwt__lte": 300,
"logp__lte": 3,
"hbd__lte": 3,
"availability": "for-sale",
})
print(f"\nFragment-like compounds: {len(df_frags)}")
print(df_frags[["zinc_id", "mwt", "logp", "smiles"]].head())
Fetch full compound data for a known ZINC identifier.
import requests
BASE = "https://zinc15.docking.org"
zinc_id = "ZINC000000029632"
r = requests.get(f"{BASE}/substances/{zinc_id}.json")
r.raise_for_status()
c = r.json()
print(f"ZINC ID : {c['zinc_id']}")
print(f"SMILES : {c['smiles']}")
print(f"MW : {c['mwt']:.2f}")
print(f"logP : {c['logp']:.2f}")
print(f"HBD : {c['hbd']}")
print(f"HBA : {c['hba']}")
print(f"TPSA : {c.get('tpsa', 'n/a')}")
print(f"Rotatable: {c.get('rotatable_bonds', 'n/a')}")
print(f"Suppliers: {len(c.get('suppliers', []))}")
ZINC organizes compounds into "tranches" by MW and logP. Download pre-built SDF/SMILES files.
import requests
# ZINC15 tranche download (MW 200-250, logP 1-2 range)
# Tranche naming: letters encode MW range (A-K) and logP range (A-J)
# See http://zinc15.docking.org/tranches/home
def download_zinc_tranche(tranche_name, dest_file, fmt="smi"):
"""Download a ZINC tranche SMILES file."""
url = f"https://zinc15.docking.org/tranches/{tranche_name}.{fmt}"
r = requests.get(url, stream=True)
r.raise_for_status()
with open(dest_file, "wb") as f:
for chunk in r.iter_content(chunk_size=8192):
f.write(chunk)
print(f"Downloaded {dest_file}")
# Download one tranche as SMILES
download_zinc_tranche("AABA", "zinc_AABA.smi", fmt="smi")
Find ZINC compounds similar to a query molecule.
import requests, pandas as pd
BASE = "https://zinc15.docking.org"
query_smiles = "c1ccc(NC(=O)c2ccccc2)cc1" # benzanilide analog
r = requests.get(f"{BASE}/substances.json",
params={
"smiles": query_smiles,
"similarity": 0.6, # Tanimoto similarity threshold
"count": 20,
"availability": "for-sale"
})
r.raise_for_status()
results = r.json()
print(f"Similar compounds found: {len(results)}")
df = pd.DataFrame(results)[["zinc_id", "smiles", "mwt", "logp"]]
print(df.head())
Retrieve purchasability and supplier catalog data for compounds.
import requests
BASE = "https://zinc15.docking.org"
# Check purchasability and catalog info
zinc_id = "ZINC000000029632"
r = requests.get(f"{BASE}/substances/{zinc_id}/suppliers.json")
r.raise_for_status()
suppliers = r.json()
print(f"Suppliers for {zinc_id}: {len(suppliers)}")
for sup in suppliers[:5]:
print(f" {sup.get('name', 'n/a'):30s} | Catalog: {sup.get('catalognum', 'n/a')}")
For large-scale virtual screening, download entire ZINC subsets as compressed SMILES.
import requests, gzip, io, pandas as pd
# ZINC15 drug-like purchasable slice (public URL pattern)
# Full drug-like: https://zinc15.docking.org/substances/subsets/drug-like.smi.gz
def download_zinc_subset(subset_name, max_lines=1000):
"""Download a ZINC subset SMILES file and return a DataFrame sample."""
url = f"https://zinc15.docking.org/substances/subsets/{subset_name}.smi.gz"
r = requests.get(url, stream=True)
r.raise_for_status()
lines = []
with gzip.open(r.raw, "rt") as f:
for i, line in enumerate(f):
if i >= max_lines:
break
lines.append(line.strip().split())
df = pd.DataFrame(lines, columns=["smiles", "zinc_id"] + [f"col{i}" for i in range(max(0, len(lines[0])-2))])
return df[["smiles", "zinc_id"]]
# Load first 1000 from lead-like subset
df_sample = download_zinc_subset("lead-like", max_lines=1000)
print(f"Loaded {len(df_sample)} compounds from lead-like subset")
print(df_sample.head())
Compounds are organized into a 2D grid of "tranches" based on MW (rows A–K: <200 to >600 Da) and logP (columns A–J: <-1 to >5). Each tranche can be downloaded as a SMILES or SDF file. This tranching enables targeted downloads of specific property spaces for docking.
Goal: Curate a purchasable, lead-like compound library within specific property ranges, deduplicate, and export for docking.
import requests, pandas as pd
BASE = "https://zinc15.docking.org"
# Fetch lead-like purchasable compounds with Lipinski compliance
params = {
"mwt__gte": 200, "mwt__lte": 500,
"logp__gte": -1, "logp__lte": 5,
"hbd__lte": 5, "hba__lte": 10,
"rotatable_bonds__lte": 10,
"availability": "for-sale",
"count": 200,
}
r = requests.get(f"{BASE}/substances.json", params=params)
r.raise_for_status()
compounds = r.json()
df = pd.DataFrame(compounds)[["zinc_id", "smiles", "mwt", "logp", "hbd", "hba"]]
df = df.drop_duplicates(subset=["smiles"])
print(f"Curated library: {len(df)} unique compounds")
# Export as SMILES for docking input
df[["smiles", "zinc_id"]].to_csv("docking_library.smi", sep=" ", index=False, header=False)
print("Saved: docking_library.smi")
print(df.head())
Goal: Download fragment-like (Rule of Three) compounds for fragment-based drug discovery.
import requests, pandas as pd
BASE = "https://zinc15.docking.org"
# Rule of Three: MW ≤ 300, logP ≤ 3, HBD ≤ 3, HBA ≤ 3, RotB ≤ 3
params = {
"mwt__lte": 300,
"logp__lte": 3,
"hbd__lte": 3,
"hba__lte": 3,
"rotatable_bonds__lte": 3,
"availability": "for-sale",
"count": 200,
}
r = requests.get(f"{BASE}/substances.json", params=params)
fragments = r.json()
df = pd.DataFrame(fragments)[["zinc_id", "smiles", "mwt", "logp"]]
print(f"Fragment library: {len(df)} compounds (Rule of Three)")
df.to_csv("fragment_library.smi", sep=" ", index=False, header=False)
print("Saved: fragment_library.smi")
df.describe()
| Parameter | Module | Default | Range / Options | Effect |
|-----------|--------|---------|-----------------|--------|
| mwt__gte / mwt__lte | Search | — | numeric (Da) | Molecular weight lower/upper bound |
| logp__gte / logp__lte | Search | — | numeric | logP (lipophilicity) range |
| hbd__lte | Search | — | integer | Max hydrogen bond donors |
| hba__lte | Search | — | integer | Max hydrogen bond acceptors |
| rotatable_bonds__lte | Search | — | integer | Max rotatable bonds |
| availability | Search | all | "for-sale", "in-stock", "on-demand" | Purchasability filter |
| count | Search | 10 | 1–1000 | Max compounds returned per request |
| similarity | Similarity | — | 0.0–1.0 | Tanimoto similarity threshold |
Use tranches for large docking campaigns: Downloading entire MW/logP tranches as pre-built SDF files is faster than paginating the API. Use the ZINC tranches page to identify the subset of property space you need.
Apply reactivity filters: ZINC marks reactive compounds with "reactivity" flags. Exclude compounds with reactive groups (reactivity: "clean" filter) for cell-based assays.
Deduplicate by SMILES: API results may contain duplicates across supplier catalog entries. Canonical SMILES deduplication with RDKit (Chem.MolToSmiles(Chem.MolFromSmiles(smi))) before docking.
Combine with RDKit filtering: After downloading, apply additional filters (PAINS, Brenk alerts) using rdkit-cheminformatics or medchem before investing compute in docking.
Cache SMILES downloads: ZINC data is updated periodically. Cache downloads with a date-stamped filename and avoid re-downloading within a project.
When to use: Find the ZINC ID for a known compound to check purchasability.
import requests
BASE = "https://zinc15.docking.org"
smiles = "CC(=O)Nc1ccc(O)cc1" # paracetamol / acetaminophen
r = requests.get(f"{BASE}/substances.json",
params={"smiles": smiles, "count": 3})
for c in r.json():
print(f"ZINC: {c['zinc_id']} | MW: {c['mwt']:.1f} | In stock: {c.get('availability')}")
When to use: Download 3D SDF conformers for a list of ZINC IDs for use in docking software.
import requests
BASE = "https://zinc15.docking.org"
zinc_ids = ["ZINC000000029632", "ZINC000001532592"]
for zid in zinc_ids:
r = requests.get(f"{BASE}/substances/{zid}.sdf")
if r.ok:
with open(f"{zid}.sdf", "w") as f:
f.write(r.text)
print(f"Downloaded {zid}.sdf")
else:
print(f"Not available: {zid}")
When to use: Quickly assess the property coverage of a downloaded compound set.
import pandas as pd
df = pd.read_csv("docking_library.smi", sep=" ", names=["smiles", "zinc_id"])
print(f"Library size: {len(df)}")
# If you have the full ZINC metadata:
# df = pd.DataFrame(compounds)[["mwt", "logp", "hbd", "hba"]]
# print(df.describe())
# import matplotlib.pyplot as plt
# df[["mwt", "logp"]].hist(bins=30, figsize=(10, 4)); plt.show()
| Problem | Cause | Solution |
|---------|-------|----------|
| HTTP 404 for compound ID | ZINC ID format incorrect | Use full 12-digit ZINC ID (e.g., ZINC000000029632) |
| Empty results for property search | Filters too restrictive | Relax ranges; check mwt__gte < mwt__lte is not inverted |
| Similarity search returns nothing | SMILES invalid or unusual scaffold | Validate SMILES with RDKit first; try lower similarity threshold |
| Tranche file download fails | Tranche code wrong | Verify tranche naming at zinc15.docking.org/tranches/home |
| API returns HTML error page | Server maintenance | Retry after a few minutes; check ZINC status |
| Slow large downloads | Large compound sets | Download tranche files via FTP/HTTP bulk download instead of API pagination |
rdkit-cheminformatics — Compute additional properties and apply PAINS filters on downloaded ZINC compoundsautodock-vina-docking — Use downloaded ZINC SMILES/SDF files for molecular docking campaignschembl-database-bioactivity — Bioactivity data for compounds identified in ZINC virtual screensmedchem — Apply medicinal chemistry filters (Lipinski, PAINS, NIBR) on ZINC librariestools
Fast short-read DNA aligner for WGS/WES/ChIP-seq. 2× faster BWA-MEM successor; outputs SAM/BAM with read group headers for GATK. Primary plus supplementary records for chimeric reads. Use STAR for RNA-seq splice-aware alignment; Bowtie2 is a comparable alternative.
tools
smina molecular docking CLI. AutoDock Vina fork with customizable scoring functions, native SDF/MOL2/PDB ligand input, autoboxing, local energy minimization, and per-atom score breakdowns. Pipeline: receptor PDBQT prep -> ligand prep (RDKit/OpenBabel) -> dock via autobox or explicit grid -> rescore/minimize with custom scoring -> rank poses by affinity. Choose smina over Vina when you need custom scoring terms (--custom_scoring), local optimization of an existing pose (--local_only), per-atom contributions (--atom_term_data), or SDF/MOL2 ligands without manual PDBQT conversion. For unknown binding sites use diffdock-blind-docking; for the Python-bindings/Vinardo workflow use autodock-vina-docking.
development
mdtraj molecular dynamics trajectory analysis (Python). Reads DCD/XTC/TRR/NetCDF/H5/PDB topologies and trajectories; computes RMSD vs time, radius of gyration, per-residue RMSF, residue-residue contact frequency maps, phi/psi torsions for Ramachandran plots (general + Gly/Pro), and 8-state DSSP secondary structure. Modules: trajectory I/O, geometry (distances/angles/dihedrals), structural analysis (RMSD/Rg/RMSF/SASA), contacts, hydrogen bonds, secondary structure (DSSP), NMR observables. For broader atom-selection grammar use mdanalysis-trajectory; for running MD simulations use OpenMM/GROMACS.
development
Programmatic PubMed access via NCBI E-utilities REST API. Covers Boolean/MeSH queries, field-tagged search, endpoints (ESearch, EFetch, ESummary, EPost, ELink), history server for batches, citation matching, systematic review strategies. Use for biomedical literature search or automated pipelines.