plugin/skills/tooluniverse-binder-discovery/SKILL.md
Discover novel small-molecule binders for protein targets using structure-based and ligand-based screening. Covers druggability assessment, known-ligand mining (ChEMBL, BindingDB), similarity expansion, ADMET filtering, and synthesis feasibility. Use for hit identification, virtual screening, target-to-compounds workflows, and lead-finding before commit-to-medchem.
npx skillsauth add mims-harvard/tooluniverse tooluniverse-binder-discoveryInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Systematic discovery of novel small molecule binders using 60+ ToolUniverse tools across druggability assessment, known ligand mining, similarity expansion, ADMET filtering, and synthesis feasibility.
LOOK UP DON'T GUESS - Always retrieve actual data from tools before drawing conclusions. Do not assume druggability, binding sites, or compound properties based on target class alone.
KEY PRINCIPLES:
Before any tool call, reason about the target's structural biology:
Is the binding site a well-defined pocket (small molecule accessible) or a flat protein-protein interface (needs peptide/macrocycle)? This determines your screening strategy.
Use this reasoning to select phases and warn the user about challenges before executing a full workflow.
DO NOT show search process or tool outputs to the user. Instead:
Create the report file FIRST - Before any data collection:
[TARGET]_binder_discovery_report.md[Researching...] in each sectionProgressively update the report - As you gather data, update each section immediately.
Output separate data files:
[TARGET]_candidate_compounds.csv - Prioritized compounds with SMILES, scores[TARGET]_bibliography.json - Literature references (optional)Every piece of information MUST include its source:
Example: *Source: ChEMBL via ChEMBL_get_target_activities (CHEMBL203)*
Phases in order:
get_tool_info)CRITICAL: Verify tool parameters before calling unfamiliar tools.
tool_info = tu.tools.get_tool_info(tool_name="ChEMBL_get_target_activities")
Common parameter corrections (verify with get_tool_info if uncertain):
OpenTargets_*: ensemblId (camelCase); ADMETAI_*: smiles must be a listNvidiaNIM_alphafold2 (requires NVIDIA_API_KEY env var; free key at build.nvidia.com): sequence not seq; NvidiaNIM_genmol (requires NVIDIA_API_KEY env var; free key at build.nvidia.com): SMILES must contain [*{min-max}]NvidiaNIM_boltz2 (requires NVIDIA_API_KEY env var; free key at build.nvidia.com): polymers=[{"molecule_type": "protein", "sequence": "..."}]Resolve all IDs upfront and store for downstream queries:
1. UniProt_search(query=target_name, organism="human") -> UniProt accession
2. MyGene_query_genes(q=gene_symbol, species="human") -> Ensembl gene ID
3. ChEMBL_search_targets(query=target_name, organism="Homo sapiens") -> ChEMBL target ID
4. GtoPdb_search_targets(query=target_name) -> GtoPdb ID (if GPCR/channel/enzyme)
Use multi-source triangulation:
OpenTargets_get_target_tractability_by_ensemblID(ensemblId) - tractability bucketDGIdb_get_gene_druggability(genes=[gene_symbol]) - druggability categoriesOpenTargets_get_target_classes_by_ensemblID(ensemblId) - target classGPCRdb_get_protein + GPCRdb_get_ligands + GPCRdb_get_structuresTheraSAbDab_search_by_target(target=target_name)Decision Point: If no tractability data and binding site reasoning suggests PPI or disordered region, explicitly warn the user before proceeding.
ChEMBL_search_binding_sites(target_chembl_id)get_binding_affinity_by_pdb_id(pdb_id) for co-crystallized ligandsInterPro_get_protein_domains(accession) for domain architectureRequires NVIDIA_API_KEY. Two options:
NvidiaNIM_alphafold2(sequence, algorithm="mmseqs2") - high accuracy, 5-15 minESMFold_predict_structure(sequence) - fast (~30s), max 1024 AApLDDT guidance: >=90 very high confidence, 70-90 confident, <70 use with caution. Low pLDDT in the putative binding region undermines docking reliability.
Priority order for bioactivity data:
ChEMBL_get_target_activities - curated, SAR-readyBindingDB_get_ligands_by_uniprot - direct Ki/Kd with literature linksGtoPdb_search_ligands - pharmacology focus (GPCRs, channels)PubChem_search_assays_by_target_gene - HTS screens, novel scaffoldsOpenTargets_get_chemical_probes_by_target_ensemblID - validated probesKey steps:
BindingDB_get_targets_by_compoundTools:
PDB_search_similar_structures(query=uniprot, type="sequence") - find PDB entriesget_protein_metadata_by_pdb_id(pdb_id) - resolution, methodget_binding_affinity_by_pdb_id(pdb_id) - co-crystal ligand affinitiesget_ligand_smiles_by_chem_comp_id(chem_comp_id) - ligand SMILES from PDBEMDB_search_structures(query) - cryo-EM structures (prefer for GPCRs, ion channels)alphafold_get_prediction(qualifier) - AlphaFold DB fallbackIf PDB + SDF available: use get_diffdock_info(protein=PDB, ligand=SDF, num_poses=10).
If only sequence + SMILES: use NvidiaNIM_boltz2(polymers=[...], ligands=[...]).
Dock a known reference inhibitor first to validate the binding pocket geometry before running candidates.
Use 3-5 diverse actives as seeds, similarity threshold 70-85%:
ChEMBL_search_similar_molecules(molecule=SMILES, similarity=70)PubChem_search_compounds_by_similarity(smiles, threshold=0.7)ChEMBL_search_substructure(smiles=core_scaffold)STITCH_get_chemical_protein_interactions(identifier=gene, species=9606)GenMol - scaffold hopping with masked regions:
NvidiaNIM_genmol(smiles="...core...[*{3-8}]...tail...[*{1-3}]...", num_molecules=100, temperature=2.0, scoring="QED")
MolMIM - controlled analog generation:
NvidiaNIM_molmim(smi=reference_smiles, num_molecules=50, algorithm="CMA-ES")
Apply sequentially (all tools accept smiles=[list]):
ADMETAI_predict_physicochemical_properties - Lipinski violations <= 1, QED > 0.3, MW 200-600ADMETAI_predict_bioavailability - oral bioavailability > 0.3ADMETAI_predict_toxicity - AMES < 0.5, hERG < 0.5, DILI < 0.5ADMETAI_predict_CYP_interactions - flag CYP3A4 inhibitorsChEMBL_search_compound_structural_alerts - no PAINSInclude a filter funnel summary in the report showing pass/fail counts at each stage.
Composite score: docking confidence (40%) + ADMET score (30%) + similarity to known active (20%) + novelty (10%, not in ChEMBL + novel scaffold bonus).
Evidence tiers for candidates:
Deliver top 20 candidates with: Rank, ID, SMILES, docking score, ADMET score, overall score, source, evidence tier.
PubMed_search_articles(query="[TARGET] inhibitor SAR") - peer-reviewedEuropePMC_search_articles(query, source="PPR") - preprints (not peer-reviewed)openalex_search_works(query) - citation analysisTarget ID: ChEMBL_search_targets -> GtoPdb_search_targets -> "Not in databases"
Druggability: OpenTargets tractability -> DGIdb druggability -> target class proxy
Bioactivity: ChEMBL -> BindingDB -> GtoPdb -> PubChem BioAssay -> "No data"
Structure: PDB -> EMDB (membrane) -> alphafold_get_prediction -> NvidiaNIM_esmfold -> AlphaFold DB -> "None"
Similarity: ChEMBL similar -> PubChem similar -> "Search failed"
Docking: get_diffdock_info -> NvidiaNIM_boltz2 -> similarity-based scoring
Generation: NvidiaNIM_genmol -> NvidiaNIM_molmim -> similarity search only
Literature: PubMed -> EuropePMC (preprints) -> OpenAlex
GPCR data: GPCRdb_get_protein -> GtoPdb_search_targets
When ToolUniverse tools return limited compound sets, access chemical databases directly:
import requests, pandas as pd
# PubChem batch property retrieval (up to 100 CIDs per call)
cids = "2244,5988,3672"
url = f"https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/cid/{cids}/property/MolecularWeight,XLogP,TPSA,HBondDonorCount,HBondAcceptorCount/JSON"
props = pd.DataFrame(requests.get(url).json()["PropertyTable"]["Properties"])
# ChEMBL bioactivity bulk download for a target
target_id = "CHEMBL203" # EGFR
url = f"https://www.ebi.ac.uk/chembl/api/data/activity.json?target_chembl_id={target_id}&pchembl_value__gte=5&limit=1000"
activities = requests.get(url).json()["activities"]
df = pd.DataFrame(activities)[["molecule_chembl_id", "canonical_smiles", "pchembl_value", "standard_type"]]
# Lipinski Rule of 5 filtering (no RDKit needed)
lipinski = props[(props["MolecularWeight"] <= 500) & (props["XLogP"] <= 5) &
(props["HBondDonorCount"] <= 5) & (props["HBondAcceptorCount"] <= 10)]
# SDF download from PubChem (for docking input)
sdf_url = f"https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/cid/{cids}/SDF"
sdf_content = requests.get(sdf_url).text
See tooluniverse-data-wrangling skill for format cookbook and pagination patterns.
AlphaFold2: 5-15 min (async, max ~2000 AA). ESMFold: ~30 sec (max 1024 AA). DiffDock: ~1-2 min/ligand. Boltz2: ~2-5 min. GenMol/MolMIM: ~1-3 min.
Always check: import os; nvidia_available = bool(os.environ.get("NVIDIA_API_KEY"))
For large expansions (>500 compounds): batch in chunks of 100, prioritize top candidates for docking.
tools
Post-market safety surveillance and recall/adverse-event RETRIEVAL across the full spectrum of FDA-regulated products that are NOT covered by the drug-AE signal skills: medical devices, food / dietary supplements / cosmetics, veterinary drugs, and drug supply (shortages). Orchestrates openFDA endpoints (MAUDE device adverse events + device recalls + 510(k), CAERS food/supplement/ cosmetic adverse events, veterinary adverse events, drug shortages, and cross-product enforcement/recall reports). USE WHEN the user asks: "are there adverse events for [device / pacemaker / infusion pump / insulin pump]", "device recalls for [firm/product]", "supplement / vitamin / cosmetic adverse reactions", "is [drug] in shortage", "what injectables are on shortage", "veterinary / animal adverse events for [drug] in [dog/cat/horse]", "food recall for listeria", "MAUDE report for [device]", "CAERS reactions for [brand]". DO NOT USE for drug adverse-event SIGNAL detection or disproportionality (PRR / ROR / IC) or drug-AE association scoring — that is `tooluniverse-pharmacovigilance` / `tooluniverse-adverse-event-detection`. This skill is multi-product surveillance and retrieval, not drug-AE statistical signal mining.
tools
--- name: tooluniverse-phewas description: Cross-ancestry / cross-biobank phenome-wide association (PheWAS) and replication. Given ONE variant (rsID) or ONE gene, look up every phenotype it associates with across European/UK (UKB-TOPMed), Finnish (FinnGen), Japanese (BioBank Japan), and Taiwanese (TPMI) biobanks, plus exome-wide gene-burden PheWAS (Genebass), then judge whether an association replicates across ancestries or is population-specific. Use whenever the user asks "what else is this va
tools
Dereplicate a putative natural product and assign its chemical taxonomy. Use to answer "is [compound] a known natural product", "what microbe/organism produces [compound]", "what chemical class is [compound]", "dereplicate this metabolite (by formula/exact mass/InChIKey/SMILES)", or "classify this molecule into ChemOnt". Searches NPAtlas for known microbial natural products (producing organism + literature reference), assigns the ChemOnt kingdom→superclass→class→subclass hierarchy via ClassyFire, resolves systematic IUPAC names to structure via OPSIN, and cross-references identity in PubChem. NOT for general drug/compound identity or ADMET (use tooluniverse-chemical-compound-retrieval / tooluniverse-small-molecule-discovery) and NOT for metabolomics pathway/enrichment analysis (use tooluniverse-metabolomics skills).
tools
Genome-ASSEMBLY discovery, QC, and replicon mapping for any organism (bacteria, archaea, fungi, and beyond) using NCBI Datasets. Resolves an organism name or taxid to assemblies, picks the reference/representative or best-quality assembly, pulls assembly QC metrics (total length, contig/scaffold N50, contig count, GC%, assembly level, RefSeq category), enumerates chromosomes and plasmids via per-replicon sequence reports, and compares candidate assemblies on quality. Use for "what genomes are available for [organism]", "assembly stats / N50 / GC content for [GCF_/GCA_ accession]", "how many plasmids does [strain] have", "compare assemblies for [species]", "find the reference genome for [taxon]", "is this assembly Complete Genome or just contigs". NOT for gene-level orthology/synteny (use tooluniverse-comparative-genomics), plant gene structure (use tooluniverse-plant-genomics), de novo assembly from raw reads (no tool exists), or taxonomy-only name/lineage lookups.