skills/binder-discovery/SKILL.md
ToolUniverse workflow — Binder Discovery
npx skillsauth add lamm-mit/scienceclaw binder-discoveryInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Systematic discovery of novel small molecule binders using 60+ ToolUniverse tools across druggability assessment, known ligand mining, similarity expansion, ADMET filtering, and synthesis feasibility.
KEY PRINCIPLES:
DO NOT show search process or tool outputs to the user. Instead:
Create the report file FIRST - Before any data collection:
[TARGET]_binder_discovery_report.md[Researching...] in each sectionProgressively update the report - As you gather data:
Output separate data files:
[TARGET]_candidate_compounds.csv - Prioritized compounds with SMILES, scores[TARGET]_bibliography.json - Literature references (optional)Every piece of information MUST include its source:
### 3.2 Known Inhibitors
| Compound | ChEMBL ID | IC50 (nM) | Selectivity | Source |
|----------|-----------|-----------|-------------|--------|
| Imatinib | CHEMBL941 | 38 | ABL-selective | ChEMBL |
| Dasatinib | CHEMBL1421 | 0.5 | Multi-kinase | ChEMBL |
*Source: ChEMBL via `ChEMBL_get_target_activities` (CHEMBL1862)*
Phase 0: Tool Verification (check parameter names)
↓
Phase 1: Target Validation
├─ 1.1 Resolve identifiers (UniProt, Ensembl, ChEMBL target ID)
├─ 1.2 Assess druggability/tractability
│ └─ 1.2.5 Check therapeutic antibodies (Thera-SAbDab) [NEW]
├─ 1.3 Identify binding sites
└─ 1.4 Predict structure (NvidiaNIM_alphafold2/esmfold)
↓
Phase 2: Known Ligand Mining
├─ Extract ChEMBL bioactivity data
├─ Get GtoPdb interactions
├─ Identify chemical probes
├─ BindingDB affinity data (NEW - Ki/IC50/Kd)
├─ PubChem BioAssay HTS data (NEW - screening hits)
└─ Analyze SAR from known actives
↓
Phase 3: Structure Analysis
├─ Get PDB structures with ligands
├─ Check EMDB for cryo-EM structures (NEW - for membrane targets)
├─ Analyze binding pocket
└─ Identify key interactions
↓
Phase 3.5: Docking Validation (NvidiaNIM_diffdock/boltz2) [NEW]
├─ Dock reference inhibitor
└─ Validate binding pocket geometry
↓
Phase 4: Compound Expansion
├─ 4.1-4.3 Similarity/substructure search
└─ 4.4 De novo generation (NvidiaNIM_genmol/molmim) [NEW]
↓
Phase 5: ADMET Filtering
├─ Predict physicochemical properties
├─ Predict ADMET endpoints
└─ Flag liabilities
↓
Phase 6: Candidate Docking & Prioritization
├─ Dock all candidates (NvidiaNIM_diffdock/boltz2) [UPDATED]
├─ Score by docking + ADMET + novelty
├─ Assess synthesis feasibility
└─ Generate final ranked list
↓
Phase 7: Report Synthesis
CRITICAL: Verify tool parameters before calling unfamiliar tools.
# Check tool params to prevent silent failures
tool_info = tu.tools.get_tool_info(tool_name="ChEMBL_get_target_activities")
| Tool | WRONG Parameter | CORRECT Parameter |
|------|-----------------|-------------------|
| OpenTargets_get_target_tractability_by_ensemblID | ensembl_id | ensemblId |
| ChEMBL_get_target_activities | chembl_target_id | target_chembl_id |
| ChEMBL_search_similar_molecules | smiles | molecule (accepts SMILES, ChEMBL ID, or name) |
| alphafold_get_prediction | uniprot | accession |
1. UniProt_search(query=target_name, organism="human")
└─ Extract: UniProt accession, gene name, protein name
2. MyGene_query_genes(q=gene_symbol, species="human")
└─ Extract: Ensembl gene ID, NCBI gene ID
3. ChEMBL_search_targets(query=target_name, organism="Homo sapiens")
└─ Extract: ChEMBL target ID, target type
4. GtoPdb_get_targets(query=target_name)
└─ Extract: GtoPdb target ID (if GPCR/ion channel/enzyme)
Store all IDs for downstream queries:
ids = {
'uniprot': 'P00533',
'ensembl': 'ENSG00000146648',
'chembl_target': 'CHEMBL203',
'gene_symbol': 'EGFR',
'gtopdb': '1797' # if available
}
Multi-Source Triangulation:
1. OpenTargets_get_target_tractability_by_ensemblID(ensemblId)
└─ Extract: Small molecule tractability score, bucket
2. DGIdb_get_gene_druggability(genes=[gene_symbol])
└─ Extract: Druggability categories, known drug count
3. OpenTargets_get_target_classes_by_ensemblID(ensemblId)
└─ Extract: Target class (kinase, GPCR, etc.)
4. GPCRdb_get_protein(protein=entry_name) # NEW - for GPCRs
└─ Extract: GPCR family, receptor state, ligand binding data
~35% of all approved drugs target GPCRs. For GPCR targets, use specialized data:
def check_if_gpcr_and_enrich(tu, target_name, uniprot_id):
"""Check if target is GPCR and get specialized data."""
# Build GPCRdb entry name (e.g., "adrb2_human")
entry_name = f"{target_name.lower()}_human"
# Check if it's a GPCR
gpcr_info = tu.tools.GPCRdb_get_protein(
operation="get_protein",
protein=entry_name
)
if gpcr_info.get('status') == 'success':
# It's a GPCR - get specialized data
# Get known structures (active/inactive states)
structures = tu.tools.GPCRdb_get_structures(
operation="get_structures",
protein=entry_name
)
# Get known ligands
ligands = tu.tools.GPCRdb_get_ligands(
operation="get_ligands",
protein=entry_name
)
# Get mutation data (important for SAR)
mutations = tu.tools.GPCRdb_get_mutations(
operation="get_mutations",
protein=entry_name
)
return {
'is_gpcr': True,
'gpcr_family': gpcr_info['data'].get('family'),
'gpcr_class': gpcr_info['data'].get('receptor_class'),
'structures': structures['data'].get('structures', []),
'ligands': ligands['data'].get('ligands', []),
'mutation_data': mutations['data'].get('mutations', [])
}
return {'is_gpcr': False}
GPCRdb Advantages:
Druggability Scorecard:
| Factor | Assessment | Score | |--------|------------|-------| | Known small molecule drugs | Yes (3+) | ★★★ | | Tractability bucket | 1-3 | ★★☆-★★★ | | Target class | Enzyme/GPCR/Ion channel | ★★★ | | Binding site known | Yes (X-ray) | ★★★ | | GPCRdb ligands available | Yes (10+) | ★★★ (GPCR only) | | Therapeutic antibodies exist | Check Thera-SAbDab | See 1.2.5 |
Decision Point: If druggability score < ★★☆, warn user about challenges.
Check if therapeutic antibodies already target this protein - important for:
def check_therapeutic_antibodies(tu, target_name):
"""
Check Thera-SAbDab for therapeutic antibodies against target.
"""
# Search by target name
results = tu.tools.TheraSAbDab_search_by_target(
target=target_name
)
if results.get('status') == 'success':
antibodies = results['data'].get('therapeutics', [])
# Categorize by clinical stage
by_phase = {'Approved': [], 'Phase 3': [], 'Phase 2': [], 'Phase 1': [], 'Preclinical': []}
for ab in antibodies:
phase = ab.get('phase', 'Unknown')
for key in by_phase.keys():
if key.lower() in phase.lower():
by_phase[key].append(ab)
break
return {
'total_antibodies': len(antibodies),
'by_phase': by_phase,
'antibodies': antibodies[:10], # Top 10
'competitive_alert': len(by_phase.get('Approved', [])) > 0
}
return None
def get_antibody_landscape(tu, target_name, uniprot_id=None):
"""
Comprehensive antibody competitive landscape.
"""
# Thera-SAbDab search
therasabdab = check_therapeutic_antibodies(tu, target_name)
# Also search by common synonyms
synonyms = [target_name]
if target_name != uniprot_id:
synonyms.append(uniprot_id)
all_antibodies = []
for synonym in synonyms:
results = tu.tools.TheraSAbDab_search_therapeutics(query=synonym)
if results.get('status') == 'success':
all_antibodies.extend(results['data'].get('therapeutics', []))
# Deduplicate
seen = set()
unique = []
for ab in all_antibodies:
inn = ab.get('inn_name')
if inn and inn not in seen:
seen.add(inn)
unique.append(ab)
return {
'antibodies': unique,
'count': len(unique),
'has_approved': any(ab.get('phase', '').lower() == 'approved' for ab in unique),
'source': 'Thera-SAbDab'
}
Report Output:
### 1.2.5 Therapeutic Antibody Landscape (NEW)
**Thera-SAbDab Search Results**:
| Antibody (INN) | Target | Format | Phase | PDB |
|----------------|--------|--------|-------|-----|
| Pembrolizumab | PD-1 | IgG4 | Approved | 5DK3 |
| Nivolumab | PD-1 | IgG4 | Approved | 5WT9 |
| Cemiplimab | PD-1 | IgG4 | Approved | 7WVM |
**Competitive Landscape**: ⚠️ 3 approved antibodies target this protein
**Strategic Implication**: Small molecule approach offers differentiation (oral dosing, CNS penetration, cost)
*Source: Thera-SAbDab via `TheraSAbDab_search_by_target`*
Why Include Antibody Landscape:
1. ChEMBL_search_binding_sites(target_chembl_id)
└─ Extract: Binding site names, types
2. get_binding_affinity_by_pdb_id(pdb_id) # For each PDB with ligand
└─ Extract: Kd, Ki, IC50 values for co-crystallized ligands
3. InterPro_get_protein_domains(uniprot_accession)
└─ Extract: Domain architecture, active sites
Output for Report:
### 1.3 Binding Site Assessment
**Known Binding Sites**:
| Site | Type | Evidence | Key Residues | Source |
|------|------|----------|--------------|--------|
| ATP pocket | Orthosteric | X-ray (23 PDBs) | K745, E762, M793 | PDB/ChEMBL |
| Allosteric pocket | Allosteric | X-ray (3 PDBs) | T790, C797 | PDB |
**Binding Site Druggability**: ★★★ (well-defined pocket, multiple co-crystal structures)
*Source: ChEMBL via `ChEMBL_search_binding_sites`, PDB structures*
When no experimental structure is available, or for custom domain predictions.
Requires: NVIDIA_API_KEY environment variable
Option A: AlphaFold2 (High accuracy, async)
NvidiaNIM_alphafold2(
sequence=kinase_domain_sequence,
algorithm="mmseqs2",
relax_prediction=False
)
└─ Returns: PDB structure with pLDDT confidence scores
└─ Use when: Accuracy is critical, time is available (~5-15 min)
Option B: ESMFold (Fast, synchronous)
NvidiaNIM_esmfold(sequence=kinase_domain_sequence)
└─ Returns: PDB structure (max 1024 AA)
└─ Use when: Quick assessment needed (~30 sec)
Report pLDDT Confidence:
### 1.4 Structure Prediction Quality
**Method**: AlphaFold2 via NVIDIA NIM
**Mean pLDDT**: 90.94 (very high confidence)
| Confidence Level | Range | Fraction | Interpretation |
|------------------|-------|----------|----------------|
| Very High | ≥90 | 74.3% | Highly reliable |
| Confident | 70-90 | 16.0% | Reliable |
| Low | 50-70 | 9.0% | Use caution |
| Very Low | <50 | 0.7% | Unreliable |
**Key Binding Residue Confidence**:
| Residue | Function | pLDDT |
|---------|----------|-------|
| K745 | ATP binding | 90.0 |
| T790 | Gatekeeper | 92.3 |
| M793 | Hinge region | 95.3 |
| D855 | DFG motif | 89.5 |
*Source: NVIDIA NIM via `NvidiaNIM_alphafold2`*
1. ChEMBL_get_target_activities(target_chembl_id, limit=500)
└─ Filter: standard_type in ["IC50", "Ki", "Kd", "EC50"]
└─ Filter: standard_value < 10000 nM
└─ Extract: ChEMBL molecule IDs, SMILES, potency values
2. ChEMBL_get_molecule(molecule_chembl_id) # For top actives
└─ Extract: Full molecular data, max_phase, oral flag
Activity Summary Table:
### 2.1 Known Active Compounds (ChEMBL)
**Total Bioactivity Points**: 2,847 (IC50: 1,234 | Ki: 892 | Kd: 456 | EC50: 265)
**Compounds with IC50 < 100 nM**: 156
**Approved Drugs for This Target**: 5
| Compound | ChEMBL ID | IC50 (nM) | Max Phase | SMILES (truncated) |
|----------|-----------|-----------|-----------|-------------------|
| Erlotinib | CHEMBL553 | 2 | 4 | COc1cc2ncnc(Nc3ccc... |
| Gefitinib | CHEMBL939 | 5 | 4 | COc1cc2ncnc(Nc3ccc... |
| [Novel] | CHEMBL123 | 12 | 0 | c1ccc(NC(=O)c2ccc... |
*Source: ChEMBL via `ChEMBL_get_target_activities` (CHEMBL203)*
GtoPdb_get_target_interactions(target_id)
└─ Extract: Ligands with pKi/pIC50, selectivity data
OpenTargets_get_chemical_probes_by_target_ensemblID(ensemblId)
└─ Extract: Validated chemical probes with ratings
Output for Report:
### 2.3 Chemical Probes
| Probe | Target | Rating | Use | Caveat | Source |
|-------|--------|--------|-----|--------|--------|
| Probe-X | EGFR | ★★★★ | In vivo | None | Chemical Probes Portal |
| Probe-Y | EGFR | ★★★☆ | In vitro | Off-target kinase activity | Open Targets |
**Recommended Probe for Target Validation**: Probe-X (highest rating, validated in vivo)
Identify common scaffolds and SAR trends:
### 2.4 Structure-Activity Relationships
**Core Scaffolds Identified**:
1. **4-Anilinoquinazoline** (34 compounds, IC50 range: 2-500 nM)
- N1 position: Aryl preferred
- C6/C7: Methoxy groups improve potency
2. **Pyrimidine-amine** (12 compounds, IC50 range: 15-800 nM)
- Less potent than quinazolines
- Better selectivity profile
**Key SAR Insights**:
- Halogen at meta position of aniline increases potency 3-5x
- C7 ethoxy group critical for binding (H-bond to M793)
BindingDB provides experimental binding affinity data complementary to ChEMBL:
def get_bindingdb_ligands(tu, uniprot_id, affinity_cutoff=10000):
"""
Get ligands from BindingDB with measured affinities.
BindingDB advantages:
- May have compounds not in ChEMBL
- Different affinity types (Ki, IC50, Kd)
- Direct literature links
"""
result = tu.tools.BindingDB_get_ligands_by_uniprot(
uniprot=uniprot_id,
affinity_cutoff=affinity_cutoff # nM
)
if result:
ligands = []
for entry in result:
ligands.append({
'smiles': entry.get('smile'),
'affinity_type': entry.get('affinity_type'),
'affinity_nM': entry.get('affinity'),
'pmid': entry.get('pmid'),
'monomer_id': entry.get('monomerid')
})
# Sort by potency
ligands.sort(key=lambda x: float(x['affinity_nM']) if x['affinity_nM'] else 1e6)
return ligands[:50] # Top 50
return []
def find_compound_polypharmacology(tu, smiles, similarity_cutoff=0.85):
"""Find off-target interactions for selectivity analysis."""
targets = tu.tools.BindingDB_get_targets_by_compound(
smiles=smiles,
similarity_cutoff=similarity_cutoff
)
return targets # Other proteins this compound may bind
BindingDB Output for Report:
### 2.5 Additional Ligands (BindingDB) (NEW)
**Total Unique Ligands**: 89 (non-overlapping with ChEMBL)
**Most Potent**: 0.3 nM Ki
| SMILES | Affinity Type | Value (nM) | PMID | BindingDB ID |
|--------|---------------|------------|------|--------------|
| CC(C)Cc1ccc... | Ki | 0.3 | 15737014 | 12345 |
| COc1cc2ncnc... | IC50 | 2.1 | 16460808 | 12346 |
**Novel Scaffolds from BindingDB**: 3 scaffolds not seen in ChEMBL data
*Source: BindingDB via `BindingDB_get_ligands_by_uniprot`*
PubChem BioAssay provides HTS screening results and dose-response data:
def get_pubchem_assays_for_target(tu, gene_symbol):
"""
Get bioassays and active compounds from PubChem.
Advantages:
- HTS data not in ChEMBL
- NIH-funded screening programs (MLPCN)
- Dose-response curves for IC50 calculation
"""
# Search assays targeting this gene
assays = tu.tools.PubChem_search_assays_by_target_gene(
gene_symbol=gene_symbol
)
results = {
'assays': [],
'total_active_compounds': 0
}
if assays.get('data', {}).get('aids'):
for aid in assays['data']['aids'][:10]: # Top 10 assays
# Get assay summary
summary = tu.tools.PubChem_get_assay_summary(aid=aid)
# Get active compounds
actives = tu.tools.PubChem_get_assay_active_compounds(aid=aid)
active_cids = actives.get('data', {}).get('cids', [])
results['assays'].append({
'aid': aid,
'summary': summary.get('data', {}),
'active_count': len(active_cids)
})
results['total_active_compounds'] += len(active_cids)
return results
def get_dose_response_data(tu, aid):
"""Get dose-response curves for IC50/EC50 determination."""
dr_data = tu.tools.PubChem_get_assay_dose_response(aid=aid)
return dr_data
def get_compound_bioactivity_profile(tu, cid):
"""Get all bioactivity data for a compound."""
profile = tu.tools.PubChem_get_compound_bioactivity(cid=cid)
return profile
PubChem BioAssay Output for Report:
### 2.6 PubChem HTS Screening Data (NEW)
**Assays Found**: 45
**Total Active Compounds Across Assays**: ~1,200
| AID | Assay Type | Active Compounds | Target | Description |
|-----|------------|------------------|--------|-------------|
| 504526 | HTS | 234 | EGFR | qHTS inhibition screen |
| 1053104 | Dose-response | 12 | EGFR kinase | Confirmatory IC50 |
| 651564 | Cellular | 8 | EGFR | Cell proliferation assay |
**Novel Actives** (not in ChEMBL/BindingDB):
- CID 12345678: Active in AID 504526, IC50 = 45 nM
- CID 23456789: Active in AID 1053104, IC50 = 120 nM
*Source: PubChem via `PubChem_search_assays_by_target_gene`, `PubChem_get_assay_active_compounds`*
Why Use Both BindingDB and PubChem: | Source | Strengths | Best For | |--------|-----------|----------| | ChEMBL | Curated, standardized, SAR data | Primary ligand source | | BindingDB | Direct affinity measurements | Ki/Kd values, PMIDs | | PubChem BioAssay | HTS data, NIH screens | Novel scaffolds, broad coverage |
1. PDB_search_similar_structures(query=uniprot_accession, type="sequence")
└─ Extract: PDB IDs with ligands
2. get_protein_metadata_by_pdb_id(pdb_id)
└─ Extract: Resolution, method, ligand codes
3. alphafold_get_prediction(accession=uniprot_accession)
└─ Extract: Predicted structure (if no experimental)
Prioritize EMDB for: Membrane proteins (GPCRs, ion channels), large complexes, targets with multiple conformational states.
def get_cryoem_structures(tu, target_name, uniprot_accession):
"""Get cryo-EM structures for membrane targets."""
# Search EMDB
emdb_results = tu.tools.emdb_search(
query=f"{target_name} membrane receptor"
)
structures = []
for entry in emdb_results[:5]:
details = tu.tools.emdb_get_entry(entry_id=entry['emdb_id'])
# Get associated PDB model (essential for docking)
pdb_models = details.get('pdb_ids', [])
structures.append({
'emdb_id': entry['emdb_id'],
'resolution': entry.get('resolution', 'N/A'),
'title': entry.get('title', 'N/A'),
'conformational_state': details.get('state', 'Unknown'),
'pdb_models': pdb_models
})
return structures
When to use cryo-EM over X-ray: | Target Type | Prefer cryo-EM? | Reason | |-------------|-----------------|--------| | GPCR | Yes | Native membrane conformation | | Ion channel | Yes | Multiple functional states | | Receptor-ligand complex | Yes | Physiological state | | Kinase | Usually X-ray | Higher resolution typically |
Structure Summary:
### 3.1 Available Structures
| PDB ID | Resolution | Method | Ligand | Affinity | State |
|--------|------------|--------|--------|----------|-------|
| 1M17 | 2.6 Å | X-ray | Erlotinib | Ki=0.4 nM | Active |
| 4HJO | 2.1 Å | X-ray | Lapatinib | Ki=3 nM | Inactive |
| AF-P00533 | - | Predicted | None | - | - |
### 3.1b Cryo-EM Structures (EMDB)
| EMDB ID | Resolution | PDB Model | Conformation | Ligand |
|---------|------------|-----------|--------------|--------|
| EMD-12345 | 3.2 Å | 7ABC | Active | Agonist |
| EMD-23456 | 3.5 Å | 8DEF | Inactive | Antagonist |
**Best Structure for Docking**: 1M17 (high resolution, relevant ligand)
*Source: RCSB PDB, EMDB, AlphaFold DB*
get_binding_affinity_by_pdb_id(pdb_id)
└─ Extract: Binding affinities for co-crystallized ligands
Output for Report:
### 3.2 Binding Pocket Characterization
**Pocket Volume**: ~850 ų (well-defined)
**Key Interaction Residues**:
- **Hinge region**: M793 (backbone H-bond donor/acceptor)
- **Gatekeeper**: T790 (small residue, allows access)
- **DFG motif**: D855 (active conformation)
- **Selectivity pocket**: L788, G796 (unique to EGFR)
**Druggability Assessment**: High (enclosed pocket, conserved interactions)
Validate structure and score compounds using molecular docking.
Requires: NVIDIA_API_KEY environment variable
Dock a known inhibitor to validate the structure captures the binding pocket correctly.
Option A: DiffDock (Blind docking, PDB + SDF input)
NvidiaNIM_diffdock(
protein=pdb_content, # PDB text content
ligand=reference_sdf, # SDF/MOL2 content
num_poses=10
)
└─ Returns: Docking poses with confidence scores
└─ Use: When you have PDB structure and ligand SDF file
Option B: Boltz2 (From sequence + SMILES)
NvidiaNIM_boltz2(
polymers=[{"molecule_type": "protein", "sequence": kinase_sequence}],
ligands=[{"smiles": "COc1cc2ncnc(Nc3ccc(C#C)cc3)c2cc1OCCOC"}],
sampling_steps=50,
diffusion_samples=1
)
└─ Returns: Protein-ligand complex structure
└─ Use: When starting from SMILES, no SDF needed
| Score vs Reference | Priority | Symbol | |--------------------|----------|--------| | Higher than reference | Top priority | ★★★★ | | Within 5% of reference | High priority | ★★★ | | Within 20% of reference | Moderate priority | ★★☆ | | >20% lower | Low priority | ★☆☆ |
Report Format:
### 3.5 Docking Validation Results
**Reference Compound**: Erlotinib
**Method**: DiffDock via NVIDIA NIM
| Metric | Value | Interpretation |
|--------|-------|----------------|
| Best Pose Confidence | 0.906 | Excellent |
| Steric Clashes | None | Clean binding pose |
**Validation Status**: ✓ Structure captures binding pocket correctly
*Source: NVIDIA NIM via `NvidiaNIM_diffdock`*
Starting from top actives, expand chemical space:
1. ChEMBL_search_similar_molecules(molecule=top_active_smiles, similarity=70)
└─ Extract: Similar compounds not yet tested on target
2. PubChem_search_compounds_by_similarity(smiles, threshold=0.7)
└─ Extract: PubChem CIDs with similar structures
Strategy:
1. ChEMBL_search_substructure(smiles=core_scaffold)
└─ Extract: Compounds containing scaffold
2. PubChem_search_compounds_by_substructure(smiles=core_scaffold)
└─ Extract: Additional scaffold-containing compounds
1. STITCH_get_chemical_protein_interactions(identifier=target_gene)
└─ Extract: Additional chemical-protein links
2. DGIdb_get_drug_gene_interactions(genes=[gene_symbol])
└─ Extract: Approved/investigational drugs
Output for Report:
### 4. Compound Expansion Results
**Starting Seeds**: 5 diverse actives (IC50 < 100 nM)
**Similarity Expansion**: 847 compounds (70% threshold)
**Substructure Search**: 234 scaffold matches
**Cross-Database**: 45 additional hits
**After Deduplication**: 923 unique candidate compounds
| Source | Compounds | Already Tested | Novel Candidates |
|--------|-----------|----------------|------------------|
| ChEMBL similarity | 456 | 234 | 222 |
| PubChem similarity | 391 | 156 | 235 |
| ChEMBL substructure | 178 | 89 | 89 |
| STITCH | 45 | 23 | 22 |
| **Total Unique** | **923** | **355** | **568** |
When database mining yields insufficient candidates, generate novel molecules.
Requires: NVIDIA_API_KEY environment variable
Option A: GenMol (Scaffold Hopping with Masked Regions)
NvidiaNIM_genmol(
smiles="COc1cc2ncnc(Nc3ccc([*{3-8}])c([*{1-3}])c3)c2cc1OCCCN1CCOCC1",
num_molecules=100,
temperature=2.0,
scoring="QED"
)
└─ Input: SMILES with [*{min-max}] masked regions
└─ Output: Generated molecules with QED/LogP scores
└─ Use: Explore specific positions while keeping scaffold
Mask Design Strategy:
| Position | Mask | Purpose |
|----------|------|---------|
| Aniline substituent | [*{1-3}] | Small groups (halogen, methyl) |
| Solubilizing group | [*{5-10}] | Morpholine, piperazine variants |
| Linker region | [*{3-6}] | Spacer variations |
Example Masked SMILES for EGFR:
# Keep quinazoline core, vary aniline and tail
COc1cc2ncnc(Nc3ccc([*{1-3}])c([*{1-3}])c3)c2cc1[*{5-12}]
Option B: MolMIM (Controlled Generation from Reference)
NvidiaNIM_molmim(
smi="COc1cc2ncnc(Nc3ccc(Cl)cc3)c2cc1OCCN1CCOCC1",
num_molecules=50,
algorithm="CMA-ES"
)
└─ Input: Reference SMILES (known active)
└─ Output: Optimized analogs with property scores
└─ Use: Generate close analogs of top actives
Generation Workflow:
Report Format:
### 4.4 De Novo Generation Results
**Method**: GenMol via NVIDIA NIM
**Seed Scaffold**: 4-anilinoquinazoline (from erlotinib)
**Masked Positions**: Aniline (3,4), solubilizing tail
| Metric | Value |
|--------|-------|
| Molecules Generated | 100 |
| Passing Lipinski | 95 (95%) |
| Mean QED Score | 0.72 |
| Unique Scaffolds | 12 |
**Top Generated Compounds**:
| ID | SMILES | QED | LogP | Novelty |
|----|--------|-----|------|---------|
| GEN-001 | COc1cc2ncnc(Nc3ccc(Cl)c(Cl)c3)c2cc1OCCCN1CCOCC1 | 0.81 | 4.2 | Novel substitution |
| GEN-002 | COc1cc2ncnc(Nc3ccc(C#N)c(F)c3)c2cc1OCCCN1CCOCC1 | 0.78 | 3.8 | Novel substitution |
*Source: NVIDIA NIM via `NvidiaNIM_genmol`*
ADMETAI_predict_physicochemical_properties(smiles=[compound_list])
└─ Filter: Lipinski violations ≤ 1
└─ Filter: QED > 0.3
└─ Filter: MW 200-600
1. ADMETAI_predict_bioavailability(smiles=[compound_list])
└─ Filter: Oral bioavailability > 0.3
2. ADMETAI_predict_toxicity(smiles=[compound_list])
└─ Filter: AMES < 0.5, hERG < 0.5, DILI < 0.5
3. ADMETAI_predict_CYP_interactions(smiles=[compound_list])
└─ Flag: CYP3A4 inhibitors (drug interaction risk)
ChEMBL_search_compound_structural_alerts(smiles=compound_smiles)
└─ Flag: PAINS, reactive groups, toxicophores
ADMET Filter Summary:
### 5. ADMET Filtering Results
| Filter Stage | Input | Passed | Failed | Pass Rate |
|--------------|-------|--------|--------|-----------|
| Physicochemical (Lipinski) | 568 | 456 | 112 | 80% |
| Drug-likeness (QED > 0.3) | 456 | 398 | 58 | 87% |
| Bioavailability (> 0.3) | 398 | 312 | 86 | 78% |
| Toxicity filters | 312 | 267 | 45 | 86% |
| Structural alerts | 267 | 234 | 33 | 88% |
| **Final Candidates** | **568** | **234** | **334** | **41%** |
**Common Failure Reasons**:
1. High molecular weight (>600): 67 compounds
2. Low predicted bioavailability: 86 compounds
3. hERG liability: 28 compounds
4. PAINS alerts: 18 compounds
Score each candidate on multiple dimensions:
| Dimension | Weight | Scoring Criteria | |-----------|--------|------------------| | Structural Similarity | 25% | Tanimoto to actives (0.7-1.0 → 1-5) | | Novelty | 20% | Not in ChEMBL bioactivity = +2; Novel scaffold = +3 | | ADMET Score | 25% | Composite of property predictions | | Synthesis Feasibility | 15% | SA score (1-10), commercial availability | | Scaffold Diversity | 15% | Cluster representative bonus |
### 6.2 Synthesis Feasibility Assessment
| Candidate | SA Score | Commercial | Estimated Steps | Flag |
|-----------|----------|------------|-----------------|------|
| Compound-1 | 2.3 | Yes (Enamine) | 0 | ★★★ |
| Compound-2 | 3.5 | Building block | 2-3 | ★★☆ |
| Compound-3 | 5.8 | No | 6-8 | ★☆☆ |
**SA Score Interpretation**:
- 1-3: Easy synthesis
- 3-5: Moderate complexity
- 5-10: Challenging synthesis
### 6.3 Top 20 Candidate Compounds
| Rank | ID | SMILES | Sim. Score | ADMET | Novelty | Overall | Rationale |
|------|-----|--------|------------|-------|---------|---------|-----------|
| 1 | CPD-001 | Cc1ccc... | 0.82 | 4.5 | Novel scaffold | 4.2 | High similarity, clean ADMET |
| 2 | CPD-002 | COc1cc... | 0.78 | 4.3 | Not tested | 4.0 | Quinazoline analog |
| 3 | CPD-003 | Nc1ccc... | 0.75 | 4.1 | Novel core | 3.9 | New chemotype |
| ... | ... | ... | ... | ... | ... | ... | ... |
**Scaffold Diversity**: 7 distinct scaffolds in top 20
**Commercial Availability**: 12/20 available for purchase
**Estimated Hit Rate**: 15-25% (based on similarity to actives)
Search literature to validate candidate compounds and understand target context.
def search_binder_literature(tu, target_name, compound_scaffolds):
"""Search literature for compound and target evidence."""
# PubMed: Published SAR studies
sar_papers = tu.tools.PubMed_search_articles(
query=f"{target_name} inhibitor SAR structure-activity",
limit=30
)
# BioRxiv: Latest unpublished findings
preprints = tu.tools.BioRxiv_search_preprints(
query=f"{target_name} small molecule discovery",
limit=15
)
# MedRxiv: Clinical data on inhibitors
clinical = tu.tools.MedRxiv_search_preprints(
query=f"{target_name} inhibitor clinical trial",
limit=10
)
# Citation analysis for key papers
key_papers = sar_papers[:10]
for paper in key_papers:
citation = tu.tools.openalex_search_works(
query=paper['title'],
limit=1
)
paper['citations'] = citation[0].get('cited_by_count', 0) if citation else 0
return {
'published_sar': sar_papers,
'preprints': preprints,
'clinical_preprints': clinical,
'high_impact_papers': sorted(key_papers, key=lambda x: x.get('citations', 0), reverse=True)
}
## 6.5 Literature Evidence
### Published SAR Studies
| PMID | Title | Year | Key Insight |
|------|-------|------|-------------|
| 34567890 | Discovery of novel EGFR inhibitors... | 2024 | C7 substitution critical |
| 33456789 | Structure-activity relationship of... | 2023 | Fluorine improves potency |
### Recent Preprints (⚠️ Not Peer-Reviewed)
| Source | Title | Posted | Relevance |
|--------|-------|--------|-----------|
| BioRxiv | Novel scaffolds for EGFR... | 2024-02 | New chemotype discovery |
| MedRxiv | Clinical activity of... | 2024-01 | Phase 2 results |
### High-Impact References
| PMID | Citations | Title |
|------|-----------|-------|
| 32123456 | 523 | Landmark EGFR inhibitor study... |
| 31234567 | 312 | Comprehensive SAR analysis... |
*Source: PubMed, BioRxiv, MedRxiv, OpenAlex*
File: [TARGET]_binder_discovery_report.md
# Small Molecule Binder Discovery: [TARGET]
**Generated**: [Date] | **Query**: [Original query] | **Status**: In Progress
---
## Executive Summary
[Researching...]
---
## 1. Target Validation
### 1.1 Target Identifiers
[Researching...]
### 1.2 Druggability Assessment
[Researching...]
### 1.3 Binding Site Analysis
[Researching...]
---
## 2. Known Ligand Landscape
### 2.1 ChEMBL Bioactivity Summary
[Researching...]
### 2.2 Approved Drugs & Clinical Compounds
[Researching...]
### 2.3 Chemical Probes
[Researching...]
### 2.4 SAR Insights
[Researching...]
---
## 3. Structural Information
### 3.1 Available Structures
[Researching...]
### 3.2 Binding Pocket Analysis
[Researching...]
### 3.3 Key Interactions
[Researching...]
---
## 4. Compound Expansion
### 4.1 Similarity Search Results
[Researching...]
### 4.2 Substructure Search Results
[Researching...]
### 4.3 Cross-Database Mining
[Researching...]
---
## 5. ADMET Filtering
### 5.1 Physicochemical Filters
[Researching...]
### 5.2 ADMET Predictions
[Researching...]
### 5.3 Structural Alerts
[Researching...]
### 5.4 Filter Summary
[Researching...]
---
## 6. Candidate Prioritization
### 6.1 Scoring Methodology
[Researching...]
### 6.2 Synthesis Feasibility
[Researching...]
### 6.3 Top 20 Candidates
[Researching...]
---
## 7. Recommendations
### 7.1 Immediate Actions
[Researching...]
### 7.2 Experimental Validation Plan
[Researching...]
### 7.3 Backup Strategies
[Researching...]
---
## 8. Data Gaps & Limitations
[Researching...]
---
## 9. Data Sources
[Will be populated as research progresses...]
---
## 10. Methods Summary
| Step | Tool | Purpose |
|------|------|---------|
| Sequence retrieval | UniProt_search | Get protein sequence |
| Structure prediction | NvidiaNIM_alphafold2 / NvidiaNIM_esmfold | 3D structure with pLDDT |
| Docking validation | NvidiaNIM_diffdock / NvidiaNIM_boltz2 | Validate binding pocket |
| Known ligands | ChEMBL_get_target_activities | Bioactivity data |
| Similarity search | ChEMBL_search_similar_molecules | Expand chemical space |
| De novo generation | NvidiaNIM_genmol / NvidiaNIM_molmim | Novel molecule design |
| ADMET filtering | ADMETAI_predict_* | Drug-likeness assessment |
| Candidate docking | NvidiaNIM_diffdock / NvidiaNIM_boltz2 | Final scoring |
| Tier | Symbol | Description | Example | |------|--------|-------------|---------| | T0 | ★★★★ | Docking score > reference inhibitor | Better than erlotinib | | T1 | ★★★ | Experimental IC50/Ki < 100 nM | ChEMBL bioactivity | | T2 | ★★☆ | Docking within 5% of reference OR IC50 100-1000 nM | High priority | | T3 | ★☆☆ | Structural similarity > 80% to T1 | Predicted active | | T4 | ☆☆☆ | Similarity 70-80%, scaffold match | Lower confidence | | T5 | ○○○ | Generated molecule, ADMET-passed, no docking | Speculative |
Docking-Enhanced Grading: When NVIDIA NIM docking is available, compounds gain evidence:
Apply to all candidate compounds:
| Compound | Evidence | Docking vs Ref | Rationale |
|----------|----------|----------------|-----------|
| CPD-001 | ★★★★ | +8.3% | 85% similar, docking > erlotinib |
| CPD-002 | ★★★ | -2.1% | IC50=45nM, validated by docking |
| CPD-003 | ★★☆ | -4.5% | 78% similar, good docking |
| GEN-001 | ★☆☆ | -15% | Generated, ADMET-passed |
| Tool | Purpose |
|------|---------|
| UniProt_search | Resolve UniProt accession |
| MyGene_query_genes | Get Ensembl/NCBI IDs |
| ChEMBL_search_targets | Get ChEMBL target ID |
| OpenTargets_get_target_tractability_by_ensemblID | Tractability assessment |
| DGIdb_get_gene_druggability | Druggability categories |
| ChEMBL_search_binding_sites | Binding site info |
| InterPro_get_protein_domains | Domain architecture |
| Tool | Purpose |
|------|---------|
| ChEMBL_get_target_activities | Bioactivity data |
| ChEMBL_get_molecule | Molecule details |
| GtoPdb_get_target_interactions | Pharmacology data |
| OpenTargets_get_chemical_probes_by_target_ensemblID | Chemical probes |
| OpenTargets_get_associated_drugs_by_target_ensemblID | Known drugs |
| Tool | Purpose |
|------|---------|
| NvidiaNIM_alphafold2 | High-accuracy structure prediction with pLDDT |
| NvidiaNIM_esmfold | Fast structure prediction (max 1024 AA) |
| NvidiaNIM_msa_search | MSA generation for AlphaFold |
| Tool | Purpose |
|------|---------|
| PDB_search_similar_structures | Find PDB structures |
| get_protein_metadata_by_pdb_id | Structure metadata |
| get_binding_affinity_by_pdb_id | Ligand affinities |
| alphafold_get_prediction | Predicted structure (AlphaFold DB) |
| get_ligand_smiles_by_chem_comp_id | Ligand structures |
| emdb_search | Search cryo-EM structures (NEW) |
| emdb_get_entry | Get EMDB entry details (NEW) |
| Tool | Purpose |
|------|---------|
| NvidiaNIM_diffdock | Blind molecular docking (PDB + SDF) |
| NvidiaNIM_boltz2 | Protein-ligand complex (sequence + SMILES) |
| Tool | Purpose |
|------|---------|
| ChEMBL_search_similar_molecules | Similarity search |
| PubChem_search_compounds_by_similarity | PubChem similarity |
| ChEMBL_search_substructure | Substructure search |
| PubChem_search_compounds_by_substructure | PubChem substructure |
| STITCH_get_chemical_protein_interactions | Cross-database |
| Tool | Purpose |
|------|---------|
| NvidiaNIM_genmol | Scaffold hopping with masked regions |
| NvidiaNIM_molmim | Controlled generation from reference |
| Tool | Purpose |
|------|---------|
| ADMETAI_predict_physicochemical_properties | Drug-likeness |
| ADMETAI_predict_bioavailability | Oral absorption |
| ADMETAI_predict_toxicity | Toxicity flags |
| ADMETAI_predict_CYP_interactions | CYP liabilities |
| ChEMBL_search_compound_structural_alerts | PAINS, alerts |
| Tool | Purpose |
|------|---------|
| NvidiaNIM_diffdock | Score all candidates by docking |
| NvidiaNIM_boltz2 | Alternative docking from SMILES |
| Tool | Purpose |
|------|---------|
| PubMed_search_articles | Published SAR studies |
| BioRxiv_search_preprints | Latest biology preprints |
| MedRxiv_search_preprints | Clinical preprints |
| openalex_search_works | Citation analysis |
| SemanticScholar_search | AI-ranked papers |
| Primary Tool | Fallback 1 | Fallback 2 | Use When |
|--------------|------------|------------|----------|
| ChEMBL_get_target_activities | GtoPdb_get_target_interactions | PubChem_search_assays | No ChEMBL data |
| ChEMBL_search_similar_molecules | PubChem_search_compounds_by_similarity | STITCH_get_chemical_protein_interactions | ChEMBL exhausted |
| PDB_search_similar_structures | NvidiaNIM_alphafold2 | alphafold_get_prediction | No PDB structure |
| alphafold_get_prediction | NvidiaNIM_alphafold2 | NvidiaNIM_esmfold | AlphaFold DB unavailable |
| NvidiaNIM_alphafold2 | NvidiaNIM_esmfold | alphafold_get_prediction | AlphaFold2 NIM error |
| NvidiaNIM_diffdock | NvidiaNIM_boltz2 | Skip docking, use similarity | Docking error |
| NvidiaNIM_genmol | NvidiaNIM_molmim | Manual scaffold hopping | Generation error |
| OpenTargets_get_target_tractability | DGIdb_get_gene_druggability | Document "Unknown" | Open Targets error |
| ADMETAI_* | SwissADME tools | Basic Lipinski | Invalid SMILES |
| PDB_search_similar_structures | emdb_search + PDB | NvidiaNIM_alphafold2 | Membrane proteins |
| PubMed_search_articles | openalex_search_works | SemanticScholar_search | Literature search |
| BioRxiv_search_preprints | MedRxiv_search_preprints | Skip preprints | Preprint sources |
NVIDIA NIM API Key Required: Tools with NvidiaNIM_ prefix require NVIDIA_API_KEY environment variable. Check availability at start:
import os
nvidia_available = bool(os.environ.get("NVIDIA_API_KEY"))
# If not available, fall back to non-NIM alternatives
User: "Find novel binders for EGFR" → Rich ChEMBL data; focus on novel scaffolds, selectivity, ADMET
User: "Find small molecules for [new target with no known ligands]" → Limited bioactivity; rely on structure-based assessment, similar target ligands
User: "Find analogs of compound X for target Y" → Deep similarity search around specific compound; focus on SAR
User: "Find selective inhibitors for kinase X vs kinase Y" → Include selectivity analysis; filter by off-target predictions
Use this skill for discovering new compounds for a protein target.
tools
Onboard and manage Paperclip AI for research-paper knowledge and agent orchestration
development
Perform AI-powered web searches with real-time information using Perplexity models via LiteLLM and OpenRouter. This skill should be used when conducting web searches for current information, finding recent scientific literature, getting grounded answers with source citations, or accessing information beyond the model knowledge cutoff. Provides access to multiple Perplexity models including Sonar Pro, Sonar Pro Search (advanced agentic search), and Sonar Reasoning Pro through a single OpenRouter API key.
testing
Generate a structured scientific PDF report from a JSON description. Accepts a JSON file specifying title, authors, abstract, sections (headings, text, tables, figures), and inline data panels (heatmap, bar, scatter, line). Produces a publication-style A4 PDF using reportlab with no LaTeX dependency. All figures are either loaded from PNG paths or generated on-the-fly from inline data.
development
Execute arbitrary Python code and return stdout. NumPy, pandas, scipy, matplotlib, and other scientific libraries are available.