skills/05-kthorn-research-superpower/research/checking-chembl/SKILL.md
<!-- ╔══════════════════════════════════════════════════════════════╗ ║ 本文件为开源 Skill 原始文档,收录仅供学习与研究参考 ║ ║ CoPaper.AI 收集整理 | https://copaper.ai ║ ╚══════════════════════════════════════════════════════════════╝ 来源仓库: https://github.com/kthorn/research-superpower 项目名称: research-superpower 开源协议: MIT License 收录日期: 2026-04-02 声明: 本文件版权归原作者所有。此处收录旨在为社会科学实证研究者 提供 AI Agent Skills 的集中参考。如有侵权,请联系删除。 --> --- name: Checking ChEMBL for Structured SAR Data desc
npx skillsauth add brycewang-stanford/Awesome-Agent-Skills-for-Empirical-Research skills/05-kthorn-research-superpower/research/checking-chemblInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
ChEMBL is a manually curated database of ~99,000 medicinal chemistry papers with extracted, standardized bioactivity data. If a paper is in ChEMBL, you can access structured data without parsing PDFs.
Core principle: Check ChEMBL first for medicinal chemistry papers. Curated data is more reliable than table parsing.
Use this skill when:
When NOT to use:
Base URL: https://www.ebi.ac.uk/chembl/api/data/
No authentication required
CRITICAL: ChEMBL can ONLY be queried by DOI, NOT by PMID
?doi=10.1234/exampleTwo-step process:
Query by DOI (ONLY method that works):
curl -s "https://www.ebi.ac.uk/chembl/api/data/document.json?doi=DOI"
⚠️ IMPORTANT: Must use DOI, not PMID
# ✅ CORRECT - Use DOI
doi="10.1021/jm401507s"
curl -s "https://www.ebi.ac.uk/chembl/api/data/document.json?doi=$doi"
# ❌ WRONG - PMID won't work (will return 0 results)
pmid="24446688"
curl -s "https://www.ebi.ac.uk/chembl/api/data/document.json?pubmed_id=$pmid" # Does NOT work!
If you only have PMID: Fetch DOI from PubMed first, then query ChEMBL with the DOI.
Response structure:
{
"documents": [
{
"document_chembl_id": "CHEMBL3120156",
"doi": "10.1021/jm401507s",
"title": "Discovery and development of simeprevir (TMC435), a HCV NS3/4A protease inhibitor.",
"abstract": "Hepatitis C virus is a blood-borne infection...",
"pubmed_id": 24446688,
"journal": "J Med Chem",
"year": 2014,
"doc_type": "PUBLICATION"
}
],
"page_meta": {
"total_count": 1
}
}
Key fields:
document_chembl_id - Use this to retrieve activity datadoc_type - "PUBLICATION" (from literature) or "DATASET" (deposited)pubmed_id - PMID is in the response, but cannot be used to query ChEMBLtotal_count = 0, paper not in ChEMBLParse response:
response=$(curl -s "https://www.ebi.ac.uk/chembl/api/data/document.json?doi=$doi")
if [ $(echo "$response" | jq -r '.page_meta.total_count') -gt 0 ]; then
chembl_id=$(echo "$response" | jq -r '.documents[0].document_chembl_id')
echo "✓ Found in ChEMBL: $chembl_id"
else
echo "✗ Not in ChEMBL"
fi
Query activity endpoint:
curl -s "https://www.ebi.ac.uk/chembl/api/data/activity.json?document_chembl_id=CHEMBL3120156&limit=1"
Extract total count:
activity_url="https://www.ebi.ac.uk/chembl/api/data/activity.json?document_chembl_id=$chembl_id&limit=1"
activity_count=$(curl -s "$activity_url" | jq -r '.page_meta.total_count')
echo "→ $activity_count bioactivity data points"
Report immediately:
📄 [15/127] Screening: "Discovery and development of simeprevir"
Abstract score: 9 → Fetching full text...
✓ ChEMBL: CHEMBL3120156 (101 activity data points)
→ IC50 data for HCV NS3 protease inhibitors available
Add to SUMMARY.md:
### [Discovery and development of simeprevir (TMC435), a HCV NS3/4A protease inhibitor](https://doi.org/10.1021/jm401507s) (Score: 9)
**DOI:** [10.1021/jm401507s](https://doi.org/10.1021/jm401507s)
**PMID:** [24446688](https://pubmed.ncbi.nlm.nih.gov/24446688/)
**ChEMBL:** [CHEMBL3120156](https://www.ebi.ac.uk/chembl/document_report_card/CHEMBL3120156/) (101 data points)
**Key Findings:**
- IC50 data for HCV NS3/4A protease inhibitors (from ChEMBL)
- Lead compound simeprevir (TMC435) approved for HCV treatment
- Structures and full activity data: [ChEMBL API](https://www.ebi.ac.uk/chembl/api/data/activity.json?document_chembl_id=CHEMBL3120156)
**ChEMBL Activity Summary:**
- IC50 values for HCV NS3/4A protease
- PK parameters (AUC, Cmax, clearance)
- DMPK assays (metabolic stability, permeability)
Always include ChEMBL status:
Add to papers-reviewed.json:
{
"10.1021/jm401507s": {
"pmid": "24446688",
"status": "relevant",
"score": 9,
"chembl_id": "CHEMBL3120156",
"chembl_activities": 101,
"has_structured_data": true
}
}
For papers with rich ChEMBL data (>20 activities), consider extracting:
# Get all IC50 data
curl -s "https://www.ebi.ac.uk/chembl/api/data/activity.json?document_chembl_id=CHEMBL3120156&standard_type=IC50&limit=100" > chembl_data.json
# Summary statistics
jq '[.activities[] | .standard_value | tonumber] | "Min: \(min), Max: \(max), Count: \(length)"' chembl_data.json
Report to user:
📊 ChEMBL data extracted:
- IC50 values for HCV NS3/4A protease
- All structures downloaded
- Data saved to: chembl_CHEMBL3120156_ic50.json
During evaluating-paper-relevance workflow:
Workflow integration point:
Stage 2: Deep Dive
├─ 1. Fetch Full Text (PMC → DOI → Unpaywall)
├─ 1.5. Check ChEMBL ← ADD THIS STEP
│ ├─ Query by DOI
│ ├─ If found: note ChEMBL ID + activity count
│ └─ Report to user
├─ 2. Scan for Relevant Content
└─ 3. Extract Findings
| Type | Description | Units | |------|-------------|-------| | IC50 | Half-maximal inhibitory concentration | nM, µM | | MIC | Minimum inhibitory concentration | µg/mL, nM | | Ki | Inhibition constant | nM, µM | | EC50 | Half-maximal effective concentration | nM, µM | | Kd | Dissociation constant | nM, µM | | Potency | General potency measurement | Various |
Filter by activity type:
curl "https://www.ebi.ac.uk/chembl/api/data/activity.json?document_chembl_id=ID&standard_type=MIC"
~99,000 documents (as of 2025)
Well represented:
Poorly represented:
Typical hit rate:
vs. PDF table parsing:
When to still use PDF:
CRITICAL: Report ChEMBL check for every relevant paper
Example workflow report:
📄 [15/50] Screening: "Novel MmpL3 inhibitors..."
Abstract score: 8 → Checking ChEMBL...
✓ ChEMBL: CHEMBL3456789 (34 data points)
→ Fetching full text...
→ Added to SUMMARY.md with ChEMBL link
For papers not in ChEMBL:
📄 [16/50] Screening: "Another paper..."
Abstract score: 9 → Checking ChEMBL...
✗ Not in ChEMBL (likely too recent or review paper)
→ Fetching full text via Unpaywall...
For research sessions with many medicinal chemistry papers:
Create check_chembl.py:
#!/usr/bin/env python3
import requests
import json
import sys
def check_chembl(doi):
"""Check if DOI is in ChEMBL and return summary
IMPORTANT: Must use DOI, not PMID. ChEMBL API does not accept PMID queries.
"""
# Query document (ONLY works with DOI)
doc_url = f"https://www.ebi.ac.uk/chembl/api/data/document.json?doi={doi}"
try:
doc_response = requests.get(doc_url, timeout=10).json()
except:
return None
# Check if found
if doc_response.get('page_meta', {}).get('total_count', 0) == 0:
return {'in_chembl': False}
doc = doc_response['documents'][0]
chembl_id = doc['document_chembl_id']
# Get activity count
act_url = f"https://www.ebi.ac.uk/chembl/api/data/activity.json?document_chembl_id={chembl_id}&limit=1"
try:
act_response = requests.get(act_url, timeout=10).json()
activity_count = act_response.get('page_meta', {}).get('total_count', 0)
except:
activity_count = 0
return {
'in_chembl': True,
'chembl_id': chembl_id,
'activity_count': activity_count,
'doc_type': doc.get('doc_type'),
'title': doc.get('title')
}
if __name__ == "__main__":
doi = sys.argv[1]
result = check_chembl(doi)
if result and result['in_chembl']:
print(f"✓ {result['chembl_id']} ({result['activity_count']} activities)")
else:
print("✗ Not in ChEMBL")
Usage:
python3 check_chembl.py "10.1021/jm401507s"
# Output: ✓ CHEMBL3120156 (101 activities)
Querying by PMID: Using PMID instead of DOI → Always returns 0 results, ChEMBL only accepts DOI queries Skipping ChEMBL check: Not checking medicinal chemistry papers → Missing structured data that's already extracted Checking non-medchem papers: Checking genomics/cell biology papers → Wasting time, won't be in ChEMBL Not reporting status: Silent ChEMBL checks → User can't see what's happening Not adding to SUMMARY.md: Forgetting to include ChEMBL ID → Harder for user to access data later Only using ChEMBL: Not fetching full text when paper in ChEMBL → Missing context, methods, discussion Parsing PDFs when in ChEMBL: Manually extracting tables when structured data available → Wasting time and introducing errors
| Task | Command |
|------|---------|
| Check if DOI in ChEMBL | curl "https://www.ebi.ac.uk/chembl/api/data/document.json?doi=DOI" |
| Get activity count | curl "https://www.ebi.ac.uk/chembl/api/data/activity.json?document_chembl_id=ID&limit=1" |
| Get all activities | curl "https://www.ebi.ac.uk/chembl/api/data/activity.json?document_chembl_id=ID&limit=1000" |
| Filter by activity type | curl "...activity.json?document_chembl_id=ID&standard_type=MIC" |
| ChEMBL paper page | https://www.ebi.ac.uk/chembl/document_report_card/CHEMBL_ID/ |
Add to .claude/settings.local.json.template:
"Bash(curl*https://www.ebi.ac.uk/chembl/api/data/*)",
"WebFetch(domain:www.ebi.ac.uk)"
ChEMBL check successful when:
After checking ChEMBL:
docs/CHEMBL_INTEGRATION.mdtools
Show mcp-stata identity, connected tools, and status. Use when the user asks if mcp-stata is available, asks about access to the toolkit, or asks what Stata tools are connected.
tools
Activate when users mention Stata commands, .do files, regressions, econometrics, stored results, graphs, dataset inspection, replication, or Stata errors. Route the task through mcp-stata tools and the specialized research skills instead of treating it as plain text coding.
development
Build and review paper-ready regression, balance, and summary tables from Stata outputs. Use when the user needs a clean table for a draft, appendix, or coauthor share-out.
tools
Install, configure, update, or verify mcp-stata across Claude Code, Codex, Gemini CLI, Cursor, Windsurf, and VS Code. Activate when users ask to set up the Stata toolkit or troubleshoot the installation.