skills/tooluniverse-natural-product-dereplication/SKILL.md
Dereplicate a putative natural product and assign its chemical taxonomy. Use to answer "is [compound] a known natural product", "what microbe/organism produces [compound]", "what chemical class is [compound]", "dereplicate this metabolite (by formula/exact mass/InChIKey/SMILES)", or "classify this molecule into ChemOnt". Searches NPAtlas for known microbial natural products (producing organism + literature reference), assigns the ChemOnt kingdom→superclass→class→subclass hierarchy via ClassyFire, resolves systematic IUPAC names to structure via OPSIN, and cross-references identity in PubChem. NOT for general drug/compound identity or ADMET (use tooluniverse-chemical-compound-retrieval / tooluniverse-small-molecule-discovery) and NOT for metabolomics pathway/enrichment analysis (use tooluniverse-metabolomics skills).
npx skillsauth add mims-harvard/tooluniverse tooluniverse-natural-product-dereplicationInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Decide whether a putative natural product is already known, identify the microbe that produces it, attach the literature reference, and assign its ChemOnt chemical class. This is the dereplication question every NP chemist and metabolomics analyst asks of a new feature: "have we seen this before, and what makes it?"
LOOK UP DON'T GUESS: Never assume an NPAID, producing organism, exact mass, or chemical class. Every identity, provenance, and taxonomy claim must come from a live tool call.
Scope (microbial NPs only): NPAtlas covers natural products from bacteria and fungi. It does NOT cover plant, animal, or marine-invertebrate metabolites unless a microbial producer was reported. A "no NPAtlas hit" therefore means not a known microbial NP — it does not prove the molecule is novel in an absolute sense.
| Tool | Input | Returns |
|------|-------|---------|
| NPAtlas_search_compounds | name / inchikey / formula / smiles, limit | list of {npaid, name, molecular_formula, molecular_weight, exact_mass, inchikey, smiles}. (origin_organism is null here — fetch the full record for provenance) |
| NPAtlas_get_compound | npaid (e.g. NPA014588) | full record incl. origin_organism (producing microbe + taxonomic lineage) and origin_reference (title/doi/journal/year) |
| ClassyFire_classify_by_inchikey | inchikey (full 27-char) | ChemOnt kingdom→superclass→class→subclass→direct_parent, molecular_framework, substituents. classified:false if not in cache |
| OPSIN_name_to_structure | name (systematic IUPAC) | smiles / inchi / inchikey. parsed:false for trade/trivial names |
| PubChem_get_CID_by_compound_name | name | {IdentifierList:{CID:[...]}} |
| PubChem_get_compound_properties_by_CID | cid, properties (e.g. ["MolecularFormula","MolecularWeight","InChIKey","IUPACName"]) | property table — use to obtain an InChIKey for arbitrary compounds |
Phase 0: Classify input — name / formula / exact mass / InChIKey / SMILES?
Phase 1: Obtain an InChIKey (the universal key for ClassyFire & precise NPAtlas match)
Phase 2: Dereplicate against NPAtlas (known microbial NP? which organism? which paper?)
Phase 3: Assign ChemOnt chemical class via ClassyFire
Phase 4: Cross-reference identity in PubChem
Phase 5: Report — known/novel call + provenance + class hierarchy + interpretation note
XXXXXXXXXXXXXX-XXXXXXXXXX-X) → skip to Phase 2; it is already the universal key.NPAtlas_search_compounds(smiles=...); also feed to PubChem for an InChIKey.2-acetyloxybenzoic acid) → Phase 1 via OPSIN.staurosporine, penicillin) → Phase 1 via PubChem (OPSIN will return parsed:false for these).# Systematic IUPAC name → structure (OPSIN). parsed:false ⇒ fall through to PubChem.
op = tu.tools.OPSIN_name_to_structure(name="2-acetyloxybenzoic acid")
inchikey = op["data"]["inchikey"] # only if op["data"]["parsed"]
# Trivial/common name → PubChem CID → properties (incl. InChIKey)
cid = tu.tools.PubChem_get_CID_by_compound_name(name="staurosporine")["data"]["IdentifierList"]["CID"][0]
props = tu.tools.PubChem_get_compound_properties_by_CID(
cid=cid, properties=["MolecularFormula","MolecularWeight","InChIKey","IUPACName"])
inchikey = props["data"]["PropertyTable"]["Properties"][0]["InChIKey"]
The InChIKey is what makes dereplication exact: an InChIKey match is a structure match; a name match is not (synonyms, analogs, and salts share names).
Search by the most specific key available. Prefer InChIKey (exact structure), then formula (catches isomers — useful for an MS feature with only a formula), then name (loosest — returns analogs).
# Exact, structure-level
hits = tu.tools.NPAtlas_search_compounds(inchikey="HKSZLNNOFSGOKW-FYTWVXJKSA-N", limit=5)
# MS-feature style (formula or exact mass) — expect multiple isomeric hits
hits = tu.tools.NPAtlas_search_compounds(formula="C28H26N4O3", limit=10)
For each candidate NPAID, fetch the full record to get the producing organism and reference (search results carry origin_organism: null):
rec = tu.tools.NPAtlas_get_compound(npaid="NPA014588")["data"]
organism = rec["origin_organism"]["name"] # e.g. "Streptomyces"
lineage = rec["origin_organism"]["ancestors"] # domain→...→family
reference = rec["origin_reference"] # title, doi, journal, year
cf = tu.tools.ClassyFire_classify_by_inchikey(inchikey=inchikey)["data"]
# cf["kingdom"], cf["superclass"], cf["class"], cf["subclass"], cf["direct_parent"]
# cf["molecular_framework"], cf["substituents"]
If classified:false, the InChIKey is not in the ClassyFire cache — report the class as unavailable (do not invent one). A correct InChIKey is required; a wrong stereo/protonation layer will miss the cache.
Confirm the same molecule exists in PubChem (CID, IUPAC name, formula, MW) so the identity is anchored to a second independent database. Disagreement in molecular formula between NPAtlas and PubChem is a red flag that the name/structure resolution went astray.
Deliver:
C28H26N4O3, 466.2005 Da, returns both staurosporine and an ardeemin derivative). Treat these as a ranked candidate list, not an identification — confirm with InChIKey, MS/MS, or NMR before claiming identity.classified:false = ChemOnt has no cached classification for that exact InChIKey (often because the InChIKey's stereo/charge layer differs from the cached entry, or the compound is new). Report class as unavailable rather than guessing.parsed:false = the name was not systematic IUPAC (trade/trivial name); route to PubChem for an InChIKey instead.Input: staurosporine (a trivial name).
name=staurosporine → parsed:false (not systematic IUPAC) → fall through to PubChem.44259; properties → MolecularFormula C28H26N4O3, MW 466.5, InChIKey HKSZLNNOFSGOKW-FYTWVXJKSA-N.inchikey=HKSZLNNOFSGOKW-FYTWVXJKSA-N → 1 exact hit: NPA014588 Staurosporine, exact_mass 466.2005. → Known microbial NP.get_compound NPA014588 → producing organism Streptomyces (genus; lineage Bacteria → Actinobacteria → Actinobacteria → Streptomycetales → Streptomycetaceae); reference "X-Ray crystal structure of staurosporine: a new alkaloid from a Streptomyces…", DOI 10.1039/C39780000800, 1978.inchikey=HKSZLNNOFSGOKW-FYTWVXJKSA-N → Organic compounds → Organoheterocyclic compounds → Indoles and derivatives → Carbazoles → direct parent Indolocarbazoles; molecular framework: aromatic heteropolycyclic.Call: Known microbial natural product (NPA014588), an indolocarbazole alkaloid produced by Streptomyces, first reported 1978.
Dereplication-logic footnote: the formula C28H26N4O3 (exact mass 466.2005) alone is not unique — NPAtlas formula search returns 2 isomers (staurosporine and 5-N-acetyl-15b-didehydroardeemin). The InChIKey is what pins the identity to staurosporine specifically. An MS feature with only this formula would need MS/MS or NMR to choose between the isomers.
Input IUPAC 2-acetyloxybenzoic acid.
parsed:true, InChIKey BSYNRYMUTXBXSQ-UHFFFAOYSA-N.inchikey=BSYNRYMUTXBXSQ-... → no microbial NP record ⇒ not a known microbial natural product (it is a semisynthetic drug — consistent with NPAtlas scope).Call: Classified into ChemOnt (acylsalicylic acid), but no NPAtlas microbial-NP provenance — illustrating a legitimate "no hit" that is not a novel NP.
classified:false means "not cached for this exact InChIKey", not "unclassifiable". Wrong stereo/charge layers miss the cache.parsed:false; route them through PubChem.origin_organism: null; provenance requires the NPAtlas_get_compound full record.classified:false.tools
Post-market safety surveillance and recall/adverse-event RETRIEVAL across the full spectrum of FDA-regulated products that are NOT covered by the drug-AE signal skills: medical devices, food / dietary supplements / cosmetics, veterinary drugs, and drug supply (shortages). Orchestrates openFDA endpoints (MAUDE device adverse events + device recalls + 510(k), CAERS food/supplement/ cosmetic adverse events, veterinary adverse events, drug shortages, and cross-product enforcement/recall reports). USE WHEN the user asks: "are there adverse events for [device / pacemaker / infusion pump / insulin pump]", "device recalls for [firm/product]", "supplement / vitamin / cosmetic adverse reactions", "is [drug] in shortage", "what injectables are on shortage", "veterinary / animal adverse events for [drug] in [dog/cat/horse]", "food recall for listeria", "MAUDE report for [device]", "CAERS reactions for [brand]". DO NOT USE for drug adverse-event SIGNAL detection or disproportionality (PRR / ROR / IC) or drug-AE association scoring — that is `tooluniverse-pharmacovigilance` / `tooluniverse-adverse-event-detection`. This skill is multi-product surveillance and retrieval, not drug-AE statistical signal mining.
tools
--- name: tooluniverse-phewas description: Cross-ancestry / cross-biobank phenome-wide association (PheWAS) and replication. Given ONE variant (rsID) or ONE gene, look up every phenotype it associates with across European/UK (UKB-TOPMed), Finnish (FinnGen), Japanese (BioBank Japan), and Taiwanese (TPMI) biobanks, plus exome-wide gene-burden PheWAS (Genebass), then judge whether an association replicates across ancestries or is population-specific. Use whenever the user asks "what else is this va
tools
Genome-ASSEMBLY discovery, QC, and replicon mapping for any organism (bacteria, archaea, fungi, and beyond) using NCBI Datasets. Resolves an organism name or taxid to assemblies, picks the reference/representative or best-quality assembly, pulls assembly QC metrics (total length, contig/scaffold N50, contig count, GC%, assembly level, RefSeq category), enumerates chromosomes and plasmids via per-replicon sequence reports, and compares candidate assemblies on quality. Use for "what genomes are available for [organism]", "assembly stats / N50 / GC content for [GCF_/GCA_ accession]", "how many plasmids does [strain] have", "compare assemblies for [species]", "find the reference genome for [taxon]", "is this assembly Complete Genome or just contigs". NOT for gene-level orthology/synteny (use tooluniverse-comparative-genomics), plant gene structure (use tooluniverse-plant-genomics), de novo assembly from raw reads (no tool exists), or taxonomy-only name/lineage lookups.
tools
--- name: tooluniverse-fastq-qc description: FASTQ quality control and adapter/quality-trimming decisions with local NGS tools — run FastQC on raw reads, summarize a project with MultiQC, interpret per-base sequence quality, per-base N content, adapter content, overrepresented sequences, sequence duplication and GC content, and decide whether (and how) to trim with fastp / Cutadapt before downstream analysis. seqkit for read counts/stats/subsampling. Use when someone asks "run QC on my FASTQs",