metabolomics/metabolite-annotation/SKILL.md
Turns untargeted LC-MS/MS features (m/z, RT, MS/MS) into confidence-stratified metabolite annotations using spectral-library matching (matchms), in-silico tools (SIRIUS/CSI:FingerID, MetFrag) and molecular networking, and assigns a defensible MSI/Schymanski confidence level to each. Use when naming detected features, scoring MS/MS against a reference library, running SIRIUS, or deciding what confidence level an evidence set actually supports. For upstream feature extraction see metabolomics/xcms-preprocessing and metabolomics/msdial-preprocessing; for downstream enrichment that must respect these levels see metabolomics/pathway-mapping; for lipid-specific structural annotation see metabolomics/lipidomics.
npx skillsauth add GPTomics/bioSkills bio-metabolomics-metabolite-annotationInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Reference examples tested with: matchms 0.33+, SIRIUS 6.x, MetFrag 2.5+
Before using code patterns, verify installed versions match. If versions differ:
pip show <package> then help(module.function) to check signatures<tool> --version then <tool> --help to confirm flagsSpectral matching needs precursor m/z on every MS/MS spectrum (add_precursor_mz filter) or ModifiedCosine silently returns zeros. Level 1 needs an authentic standard run in the same lab under the same method; no software output can substitute for it.
If code throws ImportError, AttributeError, or TypeError, introspect the installed package and adapt the example to match the actual API rather than retrying.
"Annotate my metabolomics features with compound identities" -> Map each feature's m/z and MS/MS to candidate structures, then attach an explicit confidence level to every name.
matchms.calculate_scores() for library matching (matchms)sirius ... formulas fingerprints structures canopus for in-silico formula/structure/class (SIRIUS)A metabolite name without a stated MSI/Schymanski level is scientifically incomplete. The inference chain m/z -> formula -> structure -> isomer-resolved identity is three separate lossy steps, each needing its own orthogonal evidence axis. A database hit supplies a name, not evidence: with no MS/MS or RT to back it, it is Schymanski Level 4 (formula) at best, often Level 5 (a feature of interest). A high cosine score ranks candidates; it never proves one. Only an in-house authentic standard, same method, with MS, MS/MS, and RT all matching reaches Level 1 ("identification") -- everything else is an honest hypothesis. The field's recurring sin is laundering Level 2/3 hypotheses into Level-1 prose; the canonical worked example is phenylacetylglutamine being reported as phenylacetylglycine in nearly half of NMR studies (Theodoridis 2023). Assign the lowest level the evidence honestly supports and report which database/version was searched.
| Schymanski | MSI | Name | Evidence required | |---|---|---|---| | Level 1 | 1 | Confirmed structure | In-house authentic standard, same method: MS + MS/MS + RT all match. The only "identification". | | Level 2a | 2 | Probable structure (library) | MS/MS matches a reference library spectrum; no in-house standard. | | Level 2b | 2 | Probable structure (diagnostic) | Diagnostic fragments / RT / ionization consistent with exactly one structure; no reference spectrum. | | Level 3 | 3 | Tentative candidate(s) | Evidence narrows to a structure class or candidate set but isomers remain unresolved. | | Level 4 | -- | Unequivocal formula | MS1 accurate mass + isotope pattern + adduct logic assign one formula; no structure. | | Level 5 | 4 | Exact mass | A feature of interest; nothing assigned. |
Promote one level per orthogonal evidence axis that survives scrutiny; cap at Level 2 unless an in-house standard exists. CSI:FingerID and library matching recover constitution only -- no stereochemistry, so enantiomer/regiochemistry claims cannot come from MS/MS.
| Tool | Core idea | Output | Best for | |---|---|---|---| | matchms (CosineGreedy / ModifiedCosine / spectral entropy) | Score query MS/MS against library spectra | Ranked library hits + matched-peak count | Level 2a when a library spectrum exists | | SIRIUS + ZODIAC | Fragmentation trees + isotope pattern, dataset-wide formula re-ranking | Ranked molecular formula | Formula (Level 4); the reliable part of SIRIUS | | CSI:FingerID + COSMIC | Predict fingerprint, search structure DB, calibrated confidence | Ranked structures + FDR-controllable score | Level 2b/3 structure when COSMIC FDR is set | | CANOPUS | Predict compound class directly from MS2 | ClassyFire + NPClassifier class | Level 3 class for unknowns; often the most honest output | | MetFrag | Bond-disconnection scoring of candidate list | Explainable fragment-supported ranks | Transparent, scriptable, custom DBs, RT term | | FBMN (GNPS2) + MS2Query | Modified-cosine network / ML analogue search | Edges = "related to" | Analogue propagation (Level 3 scaffold hypothesis) |
| Situation | Do | Achievable level | |---|---|---| | In-house authentic standard, same method, MS+MS/MS+RT match | Confirm against standard | Level 1 | | MS/MS available, library spectrum likely exists | matchms library match (entropy or modified cosine) | Level 2a | | MS/MS available, no library spectrum | SIRIUS formulas + CSI:FingerID + CANOPUS, or MetFrag | Level 2b/3 (formula Level 4) | | Need class only / compound absent from all DBs | CANOPUS (class); MSNovelist (de novo SMILES) | Level 3 | | Find analogues / propagate across a network | FBMN on GNPS2 + MS2Query | Level 3 (scaffold hypothesis) | | Only MS1 m/z + isotopes + clean adduct | Formula assignment (SIRIUS / seven golden rules) | Level 4 | | Bare m/z, no orthogonal evidence | Report as a feature | Level 5 | | Biology hinges on a specific isomer / stereocenter | Demand a standard or orthogonal method (NMR, chiral assay) | MS alone insufficient |
Goal: Rank library candidates for each query spectrum and attach the matched-peak count, not just the score.
Approach: Harmonize metadata, normalize intensities, add precursor m/z, score with ModifiedCosine (analogue-aware) or spectral entropy (identity), then keep only hits above both a score and a matched-peak floor.
from matchms import calculate_scores
from matchms.filtering import default_filters, normalize_intensities, add_precursor_mz
try:
from matchms.similarity import ModifiedCosineGreedy as ModifiedCosine # matchms 0.33+
except ImportError:
from matchms.similarity import ModifiedCosine # matchms <= 0.32
def prepare(spectrum):
spectrum = default_filters(spectrum)
spectrum = add_precursor_mz(spectrum) # required for ModifiedCosine or scores are zero
return normalize_intensities(spectrum)
queries = [prepare(s) for s in queries_raw]
references = [prepare(s) for s in references_raw]
scores = calculate_scores(references, queries, ModifiedCosine(tolerance=0.005))
# CosineGreedy/ModifiedCosine return a structured array; the field names are
# class-prefixed and version-dependent (e.g. 'ModifiedCosineGreedy_score' in 0.33),
# so derive them from the dtype rather than hard-coding.
for query in queries:
pairs = scores.scores_by_query(query)
score_field, match_field = pairs[0][1].dtype.names
ref, hit = max(pairs, key=lambda pair: pair[1][score_field])
if hit[score_field] >= 0.7 and hit[match_field] >= 6: # score floor + peak-count floor (GNPS defaults)
print(ref.get('compound_name'), hit[score_field], hit[match_field]) # Level 2a candidate
Goal: Annotate features that have no library spectrum, reporting formula and class with more trust than top-1 structure.
Approach: Run the SIRIUS subcommand chain on one project space; trust ZODIAC-refined formula over CSI:FingerID structure, and only report a structure as confident when a COSMIC FDR threshold is set.
# SIRIUS 6 is a multi-command pipeline on one line. A free academic account/license
# is required (since v5); log in once, then the project space persists across runs.
# Credential flags vary by version; run `sirius login --help` to confirm (commonly `-u <email>`).
sirius login -u "$SIRIUS_USER"
sirius --input features.mgf --project ./sirius_project \
formulas --profile orbitrap \
fingerprints \
structures --database bio \
canopus \
write-summaries --output ./sirius_summary
# Verify exact subcommand spelling with `sirius <command> --help`: formulas/fingerprints/
# structures/canopus changed plural/singular and options between v5 and v6.
# --database (on structures) is a scientific choice: 'bio' raises plausibility but cannot
# return a novel metabolite; 'pubchem' maximizes recall but floods implausible isomers.
Goal: Collapse a feature's evidence set into a single defensible confidence level.
Approach: Start at Level 5 and promote per surviving orthogonal axis; an authentic standard is the only path to Level 1.
def assign_level(evidence):
if evidence.get('authentic_standard_same_method'):
return 1
if evidence.get('library_match') and evidence['library_match']['score'] >= 0.7 and evidence['library_match']['matches'] >= 6:
return '2a' # reference library spectrum, no in-house standard
if evidence.get('diagnostic_fragments') and evidence.get('single_structure_consistent'):
return '2b'
if evidence.get('candidate_set') or evidence.get('canopus_class') or evidence.get('network_propagated'):
return 3 # isomers unresolved, class only, or "related to" an annotated node
if evidence.get('unambiguous_formula'):
return 4 # MS1 + isotopes + adduct logic, no structure
return 5
| Threshold | Source | Rationale | |---|---|---| | Cosine/modified-cosine >= 0.7 AND >= 6 matched peaks | GNPS defaults (Wang 2016) | Suppresses promiscuous low-complexity spectra that hairball the network. | | Spectral entropy >= 0.75 -> FDR < 10% | Li 2021 (natural-products benchmark) | Dataset-dependent, NOT a universal constant; entropy beats dot product for identity. | | MS1 mass error <= 5 ppm (HRMS) | HRMS convention | Tighter than the 10 ppm older default; pairs with isotope-pattern filter. | | Isotope-pattern ~2% abundance accuracy | Kind & Fiehn 2006 | Removes >95% of false formula candidates even at 3 ppm -- orthogonal info, not better mass accuracy, fixes formula. | | COSMIC 0.94 / 0.64 / 0.34 ~ 5 / 10 / 20% FDR | Hoffmann 2022 | Calibrated confidence on CSI:FingerID structures; raw top-1 with no COSMIC is Level 3. | | Predicted CCS within ~3-5% of measured | AllCCS / IMS benchmarks (Zhou 2020) | Use CCS as a falsifier (rejects candidates), not as positive proof of identity. |
| Error / symptom | Cause | Solution |
|---|---|---|
| ModifiedCosine scores all zero | Missing precursor m/z on spectra | Apply add_precursor_mz filter to both references and queries first. |
| AttributeError: 'Scores' has no attribute 'scores' | Indexing scores.scores[...] (old tutorials) | Use scores.scores_by_query(query) or scores.to_array(name=...). |
| ValueError: no field of name <X>_score | Field names are class-prefixed and version-dependent | Read pair[1].dtype.names for the score/matches field names rather than hard-coding. |
| ImportError: cannot import name 'ModifiedCosine' | Renamed to ModifiedCosineGreedy in matchms 0.33 | Try the new name with an ImportError fallback to the old. |
| sirius formula not found | v5 used singular subcommands; v6 uses formulas | Run sirius --help; verify plural/singular per installed version. |
| SIRIUS exits at login | Account/license required since v5 | sirius login once with a free academic account before the chain. |
| Pathway enrichment lights up everywhere | Ambiguous features mapped to many DB IDs | Collapse ion families and carry levels into enrichment (metabolomics/pathway-mapping). |
tools
--- name: bio-phasing-imputation-foundations description: Frames the phasing/imputation pipeline before any tool runs: phasing and imputation are one Li-Stephens copying HMM (recombination is the transition, mutation the emission, the genetic map and Ne set the rates), imputation's honest output is a dosage with a self-estimated quality (INFO/R2/DR2) not a hard genotype, and the stages are ordered and each fails silently (QC, align build and strand to the panel, phase, impute per chromosome, fil
tools
Chooses the enrichment generation before any tool runs, mapping the input shape to a method class - a pre-selected gene list plus a background to over-representation analysis (ORA, hypergeometric), a ranked statistic for all genes to gene set enrichment (GSEA), a signed signaling topology to pathway-topology (SPIA) - then making the null explicit (competitive vs self-contained, gene vs subject sampling) and running a trustworthiness checklist (testable-gene universe, FDR, redundancy collapse, leading-edge check, version reporting). Covers why every clusterProfiler GSEA is the inter-gene-correlation-uncorrected competitive null, why the background not the gene list decides ORA significance, and why no method is universally best. Use when deciding ORA vs GSEA vs topology, which gene-set DB, whether a result is trustworthy, or which null a tool computes. For ORA see go-enrichment, GSEA see gsea, databases kegg-pathways/reactome-pathways/wikipathways; the ranking comes from differential-expression/de-results.
testing
End-to-end GWAS workflow from VCF to association results. Covers PLINK QC, population structure correction, and association testing for case-control or quantitative traits. Use when running genome-wide association studies.
development
Orchestrates the full path from differential expression results to redundancy-collapsed functional enrichment: choose ORA vs GSEA, convert gene IDs per method, run enrichGO/enrichKEGG/enrichPathway/enrichWP or gseGO/gseKEGG (clusterProfiler, ReactomePA, rWikiPathways), and visualize. Routes the ORA-vs-GSEA generation fork and the null/universe/reproducibility theory to pathway-analysis/enrichment-foundations. Use when a DESeq2/edgeR/limma result must become enriched GO terms, KEGG/Reactome/WikiPathways pathways, or a GSEA leading edge; when deciding whether a ranking exists for all genes (GSEA, named decreasing vector) or only a pre-selected list (ORA plus a defensible background universe); or when assembling DE-to-pathway end to end. The DE list and ranking statistic come from differential-expression/de-results; per-method nuance lives in the pathway-analysis skills.