metabolomics/msdial-preprocessing/SKILL.md
Runs the MS-DIAL preprocessing workflow (peak picking, MS2Dec spectral deconvolution, alignment, gap-filling) and imports the alignment-result table into R or Python with honest filtering. Use when preprocessing LC-MS DDA/DIA (SWATH) raw data with MS-DIAL, deciding MS-DIAL vs XCMS, configuring the MsdialConsoleApp console run, or parsing an MS-DIAL export into a clean feature matrix. For programmatic R peak detection and the feature-table-as-artifact framing see metabolomics/xcms-preprocessing; for lipid annotation mode see metabolomics/lipidomics; for MSI-level confidence honesty see metabolomics/metabolite-annotation; for drift correction and QC see metabolomics/normalization-qc.
npx skillsauth add GPTomics/bioSkills bio-metabolomics-msdial-preprocessingInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Reference examples tested with: MS-DIAL 5.x (LC-MS) / MS-DIAL 4.x (GC-MS), pandas 2.2+, R 4.3+
Before using code patterns, verify installed versions match. If versions differ:
MsdialConsoleApp with no arguments to print the current subcommand/flag listpackageVersion('<pkg>') then ?function_name to verify parameterspip show pandas then help(module.function) to check signaturesThe MS-DIAL GUI runs only on Windows; the console (MsdialConsoleApp) is the cross-platform headless entry. Which build supports a task is itself a constraint: MS-DIAL 5-alpha covers DI-MS, IM-MS, LC-MS, LC-IM-MS but NOT GC-MS - GC-EI stays in the MS-DIAL 4 lineage. If code throws ImportError, AttributeError, or TypeError, introspect the installed package and adapt rather than retrying.
"Process my LC-MS run with MS-DIAL and give me a feature table" -> Pick peaks per file, deconvolve chimeric MS/MS into clean component spectra (MS2Dec), align across samples, gap-fill, then import the alignment result and filter it honestly.
MsdialConsoleApp lcmsdda|lcmsdia|gcms -i <in> -o <out> -m <param.txt>read.csv(..., skip = 4, check.names = FALSE) to parse the alignment exportpandas.read_csv(..., skiprows=4) for the same exportThe same raw files through MS-DIAL versus XCMS yield different feature tables and different marker lists. Li 2018 benchmarked five tools on a 1,100-compound standard and found that while feature detection was broadly similar, quantification and the set of selected discriminating markers differed by tool. A metabolomics "hit" is conditional on (raw data + software + version + every parameter + fill/filter order), not on the raw files alone. MS-DIAL's specific differentiator is MS2Dec deconvolution: it reconstructs clean, library-matchable MS/MS spectra from chimeric DDA/DIA fragment data, which is what makes wide-window DIA (SWATH) tractable at all. Report the full processing specification as part of the result, and treat a finding that survives only one pipeline as a candidate, not a result.
| Axis | MS-DIAL | XCMS | |---|---|---| | Interface | Windows GUI + cross-platform console | R package (scriptable everywhere) | | Core differentiator | MS2Dec MS/MS deconvolution (DDA + DIA) | centWave peak picking, full programmatic control | | Annotation | Built-in (library + MS-FINDER + LipidBlast) | Separate (CAMERA, downstream tools) | | Lipidomics | Strong (predicted-CCS / EAD structural elucidation in v5) | Manual | | Reproducibility unit | Param file + GUI choices | Versioned R script | | Best when | DIA data, lipidomics, GUI workflow, built-in IDs | Scripted pipelines, custom parameters, cohort scale |
Use MS-DIAL when DIA deconvolution or built-in lipid annotation is the point; use metabolomics/xcms-preprocessing for fully scripted, version-pinned cohort processing. The strongest untargeted claims replicate across both.
| Situation | Do | Why |
|---|---|---|
| LC-MS, top-N MS/MS (DDA) | lcmsdda console / GUI LC-MS DDA | Cleaner per-precursor MS2, but intensity-biased, stochastic coverage |
| LC-MS, wide-window MS/MS (DIA / SWATH) | lcmsdia (ABF input only) | Complete MS2 coverage; chimeric spectra REQUIRE MS2Dec to be usable |
| GC-EI run | gcms (MS-DIAL 4 build), or AMDIS/eRah | EI fragments every co-eluting compound; deconvolution IS detection (see below) |
| Headless / Linux cluster | MsdialConsoleApp with a -m param file | GUI is Windows-only; console is the reproducible batch path |
| Lipid-focused study | MS-DIAL + LipidBlast | -> metabolomics/lipidomics for lipid annotation mode |
| Already have an alignment CSV | skip processing, parse + filter | See import + honest-filter sections below |
In GC-EI, 70 eV ionization fragments every compound reproducibly, so the trace at any retention time is a superposition of fragments from several co-eluting molecules. Naive peak picking conflates them; deconvolution into component spectra IS the feature-detection step, then each component is matched against EI+RI libraries (NIST, FiehnLib). Cross-run/cross-lab alignment uses retention index (Kovats n-alkanes, or Fiehn FAME markers giving diagnostic m/z 74/87) rather than raw RT, because RT drifts with column aging. MS-DIAL 5-alpha explicitly excludes GC-MS; use the gcms token in a MS-DIAL 4 build, or AMDIS/eRah, for GC-EI work.
Goal: Process a folder of converted spectra into an alignment table without the GUI.
Approach: Pick the analysis-type token, point -i/-o/-m at input dir, output dir, and a method (parameter) file; keep -p only if the project should reopen in the GUI.
# DDA LC-MS: accepts netCDF/mzML/ABF. Output is *.msdial in the output dir.
MsdialConsoleApp lcmsdda -i ./LCMS_DDA/ -o ./LCMS_DDA_out/ -m ./Msdial-lcms-dda-Param.txt
# DIA/SWATH LC-MS: accepts ABF ONLY (convert vendor raw -> ABF first). MS2Dec is the point.
MsdialConsoleApp lcmsdia -i ./LCMS_DIA/ -o ./LCMS_DIA_out/ -m ./Msdial-lcms-dia-Param.txt
# GC-EI (MS-DIAL 4 build): retention-index alignment, quant-mass quantification.
MsdialConsoleApp gcms -i ./GCMS/ -o ./GCMS_out/ -m ./Msdial-GCMS-Param.txt -p
The parameter file is plain text (one Key=Value per line). The Minimum peak height key is the direct analog of an intensity floor and is instrument-dependent: the GUI default is tuned for a TOF and is often far too high (or its baseline assumption wrong) for an Orbitrap. Set the alignment reference to a pooled QC, never to file #1 by default.
Goal: Split the MS-DIAL alignment export into a feature-metadata frame and an intensity matrix.
Approach: The export carries four header rows above the real column header (sample class / file type / injection order / batch), so skip them; metadata columns precede the per-sample Area columns.
# MS-DIAL alignment export: real column header is on row 5, so skip the first 4 rows.
msdial <- read.csv('AlignResult.txt', sep = '\t', skip = 4, check.names = FALSE)
# Metadata columns appear before the per-sample intensity columns. Common ones:
# 'Alignment ID', 'Average Rt(min)', 'Average Mz', 'Metabolite name', 'Adduct type',
# 'Fill %', 'MS/MS assigned', 'Reference RT', 'Formula', 'Ontology', 'INCHIKEY',
# 'SMILES', 'Annotation tag (VS1.0)'. Sample columns are everything after these.
meta_cols <- c('Alignment ID', 'Average Rt(min)', 'Average Mz', 'Metabolite name',
'Adduct type', 'Fill %', 'MS/MS assigned', 'Annotation tag (VS1.0)')
meta_cols <- intersect(meta_cols, colnames(msdial))
sample_cols <- setdiff(colnames(msdial), colnames(msdial)[seq_len(max(match(meta_cols, colnames(msdial))))])
feature_info <- msdial[, meta_cols]
intensity <- as.matrix(msdial[, sample_cols])
rownames(intensity) <- msdial[['Alignment ID']]
Goal: Same split, in pandas.
Approach: skiprows=4 to land on the real header; slice metadata vs sample columns by position after the last known metadata column.
import pandas as pd
msdial = pd.read_csv('AlignResult.txt', sep='\t', skiprows=4)
meta_cols = ['Alignment ID', 'Average Rt(min)', 'Average Mz', 'Metabolite name', 'Adduct type', 'Fill %', 'MS/MS assigned', 'Annotation tag (VS1.0)']
meta_cols = [c for c in meta_cols if c in msdial.columns]
last_meta = max(msdial.columns.get_loc(c) for c in meta_cols)
sample_cols = msdial.columns[last_meta + 1:]
feature_info = msdial[meta_cols].copy()
intensity = msdial[sample_cols].set_axis(msdial['Alignment ID']) if False else msdial[sample_cols].copy()
intensity.index = msdial['Alignment ID']
Goal: Keep features supported by real signal and known confidence, without overtrusting annotation tags.
Approach: Filter on Fill% (cross-sample presence), require MS/MS support for any feature called identified, and tie the annotation tag to a real MSI confidence level rather than treating a name as proof.
# Fill% is the fraction of samples with a DETECTED (not gap-filled) peak. Low Fill% means
# the feature exists mostly as gap-filled noise-floor integrals, which fabricate intensity
# (an honest 'below detection' becomes a positive number). 70% is a common floor.
keep_fill <- feature_info[['Fill %']] >= 70
# An annotated name without MS/MS is at best an MSI Level 2/3 putative ID (accurate mass
# only). Require 'MS/MS assigned == TRUE' before trusting any identity downstream.
has_msms <- feature_info[['MS/MS assigned']] == 'TRUE'
# Annotation tag confidence (do NOT treat a name as an identification). The exact tag
# vocabulary is MS-DIAL-version-dependent, so inspect unique(feature_info[['Annotation tag (VS1.0)']])
# and map the strings the build actually emits rather than hard-coding them:
# Metabolite / Lipid with MS/MS -> MSI Level 2 (spectral library match)
# Suggested* mass-only -> MSI Level 3 (putative, no MS/MS)
# Unknown -> unannotated feature
feature_info$msi_level <- ifelse(feature_info[['Annotation tag (VS1.0)']] %in% c('Metabolite', 'Lipid') & has_msms, 2,
ifelse(grepl('^Suggested', feature_info[['Annotation tag (VS1.0)']]), 3, NA))
filtered <- intensity[keep_fill, ]
Confidence-level honesty and orthogonal-evidence identification belong to metabolomics/metabolite-annotation; this skill only routes the tag to the right level. Fill% / blank / drift filtering interacts with normalization-qc - process blanks and pooled QCs through the SAME run, then filter the aligned table.
lcmsdda.lcmsdda does not deconvolve wide-isolation chimeric MS/MS, so fragments from co-isolated precursors stay mixed.lcmsdia (ABF input only); MS2Dec deconvolution is the entire reason to run DIA in MS-DIAL.Annotation tag != Unknown and calling the survivors "identified."Suggested* tag is an accurate-mass guess with no MS/MS; a named hit without MS/MS is MSI Level 3.MS/MS assigned == TRUE for any identity claim; map tags to MSI levels (see filtering section) and defer to metabolomics/metabolite-annotation.gcms token in a MS-DIAL 4 build (or AMDIS/eRah); align on Kovats/FAME retention index.| Threshold | Source | Rationale | |---|---|---| | Fill% >= 70% | Common untargeted practice | Below this, the feature is mostly gap-filled noise-floor integrals, not measurements | | QC CV (RSD) < 20-30% | Broadhurst 2018 | Technical reproducibility floor; drop features noisier than this in pooled QCs | | D-ratio (sd_QC/sd_sample) < 0.5 | Broadhurst 2018 | Keeps features whose technical variance is well below biological variance | | Blank filter: sample mean > 3-5x blank mean | Broadhurst 2018 | Removes background/contaminant features present in process blanks | | ~10x more features than compounds | Mahieu 2017 | One metabolite makes adducts/isotopes/fragments; counting features over-counts hypotheses |
| Error / symptom | Cause | Solution |
|---|---|---|
| All columns land in one field on import | Header offset wrong; tab-separated export read as CSV | skip=4 (R) / skiprows=4 (Python), set sep='\t' |
| lcmsdia rejects mzML input | DIA mode accepts ABF only | Convert vendor raw to ABF (Reifycs ABF converter) before lcmsdia |
| Annotation tag column not found | Header changes across versions (e.g. Annotation tag (VS1.0)) | Match by prefix / inspect colnames(); do not hard-code the suffix |
| No GC-MS option in MS-DIAL 5 | 5-alpha excludes GC-MS | Use a MS-DIAL 4 build's gcms token, or AMDIS/eRah |
| Console command not found on Linux | Expecting the GUI executable | The GUI is Windows-only; run MsdialConsoleApp (cross-platform) |
| Few features detected | Minimum peak height default too high for the instrument | Lower it toward the real baseline; defaults are TOF-tuned |
tools
--- name: bio-phasing-imputation-foundations description: Frames the phasing/imputation pipeline before any tool runs: phasing and imputation are one Li-Stephens copying HMM (recombination is the transition, mutation the emission, the genetic map and Ne set the rates), imputation's honest output is a dosage with a self-estimated quality (INFO/R2/DR2) not a hard genotype, and the stages are ordered and each fails silently (QC, align build and strand to the panel, phase, impute per chromosome, fil
tools
Chooses the enrichment generation before any tool runs, mapping the input shape to a method class - a pre-selected gene list plus a background to over-representation analysis (ORA, hypergeometric), a ranked statistic for all genes to gene set enrichment (GSEA), a signed signaling topology to pathway-topology (SPIA) - then making the null explicit (competitive vs self-contained, gene vs subject sampling) and running a trustworthiness checklist (testable-gene universe, FDR, redundancy collapse, leading-edge check, version reporting). Covers why every clusterProfiler GSEA is the inter-gene-correlation-uncorrected competitive null, why the background not the gene list decides ORA significance, and why no method is universally best. Use when deciding ORA vs GSEA vs topology, which gene-set DB, whether a result is trustworthy, or which null a tool computes. For ORA see go-enrichment, GSEA see gsea, databases kegg-pathways/reactome-pathways/wikipathways; the ranking comes from differential-expression/de-results.
testing
End-to-end GWAS workflow from VCF to association results. Covers PLINK QC, population structure correction, and association testing for case-control or quantitative traits. Use when running genome-wide association studies.
development
Orchestrates the full path from differential expression results to redundancy-collapsed functional enrichment: choose ORA vs GSEA, convert gene IDs per method, run enrichGO/enrichKEGG/enrichPathway/enrichWP or gseGO/gseKEGG (clusterProfiler, ReactomePA, rWikiPathways), and visualize. Routes the ORA-vs-GSEA generation fork and the null/universe/reproducibility theory to pathway-analysis/enrichment-foundations. Use when a DESeq2/edgeR/limma result must become enriched GO terms, KEGG/Reactome/WikiPathways pathways, or a GSEA leading edge; when deciding whether a ranking exists for all genes (GSEA, named decreasing vector) or only a pre-selected list (ORA plus a defensible background universe); or when assembling DE-to-pathway end to end. The DE list and ranking statistic come from differential-expression/de-results; per-method nuance lives in the pathway-analysis skills.