proteomics/data-import/SKILL.md
Load and parse mass spectrometry data formats including mzML, mzXML, and quantification tool outputs like MaxQuant proteinGroups.txt. Use when starting a proteomics analysis with raw or processed MS data. Handles contaminant filtering and missing value assessment.
npx skillsauth add GPTomics/bioSkills bio-proteomics-data-importInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Reference examples tested with: MSnbase 2.28+, pandas 2.2+
Before using code patterns, verify installed versions match. If versions differ:
pip show <package> then help(module.function) to check signaturespackageVersion('<pkg>') then ?function_name to verify parametersIf code throws ImportError, AttributeError, or TypeError, introspect the installed package and adapt the example to match the actual API rather than retrying.
"Load my mass spec data into Python" -> Parse mzML/mzXML raw files or MaxQuant proteinGroups.txt into data structures for programmatic access and downstream analysis.
pyopenms.MzMLFile().load() for raw spectra, pandas.read_csv() for search engine outputsMSnbase::readMSData() for raw, read.delim() for MaxQuant/Proteome DiscovererGoal: Parse raw mass spectrometry data files into memory for programmatic access.
Approach: Load mzML/mzXML into an MSExperiment object, then iterate spectra by MS level to access peaks and precursor info.
from pyopenms import MSExperiment, MzMLFile, MzXMLFile
exp = MSExperiment()
MzMLFile().load('sample.mzML', exp)
for spectrum in exp:
if spectrum.getMSLevel() == 1:
mz, intensity = spectrum.get_peaks()
elif spectrum.getMSLevel() == 2:
precursor = spectrum.getPrecursors()[0]
precursor_mz = precursor.getMZ()
Goal: Import MaxQuant proteinGroups.txt with contaminant and decoy filtering.
Approach: Read the TSV file, remove reverse hits, contaminants, and site-only identifications, then extract intensity columns.
import pandas as pd
protein_groups = pd.read_csv('proteinGroups.txt', sep='\t', low_memory=False)
# Filter contaminants and reverse hits
contam_col = 'Potential contaminant' if 'Potential contaminant' in protein_groups.columns else 'Contaminant'
protein_groups = protein_groups[
(protein_groups.get(contam_col, '') != '+') &
(protein_groups.get('Reverse', '') != '+') &
(protein_groups.get('Only identified by site', '') != '+')
]
# Extract intensity columns (LFQ or iBAQ)
intensity_cols = [c for c in protein_groups.columns if c.startswith('LFQ intensity') or c.startswith('iBAQ ')]
if not intensity_cols:
intensity_cols = [c for c in protein_groups.columns if c.startswith('Intensity ') and 'Intensity L' not in c]
intensities = protein_groups[['Protein IDs', 'Gene names'] + intensity_cols]
Goal: Import DIA-NN long-format report and reshape into a protein-by-sample quantification matrix.
Approach: Pivot the report table on protein group and run columns, using MaxLFQ values.
diann_report = pd.read_csv('report.tsv', sep='\t')
# Pivot to protein-level matrix
protein_matrix = diann_report.pivot_table(
index='Protein.Group', columns='Run', values='PG.MaxLFQ', aggfunc='first'
)
Goal: Load raw MS data in R for interactive exploration of spectra and metadata.
Approach: Use MSnbase's on-disk reading mode to access spectra and feature metadata without loading all data into memory.
library(MSnbase)
raw_data <- readMSData('sample.mzML', mode = 'onDisk')
spectra <- spectra(raw_data)
header_info <- fData(raw_data)
Goal: Quantify missing value patterns across proteins and samples in an intensity matrix.
Approach: Count NaN values per protein and per sample, then compute overall missing percentage.
def assess_missing_values(df, intensity_cols):
missing_per_protein = df[intensity_cols].isna().sum(axis=1)
missing_per_sample = df[intensity_cols].isna().sum(axis=0)
total_missing = df[intensity_cols].isna().sum().sum()
total_values = df[intensity_cols].size
missing_pct = 100 * total_missing / total_values
return {'per_protein': missing_per_protein, 'per_sample': missing_per_sample, 'total_pct': missing_pct}
testing
Analyze multi-modal single-cell data (CITE-seq, Multiome, spatial). Use when working with data that measures multiple modalities per cell like RNA + protein or RNA + ATAC. Use when analyzing CITE-seq, Multiome, or other multi-modal single-cell data.
data-ai
Analyze metabolite-mediated cell-cell communication using MeboCost for metabolic signaling inference between cell types. Predict metabolite secretion and sensing patterns from scRNA-seq data. Use when studying metabolic crosstalk between cell populations or metabolite-receptor interactions.
development
Find marker genes and annotate cell types in single-cell RNA-seq using Seurat (R) and Scanpy (Python). Use for differential expression between clusters, identifying cluster-specific markers, scoring gene sets, and assigning cell type labels. Use when finding marker genes and annotating clusters.
development
Reconstruct cell lineage trees from CRISPR barcode tracing or mitochondrial mutations. Use when studying clonal dynamics, cell fate decisions, or developmental trajectories.