plugin/skills/tooluniverse-proteomics-data-retrieval/SKILL.md
Find and retrieve proteomics datasets from MassIVE and ProteomeXchange. Search by species, keyword, or accession; retrieve detailed metadata (instruments, publications, species, PTMs studied). Use for locating public proteomics datasets to reanalyze, comparing instrument/protocol coverage across studies, and pre-download dataset evaluation.
npx skillsauth add mims-harvard/tooluniverse tooluniverse-proteomics-data-retrievalInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Find and retrieve metadata for publicly available proteomics datasets from MassIVE and ProteomeXchange repositories. Supports searching by species, keyword, or accession, and returns detailed dataset metadata including instruments, publications, species, and post-translational modifications.
Triggers:
Use Cases:
When analysis requires computation (statistics, data processing, scoring, enrichment), write and run Python code via Bash. Don't describe what you would do — execute it and report actual results. Use ToolUniverse tools to retrieve data, then Python (pandas, scipy, statsmodels, matplotlib) to analyze it.
Dataset quality depends on instrument, sample preparation, and quantification method. TMT/iTRAQ (isobaric labeling) datasets have ratio compression and co-isolation interference biases that differ from label-free quantification (LFQ). DIA datasets require different analysis pipelines than DDA. Check the original publication for methods before reusing data in a meta-analysis or cross-study comparison. Instrument resolution (Orbitrap > ion trap) and acquisition mode (DIA > DDA for completeness) directly affect how many proteins are quantified and at what confidence.
| Repository | Coverage | Strengths | |-----------|----------|-----------| | MassIVE | 10,000+ datasets | Rich metadata (summaries, keywords, modifications, contacts), species filtering by taxonomy ID | | ProteomeXchange | Aggregates PRIDE, MassIVE, PeptideAtlas, jPOST, iProX | Broadest coverage, standardized PXD accessions |
Query (keyword / species / accession)
|
+-- PHASE 0: Input Resolution
| Determine search type: keyword, species, or accession lookup
|
+-- PHASE 1: Repository Search
| Search MassIVE and/or ProteomeXchange based on query type
|
+-- PHASE 2: Dataset Detail Retrieval
| Get full metadata for promising hits
|
+-- PHASE 3: Result Synthesis
Compile datasets with metadata, publications, and relevance assessment
Objective: Determine the query type and prepare appropriate search parameters.
PXD000001, MSV000079514):
ProteomeXchange_get_dataset and optionally MassIVE_get_datasetMassIVE_get_datasetMassIVE_search_datasets with species filterProteomeXchange_search_datasets with query parameterObjective: Find relevant datasets across repositories.
MassIVE_search_datasets:
page_size: Number of results to return (integer, max 100, default 10)species: NCBI taxonomy ID string to filter by species (e.g., "9606" for human)accessions (array), title, summary, species, instruments, keywordsProteomeXchange_search_datasets:
query: Optional search filter -- keyword or dataset accession (e.g., "phosphoproteomics", "PXD")limit: Max results (1-50, default 10){data: [{accession, title, species}], metadata: {source, total_returned, query}}For species-specific search:
MassIVE_search_datasets(page_size=20, species="9606") for species-filtered resultsProteomeXchange_search_datasets(limit=20) for broader listingFor keyword search:
ProteomeXchange_search_datasets(query="keyword", limit=20)For comprehensive discovery:
{data: ...} wrapper){data: [...], metadata: {...}}Objective: Get full metadata for datasets of interest.
MassIVE_get_dataset:
accession: Dataset accession -- accepts both MSV and PXD formats (e.g., "MSV000079514", "PXD003971")accessions, title, summary, species, instruments, keywords, contacts, publications, modificationsProteomeXchange_get_dataset:
px_id: ProteomeXchange identifier in PXD format (e.g., "PXD000001"){data: {px_id, title, species, identifiers, instruments, publications, file_count}, metadata: {...}}ProteomeXchange_get_dataset for file count; use MassIVE_get_dataset for richer summary/keywordsObjective: Compile and present dataset results in a structured format.
# Proteomics Dataset Search Results
**Query**: [original query]
**Date**: YYYY-MM-DD
**Repositories searched**: MassIVE, ProteomeXchange
## Summary
Found N datasets matching [criteria].
## Datasets
### 1. [Title]
- **Accession**: PXD/MSV number
- **Species**: [organism]
- **Instruments**: [MS platforms]
- **Publications**: [PubMed IDs / DOIs]
- **Modifications**: [PTMs if available]
- **Files**: [count if available]
- **Summary**: [brief description]
### 2. [Title]
...
## Data Gaps
[Note any limitations in search coverage]
| Tool | Parameter | Notes |
|------|-----------|-------|
| MassIVE_search_datasets | page_size | Integer, max 100. Default 10 |
| MassIVE_search_datasets | species | NCBI taxonomy ID as string (e.g., "9606" not 9606) |
| MassIVE_get_dataset | accession | Accepts both MSV and PXD formats |
| ProteomeXchange_search_datasets | query | Optional keyword or accession filter |
| ProteomeXchange_search_datasets | limit | Integer, 1-50 |
| ProteomeXchange_get_dataset | px_id | PXD format only (e.g., "PXD000001") |
Response Format Notes:
{data: [...], metadata: {...}}{data: {...}, metadata: {...}}| Situation | Fallback | |-----------|----------| | MassIVE search returns empty | Use ProteomeXchange search (broader coverage) | | ProteomeXchange search returns empty | Try broader/simpler query terms | | MassIVE_get_dataset fails for PXD accession | Use ProteomeXchange_get_dataset instead | | Species taxonomy ID unknown | Search ProteomeXchange by keyword (organism name) | | No keyword search results | Try individual terms instead of multi-word queries |
| Species | Taxonomy ID | |---------|-------------| | Human | 9606 | | Mouse | 10090 | | Rat | 10116 | | Zebrafish | 7955 | | Fruit fly | 7227 | | C. elegans | 6239 | | S. cerevisiae | 559292 | | A. thaliana | 3702 | | E. coli | 562 |
| Quality Indicator | Good | Acceptable | Caution | |-------------------|------|------------|---------| | Instrument | Orbitrap Exploris/Eclipse, timsTOF | Q Exactive, TripleTOF 6600 | Older LTQ, ion trap only | | Publication | Peer-reviewed with PubMed ID | Preprint or DOI only | No associated publication | | Metadata completeness | Species + instrument + PTMs + summary | Species + instrument only | Title only, no annotations |
Interpreting dataset search results:
Synthesis questions to address in the report:
species parameterDataverse_get_datasetpage_size/limit reasonable| Skill | Relationship |
|-------|-------------|
| tooluniverse-proteomics-analysis | Use retrieved datasets as input for MS data analysis |
| tooluniverse-protein-modification-analysis | Find PTM-specific datasets to complement iPTMnet annotations |
| tooluniverse-multi-omics-integration | Discover proteomics datasets for cross-omics integration |
tools
Post-market safety surveillance and recall/adverse-event RETRIEVAL across the full spectrum of FDA-regulated products that are NOT covered by the drug-AE signal skills: medical devices, food / dietary supplements / cosmetics, veterinary drugs, and drug supply (shortages). Orchestrates openFDA endpoints (MAUDE device adverse events + device recalls + 510(k), CAERS food/supplement/ cosmetic adverse events, veterinary adverse events, drug shortages, and cross-product enforcement/recall reports). USE WHEN the user asks: "are there adverse events for [device / pacemaker / infusion pump / insulin pump]", "device recalls for [firm/product]", "supplement / vitamin / cosmetic adverse reactions", "is [drug] in shortage", "what injectables are on shortage", "veterinary / animal adverse events for [drug] in [dog/cat/horse]", "food recall for listeria", "MAUDE report for [device]", "CAERS reactions for [brand]". DO NOT USE for drug adverse-event SIGNAL detection or disproportionality (PRR / ROR / IC) or drug-AE association scoring — that is `tooluniverse-pharmacovigilance` / `tooluniverse-adverse-event-detection`. This skill is multi-product surveillance and retrieval, not drug-AE statistical signal mining.
tools
--- name: tooluniverse-phewas description: Cross-ancestry / cross-biobank phenome-wide association (PheWAS) and replication. Given ONE variant (rsID) or ONE gene, look up every phenotype it associates with across European/UK (UKB-TOPMed), Finnish (FinnGen), Japanese (BioBank Japan), and Taiwanese (TPMI) biobanks, plus exome-wide gene-burden PheWAS (Genebass), then judge whether an association replicates across ancestries or is population-specific. Use whenever the user asks "what else is this va
tools
Dereplicate a putative natural product and assign its chemical taxonomy. Use to answer "is [compound] a known natural product", "what microbe/organism produces [compound]", "what chemical class is [compound]", "dereplicate this metabolite (by formula/exact mass/InChIKey/SMILES)", or "classify this molecule into ChemOnt". Searches NPAtlas for known microbial natural products (producing organism + literature reference), assigns the ChemOnt kingdom→superclass→class→subclass hierarchy via ClassyFire, resolves systematic IUPAC names to structure via OPSIN, and cross-references identity in PubChem. NOT for general drug/compound identity or ADMET (use tooluniverse-chemical-compound-retrieval / tooluniverse-small-molecule-discovery) and NOT for metabolomics pathway/enrichment analysis (use tooluniverse-metabolomics skills).
tools
Genome-ASSEMBLY discovery, QC, and replicon mapping for any organism (bacteria, archaea, fungi, and beyond) using NCBI Datasets. Resolves an organism name or taxid to assemblies, picks the reference/representative or best-quality assembly, pulls assembly QC metrics (total length, contig/scaffold N50, contig count, GC%, assembly level, RefSeq category), enumerates chromosomes and plasmids via per-replicon sequence reports, and compares candidate assemblies on quality. Use for "what genomes are available for [organism]", "assembly stats / N50 / GC content for [GCF_/GCA_ accession]", "how many plasmids does [strain] have", "compare assemblies for [species]", "find the reference genome for [taxon]", "is this assembly Complete Genome or just contigs". NOT for gene-level orthology/synteny (use tooluniverse-comparative-genomics), plant gene structure (use tooluniverse-plant-genomics), de novo assembly from raw reads (no tool exists), or taxonomy-only name/lineage lookups.