plugin/skills/tooluniverse-cancer-genomics-tcga/SKILL.md
TCGA/GDC cancer genomics analysis — cohort construction, clinical metadata retrieval, somatic mutation frequencies, survival analysis, and multi-omics integration. Use for TCGA-BRCA-style cohort studies, mutation prevalence by cancer type, survival-by-mutation analysis, and pan-cancer driver discovery. Always cancer-type-specific (don't use pan-cancer counts without cohort context).
npx skillsauth add mims-harvard/tooluniverse tooluniverse-cancer-genomics-tcgaInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
TCGA analysis starts with: what cancer type? what data type? Build your cohort FIRST (GDC filters), then analyze. Don't query mutations without defining the cohort — pan-cancer counts from GDC_get_mutation_frequency are uninformative without cancer-type context. A mutation frequency of 10% in one cancer type may be 0.5% in another; always specify project_id. Survival analysis (Kaplan-Meier) is hypothesis-generating in retrospective TCGA data — always report sample size and p-value, and note that TCGA cohorts are not treatment-stratified.
LOOK UP DON'T GUESS: never assume TCGA project IDs, NCIt codes, or gene coordinates — use GDC_list_projects to confirm project IDs and Progenetix_list_filtering_terms for NCIt codes.
Systematic TCGA/GDC analysis: define cohorts, retrieve clinical data, profile somatic mutations, query copy number variations, run survival analysis, and interpret variants with OncoKB.
tooluniverse-precision-oncologytooluniverse-rare-disease-genomicstooluniverse-gwas-snp-interpretationInput (cancer type / gene / TCGA project ID)
|
v
Phase 1: Study Selection -- GDC_list_projects, GDC_search_cases
|
v
Phase 2: Clinical Data -- GDC_get_clinical_data
|
v
Phase 3: Somatic Mutations -- GDC_get_ssm_by_gene, GDC_get_mutation_frequency
|
v
Phase 4: CNV Analysis -- Progenetix_cnv_search, Progenetix_search_biosamples
|
v
Phase 5: Survival Analysis -- GDC_get_survival
|
v
Phase 6: Variant Interpretation -- OncoKB_annotate_variant
| Data Type | Format | Example | |-----------|--------|---------| | GDC project | TCGA-{ABBREV} | TCGA-BRCA, TCGA-LUAD, TCGA-SKCM | | GDC case | UUID | 3c6ef4c1-... | | NCIt cancer code | NCIT:C###### | NCIT:C4017 (breast), NCIT:C3058 (GBM) | | RefSeq chromosome | refseq:NC_###### | refseq:NC_000007.14 (chr7) |
| Cancer | Project ID | NCIt Code | |--------|-----------|-----------| | Breast | TCGA-BRCA | NCIT:C4017 | | Lung adenocarcinoma | TCGA-LUAD | NCIT:C3512 | | Glioblastoma | TCGA-GBM | NCIT:C3058 | | Melanoma | TCGA-SKCM | NCIT:C3510 | | Colorectal | TCGA-COAD | NCIT:C4349 | | Ovarian | TCGA-OV | NCIT:C4908 | | Prostate | TCGA-PRAD | NCIT:C7378 |
GDC_list_projects: No params required. Returns all GDC/TCGA projects with case counts.
GDC_search_cases: project_id (string, e.g., "TCGA-BRCA"), size (int, default 10), offset (int).
Returns case UUIDs and basic metadata.
GDC_get_clinical_data: project_id (string), primary_site (string, e.g., "Breast"), disease_type (string), vital_status ("Alive" or "Dead"), gender ("female"/"male"), size (int, 1-100), offset (int).
Returns {status, data: [{case_id, demographics: {gender, race, ethnicity, vital_status, age_at_index}, diagnoses: [{primary_diagnosis, tumor_stage, age_at_diagnosis, days_to_last_follow_up}], treatments: [{therapeutic_agents, treatment_type}]}]}.
project_id + optional filters to retrieve patient-level clinical attributes.age_at_diagnosis is in days; divide by 365.25 for years.# Get clinical data for deceased BRCA patients
result = tu.tools.GDC_get_clinical_data(
project_id="TCGA-BRCA", vital_status="Dead", size=50
)
GDC_get_mutation_frequency: gene_symbol (string REQUIRED, alias: gene). Returns pan-cancer SSM occurrence count.
GDC_get_ssm_by_gene with project_id.GDC_get_ssm_by_gene: gene_symbol (string REQUIRED), project_id (string, optional), size (int, 1-100).
Returns {status, data: [{ssm_id, mutation_type, genomic_dna_change, aa_change, consequence_type}]}.
mutation_type: "Single base substitution", "Insertion", "Deletion".aa_change: amino acid change notation (e.g., "Val600Glu").# TP53 mutations in lung adenocarcinoma
mutations = tu.tools.GDC_get_ssm_by_gene(
gene_symbol="TP53", project_id="TCGA-LUAD", size=50
)
Progenetix_search_biosamples: filters (string REQUIRED, NCIt code e.g., "NCIT:C4017"), limit (int), skip (int).
Returns {status, data: {biosamples: [{biosample_id, histological_diagnosis, pathological_stage, external_references}]}}.
Progenetix_cnv_search: reference_name (string REQUIRED, RefSeq accession), start (int REQUIRED, GRCh38 1-based), end (int REQUIRED), variant_type ("DUP"/"DEL"), filters (string, NCIt code), limit (int).
Returns biosamples with CNV in the specified genomic region.
variant_type="DUP" for amplification, "DEL" for deletion.filters to restrict to a cancer type.# EGFR amplifications (chr7:55019017-55211628) in breast cancer
result = tu.tools.Progenetix_cnv_search(
reference_name="refseq:NC_000007.14",
start=55019017, end=55211628,
variant_type="DUP", filters="NCIT:C4017", limit=10
)
Progenetix_list_filtering_terms: No params. Returns all available NCIt codes and labels.
Progenetix_list_cohorts: No params. Returns named cohorts available in Progenetix.
GDC_get_survival: project_id (string REQUIRED, e.g., "TCGA-BRCA"), gene_symbol (string, optional -- filters to mutated cases).
Returns {status, data: {donors: [{id, time, censored, survivalEstimate}], overallStats: {pValue}}}.
time (days), censored (bool: False=death event, True=censored), and survivalEstimate.overallStats.pValue: log-rank p-value (present when gene_symbol splits cohort).gene_symbol: returns full-cohort survival curve.gene_symbol: returns survival split by mutation status (mutated vs. wild-type).# Survival for TCGA-BRCA split by TP53 mutation
surv = tu.tools.GDC_get_survival(project_id="TCGA-BRCA", gene_symbol="TP53")
pval = surv["data"]["overallStats"]["pValue"]
OncoKB_annotate_variant: gene (string, alias gene_symbol), variant (string, alias alteration, e.g., "V600E"), tumor_type (string, OncoTree code e.g., "MEL").
Returns {status, data: {oncogenic, mutationEffect, highestSensitiveLevel, treatments: [{drugs, level, indication}]}}.
oncogenic: "Oncogenic", "Likely Oncogenic", "Neutral", "Inconclusive", "Unknown".highestSensitiveLevel: FDA approval level ("LEVEL_1"=FDA-approved, "LEVEL_2"=standard of care, etc.).# Annotate KRAS G12C in lung adenocarcinoma
result = tu.tools.OncoKB_annotate_variant(
gene="KRAS", variant="G12C", tumor_type="LUAD"
)
| Tool | Key Params | Returns |
|------|-----------|---------|
| GDC_list_projects | (none) | All TCGA/GDC projects with counts |
| GDC_search_cases | project_id, size, offset | Case UUIDs + metadata |
| GDC_get_clinical_data | project_id, vital_status, gender, size | Demographics + diagnoses + treatments |
| GDC_get_mutation_frequency | gene_symbol (alias: gene) | Pan-cancer SSM count |
| GDC_get_ssm_by_gene | gene_symbol, project_id, size | Per-mutation records with aa_change |
| GDC_get_survival | project_id, gene_symbol (optional) | Kaplan-Meier donor array + pValue |
| Progenetix_search_biosamples | filters (NCIt code), limit | Biosample records |
| Progenetix_cnv_search | reference_name, start, end, variant_type, filters | Biosamples with CNV in region |
| Progenetix_list_filtering_terms | (none) | All NCIt codes in Progenetix |
| OncoKB_annotate_variant | gene, variant, tumor_type | Oncogenicity + treatments |
1. GDC_get_mutation_frequency(gene_symbol="KRAS")
-> Pan-cancer mutation count
2. GDC_get_ssm_by_gene(gene_symbol="KRAS", project_id="TCGA-LUAD", size=50)
-> Specific amino acid changes in lung adenocarcinoma
3. GDC_get_survival(project_id="TCGA-LUAD", gene_symbol="KRAS")
-> Survival split by KRAS mutation status + p-value
4. OncoKB_annotate_variant(gene="KRAS", variant="G12C", tumor_type="LUAD")
-> Clinical significance + approved therapies (sotorasib)
1. GDC_list_projects() -> confirm TCGA-OV exists
2. GDC_get_clinical_data(project_id="TCGA-OV", size=100)
-> Demographics, tumor stage, treatment history
3. GDC_get_survival(project_id="TCGA-OV")
-> Baseline overall survival curve for the cohort
1. Progenetix_search_biosamples(filters="NCIT:C3058", limit=10)
-> GBM biosamples with CNV data
2. Progenetix_cnv_search(
reference_name="refseq:NC_000007.14",
start=55019017, end=55211628,
variant_type="DUP", filters="NCIT:C3058"
)
-> GBM samples with EGFR amplification
| Tier | Description | Example | |------|-------------|---------| | T1 | FDA-recognized biomarker with approved therapy | BRAF V600E in melanoma (vemurafenib) | | T2 | Well-powered clinical study, standard-of-care relevance | KRAS G12C in NSCLC (sotorasib), OncoKB Level 2 | | T3 | Preclinical/small cohort evidence, biological plausibility | Recurrent hotspot in TCGA but no approved therapy | | T4 | Computational prediction or variant of unknown significance | Low-frequency mutation, no functional data |
Mutation frequency: A gene mutated in >10% of a TCGA cohort is likely a driver candidate (e.g., TP53 in 36% of all TCGA). Mutations at <1% frequency are typically passengers unless they occur at known hotspots. Always cross-reference with OncoKB oncogenicity annotation.
Survival analysis (Kaplan-Meier): A log-rank p-value < 0.05 suggests the gene mutation is associated with differential survival. Hazard ratio (HR) > 1 indicates worse prognosis for the mutated group. Interpret cautiously: TCGA cohorts are retrospective and not treatment-stratified. Small subgroups (n < 20) produce unreliable survival estimates.
Copy number variation: Focal amplifications (narrow peaks) of oncogenes (EGFR, MYC, ERBB2) are more likely functionally relevant than broad arm-level events. Homozygous deletions of tumor suppressors (CDKN2A, PTEN, RB1) are strong loss-of-function signals. DUP count from Progenetix reflects sample frequency, not copy number magnitude.
A complete cancer genomics report should answer:
When ToolUniverse tools return truncated results or you need bulk data, use the GDC API directly:
import requests, pandas as pd
# Bulk clinical data for a TCGA project
filters = {"op":"and","content":[
{"op":"=","content":{"field":"project.project_id","value":"TCGA-BRCA"}}
]}
all_cases = []
offset = 0
while True:
resp = requests.post("https://api.gdc.cancer.gov/cases", json={
"filters": filters, "size": 500, "from": offset,
"fields": "submitter_id,demographic.vital_status,demographic.days_to_death,diagnoses.tumor_stage"
}).json()
hits = resp["data"]["hits"]
if not hits: break
all_cases.extend(hits)
offset += len(hits)
df = pd.json_normalize(all_cases)
# Download MAF mutation file by UUID
file_uuid = "abc123-..." # from GDC_list_files result
url = f"https://api.gdc.cancer.gov/data/{file_uuid}"
content = requests.get(url, headers={"Content-Type": "application/json"}).content
# Gene expression: query files endpoint for HTSeq counts
expr_filters = {"op":"and","content":[
{"op":"=","content":{"field":"cases.project.project_id","value":"TCGA-BRCA"}},
{"op":"=","content":{"field":"data_type","value":"Gene Expression Quantification"}}
]}
See tooluniverse-data-wrangling skill for pagination, error handling, and format parsing patterns.
GDC_get_survival with gene_symbol splits on mutation presence only; no multi-gene or stage-based stratification.GDC_get_mutation_frequency returns pan-cancer total only; per-cancer frequencies require GDC_get_ssm_by_gene per project.GDC_get_clinical_data returns up to 100 cases per call; use offset for pagination.Progenetix_cnv_search.OncoKB_annotate_variant without ONCOKB_API_TOKEN operates in demo mode (limited to BRAF, TP53, ROS1).filters param requires NCIt CURIE format (e.g., "NCIT:C4017"), not free text.tools
Post-market safety surveillance and recall/adverse-event RETRIEVAL across the full spectrum of FDA-regulated products that are NOT covered by the drug-AE signal skills: medical devices, food / dietary supplements / cosmetics, veterinary drugs, and drug supply (shortages). Orchestrates openFDA endpoints (MAUDE device adverse events + device recalls + 510(k), CAERS food/supplement/ cosmetic adverse events, veterinary adverse events, drug shortages, and cross-product enforcement/recall reports). USE WHEN the user asks: "are there adverse events for [device / pacemaker / infusion pump / insulin pump]", "device recalls for [firm/product]", "supplement / vitamin / cosmetic adverse reactions", "is [drug] in shortage", "what injectables are on shortage", "veterinary / animal adverse events for [drug] in [dog/cat/horse]", "food recall for listeria", "MAUDE report for [device]", "CAERS reactions for [brand]". DO NOT USE for drug adverse-event SIGNAL detection or disproportionality (PRR / ROR / IC) or drug-AE association scoring — that is `tooluniverse-pharmacovigilance` / `tooluniverse-adverse-event-detection`. This skill is multi-product surveillance and retrieval, not drug-AE statistical signal mining.
tools
--- name: tooluniverse-phewas description: Cross-ancestry / cross-biobank phenome-wide association (PheWAS) and replication. Given ONE variant (rsID) or ONE gene, look up every phenotype it associates with across European/UK (UKB-TOPMed), Finnish (FinnGen), Japanese (BioBank Japan), and Taiwanese (TPMI) biobanks, plus exome-wide gene-burden PheWAS (Genebass), then judge whether an association replicates across ancestries or is population-specific. Use whenever the user asks "what else is this va
tools
Dereplicate a putative natural product and assign its chemical taxonomy. Use to answer "is [compound] a known natural product", "what microbe/organism produces [compound]", "what chemical class is [compound]", "dereplicate this metabolite (by formula/exact mass/InChIKey/SMILES)", or "classify this molecule into ChemOnt". Searches NPAtlas for known microbial natural products (producing organism + literature reference), assigns the ChemOnt kingdom→superclass→class→subclass hierarchy via ClassyFire, resolves systematic IUPAC names to structure via OPSIN, and cross-references identity in PubChem. NOT for general drug/compound identity or ADMET (use tooluniverse-chemical-compound-retrieval / tooluniverse-small-molecule-discovery) and NOT for metabolomics pathway/enrichment analysis (use tooluniverse-metabolomics skills).
tools
Genome-ASSEMBLY discovery, QC, and replicon mapping for any organism (bacteria, archaea, fungi, and beyond) using NCBI Datasets. Resolves an organism name or taxid to assemblies, picks the reference/representative or best-quality assembly, pulls assembly QC metrics (total length, contig/scaffold N50, contig count, GC%, assembly level, RefSeq category), enumerates chromosomes and plasmids via per-replicon sequence reports, and compares candidate assemblies on quality. Use for "what genomes are available for [organism]", "assembly stats / N50 / GC content for [GCF_/GCA_ accession]", "how many plasmids does [strain] have", "compare assemblies for [species]", "find the reference genome for [taxon]", "is this assembly Complete Genome or just contigs". NOT for gene-level orthology/synteny (use tooluniverse-comparative-genomics), plant gene structure (use tooluniverse-plant-genomics), de novo assembly from raw reads (no tool exists), or taxonomy-only name/lineage lookups.