skills/tooluniverse-cancer-genomics-tcga/SKILL.md
TCGA/GDC cancer genomics analysis — cohort construction, clinical metadata retrieval, somatic mutation frequencies, survival analysis, and multi-omics integration. Use for TCGA-BRCA-style cohort studies, mutation prevalence by cancer type, survival-by-mutation analysis, and pan-cancer driver discovery. Always cancer-type-specific (don't use pan-cancer counts without cohort context).
npx skillsauth add mims-harvard/tooluniverse tooluniverse-cancer-genomics-tcgaInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
TCGA analysis starts with: what cancer type? what data type? Build your cohort FIRST (GDC filters), then analyze. Don't query mutations without defining the cohort — pan-cancer counts from GDC_get_mutation_frequency are uninformative without cancer-type context. A mutation frequency of 10% in one cancer type may be 0.5% in another; always specify project_id. Survival analysis (Kaplan-Meier) is hypothesis-generating in retrospective TCGA data — always report sample size and p-value, and note that TCGA cohorts are not treatment-stratified.
LOOK UP DON'T GUESS: never assume TCGA project IDs, NCIt codes, or gene coordinates — use GDC_list_projects to confirm project IDs and Progenetix_list_filtering_terms for NCIt codes.
Systematic TCGA/GDC analysis: define cohorts, retrieve clinical data, profile somatic mutations, query copy number variations, run survival analysis, and interpret variants with OncoKB.
tooluniverse-precision-oncologytooluniverse-rare-disease-genomicstooluniverse-gwas-snp-interpretationInput (cancer type / gene / TCGA project ID)
|
v
Phase 1: Study Selection -- GDC_list_projects, GDC_search_cases
|
v
Phase 2: Clinical Data -- GDC_get_clinical_data
|
v
Phase 3: Somatic Mutations -- GDC_get_ssm_by_gene, GDC_get_mutation_frequency
|
v
Phase 4: CNV Analysis -- Progenetix_cnv_search, Progenetix_search_biosamples
|
v
Phase 5: Survival Analysis -- GDC_get_survival
|
v
Phase 6: Variant Interpretation -- OncoKB_annotate_variant
| Data Type | Format | Example | |-----------|--------|---------| | GDC project | TCGA-{ABBREV} | TCGA-BRCA, TCGA-LUAD, TCGA-SKCM | | GDC case | UUID | 3c6ef4c1-... | | NCIt cancer code | NCIT:C###### | NCIT:C4017 (breast), NCIT:C3058 (GBM) | | RefSeq chromosome | refseq:NC_###### | refseq:NC_000007.14 (chr7) |
| Cancer | Project ID | NCIt Code | |--------|-----------|-----------| | Breast | TCGA-BRCA | NCIT:C4017 | | Lung adenocarcinoma | TCGA-LUAD | NCIT:C3512 | | Glioblastoma | TCGA-GBM | NCIT:C3058 | | Melanoma | TCGA-SKCM | NCIT:C3510 | | Colorectal | TCGA-COAD | NCIT:C4349 | | Ovarian | TCGA-OV | NCIT:C4908 | | Prostate | TCGA-PRAD | NCIT:C7378 |
GDC_list_projects: No params required. Returns all GDC/TCGA projects with case counts.
GDC_search_cases: project_id (string, e.g., "TCGA-BRCA"), size (int, default 10), offset (int).
Returns case UUIDs and basic metadata.
GDC_get_clinical_data: project_id (string), primary_site (string, e.g., "Breast"), disease_type (string), vital_status ("Alive" or "Dead"), gender ("female"/"male"), size (int, 1-100), offset (int).
Returns {status, data: [{case_id, demographics: {gender, race, ethnicity, vital_status, age_at_index}, diagnoses: [{primary_diagnosis, tumor_stage, age_at_diagnosis, days_to_last_follow_up}], treatments: [{therapeutic_agents, treatment_type}]}]}.
project_id + optional filters to retrieve patient-level clinical attributes.age_at_diagnosis is in days; divide by 365.25 for years.# Get clinical data for deceased BRCA patients
result = tu.tools.GDC_get_clinical_data(
project_id="TCGA-BRCA", vital_status="Dead", size=50
)
GDC_get_mutation_frequency: gene_symbol (string REQUIRED, alias: gene). Returns pan-cancer SSM occurrence count.
GDC_get_ssm_by_gene with project_id.GDC_get_ssm_by_gene: gene_symbol (string REQUIRED), project_id (string, optional), size (int, 1-100).
Returns {status, data: [{ssm_id, mutation_type, genomic_dna_change, aa_change, consequence_type}]}.
mutation_type: "Single base substitution", "Insertion", "Deletion".aa_change: amino acid change notation (e.g., "Val600Glu").# TP53 mutations in lung adenocarcinoma
mutations = tu.tools.GDC_get_ssm_by_gene(
gene_symbol="TP53", project_id="TCGA-LUAD", size=50
)
Progenetix_search_biosamples: filters (string REQUIRED, NCIt code e.g., "NCIT:C4017"), limit (int), skip (int).
Returns {status, data: {biosamples: [{biosample_id, histological_diagnosis, pathological_stage, external_references}]}}.
Progenetix_cnv_search: reference_name (string REQUIRED, RefSeq accession), start (int REQUIRED, GRCh38 1-based), end (int REQUIRED), variant_type ("DUP"/"DEL"), filters (string, NCIt code), limit (int).
Returns biosamples with CNV in the specified genomic region.
variant_type="DUP" for amplification, "DEL" for deletion.filters to restrict to a cancer type.# EGFR amplifications (chr7:55019017-55211628) in breast cancer
result = tu.tools.Progenetix_cnv_search(
reference_name="refseq:NC_000007.14",
start=55019017, end=55211628,
variant_type="DUP", filters="NCIT:C4017", limit=10
)
Progenetix_list_filtering_terms: No params. Returns all available NCIt codes and labels.
Progenetix_list_cohorts: No params. Returns named cohorts available in Progenetix.
GDC_get_survival: project_id (string REQUIRED, e.g., "TCGA-BRCA"), gene_symbol (string, optional -- filters to mutated cases).
Returns {status, data: {donors: [{id, time, censored, survivalEstimate}], overallStats: {pValue}}}.
time (days), censored (bool: False=death event, True=censored), and survivalEstimate.overallStats.pValue: log-rank p-value (present when gene_symbol splits cohort).gene_symbol: returns full-cohort survival curve.gene_symbol: returns survival split by mutation status (mutated vs. wild-type).# Survival for TCGA-BRCA split by TP53 mutation
surv = tu.tools.GDC_get_survival(project_id="TCGA-BRCA", gene_symbol="TP53")
pval = surv["data"]["overallStats"]["pValue"]
OncoKB_annotate_variant: gene (string, alias gene_symbol), variant (string, alias alteration, e.g., "V600E"), tumor_type (string, OncoTree code e.g., "MEL").
Returns {status, data: {oncogenic, mutationEffect, highestSensitiveLevel, treatments: [{drugs, level, indication}]}}.
oncogenic: "Oncogenic", "Likely Oncogenic", "Neutral", "Inconclusive", "Unknown".highestSensitiveLevel: FDA approval level ("LEVEL_1"=FDA-approved, "LEVEL_2"=standard of care, etc.).# Annotate KRAS G12C in lung adenocarcinoma
result = tu.tools.OncoKB_annotate_variant(
gene="KRAS", variant="G12C", tumor_type="LUAD"
)
| Tool | Key Params | Returns |
|------|-----------|---------|
| GDC_list_projects | (none) | All TCGA/GDC projects with counts |
| GDC_search_cases | project_id, size, offset | Case UUIDs + metadata |
| GDC_get_clinical_data | project_id, vital_status, gender, size | Demographics + diagnoses + treatments |
| GDC_get_mutation_frequency | gene_symbol (alias: gene) | Pan-cancer SSM count |
| GDC_get_ssm_by_gene | gene_symbol, project_id, size | Per-mutation records with aa_change |
| GDC_get_survival | project_id, gene_symbol (optional) | Kaplan-Meier donor array + pValue |
| Progenetix_search_biosamples | filters (NCIt code), limit | Biosample records |
| Progenetix_cnv_search | reference_name, start, end, variant_type, filters | Biosamples with CNV in region |
| Progenetix_list_filtering_terms | (none) | All NCIt codes in Progenetix |
| OncoKB_annotate_variant | gene, variant, tumor_type | Oncogenicity + treatments |
1. GDC_get_mutation_frequency(gene_symbol="KRAS")
-> Pan-cancer mutation count
2. GDC_get_ssm_by_gene(gene_symbol="KRAS", project_id="TCGA-LUAD", size=50)
-> Specific amino acid changes in lung adenocarcinoma
3. GDC_get_survival(project_id="TCGA-LUAD", gene_symbol="KRAS")
-> Survival split by KRAS mutation status + p-value
4. OncoKB_annotate_variant(gene="KRAS", variant="G12C", tumor_type="LUAD")
-> Clinical significance + approved therapies (sotorasib)
1. GDC_list_projects() -> confirm TCGA-OV exists
2. GDC_get_clinical_data(project_id="TCGA-OV", size=100)
-> Demographics, tumor stage, treatment history
3. GDC_get_survival(project_id="TCGA-OV")
-> Baseline overall survival curve for the cohort
1. Progenetix_search_biosamples(filters="NCIT:C3058", limit=10)
-> GBM biosamples with CNV data
2. Progenetix_cnv_search(
reference_name="refseq:NC_000007.14",
start=55019017, end=55211628,
variant_type="DUP", filters="NCIT:C3058"
)
-> GBM samples with EGFR amplification
| Tier | Description | Example | |------|-------------|---------| | T1 | FDA-recognized biomarker with approved therapy | BRAF V600E in melanoma (vemurafenib) | | T2 | Well-powered clinical study, standard-of-care relevance | KRAS G12C in NSCLC (sotorasib), OncoKB Level 2 | | T3 | Preclinical/small cohort evidence, biological plausibility | Recurrent hotspot in TCGA but no approved therapy | | T4 | Computational prediction or variant of unknown significance | Low-frequency mutation, no functional data |
Mutation frequency: A gene mutated in >10% of a TCGA cohort is likely a driver candidate (e.g., TP53 in 36% of all TCGA). Mutations at <1% frequency are typically passengers unless they occur at known hotspots. Always cross-reference with OncoKB oncogenicity annotation.
Survival analysis (Kaplan-Meier): A log-rank p-value < 0.05 suggests the gene mutation is associated with differential survival. Hazard ratio (HR) > 1 indicates worse prognosis for the mutated group. Interpret cautiously: TCGA cohorts are retrospective and not treatment-stratified. Small subgroups (n < 20) produce unreliable survival estimates.
Copy number variation: Focal amplifications (narrow peaks) of oncogenes (EGFR, MYC, ERBB2) are more likely functionally relevant than broad arm-level events. Homozygous deletions of tumor suppressors (CDKN2A, PTEN, RB1) are strong loss-of-function signals. DUP count from Progenetix reflects sample frequency, not copy number magnitude.
A complete cancer genomics report should answer:
When ToolUniverse tools return truncated results or you need bulk data, use the GDC API directly:
import requests, pandas as pd
# Bulk clinical data for a TCGA project
filters = {"op":"and","content":[
{"op":"=","content":{"field":"project.project_id","value":"TCGA-BRCA"}}
]}
all_cases = []
offset = 0
while True:
resp = requests.post("https://api.gdc.cancer.gov/cases", json={
"filters": filters, "size": 500, "from": offset,
"fields": "submitter_id,demographic.vital_status,demographic.days_to_death,diagnoses.tumor_stage"
}).json()
hits = resp["data"]["hits"]
if not hits: break
all_cases.extend(hits)
offset += len(hits)
df = pd.json_normalize(all_cases)
# Download MAF mutation file by UUID
file_uuid = "abc123-..." # from GDC_list_files result
url = f"https://api.gdc.cancer.gov/data/{file_uuid}"
content = requests.get(url, headers={"Content-Type": "application/json"}).content
# Gene expression: query files endpoint for HTSeq counts
expr_filters = {"op":"and","content":[
{"op":"=","content":{"field":"cases.project.project_id","value":"TCGA-BRCA"}},
{"op":"=","content":{"field":"data_type","value":"Gene Expression Quantification"}}
]}
See tooluniverse-data-wrangling skill for pagination, error handling, and format parsing patterns.
GDC_get_survival with gene_symbol splits on mutation presence only; no multi-gene or stage-based stratification.GDC_get_mutation_frequency returns pan-cancer total only; per-cancer frequencies require GDC_get_ssm_by_gene per project.GDC_get_clinical_data returns up to 100 cases per call; use offset for pagination.Progenetix_cnv_search.OncoKB_annotate_variant without ONCOKB_API_TOKEN operates in demo mode (limited to BRAF, TP53, ROS1).filters param requires NCIt CURIE format (e.g., "NCIT:C4017"), not free text.tools
PCR / qPCR primer and oligo design — design forward/reverse primers for a target region (SantaLucia nearest-neighbor thermodynamics), compute melting temperature (Tm) and annealing temperature (Ta), check GC content, and screen an oligo for hairpins and primer-dimers. Use when you need primers for a sequence, want to QC an existing primer pair, or need the Tm of an oligo. Covers the primer-design rules (Tm matching, GC clamp, 3'-end, length) and the tools' constraint quirks.
tools
Pharmacokinetic (PK) analysis of concentration-time data — non-compartmental analysis (NCA) for Cmax, Tmax, AUC (0-t and 0-∞), terminal half-life, clearance (CL), volume of distribution (Vd), MRT, and absolute bioavailability (F). Also one-compartment fitting. Use when you have plasma/serum drug concentrations over time after a dose and need PK parameters, or to compute bioavailability from IV + oral AUCs. NOT for ADMET property prediction from structure (use tooluniverse-admet-prediction).
tools
Molecular cloning assembly design — Gibson Assembly (overlap design for seamless multi-fragment joining) and Golden Gate Assembly (Type IIS / BsaI / BbsI design with unique 4-bp fusion overhangs). Use when you need to plan how to join DNA fragments into a construct, design assembly overlaps/overhangs, or decide between cloning methods. Covers the domestication (internal-site removal), overhang-uniqueness, and overlap-Tm rules. For PCR primers to generate the fragments, see tooluniverse-primer-design.
tools
Meta-analysis / evidence synthesis — pool effect sizes across studies (odds ratios, risk ratios, hazard ratios, mean differences, correlations, GWAS betas) with fixed- or random-effects models, quantify heterogeneity (Q, I², τ²), and build a forest plot. Use when you have results from MULTIPLE studies and need a single pooled estimate, or to synthesize evidence from a systematic review / multiple GWAS / replicated experiments. Handles the error-prone effect-size + standard-error preparation (converting OR/HR/CI, two-group means±SD, proportions, and correlations into the (effect, SE) the pooling step needs).