skills/genomics-bioinformatics/cosmic-database/SKILL.md
Query COSMIC for cancer somatic mutations, gene census, mutational signatures, drug resistance variants. REST API v3.1 supports gene/sample/variant queries; free registration. For germline use clinvar-database; for drug-target data use opentargets-database or chembl-database-bioactivity.
npx skillsauth add jaechang-hits/sciagent-skills cosmic-databaseInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
COSMIC (Catalogue Of Somatic Mutations In Cancer) is the world's largest expert-curated database of somatic mutations in cancer, covering 6.7M+ coding mutations, 40,000+ cancer samples, 19,000+ genes across all cancer types. It includes the Cancer Gene Census (critical cancer genes), mutational signatures (SBS, DBS, ID), drug resistance variants, copy number data, gene expression, and methylation. The REST API v3.1 enables programmatic queries; most features are freely accessible after registration.
clinvar-database; for drug-target associations use opentargets-databaserequests, pandaspip install requests pandas
# Register at https://cancer.sanger.ac.uk/cosmic/register to obtain API credentials
import requests
import base64
# COSMIC API requires base64-encoded email:password authentication
EMAIL = "[email protected]"
PASSWORD = "your_password"
token = base64.b64encode(f"{EMAIL}:{PASSWORD}".encode()).decode()
BASE = "https://cancer.sanger.ac.uk/cosmic/api"
HEADERS = {"Authorization": f"Basic {token}"}
# Get mutations for KRAS gene
r = requests.get(f"{BASE}/mutations",
headers=HEADERS,
params={"gene_name": "KRAS", "limit": 5})
r.raise_for_status()
data = r.json()
print(f"Total KRAS mutations: {data['meta']['total']}")
for m in data["data"][:3]:
print(f" {m['mutation_id']:15s} AA: {m.get('mutation_aa')} | Cancer: {m.get('primary_site')}")
Retrieve all COSMIC somatic mutations for a gene, with cancer type and amino acid change.
import requests, base64, pandas as pd
EMAIL = "[email protected]"
PASSWORD = "your_password"
token = base64.b64encode(f"{EMAIL}:{PASSWORD}".encode()).decode()
BASE = "https://cancer.sanger.ac.uk/cosmic/api"
HEADERS = {"Authorization": f"Basic {token}"}
def get_gene_mutations(gene, limit=100, cancer_site=None):
params = {"gene_name": gene, "limit": limit}
if cancer_site:
params["primary_site"] = cancer_site
r = requests.get(f"{BASE}/mutations", headers=HEADERS, params=params)
r.raise_for_status()
return r.json()
data = get_gene_mutations("TP53", limit=20)
print(f"Total TP53 mutations in COSMIC: {data['meta']['total']}")
rows = []
for m in data["data"][:10]:
rows.append({
"mutation_id": m.get("mutation_id"),
"mutation_aa": m.get("mutation_aa"),
"mutation_cds": m.get("mutation_cds"),
"primary_site": m.get("primary_site"),
"histology": m.get("primary_histology"),
"count": m.get("count"),
})
df = pd.DataFrame(rows)
print(df.head())
# Filter by cancer site
data_lung = get_gene_mutations("TP53", cancer_site="lung", limit=20)
print(f"\nTP53 mutations in lung cancer: {data_lung['meta']['total']}")
Retrieve the COSMIC Cancer Gene Census — classified cancer driver genes.
import requests, base64, pandas as pd
EMAIL = "[email protected]"
PASSWORD = "your_password"
token = base64.b64encode(f"{EMAIL}:{PASSWORD}".encode()).decode()
BASE = "https://cancer.sanger.ac.uk/cosmic/api"
HEADERS = {"Authorization": f"Basic {token}"}
r = requests.get(f"{BASE}/genes", headers=HEADERS, params={"limit": 100})
r.raise_for_status()
data = r.json()
print(f"Total genes in COSMIC: {data['meta']['total']}")
# Get Cancer Gene Census genes
r_cgc = requests.get(f"{BASE}/genes",
headers=HEADERS,
params={"cgc_tier": "1", "limit": 50})
cgc_data = r_cgc.json()
print(f"\nCGC Tier 1 genes: {cgc_data['meta']['total']}")
rows = []
for g in cgc_data["data"][:15]:
rows.append({
"gene": g.get("gene_name"),
"tier": g.get("cgc_tier"),
"role": g.get("role_in_cancer"),
"mutation_types": g.get("mutation_types"),
"tumour_types": str(g.get("tumour_types_somatic", []))[:80],
})
df = pd.DataFrame(rows)
print(df.to_string(index=False))
Retrieve details for a known COSMIC mutation ID (COSM…).
import requests, base64
EMAIL = "[email protected]"
PASSWORD = "your_password"
token = base64.b64encode(f"{EMAIL}:{PASSWORD}".encode()).decode()
BASE = "https://cancer.sanger.ac.uk/cosmic/api"
HEADERS = {"Authorization": f"Basic {token}"}
# KRAS G12D mutation
mutation_id = "COSM521"
r = requests.get(f"{BASE}/mutations/{mutation_id}", headers=HEADERS)
r.raise_for_status()
m = r.json()
print(f"Mutation ID : {m.get('mutation_id')}")
print(f"Gene : {m.get('gene_name')}")
print(f"AA change : {m.get('mutation_aa')}")
print(f"CDS change : {m.get('mutation_cds')}")
print(f"Substitution: {m.get('mutation_description')}")
print(f"Count : {m.get('count')} samples")
print(f"Cancer types: {str(m.get('cancer_types', []))[:100]}")
Retrieve all somatic mutations for a specific cancer sample.
import requests, base64, pandas as pd
EMAIL = "[email protected]"
PASSWORD = "your_password"
token = base64.b64encode(f"{EMAIL}:{PASSWORD}".encode()).decode()
BASE = "https://cancer.sanger.ac.uk/cosmic/api"
HEADERS = {"Authorization": f"Basic {token}"}
# Search for a specific sample
r = requests.get(f"{BASE}/samples",
headers=HEADERS,
params={"primary_site": "breast", "limit": 5})
r.raise_for_status()
samples = r.json()["data"]
print(f"Example breast cancer samples:")
for s in samples[:3]:
print(f" {s.get('sample_id')}: {s.get('sample_name')} | {s.get('primary_histology')}")
# Get mutations for a specific sample
if samples:
sample_id = samples[0]["sample_id"]
r2 = requests.get(f"{BASE}/samples/{sample_id}/mutations", headers=HEADERS)
if r2.ok:
muts = r2.json()["data"]
print(f"\nMutations in sample {sample_id}: {len(muts)}")
for m in muts[:5]:
print(f" {m.get('gene_name'):10s} {m.get('mutation_aa')}")
Retrieve COSMIC mutational signature data for cancer types.
import requests, base64, pandas as pd
EMAIL = "[email protected]"
PASSWORD = "your_password"
token = base64.b64encode(f"{EMAIL}:{PASSWORD}".encode()).decode()
BASE = "https://cancer.sanger.ac.uk/cosmic/api"
HEADERS = {"Authorization": f"Basic {token}"}
# List available mutational signatures
r = requests.get(f"{BASE}/signatures", headers=HEADERS)
r.raise_for_status()
sigs = r.json()["data"]
print(f"COSMIC mutational signatures: {len(sigs)}")
for s in sigs[:5]:
print(f" {s.get('signature_name')}: {s.get('aetiology', '')[:80]}")
# Get signature attributions by cancer type
r2 = requests.get(f"{BASE}/signatures/attributions",
headers=HEADERS,
params={"cancer_type": "Breast", "limit": 10})
if r2.ok:
attributions = r2.json()["data"]
for a in attributions[:5]:
print(f" {a.get('signature_name')}: {a.get('attribution_proportion'):.2%} in breast cancer")
Query the COSMIC drug resistance database for variants conferring drug resistance.
import requests, base64, pandas as pd
EMAIL = "[email protected]"
PASSWORD = "your_password"
token = base64.b64encode(f"{EMAIL}:{PASSWORD}".encode()).decode()
BASE = "https://cancer.sanger.ac.uk/cosmic/api"
HEADERS = {"Authorization": f"Basic {token}"}
# Get drug resistance variants
r = requests.get(f"{BASE}/resistance_mutations",
headers=HEADERS,
params={"gene": "EGFR", "limit": 20})
if r.ok:
data = r.json()
print(f"EGFR drug resistance variants: {data['meta'].get('total', 'n/a')}")
for v in data.get("data", [])[:5]:
print(f" {v.get('mutation_aa'):20s} Drug: {v.get('drug')} | Resistance: {v.get('resistance_type')}")
else:
print(f"Drug resistance API: {r.status_code} - endpoint may require specific access level")
COSMIC's Cancer Gene Census classifies genes into:
COSMIC mutation IDs (COSM…) are stable identifiers for specific amino acid changes in a gene. The same COSM ID appears across all samples with that mutation, allowing cross-study comparison.
Goal: Identify the most frequently occurring somatic mutations in a cancer gene.
import requests, base64, pandas as pd
from collections import Counter
EMAIL = "[email protected]"
PASSWORD = "your_password"
token = base64.b64encode(f"{EMAIL}:{PASSWORD}".encode()).decode()
BASE = "https://cancer.sanger.ac.uk/cosmic/api"
HEADERS = {"Authorization": f"Basic {token}"}
def get_all_gene_mutations(gene, max_records=1000):
"""Paginate through all COSMIC mutations for a gene."""
all_muts = []
skip = 0
limit = 200
while len(all_muts) < max_records:
r = requests.get(f"{BASE}/mutations",
headers=HEADERS,
params={"gene_name": gene, "limit": limit, "skip": skip})
r.raise_for_status()
batch = r.json()["data"]
if not batch:
break
all_muts.extend(batch)
total = r.json()["meta"]["total"]
skip += limit
if skip >= total:
break
return all_muts
# Get hotspots for KRAS
mutations = get_all_gene_mutations("KRAS", max_records=500)
print(f"Retrieved {len(mutations)} KRAS somatic mutations")
# Rank by amino acid change frequency
aa_counter = Counter(m["mutation_aa"] for m in mutations if m.get("mutation_aa"))
hotspots = pd.DataFrame(aa_counter.most_common(15), columns=["mutation_aa", "sample_count"])
print("\nKRAS hotspot mutations:")
print(hotspots.head(10).to_string(index=False))
hotspots.to_csv("KRAS_hotspots.csv", index=False)
Goal: Export the full Cancer Gene Census as a structured table for downstream pipeline use.
import requests, base64, pandas as pd, time
EMAIL = "[email protected]"
PASSWORD = "your_password"
token = base64.b64encode(f"{EMAIL}:{PASSWORD}".encode()).decode()
BASE = "https://cancer.sanger.ac.uk/cosmic/api"
HEADERS = {"Authorization": f"Basic {token}"}
all_genes = []
for tier in [1, 2]:
skip = 0
while True:
r = requests.get(f"{BASE}/genes",
headers=HEADERS,
params={"cgc_tier": str(tier), "limit": 100, "skip": skip})
r.raise_for_status()
batch = r.json()["data"]
if not batch:
break
all_genes.extend(batch)
if len(batch) < 100:
break
skip += 100
time.sleep(0.1)
rows = [{
"gene": g.get("gene_name"),
"tier": g.get("cgc_tier"),
"role_in_cancer": g.get("role_in_cancer"),
"mutation_types": g.get("mutation_types"),
"somatic_tumours": str(g.get("tumour_types_somatic", [])),
"germline_tumours": str(g.get("tumour_types_germline", [])),
"chr": g.get("chromosomal_location"),
} for g in all_genes]
df = pd.DataFrame(rows)
df.to_csv("COSMIC_cancer_gene_census.csv", index=False)
print(f"Exported {len(df)} Cancer Gene Census genes → COSMIC_cancer_gene_census.csv")
print(df.groupby("tier")["gene"].count())
| Parameter | Module | Default | Range / Options | Effect |
|-----------|--------|---------|-----------------|--------|
| gene_name | Mutations | — | HGNC symbol | Filter mutations by gene |
| primary_site | Mutations/Samples | — | tissue type string | Filter by primary tumor site |
| limit | All | 10 | 1–200 | Records per page |
| skip | All | 0 | integer | Pagination offset |
| cgc_tier | Genes | — | "1", "2" | Cancer Gene Census tier |
| mutation_id | Mutations | — | COSM ID string | Lookup specific mutation |
Authenticate via Base64: COSMIC uses HTTP Basic Auth with base64-encoded email:password. Store credentials in environment variables, not in code.
Paginate large gene queries: Popular cancer genes (TP53, KRAS) have 100,000+ mutation records; use skip/limit pagination and cache results locally.
Use COSM IDs for cross-study comparison: Amino acid change strings may have formatting variations (p.G12D vs G12D); use COSMIC mutation IDs (COSM…) for unambiguous references.
Check data license for commercial use: COSMIC data is free for academic use but requires a commercial license for industry applications. Verify at https://cancer.sanger.ac.uk/cosmic/license.
Complement with clinical data: COSMIC captures somatic mutations from cancer sequencing; complement with clinvar-database for germline pathogenicity and opentargets-database for therapeutic significance.
When to use: Identify frequently mutated genes in a specific cancer type.
import requests, base64, pandas as pd
from collections import Counter
EMAIL = "[email protected]"
PASSWORD = "your_password"
token = base64.b64encode(f"{EMAIL}:{PASSWORD}".encode()).decode()
HEADERS = {"Authorization": f"Basic {token}"}
r = requests.get("https://cancer.sanger.ac.uk/cosmic/api/mutations",
headers=HEADERS,
params={"primary_site": "lung", "limit": 200})
data = r.json()["data"]
gene_counts = Counter(m.get("gene_name") for m in data if m.get("gene_name"))
df = pd.DataFrame(gene_counts.most_common(10), columns=["gene", "mutations"])
print(df.to_string(index=False))
When to use: Look up whether a specific amino acid change is recorded in COSMIC.
import requests, base64
EMAIL = "[email protected]"
PASSWORD = "your_password"
token = base64.b64encode(f"{EMAIL}:{PASSWORD}".encode()).decode()
HEADERS = {"Authorization": f"Basic {token}"}
gene = "KRAS"
aa_change = "p.G12D"
r = requests.get("https://cancer.sanger.ac.uk/cosmic/api/mutations",
headers=HEADERS,
params={"gene_name": gene, "limit": 200})
all_muts = r.json()["data"]
matches = [m for m in all_muts if aa_change in (m.get("mutation_aa") or "")]
print(f"{gene} {aa_change}: {'FOUND' if matches else 'NOT FOUND'} in COSMIC ({len(matches)} records)")
if matches:
print(f" Sample count: {sum(m.get('count', 0) for m in matches)}")
When to use: Get a simple list of Tier 1 cancer driver genes for filtering.
import requests, base64
EMAIL = "[email protected]"
PASSWORD = "your_password"
token = base64.b64encode(f"{EMAIL}:{PASSWORD}".encode()).decode()
HEADERS = {"Authorization": f"Basic {token}"}
r = requests.get("https://cancer.sanger.ac.uk/cosmic/api/genes",
headers=HEADERS,
params={"cgc_tier": "1", "limit": 200})
genes = [g["gene_name"] for g in r.json()["data"]]
print(f"CGC Tier 1 genes ({len(genes)}): {', '.join(genes[:10])}...")
with open("cosmic_tier1_genes.txt", "w") as f:
f.write("\n".join(genes))
| Problem | Cause | Solution |
|---------|-------|----------|
| HTTP 401 Unauthorized | Missing or incorrect API credentials | Check base64 encoding: base64.b64encode(f"{email}:{password}".encode()) |
| HTTP 403 Forbidden | Access requires different tier | Some endpoints need commercial license; check COSMIC license page |
| Empty data array | No records match filter | Broaden query; check spelling of gene symbol or site name |
| Very slow for large genes | TP53/KRAS have 100K+ records | Paginate with small limit=200; cache results to local CSV |
| Rate limit errors | >10 req/s | Add time.sleep(0.15) between requests |
| Different AA notation format | Various mutation string formats | Normalize with RDKit or use COSM IDs for exact matching |
clinvar-database — Germline pathogenicity classifications complementing COSMIC's somatic focusopentargets-database — Drug-target associations for COSMIC cancer driver genesensembl-database — Variant consequence predictions (VEP) for COSMIC variantsgwas-database — Population-level SNP associations for cancer risk (vs. COSMIC's somatic mutations)tools
Fast short-read DNA aligner for WGS/WES/ChIP-seq. 2× faster BWA-MEM successor; outputs SAM/BAM with read group headers for GATK. Primary plus supplementary records for chimeric reads. Use STAR for RNA-seq splice-aware alignment; Bowtie2 is a comparable alternative.
tools
smina molecular docking CLI. AutoDock Vina fork with customizable scoring functions, native SDF/MOL2/PDB ligand input, autoboxing, local energy minimization, and per-atom score breakdowns. Pipeline: receptor PDBQT prep -> ligand prep (RDKit/OpenBabel) -> dock via autobox or explicit grid -> rescore/minimize with custom scoring -> rank poses by affinity. Choose smina over Vina when you need custom scoring terms (--custom_scoring), local optimization of an existing pose (--local_only), per-atom contributions (--atom_term_data), or SDF/MOL2 ligands without manual PDBQT conversion. For unknown binding sites use diffdock-blind-docking; for the Python-bindings/Vinardo workflow use autodock-vina-docking.
development
mdtraj molecular dynamics trajectory analysis (Python). Reads DCD/XTC/TRR/NetCDF/H5/PDB topologies and trajectories; computes RMSD vs time, radius of gyration, per-residue RMSF, residue-residue contact frequency maps, phi/psi torsions for Ramachandran plots (general + Gly/Pro), and 8-state DSSP secondary structure. Modules: trajectory I/O, geometry (distances/angles/dihedrals), structural analysis (RMSD/Rg/RMSF/SASA), contacts, hydrogen bonds, secondary structure (DSSP), NMR observables. For broader atom-selection grammar use mdanalysis-trajectory; for running MD simulations use OpenMM/GROMACS.
development
Programmatic PubMed access via NCBI E-utilities REST API. Covers Boolean/MeSH queries, field-tagged search, endpoints (ESearch, EFetch, ESummary, EPost, ELink), history server for batches, citation matching, systematic review strategies. Use for biomedical literature search or automated pipelines.