skills/tooluniverse-protein-sae-variant-interpretation/SKILL.md
Interpret a missense variant via ESMC-6B Sparse Autoencoder (SAE) feature activations. For a given protein + variant, computes which interpretable SAE features (catalytic, ligand-binding, PTM, structural motif, domain, etc.) are lost or gained at the mutation site. Use when standard pathogenicity scores (AlphaMissense, ClinVar) say a variant is damaging but you need a MECHANISTIC explanation — e.g. 'why is this variant LoF?' Complements (does not replace) variant-interpretation and variant-to-mechanism skills, which focus on ACMG classification or regulatory mechanism.
npx skillsauth add mims-harvard/tooluniverse tooluniverse-protein-sae-variant-interpretationInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Interpret a single missense variant by comparing reference vs mutant Sparse Autoencoder (SAE) feature activations from the ESMC-6B protein language model. SAE features are interpretable latent dimensions of the model's hidden state — many activate on biologically meaningful patterns (active sites, ligand-binding pockets, PTM sequons, structural motifs).
Apply when users:
Not for (use other skills instead):
tooluniverse-variant-interpretationtooluniverse-variant-to-mechanismtooluniverse-gene-disease-associationtooluniverse-cancer-variant-interpretation| Input | Format | Example |
|---|---|---|
| Protein identifier | UniProt accession or HGNC gene symbol | P04637 or TP53 |
| Variant | Single-letter code: {ref_aa}{position_1idx}{alt_aa} | R175H |
Optional:
pip install 'esm @ git+https://github.com/evolutionaryscale/esm@ee891c52'
The PyPI release of esm does NOT yet include SAEConfig. Install from the upstream feature branch.License note: SAE outputs from Forge are governed by the Cambrian Inference Clickthrough License — non-commercial / academic research use only unless a separate commercial agreement applies.
If the user gave a gene symbol, resolve to a reviewed human accession:
UniProt_search(
query="gene:TP53 AND organism_id:9606 AND reviewed:true",
fields=["accession", "gene_names", "protein_name"],
)
# → returns accession P04637 as the canonical reviewed human TP53
UniProt_get_sequence_by_accession(accession="P04637")
# → returns the canonical isoform 1 sequence (393 AA for TP53)
Parse the variant string (e.g. R175H):
ref_aa = "R", position = 175 (1-indexed), alt_aa = "H"ref_sequence[174] == "R" (Python is 0-indexed, position is 1-indexed)mutant_sequence = ref_sequence[:174] + "H" + ref_sequence[175:]If the reference residue does NOT match, return an explicit error — do not silently mutate the wrong position.
For the standard variant-interpretation use case, use one of two composite tools depending on how much you need:
Fullest one-call — disruption + per-feature biological category labels + mechanism summary:
ESM_explain_variant_mechanism(
sequence=ref_sequence,
position=175, ref_aa="R", alt_aa="H",
window=8,
top_k_features=5, # describe top 5 lost + top 5 gained
)
# data["mechanism_summary"] e.g. "Disrupted feature categories (lost): catalytic=2, ligand-binding=1"
# data["lost_feature_categories"] / ["gained_feature_categories"] — category counts
# data["top_features_lost"] / ["top_features_gained"] — per-feature delta + category + confidence
This is the right default for variant-mechanism reports — saves you from chaining ESM_score_variant_sae_disruption + N ESM_describe_sae_feature calls. Set include_descriptions=false to skip labeling (2 Forge calls only) when you just need the deltas.
Raw delta only (no category labels, no describe calls — faster):
ESM_score_variant_sae_disruption(
sequence=ref_sequence,
position=175, ref_aa="R", alt_aa="H",
window=8, top_k_features=10,
)
# → returns top_features_lost + top_features_gained ranked by |delta|
# plus ref / mut activation sums per feature
If ref_aa doesn't match the sequence at the given position, both tools return a clear error (you supplied the wrong isoform / mis-labeled the variant). The longer path below is for inspecting raw per-residue features.
Multiple variants at once (e.g. saturation at residue 175 — all 19 alternates):
alts = "ACDEFGHIKLMNPQRSTVWY".replace("R", "")
variants = [{"position": 175, "ref_aa": "R", "alt_aa": a} for a in alts]
ESM_score_variant_sae_batch(sequence=ref_sequence, variants=variants, top_k_features=5)
# 1 + 19 = 20 Forge calls, not 38
ref_features = ESM_get_sae_features(
sequence=ref_sequence,
position=175, # 1-indexed mutation position
window=8, # +/- 8 residues = 17-residue window
top_k_per_residue=64, # full sparsity (k=64 is the SAE's actual k)
)
mut_features = ESM_get_sae_features(
sequence=mutant_sequence,
position=175,
window=8,
top_k_per_residue=64,
)
Each call returns a list of {residue_idx_1based, active_features: [{feature_id, activation}]} for residues in the window. Typical latency: ~1-3 seconds per call (so ~2-6s total). Forge cost: 2 credits (1 per call).
Aggregate gained / lost features over the window:
# Build per-feature activation arrays across the window
def feature_to_window_sum(features_response):
sums = {} # feature_id -> sum of activations across all residues in window
for residue in features_response["data"]["activations"]:
for f in residue["active_features"]:
sums[f["feature_id"]] = sums.get(f["feature_id"], 0.0) + f["activation"]
return sums
ref_sums = feature_to_window_sum(ref_features)
mut_sums = feature_to_window_sum(mut_features)
# Delta = mut - ref. Positive = gained on mutation. Negative = lost.
all_features = set(ref_sums) | set(mut_sums)
deltas = {
f: mut_sums.get(f, 0.0) - ref_sums.get(f, 0.0)
for f in all_features
}
top_lost = sorted(deltas.items(), key=lambda x: x[1])[:10] # most negative
top_gained = sorted(deltas.items(), key=lambda x: -x[1])[:5] # most positive
The 16,384 SAE features have been categorized (by UniRef90 activation patterns) into ~6 biological types. Interpret features by their dominant category:
| Category | What it means biologically | Variant types it often catches | |---|---|---| | Catalytic function | Activates on residues at or near enzyme active sites | Variants that disrupt enzymatic activity (kinases, hydrolases, transferases) | | Ligand-binding site | Activates on residues that contact small-molecule / ion / nucleotide ligands | Variants disrupting drug binding, ATP/GTP binding, metal coordination, DNA/RNA binding | | Post-translational modification (PTM) | Activates on phospho-sites, glycosylation sequons, ubiquitin sites, acetylation sites | Variants disrupting phospho-regulation, N-glycosylation, ubiquitin-mediated degradation | | Domain / motif | Activates on classic structural domains (Zn finger, leucine zipper, EF-hand, etc.) | Variants disrupting tertiary fold within a domain | | Structural stability | Activates on residues critical to local fold | Variants destabilizing the protein → folding LoF | | Secondary structure / surface | Activates on alpha helices, beta strands, solvent-exposed residues | Lower-specificity; weak mechanism signal |
When SAE features are lost in mutation → that biological capability is plausibly disrupted.
When SAE features are gained → mutation may have introduced a non-native signal (often disorder / alternative fold).
Look up the biological category of any specific feature_id via ESM_describe_sae_feature(feature_id=...). The first call for a given feature is slow (~30s, ~10 Forge credits as the tool runs SAE on a 10-protein labeling panel + checks UniProt annotations); subsequent calls for the same feature hit a local cache and are instant.
esmc-6b-2024-12_k64_codebook16384_layer60 SAE. The smaller 600m SAE exists but has not been validated against the LoF benchmark from the source repo.A single SAE-based analysis can be misleading on its own. For high-stakes interpretation, cross-validate with structural / population evidence:
| Layer | Tool | Confirms |
|---|---|---|
| Population frequency | gnomad_get_variant / gnomad_search_variants | Is the variant rare? (LoF candidates are usually rare or absent) |
| Pathogenicity prediction | AlphaMissense_get_variant_score | Does an independent ML predictor agree it's damaging? |
| Clinical evidence | ClinVar_search_variants + ClinVar_get_variant_details | Is this variant already curated with clinical significance? |
| Structural context | alphafold_get_prediction (returns pLDDT per residue) | Is the mutated residue in a well-folded region (pLDDT > 70)? |
| Functional annotation | UniProt_get_disease_variants_by_accession, UniProt_get_function_by_accession | Is the mutated residue in a known catalytic / binding / PTM site? |
If 3+ layers agree the variant is damaging, the SAE feature analysis is the mechanistic explanation layer: "AlphaMissense says it's damaging, gnomAD confirms it's absent in population, SAE shows the catalytic feature is lost → this is a catalytic LoF variant."
After completing the workflow, summarize as:
Variant: {VARIANT_ID} (e.g. P04637_R175H = TP53 R175H)
Protein: {gene} ({uniprot_accession})
Top features LOST at position {position} ± {window}:
- Feature {id}: activation Δ = {delta:.2f} (category: {category})
{one-line biological summary}
...
Top features GAINED at position {position} ± {window}:
- Feature {id}: Δ = +{delta:.2f} (category: {category if known else "uncategorized"})
...
Mechanistic interpretation: {2-3 sentences synthesizing the SAE evidence
with cross-validation layers}
Confidence: {high|medium|low}, based on:
- SAE evidence: {how clear is the dominant category?}
- Cross-validation: {how many layers agree?}
- Limitations encountered: {any of the 6 caveats above that apply}
tools
Post-market safety surveillance and recall/adverse-event RETRIEVAL across the full spectrum of FDA-regulated products that are NOT covered by the drug-AE signal skills: medical devices, food / dietary supplements / cosmetics, veterinary drugs, and drug supply (shortages). Orchestrates openFDA endpoints (MAUDE device adverse events + device recalls + 510(k), CAERS food/supplement/ cosmetic adverse events, veterinary adverse events, drug shortages, and cross-product enforcement/recall reports). USE WHEN the user asks: "are there adverse events for [device / pacemaker / infusion pump / insulin pump]", "device recalls for [firm/product]", "supplement / vitamin / cosmetic adverse reactions", "is [drug] in shortage", "what injectables are on shortage", "veterinary / animal adverse events for [drug] in [dog/cat/horse]", "food recall for listeria", "MAUDE report for [device]", "CAERS reactions for [brand]". DO NOT USE for drug adverse-event SIGNAL detection or disproportionality (PRR / ROR / IC) or drug-AE association scoring — that is `tooluniverse-pharmacovigilance` / `tooluniverse-adverse-event-detection`. This skill is multi-product surveillance and retrieval, not drug-AE statistical signal mining.
tools
--- name: tooluniverse-phewas description: Cross-ancestry / cross-biobank phenome-wide association (PheWAS) and replication. Given ONE variant (rsID) or ONE gene, look up every phenotype it associates with across European/UK (UKB-TOPMed), Finnish (FinnGen), Japanese (BioBank Japan), and Taiwanese (TPMI) biobanks, plus exome-wide gene-burden PheWAS (Genebass), then judge whether an association replicates across ancestries or is population-specific. Use whenever the user asks "what else is this va
tools
Dereplicate a putative natural product and assign its chemical taxonomy. Use to answer "is [compound] a known natural product", "what microbe/organism produces [compound]", "what chemical class is [compound]", "dereplicate this metabolite (by formula/exact mass/InChIKey/SMILES)", or "classify this molecule into ChemOnt". Searches NPAtlas for known microbial natural products (producing organism + literature reference), assigns the ChemOnt kingdom→superclass→class→subclass hierarchy via ClassyFire, resolves systematic IUPAC names to structure via OPSIN, and cross-references identity in PubChem. NOT for general drug/compound identity or ADMET (use tooluniverse-chemical-compound-retrieval / tooluniverse-small-molecule-discovery) and NOT for metabolomics pathway/enrichment analysis (use tooluniverse-metabolomics skills).
tools
Genome-ASSEMBLY discovery, QC, and replicon mapping for any organism (bacteria, archaea, fungi, and beyond) using NCBI Datasets. Resolves an organism name or taxid to assemblies, picks the reference/representative or best-quality assembly, pulls assembly QC metrics (total length, contig/scaffold N50, contig count, GC%, assembly level, RefSeq category), enumerates chromosomes and plasmids via per-replicon sequence reports, and compares candidate assemblies on quality. Use for "what genomes are available for [organism]", "assembly stats / N50 / GC content for [GCF_/GCA_ accession]", "how many plasmids does [strain] have", "compare assemblies for [species]", "find the reference genome for [taxon]", "is this assembly Complete Genome or just contigs". NOT for gene-level orthology/synteny (use tooluniverse-comparative-genomics), plant gene structure (use tooluniverse-plant-genomics), de novo assembly from raw reads (no tool exists), or taxonomy-only name/lineage lookups.