immunoinformatics/immunogenicity-scoring/SKILL.md
Rank and prioritize neoantigen/epitope candidates by likely T-cell response using NeoFox feature annotation, PRIME2.0, BigMHC-IM, the Łuksza/Balachandran fitness model (agretopicity + foreignness), and pVACtools tiering. Encodes the field's hard truths that immunogenicity is the least-solved layer (dedicated scores ~AUROC 0.6-0.7, modest PPV), that scores are valid only for RANKING within one patient (never absolute go/no-go or cross-patient), that DAI has anchor-inflation and WT-denominator traps, and that stacking weak correlated scores into one number is a red flag. Use when ordering a candidate list for a vaccine. Binding lives in mhc-binding-prediction; calling in neoantigen-prediction.
npx skillsauth add GPTomics/bioSkills bio-immunoinformatics-immunogenicity-scoringInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Reference examples tested with: NeoFox 1.0+, pVACtools 4.1+, pandas 2.2+, numpy 1.26+
Before using code patterns, verify installed versions match. If versions differ:
pip show <package> then help(module.function) to check signatures<tool> --version then <tool> --help to confirm flagsIf code throws ImportError, AttributeError, or TypeError, introspect the installed package and adapt the example to match the actual API rather than retrying.
Notes specific to this skill: NeoFox annotates ~16 published features (it does not rank candidates automatically); PRIME 2.x requires MixMHCpred v3.0+ on PATH; BigMHC has separate -m el and -m im heads. ImmunoBERT is a PRESENTATION model, not an immunogenicity predictor — do not use it here. PRIME2.0 is the Cell Systems 2023 paper (the Cell Reports Medicine 2021 paper is PRIME v1). Re-verify tool versions and the supported-allele lists before scoring.
"Rank my neoantigen candidates by how likely a T cell responds" -> Annotate presentation + recognition features and order candidates within a patient; never assign an absolute immunogenicity verdict.
NeoFox to compute the published feature panel; PRIME / BigMHC -m im for recognition scoresBinding/presentation is genuinely good (AUROC high-0.9s); immunogenicity is not close. Predicting whether a displayed peptide provokes a T-cell response requires knowing whether a cognate TCR exists in this patient's repertoire, whether that clone survived thymic negative selection (escaped tolerance), and whether it activates in a suppressive tumor microenvironment — none observable from sequence. Dedicated immunogenicity tools land around AUROC 0.6-0.7 on their own test sets and worse on independent data; in TESLA the dedicated in-silico immunogenicity scores correlated poorly with validated immunogenicity, while presentation strength, binding stability, abundance/expression, agretopicity, and foreignness carried the signal. Two operational rules follow. First, immunogenicity scores are calibrated within a context (a tool, an allele, often a patient's HLA), so they are legitimate for ordering one patient's candidate list and illegitimate for absolute go/no-go or cross-patient/cross-allele comparison. Second, a confident single composite number is a red flag: stacking weak, correlated, IEDB-bias-trained scores into one value launders the bias at higher apparent precision. The honest deliverable is an ordered, feature-annotated shortlist with its uncertainty stated out loud.
The best-binder heuristic fails on a tolerance argument: a peptide that binds MHC superbly but closely resembles a self-peptide the thymus presented has had its cognate T cells deleted, so display does not help. A moderate binder that looks strikingly un-self may have a full, un-tolerized repertoire. The modern requirement is conjunctive — a useful neoantigen must be both PRESENTED (binding) AND FOREIGN enough (different from self) to have escaped tolerance. The Łuksza/Balachandran fitness model formalizes this: quality = amplitude (how much better the mutant is presented than its WT, a DAI-like term) x recognition potential R (resemblance to known immunogenic foreign epitopes). This is why agretopicity and foreignness, not raw affinity, recur in every validated analysis.
| Tool | Citation | What it scores | Note | |------|----------|----------------|------| | NeoFox | Lang 2021 | ~16 features at once (DAI, foreignness, dissimilarity, PRIME, PHBR, ...) | Annotates, does NOT rank — the right division of labor | | pVACtools tiering | Hundal 2020 | Rule-based tiers + within-tier sort | Auditable default; quarantines anchor/subclonal traps | | PRIME2.0 | Gfeller 2023 | Class I immunogenicity (presentation x TCR-recognition) | Strong; needs MixMHCpred v3.0+ | | BigMHC-IM | Albert 2023 | Class I immunogenicity (transfer-learned) | High precision; pan-allelic | | IEDB immunogenicity | Calis 2013 | Class I (AA + position) | Weak, allele-pooled, no self-comparison; one feature only | | DeepImmuno | Li 2021 | Class I CNN | 9/10mer only; limited alleles | | fitness model (foreignness) | Łuksza 2017; Balachandran 2017 | Quality = amplitude x recognition | The conceptual backbone |
| Scenario | Recommended | Why | |----------|-------------|-----| | Default: rank a patient's candidates | NeoFox features -> pVACtools tiering -> human curation | Transparent features + auditable tiers, not a black-box score | | Need a single recognition score | PRIME2.0 or BigMHC-IM | Best-validated class I; report alongside features, not alone | | "Is this one immunogenic, yes/no?" | Reframe to ranking | No honest tool gives an absolute verdict | | CD4 / class II immunogenicity | Flag as a frontier (TLimmuno2 etc.) | Class II immunogenicity is even less solved | | Final shortlist for synthesis | Feature-annotated table + expression/clonality filters | Presentation + abundance carry most real signal (TESLA) |
Goal: Order one patient's candidates without collapsing fragile features into a single over-trusted number.
Approach: Compute the feature panel (NeoFox), apply the non-negotiable expression/clonality filters first, then sort by presentation + abundance + quality features, keeping the features visible side by side for human curation. Cross-patient comparison is invalid.
import pandas as pd
def rank_within_patient(df, expr_col='gene_expression', vaf_col='rna_vaf'):
'''Filter (not score) on expression/clonality first, then order by presentation,
abundance, and quality. Returns a feature-annotated table for human curation, not
a verdict. Scores are within-patient only - never compare across patients/alleles.'''
keep = df[(df[expr_col] >= 1.0) & (df[vaf_col] >= 0.25)].copy()
sort_cols = ['presentation_rank', 'gene_expression', 'agretopicity', 'foreignness']
ascending = [True, False, False, False]
cols = [c for c in sort_cols if c in keep.columns]
asc = [a for c, a in zip(sort_cols, ascending) if c in keep.columns]
return keep.sort_values(cols, ascending=asc)
Goal: Use the mutant-vs-WT binding gain without falling into its two traps.
Approach: Agretopicity (ratio, IC50_WT / IC50_MT; the DAI family — Duan 2014 uses the difference form) rewards a mutant that binds while WT does not. Trap 1: an anchor-position mutation inflates it without changing the TCR-facing surface (quarantine via the Anchor tier). Trap 2: when WT binds very poorly, the denominator explodes and the ratio is dominated by prediction noise — a value of 200 on a barely-estimable WT is not 100x more meaningful than a value of 2.
def defensive_dai(df, wt='wt_ic50', mt='mt_ic50', anchor='mutation_at_anchor', wt_cap=5000):
'''Flag anchor-inflated and denominator-unstable DAI rather than trusting the number.'''
out = df.copy()
out['dai'] = out[wt] / out[mt]
out['dai_anchor_artifact'] = out[anchor] # surface unchanged -> DAI is artifact
out['dai_unstable'] = out[wt] > wt_cap # WT barely presented -> ratio is noise
out['dai_trustworthy'] = ~out['dai_anchor_artifact'] & ~out['dai_unstable']
return out
Trigger: "score > X means immunogenic" or comparing scores across patients. Mechanism: scores are calibrated within tool/allele/patient. Symptom: false confidence; cross-patient mis-ranking. Fix: rank within a patient; state uncertainty; never threshold absolutely.
Trigger: summing/modeling DAI + foreignness + dissimilarity + hydrophobicity + PRIME into one number. Mechanism: components are weak, correlated (several measure "un-selfness"), and trained on ill-defined negatives. Symptom: an authoritative-looking 3-decimal number hiding fragile assumptions. Fix: keep features side by side; use auditable tiers; let a human weigh axes.
Trigger: trusting a high DAI. Mechanism: anchor mutation changes binding not TCR surface; tiny WT binding blows up the ratio. Symptom: top-ranked candidates that are anchor artifacts or noise. Fix: inspect mutation position and actual WT binding; quarantine via Anchor tier.
Trigger: trusting a new tool's headline AUROC. Mechanism: IEDB "negatives" conflate proven-non-immunogenic with untested; redrawing realistic negatives collapses performance. Symptom: great benchmark, poor real-world PPV. Fix: ask how negatives were defined before reading the number.
Trigger: optimizing a vaccine purely on class I immunogenicity. Mechanism: CD4 help drives durable efficacy but class II immunogenicity is a frontier. Symptom: optimizing the better-measured half of a two-armed problem. Fix: flag class II as unproven; include CD4 epitopes via mhc-class-ii-prediction.
| Threshold | Source | Rationale | |-----------|--------|-----------| | Dedicated immunogenicity AUROC ~0.6-0.7 | TESLA; tool benchmarks | The honest performance ceiling; weak prior, not verdict | | Gene TPM >= 1, RNA VAF >= 0.25 | pVACtools defaults | Unexpressed/low-VAF peptides are not displayed (filter first) | | Subclonal at DNA VAF <= purity/4 | pVACtools | Clonal targets beat subclonal (McGranahan 2016) | | Presentation + abundance carry the signal | Wells 2020 (TESLA) | Most predictive power is upstream of recognition scores | | Rank within patient only | Score calibration | Cross-patient/allele comparison is invalid | | Agretopicity ratio (amplitude); DAI difference | Łuksza 2017; Duan 2014 | Inspect position + WT binding; anchor inflation and denominator instability |
| Error / symptom | Cause | Solution | |-----------------|-------|----------| | Absolute "immunogenic: yes/no" claim | Thresholded a within-context score | Reframe as within-patient ranking | | Over-trusted single composite | Stacked weak correlated scores | Keep features visible; audit with tiers | | High-DAI artifacts at top | Anchor mutation / unstable WT denominator | Defensive DAI; Anchor tier | | Used ImmunoBERT as immunogenicity | It is a presentation model | Use PRIME/BigMHC-IM/Calis for recognition | | Great AUROC, poor validation | Ill-defined negative set | Interrogate negatives; demand functional validation | | Class II candidates over-trusted | CD4 immunogenicity is a frontier | Flag uncertainty; treat as unproven |
tools
--- name: bio-phasing-imputation-foundations description: Frames the phasing/imputation pipeline before any tool runs: phasing and imputation are one Li-Stephens copying HMM (recombination is the transition, mutation the emission, the genetic map and Ne set the rates), imputation's honest output is a dosage with a self-estimated quality (INFO/R2/DR2) not a hard genotype, and the stages are ordered and each fails silently (QC, align build and strand to the panel, phase, impute per chromosome, fil
tools
Chooses the enrichment generation before any tool runs, mapping the input shape to a method class - a pre-selected gene list plus a background to over-representation analysis (ORA, hypergeometric), a ranked statistic for all genes to gene set enrichment (GSEA), a signed signaling topology to pathway-topology (SPIA) - then making the null explicit (competitive vs self-contained, gene vs subject sampling) and running a trustworthiness checklist (testable-gene universe, FDR, redundancy collapse, leading-edge check, version reporting). Covers why every clusterProfiler GSEA is the inter-gene-correlation-uncorrected competitive null, why the background not the gene list decides ORA significance, and why no method is universally best. Use when deciding ORA vs GSEA vs topology, which gene-set DB, whether a result is trustworthy, or which null a tool computes. For ORA see go-enrichment, GSEA see gsea, databases kegg-pathways/reactome-pathways/wikipathways; the ranking comes from differential-expression/de-results.
testing
End-to-end GWAS workflow from VCF to association results. Covers PLINK QC, population structure correction, and association testing for case-control or quantitative traits. Use when running genome-wide association studies.
development
Orchestrates the full path from differential expression results to redundancy-collapsed functional enrichment: choose ORA vs GSEA, convert gene IDs per method, run enrichGO/enrichKEGG/enrichPathway/enrichWP or gseGO/gseKEGG (clusterProfiler, ReactomePA, rWikiPathways), and visualize. Routes the ORA-vs-GSEA generation fork and the null/universe/reproducibility theory to pathway-analysis/enrichment-foundations. Use when a DESeq2/edgeR/limma result must become enriched GO terms, KEGG/Reactome/WikiPathways pathways, or a GSEA leading edge; when deciding whether a ranking exists for all genes (GSEA, named decreasing vector) or only a pre-selected list (ORA plus a defensible background universe); or when assembling DE-to-pathway end to end. The DE list and ranking statistic come from differential-expression/de-results; per-method nuance lives in the pathway-analysis skills.