clinical-databases/hla-typing/SKILL.md
Calls HLA class I and class II alleles at 2/4/6/8-field resolution from WGS/WES/RNA-seq/long-read data using OptiType, HLA-LA, T1K, Polysolver, HLA-HD, arcasHLA, StarPhase, or HIBAG imputation. Use when typing for HSCT, solid-organ transplant, neoantigen prediction, PGx screening (B*57:01, B*15:02, etc.), or disease-association studies, with reconciliation across tools and IPD-IMGT/HLA version mismatch handling.
npx skillsauth add GPTomics/bioSkills bio-clinical-databases-hla-typingInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Reference examples tested with: OptiType 1.3.5, HLA-LA 1.0.4, T1K 1.0.6 (Song 2023), Polysolver 4.0, HLA-HD 1.7.1, arcasHLA 0.6.0, StarPhase 1.0+ (PacBio), HIBAG 1.40+, samtools 1.19+, bwa-mem 0.7.17+. IPD-IMGT/HLA database release frequency is quarterly; tools must be re-bundled with the current release to capture new alleles (~38,000 alleles at Jan 2024; ~43,000+ by Jul 2025).
Before using code patterns, verify installed versions match. If versions differ:
pip show <package> then help(module.function) to check signatures<tool> --version then <tool> --help to confirm flagsIf code throws ImportError, AttributeError, or TypeError, introspect the installed package and adapt the example to match the actual API rather than retrying. Tool reference-bundle vintage matters more than algorithm choice for non-European cohorts; a 2022-bundled HLA-LA will silently miss thousands of post-2022 alleles dominant in African and South Asian ancestry.
'Determine HLA genotype for HSCT / neoantigen prediction / PGx screening' -> Call HLA class I (A, B, C) and class II (DRB1, DRB3/4/5, DQA1, DQB1, DPA1, DPB1) alleles at the resolution required by the downstream application.
t1k --preset hla -1 R1.fq -2 R2.fq -f hla_reference.faOptiTypePipeline.py -i R1.fq R2.fq -dHLA-LA.pl --BAM input.bam --graph PRG_MHC_GRCh38_withIMGTarcasHLA extract sample.bam -o out && arcasHLA genotype out/sample.extracted.fq.gzHIBAG::predict() with ancestry-stratified reference panelHLA nomenclature: HLA-A*02:01:01:01 = family : protein-changing : synonymous : intronic/UTR. Expression suffixes: N (null; DNA present, no protein expressed); L (low expression); S (secreted); Q (questionable); A (aberrant). A serologically apparent DR4-positive donor carrying DRB4*01:03:01:02N is functionally DR53-negative; a classic HSCT donor-selection failure.
| Application | Min resolution | Why | |-------------|---------------|-----| | HSCT (unrelated donor) | 6-field (12/12 match) | Null alleles + permissive DPB1 + Bw4/Bw6 + TCE3 core/non-core | | Solid organ transplant | 4-field (2-digit:2-digit) | Eplet-level epitope match (HLAMatchmaker, PIRCHE-II) | | ICI neoantigen prediction | 4-field class I + II | NetMHCpan-4.1 minimum | | HLA-disease association | 4-field | Standard for GWAS HLA fine-mapping | | HLA-B*57:01 abacavir screen | 4-field, specific | Other *57 alleles (*57:03) do NOT cause HSS | | HLA-B*15:02 carbamazepine | 4-field, specific | *15:02 only; *15:01 (NFE-common) is not the risk allele |
DR haplotype linkage is fixed and is the canonical sanity check on any DR typing:
| DRB1 allele family | Linked DRB3/4/5 | |---------------------|-----------------| | DR1 (*01), DR8 (*08), DR10 (*10) | None | | DR3 (*03), DR11 (*11), DR12 (*12), DR13 (*13), DR14 (*14) | DRB3 | | DR4 (*04), DR7 (*07), DR9 (*09) | DRB4 | | DR15 (*15), DR16 (*16) | DRB5 |
Any caller reporting DRB4 with DRB1*15:01 is broken or has a chimera. Use this as a routine QC check on automated pipelines.
| Tool | Class I | Class II | KIR | Resolution | Approach | Fails when | |------|---------|----------|-----|-----------|----------|-----------| | OptiType (Szolek 2014 Bioinformatics 30:3310) | Yes (~98% 4-digit) | No | No | 4-field | ILP on exons 2-3 | Class II needed; very deep contamination | | Polysolver (Shukla 2015 Nat Biotechnol 33:1152) | Yes (~95% 4-digit) | No | No | 4-field | Allele-specific ref alignment | Class II; non-European ancestry under-typing | | HLA-LA (Dilthey 2019 Bioinformatics 35:4054) | Yes (~94% class I) | Yes (best class II of WES tools) | No | 4-field | Graph-based PRG | High RAM/disk (~30-100 GB scratch) | | T1K (Song 2023 Genome Res) | Yes (~99% 4-digit) | Yes (~99%) | Yes (KIR + KIR3DL2 ligand) | 4-field | EM on consensus reference | Newer; less benchmarking on edge cases | | HLA-HD (Kawaguchi 2017 Hum Mutat 38:788) | Yes (~98%) | Yes (~95%) | No | 4-field | Bowtie2 against IPD-IMGT | License required for commercial use | | arcasHLA (Orenbuch 2020 Bioinformatics 36:33) | Yes (~100% 2-field) | Yes (>99% 2-field) | No | 4-field from RNA-seq | EM on STAR alignment | DNA-seq; population prior bias in non-EUR | | PHLAT, HLAforest, HLAminer, seq2HLA, HLAreporter | Yes | Some | No | Mostly 2-4 field | Various | Older; superseded |
Operational benchmark consensus (Claeys 2023 BMC Genomics; Matey-Hernandez 2018): T1K is currently the best general-purpose all-rounder; HLA-LA is the class-II reference; OptiType is the class-I anchor for WES. For full coverage of class I + II + KIR on WGS/WES, T1K is the 2024-2026 recommendation.
| Tool | Platform | Resolution | Use case | |------|----------|-----------|----------| | StarPhase (PacBio official 2024+) | PacBio HiFi | 8-field (full-field) | Transplant-grade typing | | HLA*ASM | PacBio HiFi | 8-field | Assembly-based | | FuFiHLA (2025 bioRxiv) | PacBio HiFi + ONT R10 | 8-field | Platform-agnostic | | HLAminer streaming (Warren 2025) | ONT long-read | 4-field | Streaming nanopore | | pbaa + StarPhase | PacBio amplicon | 8-field | Cost-effective targeted typing | | IGenotyper (Roe 2021) | PacBio long-read | 8-field | Immunogenetics-focused |
ONT R9 was historically unreliable for null-allele discrimination due to homopolymer errors; R10.4 with duplex closes the gap for class I and is competitive with PacBio HiFi for class II. PacBio HiFi remains the gold standard for DPB1 4-field typing.
When only SNP-array genotypes are available (GWAS cohorts), use imputation:
| Tool | Approach | Reference panel | Best for | |------|----------|----------------|----------| | HIBAG (Zheng 2014 Pharmacogenomics J 14:192) | Random forest from SNP-array | Pre-fit per-ancestry classifiers (EUR, AS, AFR, HIS) | Population-stratified GWAS | | HLA-TAPAS (Luo 2021 Nat Genet 53:1504) | Multi-ancestry imputation | 21,546 multi-ancestry reference | Cross-ancestry GWAS | | HLA*IMP:02 (Dilthey 2013) | Hidden Markov | EUR-only | Legacy; EUR-only | | SNP2HLA (Jia 2013) | Beagle-based | Type 1 Diabetes / EUR | Older; EUR-only | | CookHLA (Cook 2021) | Hybrid SNP2HLA + supplementary | Multi-ancestry refs | Modern alternative to SNP2HLA | | Multi-Ethnic Reference Panel (Degenhardt 2019) | Multi-ancestry imputation | Cross-population samples | Cross-ancestry GWAS |
Critical caveat: imputation panel quality is the limiting factor, NOT the imputation algorithm. EUR-trained HIBAG on East-Asian SNP-array data produces confidently wrong calls. African-ancestry imputation accuracy drops 10-20 percentage points without an ancestry-matched panel (Douillard 2024 HLA). For populations underrepresented in IPD-IMGT/HLA itself, imputation is fundamentally limited regardless of method.
| Scenario | Recommended path | Why | |----------|------------------|-----| | WGS/WES, class I only, max speed | OptiType | Best class-I accuracy, ILP-based, fast | | WGS/WES, class I + II, general-purpose | T1K | Best all-rounder; class I + II + KIR co-typing | | WGS/WES, class II reference grade | HLA-LA | Highest class-II accuracy in benchmarks | | RNA-seq tumor/normal for ICI | arcasHLA | RNA-seq native; expressed-allele-aware | | Transplant 6+ field resolution | StarPhase (PacBio HiFi) | 8-field native; reference standard | | Cost-effective targeted typing | pbaa + StarPhase amplicons | Lower cost than WGS | | TCGA-style cancer cohort | Polysolver | TCGA convention; reproduces published values | | SNP array (e.g., UKB) | HIBAG with population-matched panel | No sequencing data | | Multi-ancestry GWAS | HLA-TAPAS | Cross-ancestry reference | | Class II DPB1 4-field certainty | StarPhase or HiFi | Pre-2021 WES kits under-cover DPB1 | | ONT-only data | T1K or HLAminer streaming for class I; ONT R10.4+ duplex for class II | R9 unreliable for nulls |
| HLA allele | Drug | Reaction | Population enrichment | OR | |------------|------|----------|----------------------|-----| | B*57:01 | Abacavir | Hypersensitivity syndrome | All ancestries (5-8% NFE) | ~100 | | B*15:02 | Carbamazepine, oxcarbazepine | SJS/TEN | Han Chinese, Thai, Malay (>=5%) | ~2500 | | B*58:01 | Allopurinol | SJS/TEN | Han Chinese, Korean, Thai | ~580 | | A*31:01 | Carbamazepine | DRESS, MPE | Europeans, Japanese | ~12 | | B*13:01 | Dapsone | DDS | Han Chinese, SE Asian | -- | | B*35:02 (NOT *35:01) | Minocycline | DILI | All ancestries | -- | | B*35:01 | TMP-SMX | DILI | Mixed | -- | | B*14:01 | TMP-SMX | DILI | African | -- | | A*33:01/03 | Terbinafine | DILI | Multi-ancestry | -- | | DRB1*15:01 + DQB1*06:02 haplotype | Amoxicillin-clavulanate | DILI | Europeans | -- | | B*15:13 | Phenytoin | SJS | Malaysian | -- |
Operational rule: Pharmacogenomic HLA screening requires 4-field resolution; 2-field (e.g., "B*15") misses the specific allele.
Goal: Type HLA class I, class II, KIR from short-read sequencing with KIR3DL1 Bw4/Bw6 ligand prediction.
Approach: Extract MHC-region reads, run T1K with IPD-IMGT/HLA reference; T1K outputs allele-pair calls + class II haplotype + KIR.
# Extract chr6:28-34 Mb plus alt contigs (alt-aware alignment is critical)
samtools view -b -h input.bam chr6:28000000-34000000 chr6_GL000250v2_alt chr6_GL000251v2_alt \
chr6_GL000252v2_alt chr6_GL000253v2_alt chr6_GL000254v2_alt \
chr6_GL000255v2_alt chr6_GL000256v2_alt > hla_region.bam
samtools sort -n hla_region.bam -o hla_sorted.bam
samtools fastq -1 hla_R1.fq -2 hla_R2.fq -s singletons.fq -0 /dev/null hla_sorted.bam
# Run T1K (preset hla; includes class I + II).
# Some releases ship the entry point as `run-t1k` (a wrapper script) rather than `t1k`;
# verify with `which run-t1k` / `which t1k` before scripting.
t1k --preset hla \
-1 hla_R1.fq -2 hla_R2.fq \
-f hla_idx/hlaidx_rna_seq.fa \
-o sample_hla \
--threads 8
# Output: sample_hla_genotype.tsv with HLA-A, B, C, DRB1, DRB3/4/5, DQA1, DQB1, DPA1, DPB1
Goal: Type HLA-A, B, C at 4-field from WES with high accuracy.
Approach: Razers3-based alignment to IMGT class-I reference; ILP optimization to assign reads to allele pairs.
samtools view -h input.bam chr6:28000000-34000000 | samtools fastq -1 R1.fq -2 R2.fq -
OptiTypePipeline.py -i R1.fq R2.fq -d -o optitype_out -c config.ini
# config.ini
[mapping]
razers3=/usr/bin/razers3
threads=8
[ilp]
solver=glpk
threads=8
[behavior]
deletebam=true
unpaired_weight=0
use_discordant=false
Goal: Type both class I and class II at 4-field with the highest class-II accuracy of any WES tool.
Approach: Population reference graph (PRG) covering the MHC; HLA-LA maps reads to the PRG and infers the most likely paths.
HLA-LA.pl \
--BAM input.bam \
--graph PRG_MHC_GRCh38_withIMGT \
--workingDir hla_la_out \
--sampleID sample_name \
--maxThreads 8
# Output: hla_la_out/sample_name/hla/R1_bestguess_G.txt
# Format: Locus, Allele1, Allele2, AverageCoverage
Goal: Type HLA class I + II directly from RNA-seq for ICI neoantigen prediction.
Approach: Extract HLA-mapped reads from STAR BAM, EM-based genotype call against IMGT.
# Update reference to current IPD-IMGT/HLA release
arcasHLA reference --update
# Extract and genotype
arcasHLA extract sample.bam -o arcas_out --threads 8
arcasHLA genotype arcas_out/sample.extracted.fq.gz -o arcas_out --threads 8 --population prior
# Output: arcas_out/sample.genotype.json
Goal: Impute HLA from SNP array genotypes when sequencing is unavailable.
Approach: HIBAG random-forest classifier with population-matched reference panel.
library(HIBAG)
# Population-matched panel is critical; mismatch causes systematic errors
# Available panels: EUR, ASN, AFR, HIS (download from HIBAG release page)
load('European-HLA4-hg19.RData')
# Load PLINK genotype (.bed/.bim/.fam)
gen <- hlaBED2Geno(bed.fn='cohort.bed', fam.fn='cohort.fam', bim.fn='cohort.bim')
# Predict each locus
hla_A <- predict(model.list[['A']], gen, type='response+prob')
hla_B <- predict(model.list[['B']], gen, type='response+prob')
hla_DRB1 <- predict(model.list[['DRB1']], gen, type='response+prob')
# Filter on probability >= 0.5 for downstream use; lower for exploratory
1. Alt-aware alignment missing
--alt-aware; HLA reads are coerced to chr6 primary contigs.2. Stale IPD-IMGT/HLA bundle
t1k-build; OptiType: update data/hla_reference_dna.fasta).3. EUR-trained imputation on non-EUR samples
4. Cross-mapping DRB-related loci
5. DPB1 under-coverage in pre-2021 WES kits
6. Class II expression-allele confusion
DRB4*01:03:01:02N as functional DR53.N, L, S, Q, A); treat N as null in functional analysis; preserve full nomenclature for typing report.7. Specific allele vs allele family confusion
8. KIR co-typing mistaken for HLA
| Pattern | Likely cause | Action | |---------|-------------|--------| | OptiType vs HLA-LA class I disagree | Stale reference bundle in one; non-EUR ancestry | Update both; rerun; prefer the one with current reference | | HLA-LA vs T1K class II disagree | DRB1+DRB3/4/5 haplotype rule violated in one | Check haplotype linkage; the consistent caller is correct | | HIBAG vs sequencing disagree | EUR-trained model on non-EUR sample | Trust sequencing; use ancestry-matched HIBAG panel | | Tumor vs normal HLA differ | Tumor LOH at HLA locus (frequent in NSCLC, HNSCC) | Run LOHHLA / DASH to confirm somatic loss; report germline + somatic | | DPB1 homozygous on WES, het on WGS | WES kit under-covers DPB1 exon 2 | Trust WGS; flag WES result as low confidence | | Class I 4-field stable across tools, class II differs | Class II is fundamentally harder | Prefer HLA-LA or StarPhase for class II | | arcasHLA vs OptiType for tumor RNA | arcasHLA returns expressed-allele only (may miss silenced allele due to LOH) | Confirm with DNA-based typing for transplant context |
| Threshold | Convention | Source | |-----------|-----------|--------| | IPD-IMGT/HLA quarterly release | Updates Jan/Apr/Jul/Oct | IPD-IMGT/HLA database | | Current allele count | ~43,000+ at Jul 2025 | IPD-IMGT/HLA database release notes (Robinson J et al, NAR DB issue) | | HLA region coordinates | chr6:28000000-34000000 (GRCh38) | Standard | | HLA-LA RAM requirement | ~30-100 GB scratch | HLA-LA documentation | | OptiType class I 4-digit accuracy | ~98% (1000G benchmark) | Claeys 2023 | | Polysolver class I 4-digit accuracy | ~95% | Matey-Hernandez 2018 | | HLA-LA class II accuracy | Best of WES tools | Claeys 2023 | | T1K class I + II accuracy | ~99% / ~99% | Song 2023 | | HIBAG probability cutoff | >=0.5 for clinical-grade; >=0.3 for exploratory | HIBAG documentation | | 1000G allele coverage | ~60-70% of African-ancestry alleles still under-represented in IPD-IMGT/HLA | Robinson 2024 | | HSCT matching standard | 10/10 or 12/12 at 6-field | NMDP/WMDA guidelines | | TCE3 core alleles | DPB1*02:01, *04:01, *04:02, *23:01 | Meurer 2024 Blood 144:1659 |
Hurley 2020 HLA 95:516; compiled from >8M unrelated HSCT donors across 7 geographic/ancestral groups. Categories: Common (18%, n=545), Intermediate (17%, n=513), Well-Documented (65%, n=1,997) at 2-field. Replaces legacy CWD 2.0 (Mack 2013); many older pipelines still hardcode CWD 2.0; a quiet quality failure.
DPB1 mismatch GvHD/relapse risk depends on TCE3 group:
Now operational in NMDP donor selection algorithms; legacy TCE3 frameworks (Crocchiolo 2009) lack this stratification.
| Symptom | Cause | Solution |
|---------|-------|----------|
| HLA-DRA in output (DRB1 expected) | Tool confused paralogs | Use HLA-LA or T1K which model paralog loci correctly |
| Class II reports "no call" | Pre-2021 WES kit under-covers class II | Switch to WGS or amplicon |
| Tumor and normal HLA differ | LOH at HLA locus | Confirm with LOHHLA; report germline call as ground truth |
| Imputation reports rare allele with high probability | Reference panel mismatch with cohort ancestry | Switch to ancestry-matched panel |
| 4-field call but only 2-field appears in report | Tool default truncation | Use --full-field or equivalent flag |
| Same sample gives different 4-field calls across runs | Stochastic tie-breaking | Pin random seed; report all equally-supported calls |
| DRB4 with DRB1*15 | Linkage rule violated; bug or chimera | Re-run; check for sample swap |
| Null allele not reported in summary | Tool drops N-suffix; output is misleading | Use raw 4-field output; never strip suffixes for clinical reports |
| Pushback | Standard response | |----------|-------------------| | "Why T1K when HLA-LA is the published reference?" | T1K matches HLA-LA accuracy on class II while also typing class I + KIR in one pass with lower RAM; we cite both. | | "These African-ancestry samples have low confidence" | IPD-IMGT/HLA still under-represents African ancestry (~30-40% allele gap); we ran with current 2025 release; for transplant we recommend long-read confirmation. | | "DRB1 vs DRB3/4/5 reported inconsistently" | We verified DRB1+DRB3/4/5 linkage rule on each sample as routine QC; flagged violations for re-typing. | | "Why is HLA-B*15:01 not flagged for carbamazepine?" | *15:01 (NFE common) is not the SJS risk allele; *15:02 (Han Chinese) is. PGx requires 4-field specificity. | | "Imputation results differ from sequencing" | Imputation panel quality is the limiting factor; EUR-trained HIBAG on non-EUR is unreliable; we used ancestry-matched panel. | | "TCGA pipeline used Polysolver, why T1K?" | TCGA convention is Polysolver; for current analysis we use T1K which has better class-II and KIR coverage. We can reproduce Polysolver if back-comparison needed. |
tools
--- name: bio-phasing-imputation-foundations description: Frames the phasing/imputation pipeline before any tool runs: phasing and imputation are one Li-Stephens copying HMM (recombination is the transition, mutation the emission, the genetic map and Ne set the rates), imputation's honest output is a dosage with a self-estimated quality (INFO/R2/DR2) not a hard genotype, and the stages are ordered and each fails silently (QC, align build and strand to the panel, phase, impute per chromosome, fil
tools
Chooses the enrichment generation before any tool runs, mapping the input shape to a method class - a pre-selected gene list plus a background to over-representation analysis (ORA, hypergeometric), a ranked statistic for all genes to gene set enrichment (GSEA), a signed signaling topology to pathway-topology (SPIA) - then making the null explicit (competitive vs self-contained, gene vs subject sampling) and running a trustworthiness checklist (testable-gene universe, FDR, redundancy collapse, leading-edge check, version reporting). Covers why every clusterProfiler GSEA is the inter-gene-correlation-uncorrected competitive null, why the background not the gene list decides ORA significance, and why no method is universally best. Use when deciding ORA vs GSEA vs topology, which gene-set DB, whether a result is trustworthy, or which null a tool computes. For ORA see go-enrichment, GSEA see gsea, databases kegg-pathways/reactome-pathways/wikipathways; the ranking comes from differential-expression/de-results.
testing
End-to-end GWAS workflow from VCF to association results. Covers PLINK QC, population structure correction, and association testing for case-control or quantitative traits. Use when running genome-wide association studies.
development
Orchestrates the full path from differential expression results to redundancy-collapsed functional enrichment: choose ORA vs GSEA, convert gene IDs per method, run enrichGO/enrichKEGG/enrichPathway/enrichWP or gseGO/gseKEGG (clusterProfiler, ReactomePA, rWikiPathways), and visualize. Routes the ORA-vs-GSEA generation fork and the null/universe/reproducibility theory to pathway-analysis/enrichment-foundations. Use when a DESeq2/edgeR/limma result must become enriched GO terms, KEGG/Reactome/WikiPathways pathways, or a GSEA leading edge; when deciding whether a ranking exists for all genes (GSEA, named decreasing vector) or only a pre-selected list (ORA plus a defensible background universe); or when assembling DE-to-pathway end to end. The DE list and ranking statistic come from differential-expression/de-results; per-method nuance lives in the pathway-analysis skills.