genome-annotation/annotation-transfer/SKILL.md
Transfers gene annotations between genome assemblies via coordinate liftover (UCSC liftOver, CrossMap for same-species version updates) or feature/sequence projection (Liftoff for same/close species, miniprot for protein-level cross-species, TOGA/GeMoMa/CAT for distant clades). Covers the coordinate-vs-projection decision by divergence, why a successful lift is not biological confirmation, reference bias, the silent-dropping of unmapped features, build/PAR/MHC/inversion hazards, and transfer-vs-de-novo validation. Use when annotating a new assembly of a species with an existing reference, harmonizing coordinates across builds, or mapping annotations across related species.
npx skillsauth add GPTomics/bioSkills bio-genome-annotation-annotation-transferInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Reference examples tested with: Liftoff 1.6.3+, LiftoffTools 0.4+, miniprot 0.13+, CrossMap 0.7+, UCSC liftOver (current), BioPython 1.83+, gffutils 0.12+.
Before using code patterns, verify installed versions match. If versions differ:
<tool> --version then <tool> --help to confirm flagspip show <package> then help(module.function) to check signaturesThe chain file must match the exact assembly pair (build and patch); the source and target build must be recorded with every coordinate (a coordinate without a build is unusable). If code throws an error, introspect the installed tool and adapt rather than retrying.
"Transfer annotations from a reference to my new assembly" -> Map gene models from a well-annotated reference onto a target, by coordinate liftover (same-species, fast) or by re-aligning the actual gene sequence (cross-assembly/species, structure-aware), then validate against the target.
liftoff -g ref.gff3 -o out.gff3 -u unmapped.txt target.fa reference.fa (note: target before reference), liftOver in.bed map.chain out.bed unmapped (intervals)Coordinate liftover and feature projection answer different questions, and choosing the wrong one is the dominant failure mode:
Three load-bearing consequences:
gene/CDS feature types verbatim. The "it mapped, ship it" culture is how a lifted GFF acquires the social status of a validated annotation while no one ever re-derived a model from sequence. Treat every lifted annotation as a hypothesis until target evidence (intact ORF, identity distribution, BUSCO, RNA-seq) has touched it.| Paradigm | Tools | Operates on | Right for | |----------|-------|-------------|-----------| | A. Coordinate liftover | UCSC liftOver, CrossMap, segment_liftover, paftools | pre-computed chains; intervals (BED/GFF/VCF/BAM) | same-species version updates (hg19<->hg38, mm10<->mm39); variant/peak/CNV harmonization | | B. Feature/sequence projection | Liftoff (nt), miniprot (protein), GeMoMa, TOGA, CAT, LiftOn | re-aligned gene sequence | cross-assembly/species; full gene models; polyploid/duplicated; no reliable chain |
| Divergence | Recommended | Why |
|------------|-------------|-----|
| Same species, transfer intervals | liftOver / CrossMap | chain is dense; geometry suffices for variants/peaks |
| Same species, transfer gene models | Liftoff (-chroms) | structure-aware; per-interval liftOver fragments transcripts |
| Same genus (a few % divergence) | Liftoff + miniprot rescue for the divergent tail | nucleotide alignment robust; protein for the rest |
| Same family/order (tens-hundreds My) | TOGA or GeMoMa (multi-reference) | nucleotide saturates; orthology + gene-loss reasoning |
| Beyond family / lineage-specific content / heavy rearrangement | -> eukaryotic-gene-prediction (de novo) + transfer as evidence | reference too far; only de novo sees target-specific biology |
| Pan-genome / multi-haplotype | vg annotate onto the graph | avoids single-reference bias (tooling still maturing) |
Cross-species coordinate liftover is a methodological error (synteny fragments into thousands of short chains; most genes drop silently) - it is the wrong paradigm, not a tuning problem.
liftoff -g reference.gff3 -o lifted.gff3 -u unmapped.txt -p 16 \
-chroms chrom_map.txt -polish target.fasta reference.fasta
Positional args are target first, then reference (commonly swapped - a silent error). Liftoff extracts each gene's exon sequence, aligns with minimap2, and chooses the placement maximizing identity while preserving exon-intron structure. Key flags: -a (alignment coverage, default 0.5), -s (sequence identity, default 0.5), -copies/-sc (search for extra gene copies - a per-family decision, not a default), -polish (re-align to restore intact start/stop/splice, writes *_polished.gff3), -exclude_partial, -chroms (ordered chromosome mapping; reduces false cross-chromosome placements). LiftoffTools QCs the result (variants, synteny, copy-number changes). Same-species version updates should lift ≥99% - a 97% rate is a four-alarm signal of the wrong chain or coordinate-convention mismatch, not "pretty good."
miniprot -t 16 -d target.mpi target.fasta # optional index
miniprot -Iut 16 --gff target.mpi proteins.faa > out.gff
Protein conserves far deeper than nucleotide (synonymous sites saturate), so miniprot works across species where Liftoff's nucleotide alignment fails. -I auto-sets max intron from genome length; --gff emits GFF3. Frameshift and in-frame-stop tags in the output are the signal that the "gene" is pseudogenized in the target, not a clean ortholog - inspect them; do not treat a miniprot hit as a functional gene by default. For a polished multi-reference, intron-aware annotation use GeMoMa (which reasons about intron-position conservation); for the DNA+protein hybrid use LiftOn.
TOGA consumes a genome-alignment chain + reference BED12 and uses ML on chain features (including intronic/intergenic flanks - orthologs share flanking context, paralogs/retrocopies do not) to classify orthology (one2one ... one2zero) and gene-loss/intactness (intact / partially intact / lost / missing). It exists precisely because across deep time an inactivated gene still aligns - a coordinate lift reports the corpse as "present." Use TOGA for whole-clade ortholog projection; it does not discover target-specific novel genes (the reference-bias caveat of all of Paradigm B).
Goal: Quantify transfer quality and, critically, check that lifted CDS are biologically intact, not just placed.
Approach: Compare gene counts for a transfer rate, then translate each lifted CDS from the target and check for a valid start, a single terminal stop, and correct length - coordinate success is not intactness.
import gffutils
from Bio import SeqIO
def orf_integrity(lifted_gff, target_fasta):
genome = SeqIO.to_dict(SeqIO.parse(target_fasta, 'fasta'))
db = gffutils.create_db(lifted_gff, ':memory:', merge_strategy='merge')
valid = total = 0
for cds in db.features_of_type('CDS'):
total += 1
seq = genome[cds.seqid].seq[cds.start - 1:cds.end]
if cds.strand == '-':
seq = seq.reverse_complement()
prot = seq.translate()
if prot.startswith('M') and prot.endswith('*') and prot.count('*') == 1:
valid += 1
print(f'Intact ORFs: {valid}/{total} ({valid/total:.1%}) -- a clean lift can still land in a pseudogene')
return valid, total
Also: read and classify the unmapped file (not just count it); run BUSCO on the lifted protein set and compare to the reference (a drop quantifies silently lost conserved genes); compare to a de novo annotation to expose reference bias.
Trigger: liftOver/CrossMap to transfer genes between species. Mechanism: synteny fragments into short chains; most genes have no co-linear counterpart. Symptom: plausible-looking output that silently dropped most genes. Fix: miniprot/TOGA/GeMoMa (sequence/orthology), not chains.
Trigger: reporting a transfer complete from the success file alone. Mechanism: failures go to a side file; exit code 0. Symptom: a clean GFF missing entire gene families. Fix: read and classify unmapped (deletion/split/duplicated).
Trigger: trusting a lifted gene because it placed. Mechanism: the locus can be pseudogenized/frameshifted. Symptom: RNA-seq quantified against a gene with an internal stop at residue 40. Fix: -polish + ORF check; miniprot frameshift tags; TOGA intactness class.
Trigger: liftoff ... reference.fa target.fa. Mechanism: Liftoff is target reference. Symptom: nonsense mapping. Fix: target first, reference last.
-a/-s to "rescue" featuresTrigger: lowering coverage/identity to clear the unmapped pile. Mechanism: a 35%-coverage hit is usually a paralog/pseudogene/repeat match. Symptom: low-confidence placements laundered into the success file. Fix: treat default-threshold failures as signal; loosen only with a biological hypothesis and validate rescues individually.
| Threshold | Source | Rationale |
|-----------|--------|-----------|
| Same-species mapping rate ≥99% | Liftoff/ClinVar studies | a 97% rate signals wrong chain / convention mismatch |
| Liftoff -a/-s default 0.5 | Liftoff | loosening manufactures false placements; failure is often signal |
| liftOver -minMatch default 0.95 (per-feature) | UCSC | a long feature with one chain gap fails silently |
| BUSCO on lifted set vs reference | completeness audit | a drop quantifies silently lost conserved genes |
| Divergence rule: species->liftOver/Liftoff; genus->+miniprot; family->TOGA/GeMoMa; beyond->de novo | lab convention | matches paradigm to where the chain/identity breaks |
| Every coordinate carries its build | reproducibility | a coordinate without a build is unusable |
| Error / symptom | Cause | Solution |
|-----------------|-------|----------|
| Many unmapped features (same species) | wrong/patch-mismatched chain; contig naming (chr1 vs 1) | use the exact-pair chain; harmonize names |
| Mass gene loss, clean GFF | silent dropping | read/classify the unmapped file |
| Lifted genes with internal stops | landed in pseudogene/frameshift | -polish; ORF check; re-predict de novo in problem loci |
| Paralog/copy collapse or swap | no -copies, or mapped to the paralog | -copies/-sc per family; TOGA orthology graph |
| Most genes lost cross-species | coordinate liftover used across species | switch to miniprot/TOGA |
| mtDNA coordinates don't match | hg19 chrM != rCRS | record the exact MT record, not just "hg19" |
tools
--- name: bio-phasing-imputation-foundations description: Frames the phasing/imputation pipeline before any tool runs: phasing and imputation are one Li-Stephens copying HMM (recombination is the transition, mutation the emission, the genetic map and Ne set the rates), imputation's honest output is a dosage with a self-estimated quality (INFO/R2/DR2) not a hard genotype, and the stages are ordered and each fails silently (QC, align build and strand to the panel, phase, impute per chromosome, fil
tools
Chooses the enrichment generation before any tool runs, mapping the input shape to a method class - a pre-selected gene list plus a background to over-representation analysis (ORA, hypergeometric), a ranked statistic for all genes to gene set enrichment (GSEA), a signed signaling topology to pathway-topology (SPIA) - then making the null explicit (competitive vs self-contained, gene vs subject sampling) and running a trustworthiness checklist (testable-gene universe, FDR, redundancy collapse, leading-edge check, version reporting). Covers why every clusterProfiler GSEA is the inter-gene-correlation-uncorrected competitive null, why the background not the gene list decides ORA significance, and why no method is universally best. Use when deciding ORA vs GSEA vs topology, which gene-set DB, whether a result is trustworthy, or which null a tool computes. For ORA see go-enrichment, GSEA see gsea, databases kegg-pathways/reactome-pathways/wikipathways; the ranking comes from differential-expression/de-results.
testing
End-to-end GWAS workflow from VCF to association results. Covers PLINK QC, population structure correction, and association testing for case-control or quantitative traits. Use when running genome-wide association studies.
development
Orchestrates the full path from differential expression results to redundancy-collapsed functional enrichment: choose ORA vs GSEA, convert gene IDs per method, run enrichGO/enrichKEGG/enrichPathway/enrichWP or gseGO/gseKEGG (clusterProfiler, ReactomePA, rWikiPathways), and visualize. Routes the ORA-vs-GSEA generation fork and the null/universe/reproducibility theory to pathway-analysis/enrichment-foundations. Use when a DESeq2/edgeR/limma result must become enriched GO terms, KEGG/Reactome/WikiPathways pathways, or a GSEA leading edge; when deciding whether a ranking exists for all genes (GSEA, named decreasing vector) or only a pre-selected list (ORA plus a defensible background universe); or when assembling DE-to-pathway end to end. The DE list and ranking statistic come from differential-expression/de-results; per-method nuance lives in the pathway-analysis skills.