genome-annotation/eukaryotic-gene-prediction/SKILL.md
Predicts protein-coding gene structures (exons, introns, UTRs) in eukaryotic genomes with BRAKER3 (RNA-seq + protein evidence), BRAKER1/BRAKER2, GALBA (protein-only), Funannotate (fungi), GeMoMa (homology projection), or Helixer/Tiberius (deep-learning ab initio). Covers the evidence-first tool decision, mandatory soft-masking, the training-set-quality-dominates principle, OrthoDB clade-partition selection, the one-isoform-per-locus and missing-UTR traps, merge/split errors, and reference bias against orphan genes. Use when annotating a newly assembled eukaryotic genome, choosing a gene-prediction pipeline based on available evidence, or diagnosing a poor annotation.
npx skillsauth add GPTomics/bioSkills bio-genome-annotation-eukaryotic-gene-predictionInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Reference examples tested with: BRAKER 3.0+, GALBA 1.0.11+, AUGUSTUS 3.5+, Funannotate 1.8+, HISAT2 2.2.1+, STAR 2.7.11+, BUSCO 5.5+, samtools 1.19+, gffutils 0.12+.
Before using code patterns, verify installed versions match. If versions differ:
<tool> --version then <tool> --help to confirm flagspip show <package> then help(module.function) to check signaturesReproducibility is engineered, not assumed: run BRAKER/GALBA/Funannotate from their official containers, and pin the OrthoDB partition version (v11 vs v12 give different hints -> different models), the repeat library, and the Dfam release. GeneMark (inside BRAKER) historically required an expiring academic .gm_key; the requirement has relaxed in recent versions but check the installed version's docs. If code throws an error, introspect the installed tool and adapt rather than retrying.
"Predict genes in my eukaryotic genome" -> Identify protein-coding gene structures from a soft-masked assembly using RNA-seq and/or protein evidence to train and guide a gene finder.
braker.pl --genome=masked.fa --prot_seq=orthodb_clade.fa --rnaseq_sets_ids=SRR... --softmasking (BRAKER3)Gene-prediction accuracy is governed by the quality of the training set and the extrinsic evidence, not by which gene-finder is chosen. A 2024-era HMM (AUGUSTUS) trained on a clean, evidence-validated gene set beats a fancier model trained on garbage. This reframes every decision:
| Pipeline | Citation | Evidence used | When | |----------|----------|---------------|------| | BRAKER3 | Gabriel 2024 Genome Res | RNA-seq + protein (OrthoDB) | Default when both exist; HC-training-set method | | BRAKER1 | Hoff 2016 Bioinformatics | RNA-seq only | spliced-read intron hints train GeneMark-ET | | BRAKER2 | Brůna 2021 NAR Genom Bioinform | protein only (broad OrthoDB) | ProtHint mines hints from a remote protein DB | | GALBA | Brůna 2023 BMC Bioinformatics | protein only (close relatives) | beats BRAKER2 on large vertebrate genomes with good close proteomes | | GeMoMa | Keilwagen 2018 BMC Bioinformatics | reference annotation + intron-position conservation | project an existing close-relative annotation | | Funannotate | Palmer & Stajich 2020 (Zenodo) | any | fungal de facto standard; train->predict(EVM)->update(PASA) | | MAKER2 / EVM | Holt 2011; Haas 2008 | many tracks | combiner / build-it-yourself route; transparent evidence weighting | | Helixer / Tiberius | Stiehler 2020; Gabriel 2024 | none (ab initio, DL) | evidence-poor genomes in well-represented clades | | AUGUSTUS / GeneMark-ES | Stanke 2006; Lomsadze | hints / self-train | components; standalone ab initio is the last resort |
OrthoDB rule: BRAKER2/3 want a clade partition (pick the smallest partition that still contains the target clade - Vertebrata for a fish, not all Metazoa). GALBA wants a few close-relative proteomes, not a broad clade.
| Evidence / scenario | Recommended | Why | |---------------------|-------------|-----| | RNA-seq + protein DB | BRAKER3 | state-of-the-art at all genome sizes | | RNA-seq only | BRAKER1 | intron hints from spliced reads | | Protein only, close relatives | GALBA | miniprot-aligned close proteomes train AUGUSTUS | | Protein only, broad/distant | BRAKER2 | ProtHint mines a remote OrthoDB clade | | No evidence, represented clade | Tiberius or Helixer | DL ab initio now matches evidence-based on vertebrates | | Reference annotation of a close relative | GeMoMa | homology + intron-position projection | | Fungus (any evidence) | Funannotate | tiny-intron-aware; bundled EVM + PASA update | | Need isoforms + UTRs | add PASA / Iso-Seq update step | predictors emit one CDS-only model per locus | | Genome not yet masked | -> repeat-annotation (soft-mask first) | mandatory prerequisite | | Want to assess the result | -> annotation-qc | BUSCO genome-vs-proteome, OMArk, sanity metrics |
Run repeat-annotation to soft-mask (repeats -> lowercase) before prediction. Unmasked TEs contain ORFs and pseudo-splice-sites; the predictor calls thousands of spurious genes inside repeats (catastrophic in plants where >80% of the genome can be TE) and TE domains pollute training. Pass --softmasking so BRAKER honors lowercase as a soft penalty (a real gene can still span a repeat). Hard-masking (repeats -> N) destroys sequence and truncates real repeat-overlapping genes - avoid it for prediction. But over-aggressive masking with an uncurated library deletes real multi-copy families (NLR/R-genes, zinc-fingers): filter the repeat library against a protein DB and confirm conserved families survive.
# Align RNA-seq with a splice-aware aligner; output sorted BAM
hisat2-build masked.fasta idx
hisat2 -x idx -1 R1.fq.gz -2 R2.fq.gz --dta -p 16 | samtools sort -@4 -o rnaseq.bam
samtools index rnaseq.bam
# BRAKER3: protein = an OrthoDB clade partition (smallest that contains the clade)
braker.pl --genome=masked.fasta --prot_seq=Vertebrata.fa \
--bam=rnaseq.bam --softmasking --threads=16 --species=my_species \
--gff3 --workingdir=braker3_out
--rnaseq_sets_ids=SRR...,SRR... --rnaseq_sets_dirs=/fastq/ auto-downloads/aligns reads in place of --bam. Outputs: braker.gtf/braker.gff3 (TSEBRA-combined), braker.codingseq, braker.aa.
galba.pl --genome=masked.fasta --prot_seq=close_relatives.faa \
--species=my_species --threads=16
Prefer BRAKER3 whenever RNA-seq exists - intron evidence substantially improves splice-site accuracy.
funannotate mask -i assembly.fa -o masked.fa --cpus 16
funannotate train -i masked.fa -o out -l R1.fq -r R2.fq --species "Genus species"
funannotate predict -i masked.fa -o out -s "Genus species" \
--transcript_evidence transcripts.fa --protein_evidence proteins.fa
funannotate update -i out --cpus 16 # PASA adds UTRs and isoforms
Goal: Compute the triage panel that reveals annotation health where gene count and BUSCO cannot - the isoform ratio, mono-exonic fraction, and protein-length distribution.
Approach: Load the GFF3 into gffutils; compute the mRNA:gene ratio (1.00 = isoform-naive), the single-exon fraction, and CDS-length stats; flag clade-anomalous values.
import gffutils
MONOEXONIC_FLAG = 0.30 # >30% single-exon in a vertebrate suggests unmasked TEs/pseudogenes/fragments (calibrate per clade)
def gene_model_stats(gff_file):
db = gffutils.create_db(gff_file, ':memory:', merge_strategy='merge')
genes = list(db.features_of_type('gene'))
mrnas = list(db.features_of_type(['mRNA', 'transcript']))
exon_counts = [len(list(db.children(tx, featuretype='exon'))) for tx in mrnas]
mono_frac = sum(1 for e in exon_counts if e == 1) / len(exon_counts) if exon_counts else 0
mrna_per_gene = len(mrnas) / len(genes) if genes else 0
if mrna_per_gene <= 1.001:
print('WARNING: one isoform per locus (mRNA:gene == 1.00) -- isoform/UTR-naive; AS analyses untrustworthy')
if mono_frac > MONOEXONIC_FLAG:
print(f'WARNING: mono-exonic fraction {mono_frac:.1%} high -- check masking/contamination')
return {'genes': len(genes), 'mrna_per_gene': mrna_per_gene, 'mono_exonic_fraction': mono_frac}
funannotate update). Human GENCODE has ~4-5 isoforms/gene; a fresh annotation has one.Trigger: running BRAKER without soft-masking, or hard-masking. Mechanism: TE ORFs become genes / masked sequence truncates real genes. Symptom: 2x inflated gene count, high mono-exonic fraction, or fragmented models. Fix: soft-mask with a curated library; pass --softmasking.
Trigger: annotating a contaminated/fragmented assembly. Mechanism: self-trainer learns contaminant/truncated gene structure and applies it genome-wide. Symptom: confidently wrong, BUSCO-green models. Fix: decontaminate (FCS-GX/BlobTools) and check assembly BUSCO/N50 first.
Trigger: a "close enough" pre-trained species or wrong clade partition. Mechanism: splice-site/intron-length params or protein hints mismatched. Symptom: systematically mis-placed exon boundaries; clean-looking GFF3. Fix: train on the target (BRAKER does this); pick the smallest correct OrthoDB clade.
Trigger: AS/3'-tag/APA analysis against a de novo annotation. Mechanism: one CDS-only model per locus. Symptom: discarded scRNA-seq reads; empty AS results. Fix: add a PASA/Iso-Seq update step; check mRNA:gene ratio.
Trigger: treating high D as good. Mechanism: uncollapsed haplotigs vs real WGD vs split models. Symptom: inflated gene count. Fix: if no known WGD and D>5-8%, purge_dups the assembly first; if known polyploid, confirm via synteny/Ks and keep.
| Threshold | Source | Rationale | |-----------|--------|-----------| | Soft-mask before prediction | universal | unmasked TEs inflate spurious genes | | Gene count vs nearest relative (±, reconcile with ploidy) | clade norm | 1.5-2x with no WGD = haplotigs/over-prediction; ~0.5x = over-masking/under-training | | Mono-exonic fraction ~10-20% (vertebrate) | clade norm | >25-30% = unmasked TEs/pseudogenes/fragments; fungi legitimately higher | | Protein length unimodal ~300-450 aa | eukaryote norm | sub-100-aa spike = spurious/fragmented; fat left tail = partials | | BUSCO-Duplicated ~1-3% (clean haploid) | assembly norm | >5-8% with no WGD -> purge_dups before annotating | | mRNA:gene ratio | annotation structure | == 1.00 means isoform/UTR-naive | | Pick smallest OrthoDB partition containing the clade | BRAKER guidance | broader = noisier hints, slower |
| Error / symptom | Cause | Solution |
|-----------------|-------|----------|
| Thousands of extra short genes | unmasked repeats | soft-mask; --softmasking |
| BRAKER fails mid-run | expired GeneMark key / special chars in FASTA headers | use the container; sed 's/ .*//' genome.fa |
| Many single-exon genes | unmasked TEs / contamination / fragmented assembly | verify masking; decontaminate; check N50 |
| Low protein-BUSCO, high genome-BUSCO | predictor missed present genes (training/evidence) | fix evidence/masking, not the assembly |
| AS analysis returns nothing | one-isoform annotation | run PASA/Iso-Seq update |
| Suspiciously few NLR/ZNF genes | over-masked with uncurated library | filter repeat library against a protein DB |
tools
--- name: bio-phasing-imputation-foundations description: Frames the phasing/imputation pipeline before any tool runs: phasing and imputation are one Li-Stephens copying HMM (recombination is the transition, mutation the emission, the genetic map and Ne set the rates), imputation's honest output is a dosage with a self-estimated quality (INFO/R2/DR2) not a hard genotype, and the stages are ordered and each fails silently (QC, align build and strand to the panel, phase, impute per chromosome, fil
tools
Chooses the enrichment generation before any tool runs, mapping the input shape to a method class - a pre-selected gene list plus a background to over-representation analysis (ORA, hypergeometric), a ranked statistic for all genes to gene set enrichment (GSEA), a signed signaling topology to pathway-topology (SPIA) - then making the null explicit (competitive vs self-contained, gene vs subject sampling) and running a trustworthiness checklist (testable-gene universe, FDR, redundancy collapse, leading-edge check, version reporting). Covers why every clusterProfiler GSEA is the inter-gene-correlation-uncorrected competitive null, why the background not the gene list decides ORA significance, and why no method is universally best. Use when deciding ORA vs GSEA vs topology, which gene-set DB, whether a result is trustworthy, or which null a tool computes. For ORA see go-enrichment, GSEA see gsea, databases kegg-pathways/reactome-pathways/wikipathways; the ranking comes from differential-expression/de-results.
testing
End-to-end GWAS workflow from VCF to association results. Covers PLINK QC, population structure correction, and association testing for case-control or quantitative traits. Use when running genome-wide association studies.
development
Orchestrates the full path from differential expression results to redundancy-collapsed functional enrichment: choose ORA vs GSEA, convert gene IDs per method, run enrichGO/enrichKEGG/enrichPathway/enrichWP or gseGO/gseKEGG (clusterProfiler, ReactomePA, rWikiPathways), and visualize. Routes the ORA-vs-GSEA generation fork and the null/universe/reproducibility theory to pathway-analysis/enrichment-foundations. Use when a DESeq2/edgeR/limma result must become enriched GO terms, KEGG/Reactome/WikiPathways pathways, or a GSEA leading edge; when deciding whether a ranking exists for all genes (GSEA, named decreasing vector) or only a pre-selected list (ORA plus a defensible background universe); or when assembling DE-to-pathway end to end. The DE list and ranking statistic come from differential-expression/de-results; per-method nuance lives in the pathway-analysis skills.