genome-annotation/prokaryotic-annotation/SKILL.md
Annotates bacterial and archaeal genomes (isolates, MAGs, plasmids) with Bakta (active versioned databases, NCBI-compliant output) or Prokka (legacy), producing GFF3/GenBank/EMBL/FASTA with INSDC locus tags. Covers Bakta-vs-Prokka-vs-PGAP-vs-DFAST choice, light-vs-full database tiers, translation-table selection (11/4/25), archaeal and leaderless-gene caveats, the small-ORF blind spot, pseudogene-vs-phase-variation, the pangenome re-annotation trap, and submission compliance. Use when annotating a newly assembled prokaryotic genome, choosing an annotation tool, re-annotating a collection for pangenomics, or preparing annotations for NCBI/DDBJ submission.
npx skillsauth add GPTomics/bioSkills bio-genome-annotation-prokaryotic-annotationInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Reference examples tested with: Bakta 1.9+, Prokka 1.14.6 (legacy), tRNAscan-SE 2.0+, CheckM2 1.0+, gffutils 0.12+.
Before using code patterns, verify installed versions match. If versions differ:
<tool> --version then <tool> --help to confirm flagspip show <package> then help(module.function) to check signaturesAnnotation content tracks the database version, not just the binary: a Bakta full DB (schema v5+, ~30 GB zipped) and a Bakta light DB give different functional calls, and two runs months apart can differ purely from DB updates. Record the Bakta DB version in methods. Prokka's bundled databases are frozen (~2019-2021). If code throws an error, introspect the installed tool and adapt rather than retrying.
"Annotate my bacterial genome" -> Call protein-coding genes, tRNAs/rRNAs/ncRNAs, and other features, then assign function by database identity, and emit submission-ready files.
bakta --db db/ assembly.fa (default), prokka --outdir out assembly.fa (legacy), NCBI PGAP (submission/RefSeq-grade)For a typical isolate, Prodigal/Pyrodigal recovers >95-99% of true coding genes. The unsolved problems are the start codon (translation initiation site), the small/overlapping/recoded ORFs, and above all the function. Three consequences a postdoc must internalize:
A confidently wrong product name is worse than "hypothetical protein." Names assigned by loose homology transfer across distant lineages and self-amplify through databases - there is no mechanism to retract a correction once it spreads, so annotation accuracy has gone down, not up, as sequencing scaled (Salzberg 2019 Genome Biol 20:92). Bakta's design counters this: gene-level identity only via exact (MD5/UniRef100) match, falling back to cluster level, then "hypothetical" rather than guessing. 25-50% hypothetical is healthy for a non-model organism; near 0% means over-confident transfer.
Comparing gene counts across tools/versions/DBs is invalid. Tool agreement is "highly dependent on the organism of study" and biased toward model organisms (Dimonaco 2022 Bioinformatics 38:1198) - there is no universal best tool. Differences between two annotations are mostly tool artifacts, not biology. For any collection (pangenomics), re-annotate every assembly from FASTA with one pipeline + one pinned DB; merging published annotations inflates the accessory genome ~10× (Tonkin-Hill 2020 Genome Biol 21:180).
Annotation completeness ≠ assembly completeness. A fragmented/contaminated assembly produces truncated partial CDS at every contig break, inflated counts, and missing rRNA operons. Run CheckM2 before trusting any annotation QC number.
| Tool | Maintained | Database approach | Output | When | |------|-----------|-------------------|--------|------| | Bakta | Yes (active) | Curated, versioned, alignment-free (UniRef + AMRFinderPlus + expert systems) | GFF3/GBFF/EMBL/FASTA/TSV/JSON + plot | Default for new work; reproducible, MAG-aware | | Prokka | Frozen ~2021 | BLAST hierarchy vs frozen UniProt/RefSeq + HMM | GFF3/GBK/FAA + tbl | Legacy only; pangenome pipelines (Roary) expect Prokka GFF | | NCBI PGAP | Yes (NCBI) | RefSeq protein-family models + ProSplign | ASN.1/GenBank, submission-ready | GenBank/RefSeq submission; best pseudogene/frameshift/selenoprotein handling | | DFAST | Yes (DDBJ) | DFAST reference DBs + swappable refs | INSDC files | DDBJ submission; fast, flexible reference swap | | RAST / BV-BRC | Yes | SEED subsystems | GenBank/GFF + subsystems | Subsystem/metabolic framing; web; not for direct INSDC submission |
Prokka uses Aragorn (tRNA) + barrnap (rRNA); Bakta uses tRNAscan-SE 2.0 (tRNA, domain-specific) + Infernal/Rfam (rRNA/ncRNA) + PILER-CR (CRISPR arrays) + DeepSig (signal peptides). Bakta is deliberately conservative on naming, so it reports "hypothetical" where Prokka confidently (sometimes wrongly) names a gene - Bakta looking "less annotated" is it being honest.
| Scenario | Recommended | Why |
|----------|-------------|-----|
| Routine isolate, reproducible | Bakta, full DB | standardized, versioned, current functional calls |
| Submitting to GenBank/RefSeq | PGAP | RefSeq re-annotates with PGAP regardless; best pseudogene handling |
| Submitting to DDBJ | DFAST | INSDC-ready, DDBJ-aligned |
| MAG / metagenome bin | Bakta --meta + CheckM2 gate | anonymous-mode calling; QC before trust |
| Mycoplasma / Mollicutes | Bakta --translation-table 4 | UGA=Trp; table 11 splits genes at every UGA |
| Archaeon | Bakta + verify tRNAscan-SE archaeal model; PGAP if N-termini matter | leaderless mRNAs degrade Prodigal TIS |
| Inherited Prokka pangenome pipeline | pin Prokka version, or re-call all in Bakta | tool consistency vs current biology |
| Subsystem/metabolic view | RAST / BV-BRC | SEED subsystem categories |
| Genome not yet assembled / poor QC | -> genome-assembly/assembly-qc | fix assembly before annotating |
| AMR for clinical report | -> run AMRFinderPlus/CARD-RGI directly | --organism context, point mutations, partials matter |
# Database (record the version): full ~30 GB for publishable annotation; light for triage/CI
bakta_db download --output db/ --type full
bakta --db db/ --output bakta_out --prefix ecoli_k12 \
--genus Escherichia --species coli --strain K-12 \
--locus-tag ECK12 --gram - --complete --threads 16 \
assembly.fasta
Key flags: --translation-table {11,4,25} (default 11), --gram {+,-,?} (gates DeepSig signal-peptide calls; default ?), --complete (all sequences are finished replicons; enables oriC detection - do NOT use on draft contigs), --meta (metagenome/MAG mode), --compliant (enforce INSDC structure), --keep-contig-headers, --proteins <faa> (trusted-protein transfer). Set --genus/--species from a GTDB-Tk classification, not a guess. Primary outputs: .gff3, .gbff, .faa, .ffn, .fna, .tsv, plus .hypotheticals.tsv and .inference.tsv (open the inference column to ask why a product was assigned).
prokka --outdir prokka_out --prefix my_genome --locustag MYORG \
--genus Escherichia --species coli --cpus 8 --rfam assembly.fasta
Use only for tool-chain consistency with an existing Prokka-based pangenome workflow, and pin the version. Its bundled databases are frozen: a gene family characterized after ~2019 is "hypothetical" in Prokka but named by current Bakta/PGAP, so the same gene flips core/accessory purely on DB vintage.
Goal: Load Bakta/Prokka GFF3 into a queryable database and compute coding density, the first sanity number.
Approach: Build a gffutils database, sum CDS lengths, divide by genome length; flag values outside the expected band.
import gffutils
CODING_DENSITY_LOW = 0.85 # <0.85 in a free-living bacterium suggests wrong table, fragmentation, or heavy pseudogenization
CODING_DENSITY_HIGH = 0.93 # >0.93 suggests ORF over-calling (spurious short hypotheticals)
def coding_density(gff_file, genome_length):
db = gffutils.create_db(gff_file, ':memory:', merge_strategy='merge')
coding_bp = sum(c.end - c.start + 1 for c in db.features_of_type('CDS'))
density = coding_bp / genome_length
if density < CODING_DENSITY_LOW or density > CODING_DENSITY_HIGH:
print(f'WARNING: coding density {density:.1%} outside expected 85-93%')
return density
/pseudo) is the best automated arbiter.prfB/RF2 (a +1 frameshift at a slippery CTTT + internal SD), dnaX (−1), and IS-element transposases encode one protein across two frames; naive callers emit two short ORFs. PGAP recognizes a curated set.Trigger: comparing counts from Bakta vs Prokka vs old RefSeq, or mixed-vintage records. Mechanism: tool/DB differences dominate biological differences. Symptom: "novel genes" or accessory-genome inflation. Fix: re-annotate every genome from FASTA with one pipeline + pinned DB.
Trigger: table 11 on a Mollicute. Mechanism: UGA read as stop. Symptom: low coding density, doubled short gene count, high hypothetical fraction. Fix: --translation-table 4; confirm from GTDB-Tk.
Trigger: running Bakta before CheckM2. Mechanism: contig breaks truncate CDS; contamination mixes organisms. Symptom: partial CDS at ends, inflated/chimeric gene set, missing rRNA operons. Fix: CheckM2 first; contamination >5% -> decontaminate.
Trigger: reading the product column as ground truth. Mechanism: loose-homology transfer below ~40% identity is unreliable, especially for promiscuous folds. Symptom: a "histidine-kinase expansion" that is one over-called domain. Fix: open .inference.tsv; prefer InterPro/Pfam architecture over free-text product when they conflict.
Trigger: Prodigal single mode on a MAG/plasmid/phage. Mechanism: self-training needs a full single-organism genome. Symptom: poor calls on short/mixed input. Fix: Bakta --meta (anonymous mode).
Trigger: Bakta on macOS. Mechanism: DeepSig was dropped from the default mac conda env (~v1.9.4). Symptom: silently absent signal-peptide calls. Fix: verify the run log; run on Linux if SPs matter.
| Threshold | Source | Rationale | |-----------|--------|-----------| | Coding density ~88-90% (band 85-93%) | bacterial genome norm | <85% = wrong table/fragmentation/decay; >93% = over-calling | | ~850-1,000 genes/Mb (~1 gene/kb) | bacterial gene density | far higher = over-call; far lower = under-call/wrong table | | Hypothetical 25-50% | annotation norm | ~0% = over-confident transfer; >60-70% = novel lineage or wrong tool/table | | tRNA count ≥ ~20 (often 40-60) | one isoacceptor set minimum | far fewer = fragmented assembly broke tRNA regions | | rRNA operons ~1-15 (e.g. ~7 in E. coli) | copy-number norm | zero/fractional in a "complete" genome = short-read repeat collapse | | Prodigal minimum ~30 aa / 90 nt | caller default | shorter ORFs need a dedicated sORF caller + Ribo-seq | | CheckM2 contamination ≤5%, completeness ≥90% | MIMAG-aligned | above/below -> annotation QC numbers uninterpretable, fix assembly first |
| Error / symptom | Cause | Solution |
|-----------------|-------|----------|
| Genes split, low coding density | wrong translation table | --translation-table 4/25; confirm from GTDB-Tk |
| Many hypothetical proteins | novel organism / light DB | full Bakta DB; add InterProScan/eggNOG; normal if 25-50% |
| Low gene / tRNA count, missing rRNA | fragmented assembly | CheckM2; prefer long-read/complete assembly |
| RefSeq record differs from the local GFF | RefSeq is PGAP, GenBank keeps the submitter's annotation | report WP_ accession + locus_tag; expect divergence |
| Pseudogene spike | ONT homopolymer indels vs real decay vs phase variation | check homopolymer context; corroborate with short-read/HiFi |
| Submission rejected on locus tags | invented prefix | register the prefix via BioSample (3-12 alnum, starts with a letter) |
development
Find restriction enzyme cut sites in DNA sequences using Biopython Bio.Restriction. Search with single enzymes, batches of enzymes, or commercially available enzyme sets. Returns cut positions for linear or circular DNA. Use when finding restriction enzyme cut sites in sequences.
development
Create restriction maps showing enzyme cut positions on DNA sequences using Biopython Bio.Restriction. Visualize cut sites, calculate distances between sites, and generate text or graphical maps. Use when creating or analyzing restriction maps.
development
Analyze restriction digest fragments using Biopython Bio.Restriction. Predict fragment sizes, get fragment sequences, simulate gel electrophoresis patterns, and perform double digests. Use when analyzing restriction digest fragment patterns.
development
Select restriction enzymes by criteria using Biopython Bio.Restriction. Find enzymes that cut once, don't cut, produce specific overhangs, are commercially available, or have compatible ends for cloning. Use when selecting restriction enzymes for cloning or analysis.