genome-annotation/functional-annotation/SKILL.md
Assigns GO terms, Pfam/InterPro domains, KEGG orthologs, EC numbers, and product names to predicted proteins using eggNOG-mapper (orthology), InterProScan (domain signatures), and KofamScan (KEGG), routing specialized functions to dbCAN/antiSMASH/AMRFinderPlus/SignalP. Covers the orthology-vs-domain-vs-homology paradigms, the annotation-error percolation cascade, domain-presence-is-not-function, GO IEA circularity in enrichment, evidence tiering, and bit-score/coverage thresholds. Use when adding functional annotation to predicted genes, choosing between eggNOG-mapper and InterProScan, or judging how much to trust a functional label.
npx skillsauth add GPTomics/bioSkills bio-genome-annotation-functional-annotationInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Reference examples tested with: eggNOG-mapper 2.1.15 (pin for reproducibility), InterProScan 5.66+, KofamScan 1.3+, pandas 2.2+, AGAT 1.4+.
Before using code patterns, verify installed versions match. If versions differ:
<tool> --version then <tool> --help to confirm flagspip show <package> then help(module.function) to check signaturesAnnotation content tracks database release: record the eggNOG DB version, InterPro/Pfam release, and KEGG/KofamScan profile date, and note whether InterProScan used the EBI precalculated lookup service. eggNOG-mapper v3 is under testing (not production) - pin v2.1.15. If code throws an error, introspect the installed tool and adapt rather than retrying.
"Functionally annotate my predicted proteins" -> Transfer GO/KEGG/Pfam/EC/product labels from characterized proteins by orthology and domain signatures, attaching a confidence tier and provenance to each.
emapper.py -i proteins.faa --itype proteins -m diamond (eggNOG-mapper), interproscan.sh -i proteins.faa -f TSV,GFF3 -goterms -pa (InterProScan)Almost every label on a new genome is transferred by homology/orthology/ML from a small island of experimentally characterized proteins. The transfer chain is lossy and self-reinforcing - it behaves like a percolation cascade (Gilks 2002 Bioinformatics 18:1641): an over-specific name assigned in year 0, deposited with no record that it was transferred, becomes the nearest hit for the next genome, whose label becomes evidence for the next. By the time a query reaches NR, "number of hits agreeing" measures how far an error spread, not correctness. Schnoes 2009 (PLoS Comput Biol 5:e1000605) found misannotation reaching ~80% in bulk databases (TrEMBL/NR) and near-zero in curated Swiss-Prot - the gap is the curation. Three load-bearing consequences:
1.1.1.-; specific name -> superfamily; whole-protein -> per-domain). PI/reviewer pressure to "annotate everything" manufactures the next genome's percolating error.| Paradigm | Tool | Mechanism | Failure mode | |----------|------|-----------|--------------| | Orthology | eggNOG-mapper | seed-ortholog -> orthologous group -> consensus transfer | tax-scope sensitive; HGT/xenologs break the orthology assumption | | Domain/signature | InterProScan | profile HMMs/matrices -> integrated InterPro entries | a domain implies a capability, not the substrate; broad families uninformative | | KEGG ortholog | KofamScan | per-KO HMMs + adaptive thresholds | KO assignment, not pathway proof | | Homology best-hit | DIAMOND vs Swiss-Prot | top-hit similarity, transfer label | best-hit != ortholog; transitive error propagation | | ML / structure | DeepGO, DeepFRI, Foldseek | learned sequence/structure -> GO | low precision; ontology terms not products; reaches twilight zone only |
Default workhorse pair: eggNOG-mapper + InterProScan (orthogonal evidence: orthology vs signatures), reconciled afterward. Add KofamScan if KEGG pathway reconstruction is the goal (its adaptive per-KO thresholds are stricter than eggNOG's KEGG_ko). DIAMOND-vs-Swiss-Prot is the cheap product-name layer; never use it alone for GO.
| Scenario | Recommended | Why |
|----------|-------------|-----|
| Bacterial isolate | Bakta/PGAP product names + eggNOG-mapper + InterProScan | structural pipeline first, then orthology + domains |
| Eukaryotic proteome | InterProScan (domains+GO+pathways) + eggNOG-mapper | orthogonal evidence, reconcile |
| Metagenome / MAG | eggNOG-mapper --itype metagenome (+ KofamScan, dbCAN) | built-in gene calling; KEGG modules |
| Twilight-zone / ORFan (no homolog) | ML (DeepGOPlus) or structure (ESMFold -> Foldseek -> DeepFRI) | only handle on the homology-free fraction; low-confidence leads |
| CAZymes / BGCs / AMR / signal peptides | -> dbCAN / antiSMASH / AMRFinderPlus / SignalP6 | a generic Pfam hit gives no substrate/phenotype/cluster |
| GO enrichment downstream | -> pathway-analysis/go-enrichment (mind IEA circularity) | enrichment on IEA partly tests the pipeline against itself |
download_eggnog_data.py --data_dir db/ -y # ~44 GB (DIAMOND DB installed by default; -D skips it)
emapper.py -i proteins.faa --itype proteins -m diamond \
--tax_scope auto --data_dir db/ --cpu 16 -o annot --output_dir out/
Three stages: (1) seed-ortholog search (DIAMOND/MMseqs2/HMMER) anchors the query - this is a best-hit and is not the annotation; (2) orthology assignment retrieves the seed's fine-grained orthologs within the chosen taxonomic scope; (3) functional transfer pools terms across the set of orthologs (which damps single-entry misannotation - this is why eggNOG-mapper beats raw DIAMOND-vs-NR). --tax_scope is the single most consequential parameter: too broad gathers distant orthologs and over-generalizes function; auto lets each seed take its most-informative phylogenetic ceiling. --itype {proteins,CDS,genome,metagenome} (genome/metagenome runs Prodigal first). Output .emapper.annotations columns include seed_ortholog, eggNOG_OGs, COG_category, Description, Preferred_name, GOs, EC, KEGG_ko, PFAMs (read the actual header; - = empty).
interproscan.sh -i proteins.faa -f TSV,GFF3 -goterms -pa -cpu 16
Runs member-database scanners (Pfam, PANTHER, NCBIfam, SUPERFAMILY, CDD, SMART, Gene3D, Hamap, PROSITE, ...) and integrates overlapping signatures into InterPro entries (stable IPRxxxxxx, with a type: Family/Domain/Repeat/Site/Homologous Superfamily). Report at the InterPro-entry level - it is the consensus that survives one member DB being wrong. -goterms adds the interpro2go mapping (these GO are IEA/electronic); -pa maps Reactome/MetaCyc. By default it queries the EBI precalculated lookup service (fast, MD5-keyed); -dp forces local compute (novel/confidential sequences, reproducibility). Java 11+ and a tens-of-GB data bundle required; for millions of proteins, chunk the FASTA into array jobs.
Goal: Merge eggNOG and InterProScan per protein while preserving provenance, so a curated name is never silently overwritten by a generic domain.
Approach: Parse each tool's table, keep source namespaces separate, union GO with source tags, and prefer the orthology Preferred_name/Description for the human-readable product.
import pandas as pd
def parse_eggnog(path):
df = pd.read_csv(path, sep='\t', comment='#', header=None)
cols = ['query', 'seed_ortholog', 'evalue', 'score', 'eggNOG_OGs', 'max_annot_lvl',
'COG_category', 'Description', 'Preferred_name', 'GOs', 'EC', 'KEGG_ko']
df.columns = (cols + [f'c{i}' for i in range(len(df.columns) - len(cols))])[:len(df.columns)]
return df
def best_product_name(row):
name = row.get('Preferred_name', '-')
return name if name not in ('-', '', None) else 'hypothetical protein' # honest default, not a forced guess
Use AGAT (agat_sp_manage_functional_annotation.pl) to graft BLAST/InterProScan results onto a GFF3 (it handles the spec edge cases). For GO deliverables use GAF (carries the evidence code); keep each tool in its own Dbxref namespace.
1.1.1.- is a valid statement of ignorance (the EC equivalent of "hypothetical"). Demand orthology or a curated rule before asserting a full four-level EC.* marks above-threshold hits) or eggNOG's KEGG_ko. A "complete module" is a reconstruction (a gap can be non-orthologous gene displacement; a filled step can be a paralog doing something else), not proof of flux.Trigger: transferring a specific function from one DIAMOND/BLAST top hit (esp. TrEMBL/NR). Mechanism: best-hit != ortholog; the bulk-DB hit is likely itself an auto-annotation. Symptom: confident specific names with no provenance. Fix: orthology consensus + Swiss-Prot donors.
Trigger: copying the exact substrate/EC of a characterized homolog onto a distant relative. Mechanism: mechanistically-diverse superfamilies share fold, not substrate. Symptom: a "muconate cycloisomerase" that does something else. Fix: demote to superfamily / partial EC as identity and coverage fall.
Trigger: leaving scope too broad/narrow or unpinned. Mechanism: distant orthologs over-generalize, or no informative orthologs. Symptom: vague or missing function. Fix: auto, or pin the known clade.
Trigger: enriching IEA annotations against a mismatched background. Mechanism: measures the mapping table and study popularity, not biology. Symptom: "enriched" for whatever well-studied genes are annotated for. Fix: non-IEA where possible; matched background; pin versions; caveat the result.
Trigger: inferring CAZyme substrate / AMR phenotype / BGC product from a plain Pfam domain. Mechanism: the substrate/phenotype/cluster signal is not in a generic domain. Symptom: wrong substrate or phenotype call. Fix: route to dbCAN / AMRFinderPlus / antiSMASH.
| Threshold | Source | Rationale |
|-----------|--------|-----------|
| Reason in bits-per-residue, not raw e-value | alignment statistics | e-value scales with DB size (a database-size artifact); bits/residue is density |
| Bidirectional coverage ≥50-70% query and subject | transfer practice | one-domain coverage justifies only a domain-level claim |
| ~40% identity over full length (well-behaved families only) | soft floor | no safe identity in mechanistically-diverse superfamilies; demote specificity instead |
| Named fraction "too high for the taxon" (>90% on a novel isolate) | over-annotation smell test | loose thresholds manufacturing names; expect 20-50% hypothetical |
| eggNOG --tax_scope auto | eggNOG-mapper | per-seed informative ceiling |
| KofamScan adaptive per-KO threshold (*) | Aramaki 2020 | a single global e-value misfires across KO families |
| Error / symptom | Cause | Solution |
|-----------------|-------|----------|
| Low annotation rate | fragmented ORFs / narrow scope | check protein quality; --tax_scope auto; run both tools and merge |
| Specific name on a distant homolog | over-specific transfer | demote to superfamily / partial EC; record identity |
| eggNOG DB errors | DB/version mismatch | re-download; pin emapper 2.1.15 |
| InterProScan memory/time | full proteome at once | chunk FASTA; keep lookup service on; drop PANTHER/Gene3D if not needed |
| Enrichment "too clean" | IEA circularity / study bias | matched background; pin GO release; caveat |
| Multidomain protein mislabeled | named by first/best domain | report all domains with coordinates |
tools
--- name: bio-phasing-imputation-foundations description: Frames the phasing/imputation pipeline before any tool runs: phasing and imputation are one Li-Stephens copying HMM (recombination is the transition, mutation the emission, the genetic map and Ne set the rates), imputation's honest output is a dosage with a self-estimated quality (INFO/R2/DR2) not a hard genotype, and the stages are ordered and each fails silently (QC, align build and strand to the panel, phase, impute per chromosome, fil
tools
Chooses the enrichment generation before any tool runs, mapping the input shape to a method class - a pre-selected gene list plus a background to over-representation analysis (ORA, hypergeometric), a ranked statistic for all genes to gene set enrichment (GSEA), a signed signaling topology to pathway-topology (SPIA) - then making the null explicit (competitive vs self-contained, gene vs subject sampling) and running a trustworthiness checklist (testable-gene universe, FDR, redundancy collapse, leading-edge check, version reporting). Covers why every clusterProfiler GSEA is the inter-gene-correlation-uncorrected competitive null, why the background not the gene list decides ORA significance, and why no method is universally best. Use when deciding ORA vs GSEA vs topology, which gene-set DB, whether a result is trustworthy, or which null a tool computes. For ORA see go-enrichment, GSEA see gsea, databases kegg-pathways/reactome-pathways/wikipathways; the ranking comes from differential-expression/de-results.
testing
End-to-end GWAS workflow from VCF to association results. Covers PLINK QC, population structure correction, and association testing for case-control or quantitative traits. Use when running genome-wide association studies.
development
Orchestrates the full path from differential expression results to redundancy-collapsed functional enrichment: choose ORA vs GSEA, convert gene IDs per method, run enrichGO/enrichKEGG/enrichPathway/enrichWP or gseGO/gseKEGG (clusterProfiler, ReactomePA, rWikiPathways), and visualize. Routes the ORA-vs-GSEA generation fork and the null/universe/reproducibility theory to pathway-analysis/enrichment-foundations. Use when a DESeq2/edgeR/limma result must become enriched GO terms, KEGG/Reactome/WikiPathways pathways, or a GSEA leading edge; when deciding whether a ranking exists for all genes (GSEA, named decreasing vector) or only a pre-selected list (ORA plus a defensible background universe); or when assembling DE-to-pathway end to end. The DE list and ranking statistic come from differential-expression/de-results; per-method nuance lives in the pathway-analysis skills.