comparative-genomics/ortholog-inference/SKILL.md
Infer orthologous genes and gene families across species using OrthoFinder3 (HOG-based phylogenetic orthology), SonicParanoid2, Broccoli, ProteinOrtho, OMA / FastOMA hierarchical orthologous groups, eggNOG-mapper, JustOrthologs, and TOGA whole-genome-alignment orthology. Use when building single-copy ortholog sets for phylogenomics, classifying co-orthologs and in/out-paralogs after gene duplication, propagating functional annotation via orthology with awareness of the ortholog conjecture, distinguishing speciation from duplication via gene-tree species-tree reconciliation, computing Quest-for-Orthologs benchmark performance, or running synteny-aware ortholog detection in WGD-affected lineages.
npx skillsauth add GPTomics/bioSkills bio-comparative-genomics-ortholog-inferenceInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Reference examples tested with: OrthoFinder 3.0+ (bioRxiv 2025.07.15.664860), SonicParanoid 2.0.8+ (Cosentino 2023), Broccoli 1.2+ (Derelle 2020), ProteinOrtho 6.3.0+ (Lechner 2011 + recent), OMA standalone 2.6.0+, FastOMA 0.3.5+ (Majidian 2025), eggNOG-mapper 2.1.12+, JustOrthologs 2.0+, DIAMOND 2.1.10+, MMseqs2 17-b804f+, IQ-TREE 2.3.6+, BUSCO 5.7+, Compleasm 0.2.7+, BioPython 1.84+, R 4.4+ for downstream tree-based reconciliation.
Before using code patterns, verify installed versions match. If versions differ:
orthofinder --help; sonicparanoid --help; oma --helppip show eggnog-mapper; which fastomaIf code throws Diamond requires N more sequences than provided, KeyError on species tree taxa, STAG branch length 0, or HOG file format mismatch, the OrthoFinder v2 -> v3 file layout changed (Orthogroups/ -> Phylogenetic_Hierarchical_Orthogroups/; rooted gene trees are now per-HOG); update parsing accordingly.
"Find the orthologs of my gene(s) across these species" -> Choose between graph-based (RBH / similarity-clustering: fast, lower recall) and tree-based (gene-tree reconciliation: higher accuracy, slower) frameworks; recognize that "orthology" splits into 1:1, 1:many, many:many, and the practical unit for most pipelines is the HOG (Hierarchical Orthologous Group) -- a maximal cluster of genes descended from a single ancestral gene at a defined taxonomic level (Altenhoff 2013 PLoS Comp Biol 9:e1002954). The "ortholog conjecture" (orthologs more functionally similar than paralogs) is supported but weakly (Altenhoff 2012 PLoS Comp Biol 8:e1002514); don't treat 1:1 ortholog labeling as automatic functional equivalence.
orthofinder -f proteomes/ -t 16 -M msa -- HOG output in v3 layoutsonicparanoid -i proteomes/ -o output --mode default -- ML predictor + protein language modelbroccoli.py -dir proteomes/ -threads 16 -- direct OG with chimeric handlingoma standalone HOG inference at every taxonomic levelproteinortho6.pl --project=run proteomes/*.faa -- graph clustering with optional syntenyemapper.py -i proteins.faa --output project --cpu 16 -- eggNOG annotation transfer| Tool | Approach | Output | Strength | Fails when |
|------|----------|--------|----------|------------|
| OrthoFinder3 (Emms & Kelly 2025 bioRxiv) | DIAMOND2-ultra search -> gene trees -> rooted via STRIDE -> HOG inference at every node | HOGs at every taxonomic level, orthologs, species tree, gene-duplication events | Best Quest-for-Orthologs benchmark (Altenhoff 2024); HOGs are 12% more accurate than v2 orthogroups, 20% with outgroup | Slow with > 200 species without -c 1 clustering pre-step; new file layout breaks old parsers |
| SonicParanoid2 (Cosentino 2023 NAR 51:e85) | RBH + ML classifier with protein language model embeddings | Pairwise orthologs + orthogroups | Best accuracy among graph methods at competitive speed; ML correction reduces InParanoid errors | InParanoid lineage still has paralogy confusion in WGD clades; close-relative duplications hard to resolve |
| Broccoli (Derelle 2020 MBE 37:3389) | k-mer similarity -> directed graph -> chimera-aware OG inference | OGs + chimera flagging | Robust to chimeric assemblies; runs without species tree | Less accurate than OrthoFinder on benchmark; no HOG output |
| ProteinOrtho 6 (Lechner 2011 BMC Bioinf 12:124) | Pairwise BLAST/DIAMOND + connectivity graph; optional synteny module | Orthogroups + synteny option | Fast; scales to 1000+ genomes; -synteny enables co-linear-anchor filtering | Lower recall than tree-based; synteny module slow and requires GFFs |
| OMA standalone (Altenhoff 2019 NAR 47:D424) | Strict RBH + verification + HOG inference | HOG database; orthologs at each taxonomic level | Conservative; highest precision in QfO benchmarks; "Fast" mode for prefiltering | Lowest recall among methods (Altenhoff 2024); slow for large datasets |
| FastOMA (Majidian 2025 NAR 53:D421) | OMA HOG inference with GPU-accelerated DIAMOND + Roothap | Same HOG output as OMA, 10-100x faster | Scales OMA to 1000+ genomes | Newer; less benchmarked in production |
| eggNOG-mapper 2 (Cantalapiedra 2021 MBE 38:5825) | DIAMOND/MMseqs2 against eggNOG 5 -> map to precomputed orthogroups | OGs + functional annotation (GO/KEGG/COG) | Standard for functional annotation propagation; phylogeny-aware | Pre-computed OGs; cannot add novel species coherently; only as fresh as eggNOG release |
| JustOrthologs 2 (Miller 2019 Bioinformatics 35:546) | DNA-based; exon-aware; close-species RBH | Pairwise orthologs | Extremely fast for closely related species (same family); preserves splice variants | Only suitable for closely related species |
| TOGA (Kirilenko 2023 Science 380:eabn3107) | Whole-genome-alignment chain -> ML projection + intactness classification | Per-query orthologs with intact/lost/missing call | Modern paradigm for vertebrate-scale orthology; handles gene loss explicitly; integrates with CESAR 2.0 | Requires WGA (Cactus); not designed for prokaryotes or fungi |
| HOGENOM / HOGsuite (Penel 2009 BMC Bioinf 10:S3) | Tree-based HOGs in databases | Pre-computed HOG database | Legacy; for downstream use of stored HOGs | Not for new computation; outdated taxon sampling |
Methodology evolves; Quest-for-Orthologs benchmarks (Altenhoff 2024 Genome Biol 25:115) refresh annually. Verify the current QfO benchmark results before locking on a single tool for novel benchmarking-grade work.
| Scenario | Recommended approach | Why |
|----------|------------------------|-----|
| 5-50 vertebrate / animal genomes for phylogenomics | OrthoFinder3 -M msa -A mafft -T iqtree | HOG quality at moderate scale; species tree included |
| 50-200 genomes, any clade | OrthoFinder3 default; SonicParanoid2 as second method | Cross-validate consensus HOGs |
| 200-1000 genomes, scaling required | FastOMA or SonicParanoid2 | OrthoFinder3 slow; cluster-first option (-c 1) helps |
| 1000+ bacterial genomes (pangenome scope) | Use [[pangenome-analysis]] pipeline; OrthoFinder unsuitable | Pangenome-specific tools needed |
| Closely related strains / same family | JustOrthologs or ProteinOrtho with synteny | Fast; preserves splice variants |
| WGD-affected lineage (plants, salmonids, teleosts) | OrthoFinder3 + synteny verification (see [[synteny-analysis]]); or GENESPACE | WGD inflates paralogy; synteny anchoring required |
| Functional annotation transfer | eggNOG-mapper 2 | Pre-computed; phylogeny-aware; integrates GO/KEGG |
| Single-copy orthologs for concatenation phylogenomics | OrthoFinder3 Phylogenetic_Hierarchical_Orthogroups/N0.tsv filter for 1:1 | Most rigorous HOG layer; standard concatenation input |
| Gene-loss detection across mammals/birds at scale | TOGA + CESAR 2.0 | Explicit intact/lost classification; handles assembly-gap noise |
| Functional annotation transfer in poorly characterized genome | eggNOG-mapper + OrthoFinder3 single-copy orthologs cross-validated | Functional priors + phylogenetic confidence |
| Ortholog detection in highly fragmented assembly | BUSCO/Compleasm first to assess completeness; flag affected OGs | Assembly fragmentation creates false absence -> spurious lineage-specific losses |
| Distant homolog detection (50-200 Myr divergence) | MMseqs2 sensitive (-s 7.5) + OrthoFinder3 | Default DIAMOND misses distant homologs; sensitive search needed |
| Phylogenetic orthology with paralog-tolerant species trees | ASTRAL-Pro2 on OrthoFinder gene trees | Coalescent species tree from gene trees; handles paralogy explicitly |
| HOG-level functional propagation across many species | OMA standalone HOG database | Strict hierarchical orthology; supports functional propagation per taxonomic level |
| Synteny-anchored orthology in repeat-heavy genome | ProteinOrtho -synteny or GENESPACE | Filters tandem duplicates; uses gene order to disambiguate paralogs |
Trigger: OrthoFinder run without a sufficiently distant outgroup; only ingroup taxa included.
Mechanism: OrthoFinder roots gene trees via STRIDE / STAG using outgroup-based duplication signals. Without an outgroup, the root inferred for each gene tree may be internal, mistakenly classifying a duplication-then-loss pattern as orthology. The "1:1 orthologs" returned may actually be hidden paralogs (Emms & Kelly 2017 MBE 34:3267 STRIDE; benchmark shows 8% improvement in HOG accuracy with outgroup).
Symptom: Phylogenomic concatenation produces low-bootstrap species tree; per-gene trees show inconsistent rooting; same-clade species show longer-than-expected branches in single-copy ortholog trees.
Fix: Include >= 1 outgroup taxon at the next-higher taxonomic level (sister phylum / class / order). For vertebrates, use cyclostomes (lamprey) as outgroup for jawed vertebrates; for plants, use a non-flowering plant for angiosperms. Re-run with outgroup; cross-check HOG file Phylogenetic_Hierarchical_Orthogroups/N0.tsv for 1:1 stability.
Trigger: Proteome FASTA contains multiple isoforms per gene (Ensembl, RefSeq with -NR flag).
Mechanism: OrthoFinder / SonicParanoid / OMA treat each protein sequence as a gene; isoforms are clustered together within orthogroups but inflate the apparent gene count per species, producing artifactual "co-orthologs" that are just isoforms of one gene.
Symptom: Per-species gene count > 2x what gene-annotation pipeline reported; orthogroups contain multiple proteins from same gene; gene names contain isoform suffixes (.1, -iso1, etc.).
Fix: Pre-filter proteomes to longest isoform per gene. Use agat_sp_keep_longest_isoform.pl (AGAT toolkit), Biopython snippet on Ensembl gene-isoform mapping, or Trinotate longest-ORF picker. OrthoFinder3 ships a wrapper tools/primary_transcript.py for Ensembl/UniProt format.
Trigger: Genomes annotated by different pipelines (Augustus, MAKER, Funannotate, NCBI RefSeq) mixed in one analysis.
Mechanism: Pipelines differ in handling of intron predictions, gene boundaries, and small-gene filtering. A gene predicted by Augustus but missed by MAKER appears as an apparent lineage-specific gene in the MAKER-annotated species, producing spurious "species-specific" orthogroups.
Symptom: CAFE (gene family evolution) shows extreme expansions / contractions for species with different annotation pipelines; per-species "unique" gene count varies 5-10x between technically similar genomes.
Fix: Re-annotate all genomes with one pipeline (e.g. BRAKER3 or Funannotate) before orthology. When re-annotation isn't possible, normalize via BUSCO/Compleasm completeness (filter OGs absent from species with < 95% BUSCO complete). Document annotation pipeline per species in methods.
Trigger: Transferring GO/KEGG functional annotation from 1:1 ortholog without testing functional divergence.
Mechanism: Studies show orthologs are weakly more functionally similar than paralogs (Altenhoff 2012 PLoS Comp Biol 8:e1002514; Stamboulian 2020 Bioinformatics 36:i219), but effect size is small and dependent on evolutionary distance. Subfunctionalization (Force 1999 Genetics 151:1531), neofunctionalization, and dosage subfunctionalization can rapidly differentiate 1:1 orthologs.
Symptom: Transferred annotation contradicts species-specific experimental data; ortholog has fold-change different expression patterns; rapid evolution (dN/dS > 0.3) on one branch only.
Fix: For high-confidence annotation transfer, require: (1) 1:1 orthology AND (2) low branch dN/dS (< 0.2 for both branches) AND (3) GO evidence code ISO with experimental support upstream. Tag annotation as "predicted from ortholog" rather than equating function. eggNOG ortholog-conjecture-aware mode provides confidence scoring.
Trigger: Reciprocal best hits (RBH) method on a gene that has been replaced by a paralog in one lineage.
Mechanism: Lineage A retains the original ortholog; Lineage B lost the ortholog and replaced its function with a paralog (xenologous replacement). RBH between A and B identifies the paralog as the "ortholog" because it's the best hit.
Symptom: dN/dS on RBH ortholog pair > 1 on the B branch (suggesting positive selection but actually paralog substitution); gene tree shows the B sequence is sister to other paralogs of the A gene, not to A.
Fix: Use tree-based methods (OrthoFinder3, OMA HOGs) for organisms with extensive paralog history. RBH is appropriate for orthogroup pre-screening only. JustOrthologs is explicitly an RBH method, so it's affected by this; reserve for closely related species.
Trigger: Plant, yeast, fish, or salmonid analysis where one or more ancient WGD events occurred.
Mechanism: WGD doubles all genes; subsequent gene loss is biased toward certain functional categories (Dosage Balance Hypothesis; Freeling 2007 PNAS 104:8723). Sequence-only orthology cannot distinguish "ortholog" (single ancestral gene) from "homeolog" (paralog from WGD). In hexaploids (wheat, Brassica), this triples the confusion.
Symptom: Multiple co-orthologs per species in orthogroups for WGD-affected lineages; rate variation across "co-orthologs" suggests one is the true ortholog and the others are recent duplicates.
Fix: Use synteny-aware methods (GENESPACE -- Lovell 2022 eLife 78526; ProteinOrtho -synteny; OrthoFinder3 + post-hoc synteny verification). Restrict 1:1 ortholog phylogenomics to genes in single-copy syntenic regions. For deep WGD ancestry (e.g. 2R vertebrate WGD), modern orthologs of post-2R paralogs (ohnologs) can no longer be unambiguously identified by sequence; use synteny + duplication-dating.
Trigger: orthofinder -M msa -A mafft default; very divergent orthogroups (deep taxonomy).
Mechanism: MAFFT default settings choose accurate-but-fast iterative refinement. For sequences with >50% divergence, MAFFT's auto choice may downgrade to FFT-NS-2, producing alignments with > 30% poorly aligned columns that distort gene-tree inference and downstream HOG construction.
Symptom: Per-OG MAFFT logs show "FFT-NS-2 selected"; gene trees have unstable rooting; bootstrap support < 60% for many internal branches.
Fix: Force mafft-linsi or specify -A mafft --thread -1 --localpair --maxiterate 1000. For deeply divergent OGs, run PRANK or MUSCLE5 post-hoc on critical OGs and re-build gene trees with IQ-TREE. PREQUAL or HmmCleaner segment-filtering improves resulting trees (Di Franco 2019 BMC Eco Evo 19:21).
| Quantity | Threshold | Source / Rationale |
|----------|-----------|-------------------|
| QfO benchmark adoption | 100% within-species ortholog pairs identified | Altenhoff 2024 Genome Biol 25:115; minimum competence threshold |
| BUSCO/Compleasm completeness for inclusion | >= 90% complete (single + duplicated) before orthology run | Below this, expect inflated lineage-specific OGs |
| OrthoFinder3 outgroup distance | >= one sister taxonomic level (sister phylum / class / order) | Emms & Kelly 2017 MBE 34:3267 STRIDE rooting; +20% HOG accuracy |
| Single-copy ortholog filter for phylogenomics | Present in >= 90% of species, exactly 1 copy each | Standard convention; below this, missing data biases tree inference |
| MMseqs2 sensitivity for divergent homologs | -s 7.5 for > 50% divergence | mmseqs2 documentation; default -s 4.0 misses distant homologs |
| ProteinOrtho --conn (connectivity) | >= 0.1 default; 0.2 stricter for less paralogy confusion | Lechner 2011 |
| OMA HOG inclusion criterion | RBH + Smith-Waterman score and pairwise stability | OMA convention; sub-clade HOGs nested in supergroup HOGs |
| Ortholog age threshold for functional transfer | divergence < 200 Myr OR dN/dS < 0.2 | Stamboulian 2020 Bioinformatics 36:i219; ortholog conjecture stronger at lower divergence |
| Annotation pipeline normalization | All species annotated with same pipeline OR BUSCO-completeness within 5% | Avoid annotation-heterogeneity bias on CAFE |
| eggNOG-mapper minimum score | bit score / e-value defaults; check --seed_ortholog_score | eggNOG-mapper docs |
| TOGA intactness classes (loss_summ_data.tsv) | I (intact), PI (partial intact), UL (uncertain loss), L (lost), M (missing/assembly gap), PM (partial missing) | Kirilenko 2023 + TOGA repo |
| TOGA orthology relationships (orthology_classification.tsv) | one2one, one2many, many2one, many2many, PG (paralogous projection) | Kirilenko 2023 |
| SonicParanoid2 confidence | >= 0.9 high-confidence orthologs; 0.5-0.9 moderate | Cosentino 2023 docs |
| Reasonable expected runtime (200 species, 16 cores) | OrthoFinder3 default ~8-24h; SonicParanoid2 ~2-6h; FastOMA ~3-8h | Hardware-dependent benchmarks |
Goal: Produce HOG-based orthology with species tree and gene-duplication events for any clade.
Approach: Prepare cleaned per-species proteomes (longest isoforms) -> include outgroup -> run with MSA + IQ-TREE option -> parse HOG output at appropriate taxonomic level.
# Pre-clean proteomes: longest isoform per gene
for f in raw_proteomes/*.faa; do
python tools/primary_transcript.py $f > cleaned/$(basename $f)
done
# OrthoFinder v3 with MSA + IQ-TREE for tree-based HOG inference
orthofinder \
-f cleaned/ \
-t 16 \
-a 4 \
-M msa \
-A mafft \
-T iqtree \
-S diamond_ultra_sens \
-y \
-o orthofinder_run
# Output of interest (v3 layout):
# orthofinder_run/Results_<date>/Phylogenetic_Hierarchical_Orthogroups/N0.tsv (root-level HOGs)
# orthofinder_run/Results_<date>/Single_Copy_Orthologue_Sequences/ (single-copy MSA-ready)
# orthofinder_run/Results_<date>/Species_Tree/SpeciesTree_rooted.txt
# orthofinder_run/Results_<date>/Gene_Duplication_Events/ (per-branch dup counts)
'''Parse OrthoFinder v3 HOG output for downstream analysis.'''
import pandas as pd
from pathlib import Path
def load_hogs(results_dir, level='N0'):
'''HOG file path differs from v2: Phylogenetic_Hierarchical_Orthogroups/{level}.tsv.'''
p = Path(results_dir) / 'Phylogenetic_Hierarchical_Orthogroups' / f'{level}.tsv'
df = pd.read_csv(p, sep='\t')
return df.set_index('HOG')
def _is_present(v):
'''OrthoFinder v3 HOG cells are either NaN (older versions), '' (newer), or a comma-separated gene list.'''
return not (pd.isna(v) or v == '' or str(v).strip() == '')
def single_copy_hogs(hog_df, min_species_fraction=0.9):
'''Return HOG IDs where every present species has exactly 1 gene and >= fraction of species present.'''
sp_cols = [c for c in hog_df.columns if c not in ('OG', 'Gene Tree Parent Clade')]
is_single = hog_df[sp_cols].apply(
lambda r: all((not _is_present(v)) or ',' not in str(v) for v in r), axis=1
)
n_present = hog_df[sp_cols].apply(lambda r: sum(_is_present(v) for v in r), axis=1)
keep = is_single & (n_present >= min_species_fraction * len(sp_cols))
return hog_df.index[keep].tolist()
def classify_orthology(hog_df, sp_pair):
'''Per-HOG classify pairwise orthology between two species.
Returns one of: '1-1', '1-many', 'many-1', 'many-many', 'absent', 'sp1-only', 'sp2-only'.
'''
sp1, sp2 = sp_pair
types = {}
for hog, row in hog_df.iterrows():
v1, v2 = row.get(sp1), row.get(sp2)
n1 = 0 if not _is_present(v1) else len(str(v1).split(','))
n2 = 0 if not _is_present(v2) else len(str(v2).split(','))
if n1 == 0 and n2 == 0: types[hog] = 'absent'
elif n1 == 0: types[hog] = 'sp2-only'
elif n2 == 0: types[hog] = 'sp1-only'
elif n1 == 1 and n2 == 1: types[hog] = '1-1'
elif n1 == 1 and n2 > 1: types[hog] = '1-many'
elif n1 > 1 and n2 == 1: types[hog] = 'many-1'
else: types[hog] = 'many-many'
return types
Goal: Run a second orthology method to cross-validate OrthoFinder3 calls (consensus increases QfO benchmark precision).
Approach: SonicParanoid2 with default ML predictor -> intersect orthogroups with OrthoFinder HOGs at the root level -> compute Jaccard agreement.
sonicparanoid -i cleaned/ -o sp2_run --mode default --threads 16 --pfam pre-computed
# Output: sp2_run/runs/<date>/orthogroups/flat.ortholog_groups.tsv
from itertools import combinations
import pandas as pd
def jaccard_ogs(ogs_a, ogs_b):
'''Compute mean Jaccard between two orthogroup sets keyed by gene IDs.
Each og is a frozenset of gene IDs.'''
set_a = {gene: og for og in ogs_a for gene in og}
set_b = {gene: og for og in ogs_b for gene in og}
common = set(set_a) & set(set_b)
jacs = []
for g in common:
a = set_a[g]
b = set_b[g]
j = len(a & b) / len(a | b) if a | b else 0
jacs.append(j)
return sum(jacs) / len(jacs) if jacs else 0
Consensus single-copy HOGs (1:1 in both methods) are the highest-confidence input for downstream phylogenomic concatenation or selection analyses.
Goal: Identify orthologs and classify gene-loss/intactness across hundreds of mammal or bird genomes.
Approach: Run Progressive Cactus whole-genome alignment (see [[whole-genome-alignment]]) -> TOGA projects reference genes through chains -> classifies each query gene as I/PI/UL/L/M/PM.
# After Cactus alignment producing reference-query chain files
toga.py \
--chain cactus_chain.bb \
--bed reference_genes.bed \
--tDB target.2bit \
--qDB query.2bit \
--nextflow_dir nf_pipeline_dir \
--pn project_name \
--cpu 64 \
--quiet
# Output:
# project_name/loss_summ_data.tsv per-gene intactness call
# project_name/orthology_classification.tsv one-to-one / one-to-many / many-to-many
TOGA emits eight classes per gene in loss_summ_data.tsv: I (intact), PI (partial intact), UL (uncertain loss), L (lost), M (missing under assembly gap), PM (partial missing), PG (paralogous projection / no orthologous chain found), N (no data). For evolutionary analysis, treat I as functional; PI as fragmented (often functional but caveat); L + UL as candidates for true gene loss after manual review of read coverage; M / PM are assembly-quality issues, not biology; PG indicates the algorithm found only paralogous chains. CESAR 2.0 provides exon-aware coding annotation projection used internally.
Goal: Propagate GO / KEGG / EC annotations from curated orthologs to a novel proteome.
Approach: Run eggNOG-mapper with appropriate evolutionary level -> integrate with OrthoFinder HOGs for confidence stratification.
emapper.py \
--output project \
-i novel_proteome.faa \
--cpu 16 \
--decorate_gff genome.gff \
--tax_scope auto \
--target_orthologs all \
--evidence_type experimental \
--pfam_realign denovo
Output project.emapper.annotations includes per-gene: best ortholog, taxonomic scope of OG, GO terms, KEGG pathways, COG category. Confidence stratification: experimental-evidence GO (EXP, IDA, IPI, IMP, IGI, IEP) > non-experimental (ISO from 1:1 close ortholog) > IEA (electronic). Tag annotation transfer evidence in output.
| Pattern | Likely cause | Action |
|---------|--------------|--------|
| OrthoFinder 1:1; SonicParanoid 1:many | Recent duplication in one lineage; OrthoFinder collapsed | Inspect gene tree; if duplication post-speciation, both are co-orthologs (correct in SP2) |
| OrthoFinder 1:1; ProteinOrtho merges into single OG | Different connectivity threshold | OrthoFinder HOG is more granular; both correct at different levels |
| OMA "1:1 across all species"; OrthoFinder "split orthogroups" | OMA more conservative (strict RBH) | Use OMA for stringent comparative analyses; OrthoFinder for broader recall |
| OrthoFinder OG contains tandem duplicates | Tandem duplications not collapsed | Use ProteinOrtho -synteny or GENESPACE post-filter; or apply MCScanX tandem detection (default 5-gene window) |
| eggNOG ortholog disagrees with OrthoFinder | eggNOG uses fixed reference set; OrthoFinder uses the sampled species | Trust OrthoFinder for sampled-species 1:1; trust eggNOG for functional annotation from curated references |
| TOGA reports "Lost" but OrthoFinder finds ortholog | TOGA chain-projection failed; assembly gap | Re-check assembly contig at locus; if gap, treat as missing (M) not lost (L) |
| All methods disagree on a single OG | Likely tandem duplicates, chimeric assembly, or hidden paralogy | Manual gene-tree inspection; treat OG as low-confidence |
Operational rule for publication: Cross-validate single-copy orthologs across 2+ methods (OrthoFinder3 + SonicParanoid2 or OrthoFinder3 + OMA); use consensus HOG set for phylogenomics. Functional annotation transfer requires either 1:1 orthology + low branch dN/dS, or eggNOG ISO/EXP evidence. Report annotation pipeline normalization in methods.
| Pushback | Standard response | |----------|-------------------| | "Why this orthology method?" | OrthoFinder3 / OMA / SonicParanoid2 chosen based on QfO benchmark (Altenhoff 2024) and clade-appropriate scale; consensus single-copy HOGs reported | | "Outgroup?" | Outgroup taxon X from sister taxonomic level included; STRIDE-rooted gene trees | | "Isoforms?" | Longest-isoform-per-gene pre-filter applied via AGAT / OrthoFinder primary_transcript.py | | "Annotation pipeline heterogeneity?" | All species annotated with Y pipeline OR BUSCO completeness within Z% across species | | "WGD?" | Acknowledged; synteny-aware verification via GENESPACE / ProteinOrtho synteny | | "Functional transfer evidence?" | 1:1 ortholog + dN/dS < 0.2 + eggNOG EXP/IDA evidence + tagged as predicted | | "Cross-validation?" | OrthoFinder3 + second method (SonicParanoid2 / OMA); consensus single-copy HOGs used | | "Ortholog conjecture caveat?" | Acknowledged (Altenhoff 2012; Stamboulian 2020); high-confidence single-copy 1:1 within short divergence; functional divergence flagged |
| Error / symptom | Cause | Solution |
|-----------------|-------|----------|
| OrthoFinder "no orthogroups found" | Proteomes empty or wrong path | Check FASTA non-empty; default file extension .fa .faa .fasta recognized |
| HOG file Phylogenetic_Hierarchical_Orthogroups missing | v3 layout new; parser expects v2 path | Use Phylogenetic_Hierarchical_Orthogroups/N0.tsv for v3; OrthoFinder v2 used Orthogroups/Orthogroups.tsv |
| All single-copy HOGs are tiny (1-2 genes) | Outgroup too distant; only paralog-free genes survive | Add intermediate outgroup; relax single-copy fraction to 0.7-0.8 |
| SonicParanoid2 ML model fails to download | Network or model registry issue | Pre-download with sonicparanoid-get-test-data; use --mode legacy as fallback |
| OMA "infinity recursion in HOG" | Self-referential HOG (rare bug) | Update to latest OMA; bug reports on GitHub for known cases |
| eggNOG-mapper says "no hits" for half the proteome | Default DIAMOND sensitivity too low | Add --sensmode more-sensitive or use --sensmode ultra-sensitive |
| TOGA "no chain file" | Cactus output not properly converted | Use halSynteny and chainNet UCSC pipeline; verify chain file format |
| Gene IDs in HOG have unexpected suffix __rev | Synteny strand convention | Strip the suffix; check OrthoFinder log for the convention applied |
| Per-species gene count radically different from annotation | Isoforms not filtered | Re-filter to longest isoform per gene |
| Annotation transfer says "Hypothetical" everywhere | eggNOG version mismatch with proteome | Update eggNOG database to current; v5 vs v6 has different OG layout |
| ProteinOrtho synteny option times out | Slow synteny module on huge genomes | Skip synteny for screening; run only on candidate OGs |
# OrthoFinder3
conda install -c bioconda orthofinder=3.0
# Or via Python wrapper: pip install orthofinder3
# SonicParanoid2
conda install -c bioconda sonicparanoid
# Broccoli
git clone https://github.com/rderelle/Broccoli && cd Broccoli && pip install .
# ProteinOrtho 6
conda install -c bioconda proteinortho
# OMA standalone (Linux)
wget https://omabrowser.org/standalone/OMA.tgz && tar xf OMA.tgz && cd OMA && ./install.sh
# FastOMA
pip install fastoma
# eggNOG-mapper 2
conda install -c bioconda eggnog-mapper
# JustOrthologs
git clone https://github.com/ridgelab/justOrthologs && cd justOrthologs && python setup.py install
# TOGA
conda env create -f https://raw.githubusercontent.com/hillerlab/TOGA/master/toga_env.yml
# Requires nf-core and Nextflow
# Quality control
conda install -c bioconda busco compleasm
For Quest-for-Orthologs benchmark submission, follow https://orthology.benchmarkservice.org/ instructions; the benchmark refreshes annually.
testing
Analyze multi-modal single-cell data (CITE-seq, Multiome, spatial). Use when working with data that measures multiple modalities per cell like RNA + protein or RNA + ATAC. Use when analyzing CITE-seq, Multiome, or other multi-modal single-cell data.
data-ai
Analyze metabolite-mediated cell-cell communication using MeboCost for metabolic signaling inference between cell types. Predict metabolite secretion and sensing patterns from scRNA-seq data. Use when studying metabolic crosstalk between cell populations or metabolite-receptor interactions.
development
Find marker genes and annotate cell types in single-cell RNA-seq using Seurat (R) and Scanpy (Python). Use for differential expression between clusters, identifying cluster-specific markers, scoring gene sets, and assigning cell type labels. Use when finding marker genes and annotating clusters.
development
Reconstruct cell lineage trees from CRISPR barcode tracing or mitochondrial mutations. Use when studying clonal dynamics, cell fate decisions, or developmental trajectories.