• We are sorry, but NCBI web applications do not support your browser and may not function properly. More information
Logo of narLink to Publisher's site
Nucleic Acids Res. Jan 2012; 40(D1): D284–D289.
Published online Nov 16, 2011. doi:  10.1093/nar/gkr1060
PMCID: PMC3245133

eggNOG v3.0: orthologous groups covering 1133 organisms at 41 different taxonomic ranges

Abstract

Orthologous relationships form the basis of most comparative genomic and metagenomic studies and are essential for proper phylogenetic and functional analyses. The third version of the eggNOG database (http://eggnog.embl.de) contains non-supervised orthologous groups constructed from 1133 organisms, doubling the number of genes with orthology assignment compared to eggNOG v2. The new release is the result of a number of improvements and expansions: (i) the underlying homology searches are now based on the SIMAP database; (ii) the orthologous groups have been extended to 41 levels of selected taxonomic ranges enabling much more fine-grained orthology assignments; and (iii) the newly designed web page is considerably faster with more functionality. In total, eggNOG v3 contains 721 801 orthologous groups, encompassing a total of 4 396 591 genes. Additionally, we updated 4873 and 4850 original COGs and KOGs, respectively, to include all 1133 organisms. At the universal level, covering all three domains of life, 101 208 orthologous groups are available, while the others are applicable at 40 more limited taxonomic ranges. Each group is amended by multiple sequence alignments and maximum-likelihood trees and broad functional descriptions are provided for 450 904 orthologous groups (62.5%).

INTRODUCTION

Orthology, defined as homology via speciation (1), is a crucial concept in evolutionary biology and is essential for disciplines such as comparative genomics, metagenomics and phylogenomics. The concepts of orthology and paralogy, with the latter being defined as homology via duplication (1), have been used as a foundation to introduce the concept of clusters of orthologous groups: proteins that have evolved from a single ancestral sequence existing in the last common ancestor (LCA) of the species that are being compared, through a series of speciation and duplication events (2). Orthologous groups (OGs) have proven useful for functional analyses and the annotation of newly sequenced genomes (3–5) as orthologs tend to have equivalent functions (6).

A number of orthology prediction methods have been recently introduced that can be classified into (i) graph-based methods, from the reciprocal-best-hit approach (7) to more sophisticated methods, such as the identification of best-hit triangles (2,8–11) and other clustering-based approaches (12–15) or (ii) tree-based methods that can be further classified into methods that use tree reconciliation to infer orthologs (16–19) and those that do not (20,21). Their methodological advantages and disadvantages have been reviewed in refs (22–24).

An important point is that OGs depend on their taxonomic context. The broader the taxonomic range, the deeper the LCA is placed, resulting in larger OGs with lower resolution of the orthologous relationships. Thus, the smaller taxonomical range results in more fine-grained groups. Therefore, the first and most successful resource, COG (2), provided OGs for certain taxonomic ranges, namely COGs for all three domains of life, KOGs for Eukaryotes (8) and arCOGs for Archaea (9). Some automatic orthology prediction methods also provide distinct sets of OGs for an increasing number of taxonomic groups [e.g. OrthoDB (10), eggNOG (11) and OMA (12)].

The functional annotation of OGs is particularly necessary, as functional insights from well-studied proteins/species can be transferred to uncharacterized orthologs. Moreover, several genome annotation tools [e.g. (25)] use the functional annotations of OGs to automatically map function information to large-scale genomic data. The most common form of orthologous group annotation is a consensus-based (longest common string) approach (9,12,18,21,26) in which the description of the OG is derived from available annotations of the member proteins. Only a few available resources conduct a more robust manual annotation of the groups (8) or incorporate multiple annotation sources for the description and annotate the groups with functional categories (8,11).

Here, we describe the third version of eggNOG (evolutionary genealogy of genes: Non-supervised Orthologous Groups), a database that provides orthologous groups for 943 Bacteria, 69 Archaea and 121 Eukaryotes. In total, 721 801 OGs have been computed including about twice as many orthologous relations for genes compared to the previous version. Most importantly, it contains considerably more taxonomically restricted OGs with higher resolution, covering 41 taxonomically relevant ranges such as Proteobacteria or Metazoans.

SELECTION OF GENOMES

We downloaded complete proteomes from RefSeq (27), Ensembl (28), UniProt (29), GiardiaDB (30), JGI (http://genome.jgi-psf.org/) and TAIR (31). This particular set of genomes also forms the basis for the most recent STRING (32) and STITCH (33) database, allowing for easy integration across these databases.

The analyses were performed on 1133 complete genomes, encoding 5 214 234 proteins. The genomes were selected based on pertinence and quality. Except for the many model organisms that were included in the database, the species were selected based on their taxonomic position to ensure a dense sampling of 41 selected taxonomical ranges (see below) as well as a broad coverage of the tree of life. As genome quality significantly affects the accuracy of orthology assignment (34,35) all genomes in eggNOG v3 were manually selected for genomic quality based on sequencing coverage and genome completeness judged by the coverage of 40 phylogenetic marker genes (36,37).

CONSTRUCTION OF ORTHOLOGOUS GROUPS AT DIFFERENT TAXONOMIC LEVELS

The first step of the eggNOG pipeline is an all-against-all similarity search. Due to the quadratic escalation of computational power necessary for such an all-against-all search, eggNOG v3 now uses the SIMAP database (38) for the required homology comparisons. SIMAP uses the FASTA heuristics (39), which are better at capturing sequences with a lower degree of similarity than BLAST (40), which was previously used in eggNOG, at the cost of reduced performance.

After the homology searches and the subsequent clustering step (11), 4 396 591 (84%) of all proteins investigated were assigned to at least one of the 721 801 orthologous groups generated by eggNOG (Figure 1). We extended the COGs, KOGs and arCOGs (8,9) to include the 1133 organisms, 121 eukaryotic and 69 archaeal species, respectively. As an enhancement to the 4873 COGs, 4850 KOGs and 7538 arCOGs, additional groups have been created as non-supervised OGs (NOGs), eukaryote-specific NOGs (euNOGs) and archaea-specific NOGs (arNOGs), extending those original COGs/KOGs/arNOGs by 101 208 NOGs, 41 267 euNOGs and 11 387 arNOGs. To provide a higher resolution of orthologous groups in frequently used taxonomic ranks, we applied our procedure to several subsets of organisms separately. Apart from the level of Eukaryotes (euNOGs) and Archaea (arNOGs), to provide information for all three domains of life, we provide newly derived bacteria-specific NOGs (bactNOGs). Subsequently, the orthology for 22 bacterial levels such as Firmicutes (firmNOGS), Proteobacteria (proNOGs) and Actinobacteria (actNOGs) (Figure 1) is further resolved, as well as for 14 major levels in the eukaryotic clade including Animals (meNOGs) and Fungi (fuNOGs).

Figure 1.
In addition to the over 100 000 orthologous groups in the last universal common ancestor (LUCA), eggNOG v3 also provides orthologous groups and functional annotation for an additional 40 taxonomic levels. Here we display each level with its abbreviated ...

AUTOMATED ANNOTATION OF PROTEIN FUNCTION

An important feature of eggNOG v3 is the automatic functional annotation of the OGs. The groups are annotated with a function description based on the functional annotations of each protein member within the group (26) and in parallel with one of 25 functional categories (11) compatible with those provided by the COG and KOG databases (8).

In eggNOG v3, the functional annotation pipeline has similarly been optimized to scale to the large amount of data. This has led to a significant improvement in computation time while simultaneously increasing the total number of functionally annotated OGs. Between eggNOG v2 and eggNOG v3, for corresponding taxonomic levels, the total number of annotated OGs increased by 28.8% and 10.0% for function description and functional category, respectively. In summary, of the 721 801 OGs in eggNOG v3, 62.5% have a functional annotation and 47.6% have been classified into a functional category (for details see Figure 1).

FURTHER IMPROVEMENTS

As the exponential growth of genomes and genes therein leads to considerable issues regarding performance, a number of technical improvements and speedups have been introduced; for example the parallelization of some key aspects of the OG pipeline have contributed to the performance enhancement.

One important step in the eggNOG pipeline is the inference of in-paralogs. Proteins that belong to a given subset of species and are more similar to each other than to proteins belonging to species outside that subset are defined as in-paralogs. In this release, we determined the aforementioned subsets automatically: for the universal, domain- and phylum-specific OGs, we grouped organisms within the same taxonomic order. For taxonomical ranges between the phylum and class, we used the taxonomical family, while for ranges below the class level we grouped given species together.

QUALITY ASSESSMENT OF eggNOG v3.0

So far, the majority of quality assessment tests are based on the functional conservation of predicted orthologs (41–44); however, it has been acknowledged that a phylogeny-based benchmarking approach would be more appropriate (44,45). We therefore manually curated a set of orthologous groups exemplifying multiple caveats of orthology prediction (35), named Reference OGs (RefOGs), which were used to assess the quality between this release and eggNOG v2. As many as 95% of the reference orthologs can be detected in the new release compared to only 75% in the previous version (Figure 2). This is mainly due to the updated genome annotations in eggNOG v3. We estimated the impact of four error sources: (i) false assignments, (ii) missing orthologs, (iii) fusions and (iv) fissions (for details see Figure 2). eggNOG v3 is less influenced than eggNOG v2 by false assignments and missing orthologs. Especially, for the missing orthologs, only 41% of the RefOGs are affected in this release compared to 57% in previous one. The high coverage of the benchmark set (95%) due to new genome annotations is the major contributor to this observation, highlighting the importance of frequent database updates, which is one of our goals. On the other hand, the previous release contains slightly fewer artificial fusions and fissions. As coverage of compared species affect the accuracy of orthology assignment (35), it can be expected that the addition of more species does not always improve all benchmark parameters.

Figure 2.
Quality assessment of eggNOG v3. We used 70 manually curated families (RefOGs) to test the accuracy of orthology prediction of the new release compared to eggNOG v2. For each release, we identified the orthologous group (OG) with the largest overlap of ...

ACCESS OPTIONS

To improve the usability of eggNOG v3, a new, modernized web interface was developed. As with the previous versions, the new interface provides data that can be downloaded under the Creative Commons Attribution 3.0 License at http://eggnog.embl.de. The available data include the OGs, protein sequences, multiple sequence alignments, precomputed gene trees (Figure 3) as well as the annotation of 62% of the OGs. Possible queries include multiple OG names, gene names and/or protein names. One goal of the new interface is to simplify the navigation of the various OGs by (i) a cleaner, more intuitive interface as well as (ii) an interactive species tree on the right side of the search results. The interactive species tree facilitates the navigation across different hierarchical levels by following the orthologs through the taxonomic levels. Homo sapiens serves as the default species for protein name queries; however, this can be changed to a multiple of common species within the search results. The multiple sequence alignments can be displayed using the Jalview applet (46) or downloaded in aligned or unaligned form. Precomputed phylogenetic trees are also provided and can be viewed together with any assigned PFAM (47) and SMART (48) domain via iTOL (49) or downloaded in Newick format.

Figure 3.
Screenshot of a results page. The eggNOG database was queried for the term ‘smoothened’. The top left picture demonstrates the simplified navigation of multiple search terms and species selection. The navigation tree at the top right of ...

CONCLUSIONS/PERSPECTIVES

With eggNOG v3, we provide one of the most comprehensive and up-to-date databases of orthologous groups available that delivers protein function annotation for 1133 genomes across the three domains of life. Not only does eggNOG v3 cover a broad taxonomic spectrum, but it also supplies orthologous groups for 41 manually selected taxonomic ranges. The modern, easy-to-use web interface facilitates the usage of the database with novel extended functionalities, such as an interactive species tree to assist the navigation through the increased number of hierarchical levels. Our future plans include the ongoing improvement of the quality of orthology and functional assignments, a further increase of taxonomic ranges and technical improvements to manage the computational challenges that come along with the expected exponential increase of available genomes.

FUNDING

EMBL; MetaHit RTD EC (201052); Novo Nordisk Foundation Center for Protein Research; Swiss Institute of Bioinformatics; and the University of Zurich through its Research Priority Program ‘Systems Biology and Functional Genomics’. Funding for open access charge: EMBL (internal).

Conflict of interest statement. None declared.

ACKNOWLEDGEMENTS

We would like to thank Yan Yuan for all his help and support on all technical and infrastructure issues we encountered during this project.

REFERENCES

1. Fitch WM. Distinguishing homologous from analogous proteins. Syst. Zool. 1970;19:99–113. [PubMed]
2. Tatusov RL, Koonin EV, Lipman DJ. A genomic perspective on protein families. Science. 1997;278:631–637. [PubMed]
3. Eisen JA. Phylogenomics: improving functional predictions for uncharacterized genes by evolutionary analysis. Genome Res. 1998;8:163–167. [PubMed]
4. Huynen MA, Snel B, von Mering C, Bork P. Function prediction and protein networks. CuCrr. Opin. Cell. Biol. 2003;15:191–198. [PubMed]
5. von Mering C, Jensen LJ, Snel B, Hooper SD, Krupp M, Foglierini M, Jouffre N, Huynen MA, Bork P. STRING: known and predicted protein-protein associations, integrated and transferred across organisms. Nucleic Acids Res. 2005;33:D433–D437. [PMC free article] [PubMed]
6. Koonin EM. Orthologs, paralogs and evolutionary genomics. Annu. Rev. Genet. 2005;39:309–338. [PubMed]
7. Östlund G, Schmitt T, Forslund K, Köstler T, Messina DN, Roopra S, Frings O, Sonnhammer EL. InParanoid 7: new algorithms and tools for eukaryotic orthology analysis. Nucleic Acids Res. 2010;38:D196–D203. [PMC free article] [PubMed]
8. Tatusov RL, Fedorova ND, Jackson JD, Jacobs AR, Kiryutin B, Koonin EV, Krylov DM, Mazumder R, Mekhedov SL, Nikolskaya AN, et al. The COG database: an updated version includes eukaryotes. BMC Bioinformatics. 2003;4:41. [PMC free article] [PubMed]
9. Makarova KS, Sorokin AV, Novichkov PS, Wolf YI, Koonin EV. Clusters of orthologous genes for 41 archaeal genomes and implications for evolutionary genomics of archaea. Biol. Direct. 2007;2:33. [PMC free article] [PubMed]
10. Waterhouse RM, Zdobnov EM, Tegenfeldt F, Li J, Kriventseva EV. OrthoDBL the hierarchical catalog of eukaryotic orthologs in 2011. Nucleic Acids Res. 2011;39:D283–D288. [PMC free article] [PubMed]
11. Muller J, Szklarczyk D, Julien P, Letunic I, Roth A, Kuhn M, Powell S, von Mering C, Doerks T, Jensen LJ, et al. eggNOG v2.0. extending the evolutionary genealogy of genes with enhanced non-supervised orthologous groups, species and functional annotations. Nucleic Acids Res. 2010;38:D190–D195. [PMC free article] [PubMed]
12. Altenhoff AM, Schneider A, Gonnet GH, Dessimoz C. OMA 2011: orthology inference among 1000 complete genomes. Nucleic Acids Res. 2011;39:D289–D294. [PMC free article] [PubMed]
13. Chen F, Mackey AJ, Stoeckert CJ, Jr, Roos DS. OrthoMCL-DB. Querying a comprehensive multi-species collection of ortholog groups. Nucleic Acids Res. 2006;34:D363–D368. [PMC free article] [PubMed]
14. Uchiyama I. MBGD: a platform for microbial comparative genomics based on the automated construction of orthologous groups. Nucleic Acids Res. 2007;35:D343–D346. [PMC free article] [PubMed]
15. Linard B, Thompson JD, Poch O, Lecompte O. OrthoInspector: comprehensive orthology analysis and visual exploration. BMC Bioinform. 2011;12:11. [PMC free article] [PubMed]
16. Wapinski I, Pfeffer A, Friedman N, Regev A. Automaticgenome- wide reconstruction of phylogenetic gene trees. Bioinformatics. 2007;23:i549–i58. [PubMed]
17. Huerta-Cepas J, Bueno A, Dopazo J, Gabaldón T. PhylomeDB: a database for genome-wide collections of gene phylogenies. Nucleic Acids Res. 2008;36:D491–D496. [PMC free article] [PubMed]
18. Vilella AJ, Severin J, Ureta-Vidal A, Heng L, Durbin R, Birney E. EnsemblCompara GeneTrees. Complete, duplication-aware phylogenetic trees in vertebrates. Genome Res. 2009;19:327–35. [PMC free article] [PubMed]
19. Ruan J, Li H, Chen Z, Coghlan A, Coin LJ, Guo Y, Hériché JK, Hu Y, Kristiansen K, Li R, et al. TreeFam. 2008 Update. Nucleic Acids Res. 2008;36:D735–D740. [PMC free article] [PubMed]
20. van der Heijden RT, Snel B, van Noort V, Huynen MA. Orthology prediction at scalable resolution by phylogenetic tree analysis. BMC Bioinform. 2007;8:83. [PMC free article] [PubMed]
21. Datta RS, Meacham C, Samad B, Neyer C, Sjölander K. Berkeley PHOG: PhyloFacts orthology group prediction web server. Nucleic Acids Res. 2009;37:W84–W89. [PMC free article] [PubMed]
22. Kuzniar A, van Ham RC, Pongor S, Leunissen JA. The quest for orthologs: finding the corresponding gene across genomes. Trends Genet. 2008;24:539–551. [PubMed]
23. Gabaldon T. Large-scale assignment of orthology. Back to phylogenetics? Genome Biol. 2008;9:235. [PMC free article] [PubMed]
24. Kristensen DM, Wolf YI, Mushegian AR, Koonin EV. Computational methods for Gene Orthology inference. Brief Bioinform. 2011;12:379–391. [PMC free article] [PubMed]
25. Kuzniar A, Lin K, He Y, Nijveen H, Pongor S, Leunissen JA. ProGMap: an integrated annotation resource for protein orthology. Nucleic Acids Res. 2009;37:W428–W434. [PMC free article] [PubMed]
26. Jensen LJ, Julien P, Kuhn M, von Mering C, Muller J, Doerks T, Bork P. eggNOG: automated construction and annotation of orthologous groups of genes. Nucleic Acids Res. 2008;36:D250–254. [PMC free article] [PubMed]
27. Pruitt KD, Tatusova T, Maglott DR. NCBI reference sequences (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins. Nucleic Acids Res. 2007;35:D61–D65. [PMC free article] [PubMed]
28. Flicek P, Amode MR, Barrell D, Beal K, Brent S, Chen Y, Clapham P, Coates G, Fairley S, Fitzgerald S, et al. Ensembl 2011. Nucleic Acids Res. 2011;36:D491–496.
29. The UniProt Consortium. Ongoing and future developments at the Universal Protein Resource. Nucleic Acids Res. 2011;39:D214–D219. [PMC free article] [PubMed]
30. Aurrecoechea C, Brestelli J, Brunk BP, Carlton JM, Dommer J, Fischer S, Gajria B, Gao X, Gingle A, Grant G, et al. GiardiaDB and TrichDB: integrated genomic resources for the eukaryotic protist pathogens Giardia lamblia and Trichomonas vaginalis. Nucleic Acids Res. 2009;37:D526–D530. [PMC free article] [PubMed]
31. Swarbreck D, Wilks C, Lamesch P, Berardini TZ, Garcia- Hernandez M, Foerster H, Li D, Meyer T, Muller R, Ploetz L, et al. The Arabidopsis Information Resource (TAIR): gene structure and function annotation. Nucleic Acids Res. 2008;36:D1009–D1014. [PMC free article] [PubMed]
32. Szklarczyk D, Franceschini A, Kuhn M, Simonovic M, Roth A, Minguez P, Doerks T, Stark M, Muller J, Bork P, et al. The STRING database in 2011: functional interaction networks of proteins, globally integrated and scored. Nucleic Acids Res. 2011;39:D561–D568. [PMC free article] [PubMed]
33. Kuhn M, Szklarczyk D, Franceschini A, Campillos M, von Mering C, Jensen LJ, Beyer A, Bork P. STITCH 2: an interaction network database for small molecules and proteins. Nucleic Acids Res. 2010;38:D552–D556. [PMC free article] [PubMed]
34. Milinkovitch MC, Helaers R, Depiereux E, Tzika AC, Gabaldón T. 2x genomes–depth does matter. Genome Biol. 2010;11:R16. [PMC free article] [PubMed]
35. Trachana K, Larsson TA, Powell S, Chen WH, Doerks T, Muller J, Bork P. Orthology prediction methods: a quality assessment using curated protein families. Bioessays. 2011;33:769–780. [PMC free article] [PubMed]
36. Ciccarelli FD, Doerks T, von Mering C, Creevey CJ, Snel B, Bork P. Toward automatic reconstruction of a highly resolved tree of life. Science. 2006;311:1283–1287. [PubMed]
37. Creevey CJ, Doerks T, Fitzpatrick DA, Raes J, Bork P. Universally distributed single-copy genes indicate a constant rate of horizontal transfer. PLoS One. 2011;6:e22099. [PMC free article] [PubMed]
38. Rattei T, Tischler P, Götz S, Jehl MA, Hoser J, Arnold R, Conesa A, Mewes HW. SIMAP–a comprehensive database of pre-calculated protein sequence similarities, domains, annotations and clusters. Nucleic Acids Res. 2010;38:D223–D226. [PMC free article] [PubMed]
39. Pearson WR. Rapid and sensitive sequence comparison with FASTP and FASTA. Methods Enzymol. 1990;183:63–98. [PubMed]
40. Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 1997;25:3389–3402. [PMC free article] [PubMed]
41. Pryszcz LP, Huerta-Cepas J, Gabaldon T. MetaPhOrs. Orthology and paralogy predictions from multiple phylogenetic evidence using a consistency-based confidence score. Nucleic Acids Res. 2010;39:e32. [PMC free article] [PubMed]
42. Hulsen T, Huynen MA, de Vlieg J, Groenen PM. Benchmarking ortholog identification methods using functional genomics data. Genome Biol. 2006;7:R31. [PMC free article] [PubMed]
43. Chen F, Mackey AJ, Vermunt JK, Roos DS. Assessing performance of orthology detection strategies applied to eukaryotic genomes. PLoS One. 2007;2:e383. [PMC free article] [PubMed]
44. Altenhoff AM, Dessimoz C. Phylogenetic and functional assessment of orthologs inference projects and methods. PLoS Comput. Biol. 2009;5:e1000262. [PMC free article] [PubMed]
45. Boeckmann B, Robinson-Rechavi M, Xenarios I, Dessimoz C. Conceptual framework and pilot study to benchmark phylogenomic databases based on reference gene trees. Brief Bioinform. 2011;12:423–435. [PMC free article] [PubMed]
46. Waterhouse AM, Procter JB, Martin DM, Clamp M, Barton GJ. Jalview Version 2–a multiple sequence alignment editor and analysis workbench. Bioinformatics. 2009;25:1189–1191. [PMC free article] [PubMed]
47. Finn RD, Tate J, Mistry J, Coggill PC, Sammut SJ, Hotz HR, Ceric G, Forslund K, Eddy SR, Sonnhammer EL, et al. The Pfam protein families database. Nucleic Acids Res. 2008;36:D281–D288. [PMC free article] [PubMed]
48. Letunic I, Doerks T, Bork P. SMART 6. Recent updates and new developments. Nucleic Acids Res. 2009;37:D229–D232. [PMC free article] [PubMed]
49. Letunic I, Bork P. Interactive Tree Of Life v2: online annotation and display of phylogenetic trees made easy. Nucleic Acids Res. 2011;39:W475–W78. [PMC free article] [PubMed]

Articles from Nucleic Acids Research are provided here courtesy of Oxford University Press

Formats:

Related citations in PubMed

See reviews...See all...

Cited by other articles in PMC

See all...

Links

Recent Activity

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

See more...