![]() | ![]() |
Formats:
|
||||||||||||||||||||||||||||
Copyright © 2006 The Author(s) Detecting non-orthology in the COGs database and other approaches grouping orthologs using genome-specific best hits ETH Zurich, Institute of Computational Science, CH-8092 Zürich 1Swiss Institute of Bioinformatics, CMU, Michel-Servet 1, CH-1211 Genève *To whom correspondence should be addressed. Tel: +41 44 6327472; Fax: +41 44 6321172; Email: cdessimoz/at/inf.ethz.ch Received March 14, 2006; Revised May 23, 2006; Accepted June 1, 2006. This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/2.0/uk/) which permits unrestricted non-commerical use, distribution, and reproduction in any medium, provided the original work is properly cited. This article has been cited by other articles in PMC.Abstract Correct orthology assignment is a critical prerequisite of numerous comparative genomics procedures, such as function prediction, construction of phylogenetic species trees and genome rearrangement analysis. We present an algorithm for the detection of non-orthologs that arise by mistake in current orthology classification methods based on genome-specific best hits, such as the COGs database. The algorithm works with pairwise distance estimates, rather than computationally expensive and error-prone tree-building methods. The accuracy of the algorithm is evaluated through verification of the distribution of predicted cases, case-by-case phylogenetic analysis and comparisons with predictions from other projects using independent methods. Our results show that a very significant fraction of the COG groups include non-orthologs: using conservative parameters, the algorithm detects non-orthology in a third of all COG groups. Consequently, sequence analysis sensitive to correct orthology assignments will greatly benefit from these findings. INTRODUCTION The identification of orthologous genes is a central problem in bioinformatics. Orthologs are genes that evolve from a common ancestor through speciation events, as opposed to paralogs, that result from gene duplication (1). Discriminating orthologs from paralogs is an important, but non-trivial task. It is important, because function conservation is considerably higher among orthologs (2), and also because only orthologs reflect the history of their species (1), meaning that phylogeny inferences must be based on orthologs. It is non-trivial because this distinction requires precise estimates of evolutionary distances from data that are often noisy. Other complications include gene deletion, variations in evolutionary rates, lateral gene transfer (LGT), or simply the fact that orthology and paralogy are non-transitive relations, meaning that the relation of every pair of genes must be analyzed separately. So far, several projects have addressed this problem systematically. Of those, the COGs database (3,4) is by far the best established, probably due to its early inception, its wide scope, its reasonable performance and its presence on the NCBI website. The significance of COG in the community is reflected by hundreds of references in scientific articles. Even more importantly, most current initiatives for the identification of orthologs use ideas derived from the methodology of COG, in particular the idea of genome-specific best hit (5–7). Of all those projects depending either on the methods or results from COG, few question the accuracy of them. In its last accessible release (2003), the COGs database groups 138 458 proteins from 66 prokaryotes into 4873 groups that consist of orthologs and in-paralogs. The term in-paralog was coined by Remm and coworkers (6) and describes in this context paralogs inside the same species (‘trivial paralogs’), as opposed to out-paralogs that result from a duplication event prior to the last speciation event. [Strictly speaking, in/out-paralogy is a relation defined over two sequences and a speciation event of reference. When that event is omitted, it is here the last speciation event that is implied.] The inclusion of in-paralogs is usually justified by the fact that such sequences are orthologous to every other sequence within their group. Consequently, the relation of every pair of sequences inside the same COG is unambiguous: pairs of sequences from the same species are paralogs, otherwise, they are expected to be orthologous. The construction of COG groups is based on the fact that orthologous genes almost always have a higher level of sequence conservation than paralogs. Hence, genome-specific best hits (‘BeTs’) are likely to be formed between orthologs. Yet, if the corresponding ortholog is missing, a BeT might link paralogous sequences. That problem is partly taken care of by COG's approach: BeTs are only grouped when they form triangles, and triangles are merged only when they have a common side. However, if more than one species have lost the corresponding ortholog, the construction over triangles will not suffice to prevent paralogs from being clustered together. This scenario is far from being unlikely, because losses occurring before speciation events get replicated, and therefore the problem becomes very significant as more species and strains are included for analysis. In fact, simple situations, such as the one illustrated on Figure 1
The difficulty caused by a single missing ortholog can be easily avoided by requiring that all BeTs be symmetrical, which is what most other projects do. However, if the corresponding ortholog is missing in both genomes, even a symmetrical BeT will link paralogs. Therefore, BeTs, even symmetrical, are not necessarily linking orthologs. This problem could be solved through phylogenetic analysis of the relevant gene families, in particular tree reconciliation (8), but this procedure is not yet practical in large-scale, automated contexts (2). In the following, we present an algorithm that detects non-orthology without the need of gene tree construction, then report its application on the last version of the COGs database. The algorithm was developed in the context of our own orthology classification project OMA (9), in which it is used to verify every predicted orthologous relation. MATERIALS AND METHODS The algorithm presented here is designed to detect non-trivial paralogous relations within groups of orthologs such as COG groups. Knowing that a paralogous relation within a group is likely to be caused by the loss of the corresponding ortholog in both species, the algorithm looks for a third-party species, which we call the ‘witness of non-orthology’, in which both corresponding orthologs are present (Figure 2
We finish this overview of the algorithm by considering the impact of LGT and gene fusion/fission. Clearly, the algorithm presented here was not designed to detect LGT events between x1 and y2, an interesting problem in itself that remains largely unsolved. More importantly here, an LGT in a third-party species Z can lead to a situation where Z wrongly appears to be witness of non-orthology: consider three orthologous proteins x1, y2 and z3 in three species X, Y and Z. At some point, Z acquires through LGT a member of that orthologous family, which we now refer to as z4. Z keeps both copies z3 and z4. Furthermore, Z happens to be closer to X than Y, while the donor of z4 is closer to Y than X. This situation leads to a misclassification by our algorithm. Although such cases cannot be ruled out, we did not encounter any among the numerous case-by-case analysis performed on the results. It could be that orthologous gene displacement of z3 by z4 through homologous recombination is a much more likely scenario, and besides, the frequency of LGT appears to be higher among closely related species (11). As for gene fusion or gene fission, the units for amino acid sequence analysis are no longer proteins but domains. Even though the analysis of homologous domains from distinct proteins is scientifically meaningful, our analysis remains at the level of entire proteins to simplify matters. Note that the complications caused by LGT events and, probably to a lesser extent, by gene fusion/fission are not specific to our method and pose challenges to other approaches as well, in particular tree reconciliation. Input data The algorithm uses two inputs: the COGs database and pairwise sequence alignments between all proteins involved in the analysis. As introduced above, the orthology of two sequences is verified through an exhaustive search of the corresponding sequences in complete, third-party genome. Therefore, a large number of genomes is desirable. However, since the relation between every pair of sequence is needed, such searches require the computation of a very large number of pairwise alignments. For practical reasons, all results presented here use results from the Smith–Waterman (12) all-against-all protein alignments precomputed in the scope of the OMA project (9). Comparison of evolutionary distances The algorithm uses evolutionary distances to detect paralogs. However, the distances estimates are subject to perturbation, which must be taken into account when comparing them. Therefore, assuming that errors are normally distributed, the difference Δ(d1, d2) of two distances d1, d2 has expected value:
Algorithm The algorithm goes through each COG group, and verifies inside each of them that every two genes x1, y2 coming from different species have a significant alignment, and are indeed orthologs. Alignments are considered significant if the score is above 130 (47 bits, which typically corresponds to an E-value around 2e−6) and the length of the alignment not <50% of the smallest sequence. The verification of orthology is performed through the search, in each third-party genome Z, of two genes z3 and z4 that fulfill the three conditions (i–iii) presented at the beginning of this section:
A note about parameter choice. As mentioned previously, the classification of protein pairs in orthologs and non-orthologs can be very difficult or even impossible, especially when a speciation event immediately follows a duplication event, or in the situation of frequent gene gain and gene loss, as it is observed in certain groups of proteins, such as metabolic enzymes. Here, the choice of k = 1.96 standard deviations was established empirically such that the false-positive rate (orthologs misclassified as non-orthologs) is much smaller than the false-negatives rate (missed non-orthologs). In other words, we expect that our algorithm reports only clear-cut cases of paralogy. Phylogenetic analysis To verify individual cases reported by the algorithm, phylogenetic trees were constructed using independent, common software packages, as follows: sequences were aligned using Muscle (16) and ClustalW (17). Whenever they differed, the one that seemed more likely was selected. Short sequences, suspicious regions and most gap-containing columns removed. Distance matrices (JTT, gamma) generated with protdist (18) were used to construct phylogenetic trees using neighbor (18). Clusters of interest were selected for detailed analysis. Alignments of the selected data were performed using Tcoffee (19) and the result subsequently modified as described above, and considering the Tcoffee CORE (consistency of overall residue evaluation) values for the alignment. Information on the stability of the tree topology was assessed building an extended majority rule consensus tree using consense (18) from BIONJ (20) searches performed on 1000 bootstrap replicates, which were constructed with seqboot (18). Protein trees of the data subset were constructed using the Bayesian tree-building method MrBayes (21) (JTT; invgamma-4; 1 000 000 generations). The trees were rooted using an outgroup whenever a suitable ancient paralog could be found. Note that since the analysis attempts at clustering homologs into clans, and not at predicting their hierarchical order, placement of the root is not critical here.Validation The performances of the algorithm were evaluated using the HAMAP database (22), a collection of orthologous microbial protein families generated manually by expert curators in the Swiss–Prot group. The database was retrieved on November 23, 2005. Proteins from the 99 most represented species also present in our OMA project were used in the analysis: of all 29 245 proteins, there were 21 831 proteins (75.6%), grouped in 1189 orthologous families. That yielded 309 829 pairwise relations to be verified by our procedure.The algorithm classified 279 568 (90.2%) relations as orthologous and 9420 (3.0%) as paralogous. The remaining 20 841 (6.7%) relations had alignments below our significance threshold and could therefore not be processed. The accuracy of the algorithm, in particular its very low false-positive rate was confirmed by following observations:First, paralogy is often reflected by different Swiss–Prot ID names (e.g. GREA/GREB) (23). From the 9420 predicted paralogs, only 2728 (29.0%) of them have identical ID names. Second, the distribution of the paralogs among HAMAP families was investigated: all 9420 cases of paralogy found by the algorithm are concentrated in only 150 (12.6%) of the 1189 HAMAP families. This is consistent with the fact that the inclusion of just one paralogous protein into an orthologous family is likely to result in several paralogous relations inside that family. And indeed, in all except 8 of these 150 families, more than one paralogous pair was detected. Third, these 8 improbable cases were inspected individually using phylogenetic analysis, which confirmed that they are bona fide paralogs (possibly xenologs). Fourth, the predicted cases of paralogy were compared to the gene trees over HAMAP families built by the group of Laurent Duret (http://pbil.univ-lyon1.fr/help/HAMAP.html), in a similar way as HOBACGEN (24). 7217 predicted cases could be mapped to those trees. In 6418 (88.9%) instances, paralogy was confirmed by the trees, a remarkably high level of consistency considering that the two methods are very different. As for the conflicting 799 cases, which are distributed among 51 families, we believe that most of them are caused by inaccuracies on the gene trees, which are constructed using a variant of Neighbor Joining on observed divergence, a rather crude measure of evolutionary distance. RESULTS AND DISCUSSION The algorithm was run on the current release of the COGs database (4) (http://www.biomedcentral.com/1471–2105/4/41). We used the precomputed all-against-all results from 107 complete genomes, of which 52 are represented in COGs, whereas the remaining 55 genomes were only used as potential witnesses of non-orthology. [The complete list is available in the Supplementary Data.] From all 4654 COGs, there is a total of 5 537 713 pairwise relations. Pairs between proteins from the same species (484 043) were not considered further. Additionaly, 2 733 371 relations involve at least one protein from a species outside our set of 107 genomes. Consequently, the following results were obtained through the verification of 2 320 199 relations, 45.9% of all potential orthologous relations.The results are presented in Table 1. Surprisingly, 44% of the relations had alignment scores below our significance threshold of 130, which corresponds to an E-value of about 2e−6, and could therefore not be verified. This implies that an important fraction of relations within COGs cannot be, on the basis of pairwise alignments, reliably considered homologous.
The other result is the significant proportion of non-orthologous relations found by the algorithm, more than a quarter of the pairs that could be verified. They are distributed among about a third of all COGs. The list of such groups, along with all detected non-orthology cases are available in the Supplementary Data. If we require the presence of at least two witnesses of non-orthology for a pair to be considered non-orthologous, the algorithm still finds 251 391 (19.4%) such pairs within 1146 (24.6%) COGs. When removing the sequence with the most non-orthologous relations from each COG group, the total number of non-orthologous pairs decreases by only 24 868 (1.9%).The majority (70%) of the groups predominantly non-orthologs are involved in metabolic processes, according to the functional description of the COGs database, although they only constitute a minority of all COGs. In contrast, groups involved in information storage and processing (8%) or cellular processing and signaling (11%) include less frequently non-orthologs. The remainder 11% are poorly characterized proteins. This result is in agreement with previous studies, which state that in prokaryotes, metabolic functions are under high evolutionary pressure from changing environments (25). Phylogenetic analysis of selected COG groups The presence of non-orthology in some COG groups is hardly a surprise and was in fact recently acknowledged by Koonin, coauthor of COG, in a review article (2). What is surprising here is rather the extent of non-orthology detected by the algorithm. That prompted us to verify, in addition to the validation work reported in the previous section, a number of our predictions using detailed phylogenetic analysis. In this section, we report the conclusion of such analysis on three COGs, for which we could build Bayesian likelihood trees of high confidence, confirmed by consensus NJ trees with high bootstrap values. Clan assignments were made based on those trees, and considering lineage and function, whenever reliable annotations could be found. We strongly expect that pairs of proteins across clans be non-orthologous, and use these results to evaluate the accuracy of the predictions made by the algorithm. COG0508 consists of complex-forming acyltransferases that are composed of an N-terminal biotin or lipoic acid attachment domain, a central protein–protein interaction domain, followed by the catalytic 2-oxoacid dehydrogenases acyltransferase domain. The phylogenetic analysis of roughly half of the proteobacterial sequence data from COG0508 suggests the existence of at least four distinct subgroups (see Figure 4
COG0513 includes various DEAD-box containing RNA helicases. The phylogenetic analysis of the proteobacterial data from this group suggests the existence of six clans (see Figure 5
COG1113 consists of members of the amino acid-polyamine-organo-cation (APC) superfamily from bacteria, specifically those integral membrane proteins that are involved in the transport of amino acids in prokaryotes. The phylogenetic analysis of this group suggests the existence of various clans (see Figure 6
CONCLUSION We present here a new algorithm for the detection of non-orthologous relations caused by the limitations of genome-specific best hit methods, such as the COGs database. The algorithm, rather than building gene trees, a process both computationally expensive and error-prone, works with pairwise distance estimates. The accuracy of the algorithm was evaluated through verification of the distribution of predicted cases, case-by-case phylogenetic analysis and comparisons with prediction from other projects using independent methods. Using conservative parameters, the algorithm detected non-orthology in a third of the COG groups. Methods sensitive to correct orthology assignments, such as function prediction, phylogenetic trees or genome rearrangement analysis, will profit from both the algorithm and the results presented here. SUPPLEMENTARY DATA Supplementary Data are available at NAR Online. Acknowledgments The authors thank G. Cannarozzi, D. Margadant, A. Schneider and two anonymous reviewers for their comments and suggestions on the manuscript. Conflict of interest statement. None declared. REFERENCES 1. Fitch W.M. Distinguishing homologous from analogous proteins. Syst Zool. 1970;19:99–113. [PubMed] 2. Koonin E.V. Orthologs, paralogs, and evolutionary genomics. Annu. Rev. Genet. 2005;39:309–338. [PubMed] 3. Tatusov R.L., Koonin E.V., Lipman D.J. A genomic perspective on protein families. Science. 1997;278:631–637. [PubMed] 4. Tatusov R.L., Fedorova N.D., Jackson J.D., Jacobs A.R., Kiryutin B., Koonin E.V., Krylov D.M., Mazumder R., Mekhedov S.L., Nikolskaya A.N., et al. The cog database: an updated version includes eukaryotes. BMC Bioinformatics. 2003;4:41. [PubMed] 5. Fujibuchi W., Ogata H., Matsuda H., Kanehisa M. Automatic detection of conserved gene clusters in multiple genomes by graph comparison and p-quasi grouping. Nucleic Acids Res. 2000;28:4029–4036. [PubMed] 6. Remm M., Storm C., Sonnhammer E. Automatic clustering of orthologs and in-paralogs from pairwise species comparisons. J. Mol. Biol. 2001;314:1041–1052. [PubMed] 7. Lee Y., Sultana R., Pertea G., Cho J., Karamycheva S., Tsai J., Parvizi B., Cheung F., Antonescu V., White J., et al. Cross-referencing eukaryotic genomes: TIGR Orthologous Gene Alignments (TOGA). Genome Res. 2002;12:493–502. [PubMed] 8. Goodman M., Czelusniak J., Moore G.W., Romero-Herrara A.E. Fitting the gene lineage into its species lineage: a parsimony strategy illustrated by cladograms constructed from globin sequences. Syst. Zool. 1979;28:132–168. 9. Dessimoz C., Cannarozzi G., Gil M., Margadant D., Roth A., Schneider A., Gonnet G.H. OMA, a comprehensive, automated project for the identification of orthologs from complete genome data: introduction and first achievements. In: McLysath A., Huson D.H., editors. Lecture Notes in Computer Science. Vol. 3678. Springer-Verlag; 2005. pp. 61–72. 10. Doolittle R.F. Convergent evolution: the need to be explicit. Trends Biochem. Sci. 1994;19:15–18. [PubMed] 11. Lawrence J.G., Hendrickson H. Lateral gene transfer: when will adolescence end? Mol. Microbiol. 2003;50:739–749. [PubMed] 12. Smith T.F., Waterman M.S. Identification of common molecular subsequences. J. Mol. Biol. 1981;147:195–197. [PubMed] 13. Gonnet G.H. Switzerland: ETH Zurich; 1994. A Tutorial Introduction to Computational Biochemistry Using Darwin. Technical Report Informatik. 14. Muller T., Vingron M. Modeling amino acid replacement. J. Comput. Biol. 2000;7:761–776. [PubMed] 15. Gonnet G.H., Hallett M.T., Korostensky C., Bernardin L. Darwin v. 2.0: an interpreted computer language for the biosciences. Bioinformatics. 2000;16:101–103. [PubMed] 16. Edgar R.C. Muscle: a multiple sequence alignment method with reduced time and space complexity. BMC Bioinformatics. 2004;5:113. [PubMed] 17. Chenna R., Sugawara H., Koike T., Lopez R., Gibson T.J., Higgins D.G., Thompson J.D. Multiple sequence alignment with the clustal series of programs. Nucleic Acids Res. 2003;31:3497–3500. [PubMed] 18. Felsenstein J. 1993. Phylip (phylogeny inference package) version 3.5c. distributed by the author. 19. Poirot O., O'Toole E., Notredame C. Tcoffee@igs: a web server for computing, evaluating and combining multiple sequence alignments. Nucleic Acids Res. 2003;31:3503–3506. [PubMed] 20. Gascuel O. Bionj: an improved version of the nj algorithm based on a simple model of sequence data. Mol. Biol. Evol. 1997;14:685–695. [PubMed] 21. Ronquist F., Huelsenbeck J.P. Mrbayes 3: Bayesian phylogenetic inference under mixed models. Bioinformatics. 2003;19:1572–1574. [PubMed] 22. Gattiker A., Michoud K., Rivoire C., Auchincloss A.H., Coudert E., Lima T., Kersey P., Pagni M., Sigrist C.J.A., Lachaize C., et al. Automated annotation of microbial proteomes in swiss-prot. Comput. Biol. Chem. 2003;27:49–58. [PubMed] 23. Boeckmann B., Bairoch A., Apweiler R., Blatter M.-C., Estreicher A., Gasteiger E., Martin M.J., Michoud K., O'Donovan C., Phan I., et al. The swiss-prot protein knowledgebase and its supplement trembl in 2003. Nucleic Acids Res. 2003;31:365–370. [PubMed] 24. Perriere G., Duret L., Gouy M. Hobacgen: database system for comparative genomics in bacteria. Genome Res. 2000;10:379–385. [PubMed] 25. Pal C., Papp B., Lercher M.J. Adaptive evolution of bacterial metabolic networks by horizontal gene transfer. Nature Genet. 2005;37:1372–1375. [PubMed] 26. Charollais J., Pflieger D., Vinh J., Dreyfus M., Iost I. The DEAD-box RNA helicase srmb is involved in the assembly of 50s ribosomal subunits in Escherichia coli. Mol. Microbiol. 2003;48:1253–1265. [PubMed] 27. Jones P.G., Mitta M., Kim Y., Jiang W., Inouye M. Cold shock induces a major ribosomal-associated protein that unwinds double-stranded RNA in Escherichia coli. Proc. Natl Acad. Sci. USA. 1996;93:76–80. [PubMed] 28. Carpousis A.J. The Escherichia coli RNA degradosome: structure, function and relationship in other ribonucleolytic multienzyme complexes. Biochem. Soc. Trans. 2002;30:150–155. [PubMed] 29. Ohmori H. Structural analysis of the rhle gene of Escherichia coli. Jpn. J. Genet. 1994;69:1–12. [PubMed] 30. Diges C.M., Uhlenbeck O.C. Escherichia coli dbpa is a 3′→5′ RNA helicase. Biochemistry. 2005;44:7903–7911. [PubMed] |
PubMed related articles
Your browsing activity is empty. Activity recording is turned off. |
|||||||||||||||||||||||||||
Syst Zool. 1970 Jun; 19(2):99-113.
[Syst Zool. 1970]Annu Rev Genet. 2005; 39():309-38.
[Annu Rev Genet. 2005]Science. 1997 Oct 24; 278(5338):631-7.
[Science. 1997]BMC Bioinformatics. 2003 Sep 11; 4():41.
[BMC Bioinformatics. 2003]Nucleic Acids Res. 2000 Oct 15; 28(20):4029-36.
[Nucleic Acids Res. 2000]Genome Res. 2002 Mar; 12(3):493-502.
[Genome Res. 2002]J Mol Biol. 2001 Dec 14; 314(5):1041-52.
[J Mol Biol. 2001]Science. 1997 Oct 24; 278(5338):631-7.
[Science. 1997]Annu Rev Genet. 2005; 39():309-38.
[Annu Rev Genet. 2005]Trends Biochem Sci. 1994 Jan; 19(1):15-8.
[Trends Biochem Sci. 1994]Mol Microbiol. 2003 Nov; 50(3):739-49.
[Mol Microbiol. 2003]J Mol Biol. 1981 Mar 25; 147(1):195-7.
[J Mol Biol. 1981]J Comput Biol. 2000; 7(6):761-76.
[J Comput Biol. 2000]Bioinformatics. 2000 Feb; 16(2):101-3.
[Bioinformatics. 2000]BMC Bioinformatics. 2004 Aug 19; 5():113.
[BMC Bioinformatics. 2004]Nucleic Acids Res. 2003 Jul 1; 31(13):3497-500.
[Nucleic Acids Res. 2003]Nucleic Acids Res. 2003 Jul 1; 31(13):3503-6.
[Nucleic Acids Res. 2003]Mol Biol Evol. 1997 Jul; 14(7):685-95.
[Mol Biol Evol. 1997]Bioinformatics. 2003 Aug 12; 19(12):1572-4.
[Bioinformatics. 2003]Comput Biol Chem. 2003 Feb; 27(1):49-58.
[Comput Biol Chem. 2003]Nucleic Acids Res. 2003 Jan 1; 31(1):365-70.
[Nucleic Acids Res. 2003]Genome Res. 2000 Mar; 10(3):379-85.
[Genome Res. 2000]BMC Bioinformatics. 2003 Sep 11; 4():41.
[BMC Bioinformatics. 2003]Nat Genet. 2005 Dec; 37(12):1372-5.
[Nat Genet. 2005]Annu Rev Genet. 2005; 39():309-38.
[Annu Rev Genet. 2005]Mol Microbiol. 2003 Jun; 48(5):1253-65.
[Mol Microbiol. 2003]Proc Natl Acad Sci U S A. 1996 Jan 9; 93(1):76-80.
[Proc Natl Acad Sci U S A. 1996]Biochem Soc Trans. 2002 Apr; 30(2):150-5.
[Biochem Soc Trans. 2002]Jpn J Genet. 1994 Feb; 69(1):1-12.
[Jpn J Genet. 1994]Biochemistry. 2005 May 31; 44(21):7903-11.
[Biochemistry. 2005]