Figure 1 Phylogenomic Analysis of Protein Function Using Subfamily Annotation |
Table 1 Resources for Phylogenomic Analysis |
![]() | ![]() |
Formats:
|
|||||||||||||||||||
Copyright : © 2006 Brown and Sjölander. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Functional Classification Using Phylogenomic Inference Fran Lewitter, Editor Whitehead Institute, United States of America * To whom correspondence should be addressed. E-mail: kimmen/at/berkeley.edu Duncan Brown and Kimmen Sjölander are at the University of California Berkeley, Berkeley, California, United States of America. This article has been cited by other articles in PMC.Phylogenomic inference of protein (or gene) function attempts to address the question, “What function does this protein perform?” in an evolutionary context. As originally outlined by Jonathan Eisen [1–3], phylogenomic inference of protein function is a multistep process involving selection of homologs, multiple sequence alignment (MSA), and phylogenetic tree construction; overlaying annotations on the tree topology; discriminating between orthologs and paralogs; and—finally—inferring the function of a protein based on the orthologs identified by this process and the annotations retrieved. Figure 1
In practice, phylogenomic inference of gene function is not often used. Far from it. The majority of novel sequences are assigned a putative function through the use of annotation transfer from the top hits in a database search. In our analysis of over 300,000 proteins in the UniProt database, only 3% of proteins with informative annotations (i.e., those not labelled as “hypothetical” or “unknown”) had experimental support for their annotations; 97% were annotated using electronic evidence alone. These annotations are uploaded to GenBank, where they persist even if they are eventually determined to be in error. The systematic errors associated with this annotation protocol have been pointed out by numerous investigators over the years [8–10]. The root causes of these errors are these: Gene duplication. This enables protein superfamilies to innovate novel functions on the same structural template, so that the top database hit may have a function distinct from the query. Domain shuffling. Domain fusion and fission events add an additional layer of complexity, as a query and database hit may share only a local region of homology and thus have entirely different molecular functions and structures. Propagation of existing errors in database annotations. This is particularly pernicious, as existing annotation errors are seldom detected and, even if detected, are not necessarily corrected. Evolutionary distance. Two proteins can share a common ancestor and domain structure, yet have very different functions simply due to their presence in very divergently related species. Phylogenomic analysis, properly applied, avoids these errors and provides a mechanism for detecting existing database annotation errors [3,7]. Why then is phylogenomic inference not used more widely? We believe this is due to four reasons. First, the actual frequency of annotation error is not known, so the gravity of the situation is not recognized. Second, phylogenomic inference is a much more complicated endeavor than a simple database search and requires significantly more expertise and computing resources. It is therefore not easily applied at the genome scale. Third, millions of dollars and years of effort have been poured into developing computational annotation systems that depend on annotation transfer from top database hits, perhaps overlaid with domain prediction methods such as PFAM or the NCBI CDD [11,12]. Fourth, phylogenomic approaches to protein function prediction have arisen only in the last few years, while database search methods have been available for much longer. Revolutions do not normally take place overnight. These four reasons result in phylogenomic inference being applied on a one-off basis, for a few protein superfamilies here and there. This may be about to change. A variety of software tools and algorithms enabling phylogenomic inference have been developed in recent years (see Table 1). Some of these methods have based annotation transfer on the identification of orthologs [13–15] or of functional subfamilies [6,16–21]. Other groups have used whole-tree analyses [22–24]. Still other groups employ expert knowledge to define functional subtypes and then develop statistical models to allow users to classify novel sequences [25,26]; these expert system-based approaches are unfortunately limited by the scarcity of experimental data for most protein families.
It is worth examining the assumptions underlying these phylogenomic resources, and phylogenomic inference as a whole. Tree Topology Accuracy Phylogenomic inference is based on a fundamental assumption: the phylogenetic tree topology used as the basis of functional inference is correct. This assumption must be questioned, particularly when highly divergent sequences (e.g., with pairwise identities less than 25%) are included in a tree. Protein superfamilies provide distinct challenges to phylogenetic reconstruction. Following gene duplication, proteins can undergo significant structural and functional changes associated with neofunctionalization, resulting in loop regions and other parts of protein structures not being strictly homologous across all members of a multigene family (see Figure 2
Assessing the expected accuracy of phylogenetic methods for protein superfamily reconstruction is a challenge in itself. Unlike phylogenetic reconstruction of species trees, where fossil evidence exists to help investigators assess tree accuracy, we have no fossil record for protein superfamilies. Simulation studies have tended to assume models of molecular evolution that are appropriate to single orthologous DNA sequences [29], but do not normally address many of the complexities of protein multigene family evolution. This has begun to change; models have been introduced that incorporate a wider range of information, such as indel evolution and structural constraints [30–33]. Still, we believe there is a long way to go in this regard before simulation studies can effectively assess the expected accuracy of phylogenetic inference in protein superfamilies. An additional complication in phylogenetic reconstruction of protein families is the almost universal dependence on an accurate MSA as input. Studies of alignment accuracy for pairs of proteins at different levels of evolutionary and structural divergence show dramatic increases in alignment errors with sequence divergence [34]. Several recent methods have bypassed this issue by concurrent estimation of a phylogeny and an MSA from unaligned sequences [35–37]; we look forward to future developments in this area. Another barrier to the use of phylogenomic inference methods is their computational complexity. Owing to the large size of protein superfamilies (with hundreds or thousands of taxa), many applications of phylogenomic inference employ fast distance-based methods instead of character-based approaches or forego even simple models of evolution in favor of faster hierarchical clustering algorithms (e.g., the Panther system [38]). Without an objective understanding of the expected accuracy of individual phylogenetic tree estimation methods under different conditions, we cannot know whether functional inferences based on these analyses are accurate. In practice, assessing the likely accuracy of a particular tree is typically accomplished through bootstrap analysis or comparison of trees constructed using different phylogenetic reconstruction methods. Analysis of multiple trees for a given family often shows regions of agreement as well as significant differences of opinion: closely related subtrees are often found consistently across different methods, with primary differences between trees being at the coarse branching order between these conserved subtrees. Functional inferences can then be based on subtrees with high bootstrap support or on those subtrees that are found in the strict or majority consensus of several tree methods. However, these methods of analysis are quite time consuming and impractical for large datasets or for high-throughput application. The Reliability and Source of Existing Database Annotations Any system of functional inference depends on the accuracy of the characterized members. The Gene Ontology Consortium has provided a mechanism whereby sequence annotations have associated evidence codes, documenting the origin of the annotation (e.g., by electronic means, by direct assay, or by a traceable author statement) [39]. We believe that annotation transfer, even in a phylogenomic context, should only be performed when solid experimental support is available. Our analysis of more than 300,000 proteins in the UniProt database shows only 3% of proteins with functional annotations have experimental support. We suspect that many more proteins than these have been experimentally pursued, but that the results of these experiments are not being propagated efficiently (or at all) to the sequence databases or to the GO Annotation project [40]. One reason for this is the lack of proper usage of standard sequence identifiers in the biological literature, and we applaud the efforts at various journals to improve this status quo (see, e.g., Genome Research and the PLoS journals). We would go further and recommend that sequence databases specifically encourage ontology annotation during sequence submission. We expect that advances in text-mining software will also help correct the imbalance, although the field is not yet at a point to contribute on a large scale [41]. Finally, we believe that mechanisms must be put into place to enable annotation errors to be more easily corrected. The UniProt database responds to community requests for annotation error correction; other sequence databases might do well to follow their lead. Functional Inference Based on Assumed Orthology Orthologs—genes or proteins related by speciation—are generally assumed to have greater functional similarity than paralogs, which are related by gene duplication. However, inference accuracy also depends on evolutionary distance and the particular functional attribute under consideration. Some attributes of protein families, such as the three-dimensional structure, persist across large evolutionary distances. Other attributes, such as substrate specificity, can be modified based on a handful of amino acid substitutions in critical positions. The persistence of certain traits may be more limited in some families and more expansive in others. The assumption that orthology implies a functional similarity must therefore be tempered by an assessment of evolutionary distance [42,43]. Moreover, determining orthology is not always straightforward. RIO and Orthostrapper take the approach of using phylogenetic trees to assess orthology between homologs [14,15]. This is clearly the most accurate method, although accuracy will depend on the estimated phylogeny. However, these methods require estimation of a new tree for each family of interest, and trees must be recomputed whenever novel sequences are added to the family. This limits their application in large-scale endeavors. The COG database makes the simplifying assumption that proteins are orthologs if they are reciprocal top BLAST hits [13], but this limits the resulting relationships, and domain-shuffling, high sequence diversity within the family, and incomplete genome sequencing can all contribute to error. Finally, the dearth of experimental evidence supporting functional annotations, together with ambiguous tree topology reconstruction, often limits the number of proteins that can be annotated effectively based strictly on orthology. Because of the limitations in restricting functional annotations to orthologs, methods have been developed to allow functional inference to extend beyond the strict confines of orthology. The SIFTER algorithm enables annotations to be propagated over a phylogenetic tree, using GO annotations and priors over existing annotations [22]. We believe this Bayesian approach shows great promise in automating the functional annotation of novel sequences. The Future of Phylogenomic Inference We have focused in this paper on the use of phylogenomic inference of protein function. However, phylogenomic inference can be applied to a wide array of protein family attributes. Selection of templates for comparative model construction can be performed in a phylogenomic context, e.g., picking the template that has the smallest tree distance to a target of unknown structure. Phylogenomic inference of pathway involvement may also be possible under some circumstances, for instance, in cases in which a subtree contains orthologs in closely related species. Looking to the future of phylogenomic analysis, we believe that the greatest improvement to this field will take place when investigators have access to rigorously validated biological data through which phylogenomic methods can be assessed for accuracy. The Structure Function Linkage Database [44], which links protein structures with detailed information on partial chemical reactions, is an important contribution in this regard. Carefully designed benchmark datasets, such as those developed by the protein structure prediction community (e.g., the Astral datasets [45] and SCOP [46]), as well as the international biennial CASP experiment [47], have the potential to transform the field. The protein structure prediction field is one of the most mature in all of computational biology, and we believe this is due (at least in part) to the availability of challenging benchmark datasets and international experiments. The phylogenomic community needs analogous datasets appropriate for our own development and maturation. The natural competitiveness of computational biologists is used to good measure when we can push our methods to ever-increasing levels of accuracy. ![]() Acknowledgments The authors would like to thank an anonymous reviewer for very helpful comments. Abbreviations
Footnotes Author contributions. DB and KS wrote the paper. Competing interests. The authors have declared that no competing interests exist. Funding. The authors received no specific funding for this article. References
|
PubMed related articles
Your browsing activity is empty. Activity recording is turned off. |
||||||||||||||||||
Nucleic Acids Res. 1995 Jul 25; 23(14):2715-23.
[Nucleic Acids Res. 1995]Genome Res. 1998 Mar; 8(3):163-7.
[Genome Res. 1998]Science. 2001 Feb 16; 291(5507):1304-51.
[Science. 2001]Proc Int Conf Intell Syst Mol Biol. 1998; 6():165-74.
[Proc Int Conf Intell Syst Mol Biol. 1998]Pac Symp Biocomput. 2005; ():322-33.
[Pac Symp Biocomput. 2005]Trends Genet. 1999 Apr; 15(4):132-3.
[Trends Genet. 1999]J Mol Evol. 2001 Jun; 52(6):540-2.
[J Mol Evol. 2001]Genome Res. 1998 Mar; 8(3):163-7.
[Genome Res. 1998]Bioinformatics. 2004 Jan 22; 20(2):170-9.
[Bioinformatics. 2004]Nucleic Acids Res. 2004 Jan 1; 32(Database issue):D138-41.
[Nucleic Acids Res. 2004]Nucleic Acids Res. 2002 Jan 1; 30(1):281-3.
[Nucleic Acids Res. 2002]Nucleic Acids Res. 2000 Jan 1; 28(1):33-6.
[Nucleic Acids Res. 2000]Bioinformatics. 2002 Jan; 18(1):92-9.
[Bioinformatics. 2002]Pac Symp Biocomput. 2005; ():322-33.
[Pac Symp Biocomput. 2005]Bioinformatics. 1998 Jun; 14(5):430-8.
[Bioinformatics. 1998]Pac Symp Biocomput. 2004; ():399-410.
[Pac Symp Biocomput. 2004]Mol Biol Evol. 2002 Jan; 19(1):1-7.
[Mol Biol Evol. 2002]Mol Biol Evol. 2006 Jan; 23(1):40-5.
[Mol Biol Evol. 2006]Science. 1994 Apr 29; 264(5159):671-7.
[Science. 1994]Genome Res. 1998 Dec; 8(12):1233-44.
[Genome Res. 1998]Science. 2001 Oct 5; 294(5540):93-6.
[Science. 2001]Bioinformatics. 2003 Jul 22; 19(11):1404-11.
[Bioinformatics. 2003]BMC Bioinformatics. 2005 Apr 1; 6():83.
[BMC Bioinformatics. 2005]Genome Res. 2003 Sep; 13(9):2118-28.
[Genome Res. 2003]Nat Genet. 2000 May; 25(1):25-9.
[Nat Genet. 2000]Genome Res. 2003 Apr; 13(4):662-72.
[Genome Res. 2003]J Mol Biol. 2002 Apr 26; 318(2):595-608.
[J Mol Biol. 2002]Curr Opin Chem Biol. 2003 Apr; 7(2):230-7.
[Curr Opin Chem Biol. 2003]BMC Bioinformatics. 2002 May 16; 3():14.
[BMC Bioinformatics. 2002]Bioinformatics. 2002 Jan; 18(1):92-9.
[Bioinformatics. 2002]Nucleic Acids Res. 2000 Jan 1; 28(1):33-6.
[Nucleic Acids Res. 2000]Genome Biol. 2006; 7(1):R8.
[Genome Biol. 2006]Nucleic Acids Res. 2000 Jan 1; 28(1):254-6.
[Nucleic Acids Res. 2000]Nucleic Acids Res. 2000 Jan 1; 28(1):257-9.
[Nucleic Acids Res. 2000]Proteins. 2005; 61 Suppl 7():3-7.
[Proteins. 2005]Nucleic Acids Res. 1995 Jul 25; 23(14):2715-23.
[Nucleic Acids Res. 1995]BMC Bioinformatics. 2002 May 16; 3():14.
[BMC Bioinformatics. 2002]