![]() | ![]() |
Formats: |
||||||||||||
Copyright © 1999, The National Academy of Sciences Biochemistry Assigning protein functions by comparative genome analysis: Protein phylogenetic profiles Molecular Biology Institute and Departments of Energy Laboratory of Structural Biology and Molecular Medicine, and Chemistry and Biochemistry, University of California, Box 951570, Los Angeles, CA 90095-1570 *M.P. and E.M.M. contributed equally to this work. †To whom reprint requests should be addressed. e-mail: yeates/at/mbi.ucla.edu. Contributed by David S. Eisenberg Accepted January 20, 1999. This article has been cited by other articles in PMC.Abstract Determining protein functions from genomic sequences is a central goal of bioinformatics. We present a method based on the assumption that proteins that function together in a pathway or structural complex are likely to evolve in a correlated fashion. During evolution, all such functionally linked proteins tend to be either preserved or eliminated in a new species. We describe this property of correlated evolution by characterizing each protein by its phylogenetic profile, a string that encodes the presence or absence of a protein in every known genome. We show that proteins having matching or similar profiles strongly tend to be functionally linked. This method of phylogenetic profiling allows us to predict the function of uncharacterized proteins. Keywords: genomic, bioinformatics, metabolic pathways, structural complexes The fully sequenced genomes of numerous organisms offer large amounts of information about cellular biology (see the genomes listed at the web site of The Institute for Genome Research: www.tigr.org). It is a central challenge of bioinformatics to use this information in discovering the function of proteins. Functional assignments of genes come primarily from biochemical experimentation, which can be extended by matching recently sequenced proteins to those that have already been characterized (1). For the exceptionally well studied genome of Escherichia coli (2), these and related techniques (3, 4) have lead to tentative functional assignments of slightly more than half of its proteins (5). The problem of assigning functions to the remaining proteins is addressed here. Our computational method detects proteins that participate in a common structural complex or metabolic pathway. Proteins within these groups are defined as functionally linked. The underlying hypothesis is that functionally linked proteins evolve in a correlated fashion, and, therefore, they have homologs in the same subset of organisms. For instance, we expect to find flagellar proteins in bacteria that possess flagella but not in other organisms. In short, we show that if two proteins have homologs in the same subset of fully sequenced organisms, they are likely to be functionally linked. We exploit this property systematically to map links between all the proteins coded by a genome. In general, pairs of functionally linked proteins have no amino acid sequence similarity with each other and, therefore, cannot be linked by conventional sequence-alignment techniques. METHODS To represent the subset of organisms that contain a homolog, we constructed a phylogenetic profile for each protein. This profile is a string with n entries, each one bit, where n corresponds to the number of genomes (16 in the present article). We indicate the presence of a homolog to a given protein in the nth genome with an entry of unity at the nth position. If no homolog is found, the entry is zero. Proteins are clustered according to the similarity of their phylogenetic profiles. Similar profiles show a correlated pattern of inheritance and, by implication, functional linkage. The method predicts that the functions of uncharacterized proteins are likely to be similar to characterized proteins within a cluster (Fig. (Fig.1).1
We computed phylogenetic profiles for the 4,290 proteins encoded by the genome of E. coli by aligning (6) each protein sequence (Pi) with the proteins from 16 other fully sequenced genomes (listed at the web site of The Institute for Genome Research: www.tigr.org). Proteins coded by the nth genome are defined as including a homolog of Pi if they align to Piwith a score that is deemed statistically significant.‡ RESULTS AND DISCUSSION To test whether proteins with similar phylogenetic profiles are functionally linked, we examined the phylogenetic profiles for two proteins that are known to participate in structural complexes, the ribosome protein RL7 and the flagellar structural protein FlgL, as well as a protein known to participate in a metabolic pathway, the histidine biosynthetic protein HIS5. We first identified all other E. coli ORFs with phylogenetic profiles identical to those of these three proteins and then those ORFs with profiles that differ by one bit. The results are shown in Fig. Fig.2.2
Homologs of ribosome protein RL7 are found in 10 of 11 eubacterial genomes and in yeast but not in archaeal genomes. We find that more than half of the E. coli proteins with the RL7 phylogenetic profile or profiles that differ from it by one bit have functions associated with the ribosome (Fig. (Fig.22 The comparisons of the phylogenetic profiles of flagellar proteins (Fig. (Fig.22 Fig. Fig.22 The examples included in Fig. Fig.22
The similarity of the phylogenetic profiles of the proteins that share a common keyword was evaluated by a statistical test; we compared the number of neighbors found in our keyword groups to the average number of neighbors found in a group of the same size but with randomly selected E. coli proteins. We found that, on average, the random sets contain very few neighbors compared with the keyword groups, even though the keyword groups contain only a fraction of all possible neighbor pairs. Thus, proteins that are functionally linked are far more likely to be neighbors in profile space than randomly selected proteins. However, we find only a fraction of all possible neighbors within a group. Therefore, not all functionally linked proteins have similar profiles; they may fall into multiple clusters in profile space. It is interesting to note that hypothetical proteins are also more likely to be neighbors than random proteins, suggesting that many hypothetical proteins are part of uncharacterized pathways or complexes. A second indication that functionally linked proteins are likely to have similar phylogenetic profiles comes from the analysis of classes of proteins obtained from the EcoCyc library (Encyclopedia of Escherichia coli Genes and Metabolism, ref. 8). We selected several classes that contain more than 10 members and that represent well known biochemical pathways. The results of our analysis are listed in Table 2. The conclusions that we draw from this analysis are similar to those found with the keyword groups: members of the group are far more likely to have neighboring profiles than members of a randomly selected control group.
Finally, we attempted to determine the ability of our method to predict the function of uncharacterized proteins. We equate the function of a protein with that of its neighbors in phylogenetic-profile space. This equation is accomplished by means of the keyword annotations found in the SwissProt database. To test the efficacy of this method, we compared the keywords of each characterized protein to those of the neighbors in phylogenetic-profile space. All of the neighbors, in this case, were other proteins with identical profiles. We found that on average 18% of the neighbor keywords overlapped the known keywords of the query protein. By comparison, random proteins had only a 4% overlap with the same set of neighbors. We make the rough estimate that, for more than half of E. coli proteins, we can assign the general function correctly by examining the functions of their phylogenetic-profile neighbors. This estimate should also hold true for the ability of phylogenetic profiles to assign functions to uncharacterized proteins. CONCLUSION The phylogenetic profile of a protein describes the presence or absence of homologs in organisms. Proteins that make up multimeric structural complexes are likely to have similar profiles. Also, proteins that are known to participate in a given biochemical pathway are likely to be neighbors in the space of phylogenetic profiles. These findings indicate that comparing profiles is a useful tool for identifying the complex or pathway in which a protein participates. Finally, we were able to make functional assignments of uncharacterized proteins by examining the function of proteins with identical phylogenetic profiles. As the number of fully sequenced genomes increases, scientists will be able to construct longer and potentially more informative protein phylogenetic profiles. There are at least 100 genome projects underway that are due to be completed within the next few months. These data will enable the construction of profiles 100 bits rather than 16 bits in length. Because the number of profile patterns grows exponentially with the number of fully sequenced genomes, the results of 100-bit comparisons should be considerably more informative than those with 16 bits. Furthermore, because the newly sequenced genomes will include several eukaryotic organisms, protein phylogenetic profiles also should become a useful tool for studying structural complexes and metabolic pathways in these higher organisms. Acknowledgments This work was supported by a postdoctoral fellowship from the Sloan Foundation and the Department of Energy (to M.P.), by a Hollaender postdoctoral fellowship from the Department of Energy and the Oak Ridge Institute for Science and Education (to E.M.), and by grants from the Department of Energy and the National Institutes of Health. Note Added in Proof A data structure similar to the one described in Methods has been proposed independently by Regan and Gaasterland (9) for describing the distribution of proteins in genomes. Footnotes ‡The statistical significance of an alignment score is described by the probability (P) of obtaining a higher score when the sequences are shuffled. To compute a P value threshold, we first consider the total number of sequence comparisons that we are performing. If there are n proteins in E. coli and m in all other genomes, this number is n × m. If we were to compare this number of random sequences, we would expect one pair to yield a P value of 1/(n × m) by chance. We, therefore, set this P value as our threshold. References 1. Bork P, Dandekar T, Diaz-Lazcoz Y, Eisenhaber F, Huynen M, Yuan Y. J Mol Biol. 1998;283:707–725. [PubMed] 2. Blattner F R, Plunckett G, Bloch C A, Perna N T, Burland V, Riley M, Collado-Vides J, Glasner J D, Rode C K, Mayhew G F, et al. Science. 1997;265:1453–1474. [PubMed] 3. Tatusov R L, Mushegian A R, Bork P, Brown N P, Hayes W S, Borodovsky M, Rudd K E, Koonin E V. Curr Biol. 1996;6:279–291. [PubMed] 4. Andrade M A, Sander C. Curr Opin Biotechnol. 1997;8:675–683. [PubMed] 5. Riley M. Nucleic Acids Res. 1998;26:54. [PubMed] 6. Altschul S F, Madden T L, Schaffer A A, Zhang J, Zhang Z, Miller W, Lipman D J. Nucleic Acid Res. 1997;25:3389–3402. [PubMed] 7. Bairoch A, Apweiler R. Nucleic Acids Res. 1998;26:38–42. [PubMed] 8. Karp P, Riley M, Paley S, Pellegrini-Toole A. Nucleic Acids Res. 1998;26:50–53. [PubMed] 9. Gaasterland T, Regan M A. Microb Comp Genomics. 1998;3:177–192. [PubMed] |
PubMed related articles
Your browsing activity is empty. Activity recording is turned off. |
|||||||||||
J Mol Biol. 1998 Nov 6; 283(4):707-25.
[J Mol Biol. 1998]Science. 1997 Sep 5; 277(5331):1453-62.
[Science. 1997]Curr Biol. 1996 Mar 1; 6(3):279-91.
[Curr Biol. 1996]Curr Opin Biotechnol. 1997 Dec; 8(6):675-83.
[Curr Opin Biotechnol. 1997]Nucleic Acids Res. 1998 Jan 1; 26(1):54.
[Nucleic Acids Res. 1998]Nucleic Acids Res. 1997 Sep 1; 25(17):3389-402.
[Nucleic Acids Res. 1997]Nucleic Acids Res. 1998 Jan 1; 26(1):38-42.
[Nucleic Acids Res. 1998]Nucleic Acids Res. 1998 Jan 1; 26(1):50-3.
[Nucleic Acids Res. 1998]Microb Comp Genomics. 1998; 3(3):177-92.
[Microb Comp Genomics. 1998]