![]() | ![]() |
Formats:
|
||||||||||||||||||||||||
Copyright © Copyright 2004 The Protein Society Proteome-wide functional classification and identification of prokaryotic transmembrane proteins by transmembrane topology similarity comparison 1Department of Electronic and Information System Engineering, Faculty of Science and Technology, Hirosaki University, Hirosaki 036-8561, Japan 2Department of Developmental Biology and Neuroscience, Graduate School of Life Sciences and 3Department of Molecular Immunology, Institute of Development, Aging and Cancer, Tohoku University, Sendai 980-8577, Japan Reprint requests to: Toshio Shimizu, Department of Electronic and Information System Engineering, Faculty of Science and Technology, Hirosaki University, 3, Bunkyo-cho, Hirosaki 036-8561, Japan; e-mail: slsimi/at/si.hirosaki-u.ac.jp; fax: 81-172-39-3638. Received April 15, 2004; Revised May 19, 2004; Accepted May 19, 2004. This article has been cited by other articles in PMC.Abstract We propose a new method for classifying and identifying transmembrane (TM) protein functions in proteome-scale by applying a single-linkage clustering method based on TM topology similarity, which is calculated simply from comparing the lengths of loop regions. In this study, we focused on 87 prokaryotic TM proteomes consisting of 31 proteobacteria, 22 gram-positive bacteria, 19 other bacteria, and 15 archaea. Prior to performing the clustering, we first categorized individual TM protein sequences as “known,” “putative” (similar to “known” sequences), or “unknown” by using the homology search and the sequence similarity comparison against SWISS-PROT to assess the current status of the functional annotation of the TM proteomes based on sequence similarity only. More than three-quarters, that is, 75.7% of the TM protein sequences are functionally “unknown,” with only 3.8% and 20.5% of them being classified as “known” and “putative,” respectively. Using our clustering approach based on TM topology similarity, we succeeded in increasing the rate of TM protein sequences functionally classified and identified from 24.3% to 60.9%. Obtained clusters correspond well to functional superfamilies or families, and the functional classification and identification are successfully achieved by this approach. For example, in an obtained cluster of TM proteins with six TM segments, 109 sequences out of 119 sequences annotated as “ATP-binding cassette transporter” are properly included and 122 “unknown” sequences are also contained. Keywords: transmembrane protein, transmembrane topology similarity, functional classification and identification, proteome-wide analysis, prokaryotic genome Genome projects have provided an enormous number of potential protein sequences, of which functions tried to be identified by using computer-based methods. Many of these proteins, however, have not yet been annotated, with about half of all proteome sequences being classified as functionally “unknown” or “putative” at best (Serres et al. 2001). Such is the case, in particular, for transmembrane (TM) proteins, which account for as much as 20%~30% of the total number of proteins in individual proteomes (Boyd et al. 1998; Jones 1998; Wallin and von Heijne 1998; Mitaku et al. 1999; Pasquier and Hamodrakas 1999; Stevens and Arkin 2000; Krogh et al. 2001; Liu and Rost 2001; Arai et al. 2003). As will be described later, functionally “unknown” sequences make up more than three-quarters of all TM proteomes (see later). Furthermore, in 70,228 full-length protein sequences with a function annotated as “known” in SWISS-PROT release 41 (containing 122,564 sequences in total; Boeckmann et al. 2003), the number of TM protein sequences is only 10,796, compared with 59,432 soluble protein sequences (details described in the section of Materials and Methods). This shortage of “known” TM protein sequences in the SWISS-PROT database, as a matter of course, would cause a serious delay in the classification and identification of TM protein functions if sequence similarity is used as the only criteria. On the other hand, recent studies have revealed that TM protein functions are closely related to TM topology (the number of TM segments [TMSs], positions of the TMSs and N-tail location), and can be classified and identified with high accuracies using TM topology information as the primary basis even without using sequence similarity itself directly (Sugiyama et al. 2003; Inoue et al. 2004). Individual functional groups have their own specific TM topologies, that is, characteristic combination patterns of loop lengths. The similarity of TM topologies between two TM protein sequences can be evaluated rather easily from comparing the lengths of corresponding loop regions between the two sequences, as described in detail in the Materials and Methods section. It is generally true that a pair of TM protein sequences with a higher sequence identity usually shows a higher TM topology similarity. In some cases, however, the TM topology similarity is kept at a high level between two sequences belonging to the same functional groups (at the superfamily level) even if the sequence similarity is below the twilight zone. For example, we have a pair of TM protein sequences, mouse GABA receptor α 6 (GAA6_MOUSE) and human neuronal acetylcholine receptor α 5 (ACH5_HUMAN), between which sequence identity is only 15.8%, while the TM topology similarity is as high as 96.9%. Thus, it is expected that the classification and identification of TM protein functions on proteome scale should be improved to a large degree by making good use of TM topology information in addition to sequence similarity. One example of the approaches for obtaining reliable and more accurate TM topology prediction data is the ConPred program (Ikeda et al. 2002; Arai et al. 2004; Xia et al. 2004), which is based on a consensus strategy by combining several proposed prediction methods, and achieves an accuracy increase of as much as 10%, for example, predicting the entire TM topology of prokaryotic TM protein sequences, from 56.5% (by MEMSAT 1.8 [Jones et al. 1994] and HMMTOP 2.0 [Tusnády and Simon 1998] and HMMTOP 2.0 [Tusnády and Simon 2001]) to 68.1% (Arai et al. 2004). In this study, we propose a new approach for classifying and identifying TM proteome functions by using a clustering method based on TM topology similarity. We focused on predicted TM proteins from 87 completed prokaryotic (72 bacterial and 15 archaean) genome sequences. In this approach, in the case when sequences of unknown function are segregated into a cluster together with sequences of known function, not only the functional classification but also the functional identification are achieved. Prior to carrying out the clustering, we first identified functions of the predicted TM protein sequences and classified them into three categories by using homology search and sequence comparison on the basis of sequence similarity, that is, “known,” “putative” (similar to “known” sequences), and “unknown.” Results and Discussion Table 1 summarizes the 87 prokaryotic (31 proteobacterial, 22 gram-positive bacterial, 19 other bacterial, and 15 archaean) proteomes used in this study. Out of 239,359 protein sequences in the 87 proteomes, 53,053 TM protein sequences (22.2% of the 87 proteomes) were obtained together with their TM topologies following the procedure described in the Materials and Methods section. We focused on the TM proteins with between 1 and 12 TMSs (1~12-tms), because only 3.8% of all of the TM proteins in the proteomes have more than 12 TMSs. The number and the fraction of predicted 1~12-tms TM proteins in each proteome are also listed in Table 1. Most of the proteomes fall in a narrow range around 21% over the four categories of prokaryotic species, with a few extremes, for example, 13.2% for Buchnera aphidicola and 29.1% for Tropheryma whipplei. The average fraction of TM proteins per proteome was calculated as 21.3% over all 87 species. The distribution of the number of TMSs in the 51,044 TM protein sequences is given in the second column of Table 2.
Current status of the proteome-wide functional identification of TM protein sequences based on sequence similarity only The current level of functional identification of 1~12-tms TM proteins obtained by sequence homology searches (and similarity comparisons) is shown in Table 2. The fractions of TM protein sequences identified as “known” by our approach, which are defined as almost identical to or exactly the same as one of the sequences registered in the SWISS-PROT database with an unambiguous function, are extremely low: 5.2% for 12-tms TM proteins and 5.0% for 9-tms TM proteins at the highest, and only 3.8% as an average over all 1~12-tms TM proteins. The fractions of “putative” sequences, the functions of which are inferable from the functionally known sequences in SWISS-PROT, range widely from the minimum, 11.3% for 2-tms, to the maximum, 34.9% for 9-tms TM proteins, with an overall average of 20.5%. The “known” and “putative” sequences added together amount to only 24.3%, that is, about one-quarter of the TM proteomes, indicating the majority (i.e., more than three-quarters) of TM proteomes are still classified as functionally unknown. The results listed in Table 2 are illustrated in detail separately for each species in Figure 1 ![]()
As with the fraction of “known” and “putative” sequences put together, 10 species belonging to γproteobacteria in proteobacteria (from E. coli to P. multocida in the list) stand out among the other species. This is again the contribution from the large number of “known” E. coli sequences in SWISS-PROT. Overall, the proteobacteria genomes far exceed the other three species categories in the fractions of “known” and “putative” sequences. The archaean TM proteomes have the smallest fractions of “known” plus “putative” sequences, 8.4% as an average over the 15 species. Interestingly, 65.1% of the “putative” sequences over all the archaean genomes are annotated after the proteobacterial “known” sequences, while only 23.3% of them are directly after the archaean “known” sequences. Threshold TM topology similarities and the minimum cluster size We assumed the proteome-scale functional classification using the clustering approach was successful when more than 50% of all the sequences were included in the clusters of at least 10 sequences (the minimum cluster size). The threshold TM topology similarities as the criteria for clustering were determined based on this assumption. The conditions (the 50% coverage and the minimum cluster size of 10) adopted in our approach are not based on any scientific data, but rather are purely empirical ones. This assumption is, however, supported by the relationships between the threshold TM topology similarities versus the minimum cluster size, where with increasing minimum cluster size, the threshold TM topology similarities decrease rapidly at first and then reach saturated levels at a minimum cluster size of around 10 for most numbers of TMSs (see Supplemental Fig. 1 ![]() Threshold TM topology similarities thus determined are, for example, 98%, 85%, and 82% for 1-tms, 6-tms, and 12-tms TM protein sequences, respectively, as shown in the third column of Table 3. As expected, stricter threshold similarity values are obtained for the smaller numbers of TMSs.
Comprehensive functional classification and identification of TM protein sequences based on TM topology similarity The results of the functional classification and identification of TM proteomes using the single-linkage clustering method based on TM topology similarity are summarized in Table 3 for 1~12-tms TM proteins. The numbers of large clusters generated range from 22~74, with more clusters generated for the smaller numbers of TMSs and less for larger, in general. In these large clusters, more than half of all of the TM proteome sequences are included, a large majority of which (69.8%) are “unknown” sequences together with “known” and “putative” sequences, indicating a large amount of “unknown” sequences have been functionally classified and identified by this approach. Taking into account the “known” plus “putative” sequences included in the small clusters all together, the number of functionally annotated TM protein sequences runs up to 60.9% of the TM proteome sequences, a significant improvement over the 24.3% obtained from the sequence homology search plus similarity comparison. The percentages of newly classified and identified sequences using this approach are displayed in Figure 1 ![]() The following describes the details of the functional classification and identification attained by this approach, exemplifying 6-tms TM proteins. Table 4 provides the list of the 27 large clusters generated by single-linkage clustering based on TM topology similarity (threshold similarity 85%) for 6-tms TM proteins enumerated in order of cluster size. The largest cluster, Cluster 1, includes 1085 sequences, nearly one-fourth of all of the 6-tms TM protein sequences, with the “known” plus “putative” sequences (679 in total) annotated as “transport system permease protein” except for one sequence (as photosystem II chlorophyll-binding protein). This implies the 406 “unknown” sequences (37.4% of the 1085 sequences) included in the cluster also could be annotated as transport system permease proteins. By further clustering based on sequence similarity (threshold sequence identity 30%) within Cluster 1, we obtained 46 subclusters that correspond to functional subgroups that are, for example, “dipeptide transport system permease dppB” (in total 228 sequences including “unknown” sequences), “maltose transport system permease malD” (212 sequences), “lactose transport system permeases lacF” (181 sequences), “sulfate transport system permease cysT” (118 sequences), etc., suggesting that the TM topology-based clustering may correspond to a superfamily-or family-level classification, whereas the sequence similarity-based clustering to a family- or subfamily-level one in this case.
The top 13 clusters, except for Clusters 9 and 12, contain sequences that distribute over all the species categories, indicating the TM proteins of these functional groups are essential for the life of prokaryotic species. By comparison, Cluster 14 (phage infection protein) contains sequences from only gram-positive bacterial and other bacterial species, and the sequences in Cluster 20 (intracellular separation protein) exist only in proteobacterial and gram-positive bacterial genomes. In Table 4, we have four clusters composed of only “unknown” sequences, Clusters 9, 16, 23, and 27. Of these, Clusters 23 and 27 comprise the sequences from only archaean and proteobacterial species, respectively. These “unknown” protein sequences must be not only novel but also biologically important functional groups. We expect further experimental studies would characterize these sequences and elucidate their functions in detail. Cluster 3 (231 sequences, of which 109 are “known” or “putative” assigned as “ATP-binding cassette [ABC] transporters”) clearly illustrates how well the TM topology-based clustering works in the functional classification and identification of TM proteins. Out of 119 6-tms sequences annotated as “ABC transporter,” 109 sequences (91.6%) are captured properly in this cluster, and the remaining 10 sequences are spread across nine small clusters: one sequence in a cluster with the size of nine (including nine sequences in total, N-in topology), one in a size-four cluster (N-out), two in a size-two (N-in), one in a size-two (N-out, +SP), and five orphan sequences (all N-in). The other 122 sequences are all “unknown,” and no sequences with other functions are included in this cluster at all. TM topology models of the 231 sequences are illustrated in Figure 2 ![]() ![]()
We would like to show another typical example, that is, Cluster 10 (52 sequences are included) of which TM topology models are presented in Figure 3 ![]()
The TM proteins contained in Cluster 10 have the following characteristics with the TM topology, as seen in Figure 3 ![]() ![]() Materials and methods Data source We used 239,359 open reading frames (ORFs) from 87 sequenced prokaryotic genomes registered in GenBank (Benson et al. 2004) for this study, as listed in Table 1. The ORFs were downloaded from ftp://ncbi.nlm.nih.gov/genbank/genomes/ on March 6, 2003. The 87 genomes included 31 proteobacteria, 22 gram-positive bacteria, 19 other bacteria, and 15 archaea according to the classification in GenBank. Prediction of TM protein sequences and their TM topologies from the proteomes Out of the protein sequences translated from the ORFs, we segregated TM protein sequences and predicted their TM topologies according to the following procedure: (1) prediction of TM protein sequence candidates using SOSUI (≥98% accuracy; Hirokawa et al. 1998); (2) removal of predicted SP regions using DetecSig (88% accuracy; Lao and Shimizu 2001; Lao et al. 2002); and (3) prediction of TM topology by ConPred (68.1% accuracy; Arai et al. 2004). A more detailed description of this procedure is given in our previous article (Arai et al. 2003). Functional identification of TM protein sequences based on sequence similarity We first categorized the 114,965 full-length protein sequences in SWISS-PROT release 41 into “known,” “putative,” or “unknown” groups according to the level of functional annotation. For this categorization, we adopted the simple but rational criteria given in the GTOP database (http://spock.genes.nig.ac.jp/?genome/func.html; Kawabata et al. 2002). The criterion for discriminating sequences with a “known” function requires at least one of the following: (1) more than five letters with functional information in the DE line, (2) at least one informative word in the KW line, or (3) both “-!- FUNCTION” and “-!- CATALYTIC ACTIVITY” in the CC line. Sequence entries were classified as “putative” if the entry contains one of the following descriptions: (1) “HOMO-LOG,” “HOMOLOGY,” “HYPOTHETICAL,” “POTENTIAL,” “POSSIBLE,” “PROBABLE,” or “PUTATIVE” in the DE line; (2) “BY SIMILARITY,” “HYPOTHETICAL,” “POTENTIAL,” “POSSIBLE,” “PROBABLE,” or “PUTATIVE” in the “CC -!FUNCTION” or “CC -!- CATALYTIC ACTIVITY” line; and (3) “HYPOTHETICAL PROTEIN” in the KW line. When only the “known” criterion is satisfied, the sequence is regarded as “known.” In cases when both “known” and “putative” criteria are true, the sequence is classified as “putative.” The sequences to which the “known” criterion does not fit are categorized as “unknown,” even if the “putative” criterion fits. Through this procedure, we obtained 70,228 “known” (10,796 TM protein sequences), 39,296 “putative” (6643), and 5441 “unknown” (754) sequences from SWISS-PROT release 41. Next, we classified the 51,044 predicted TM protein sequences from the 87 prokaryotic genomes into three categories in agreement with the functional description levels in SWISS-PROT using a BLAST homology search (Altschul et al. 1990, 1997) and an ALIGN (Myers and Miller 1988) sequence comparison, as illustrated in Figure 4 ![]()
Next, the “known” or “putative” candidate sequence from the BLAST search process was aligned with the matched SWISS-PROT sequences to calculate the global sequence identities between them using the ALIGN program with the default settings, except for the substitution matrix (BLOSUM 62 was used). The matched SWISS-PROT sequence with the highest identity was characterized as the most similar one to the candidate sequence, and the candidate sequence was finally classified into one of the three categories according to the value of the highest identity: “known” (with a highest identity of ≥95%), “putative” (30%~95%), or “unknown” (<30%). When a query sequence is categorized into “known” or “putative,” it is considered to be a functionally identified TM protein and the function of the matched SWISS-PROT sequence is given to the query sequence as its function. Functional classification and identification of TM protein sequences based on TM topology similarity The procedure for classifying and identifying TM protein functions based on TM topology similarity is illustrated in Figure 5 ![]()
where, n, l1, i and l2, i are the number of TMSs and the length of the i-th loop in sequences 1 and 2, respectively, and min (l1, i, l2, i) and max (l1, i, l2, i) are the lengths of the shorter and longer loops in l1, i and l2, i, respectively. Within the individual TM-topology based clusters, the sequences are further clustered by a single-linkage method based on sequence similarity (threshold sequence identity 30%) using the ALIGN program with the default settings, except for the substitution matrix (BLOSUM 62 was used), to generate subclusters that must correspond to functional subgroups in the TM-topology based clusters, as illustrated in Figure 5 ![]() Electronic supplementary material Supplemental materials are (1) lists of the obtained large clusters based on TM topology similarity for 1~12-tms TM proteins (named “Supple_Table1.doc”), (2) Supplemental Figure legends (“Supple_Fig_legends.doc”), (3) Supplemental Figure 1 ![]() ![]() Acknowledgments This research was supported in part by a Grant-in-Aid for Scientific Research on Priority Areas (C) “Genome Information Science” (no. 15014203) and a Grant-in-Aid for Scientific Research (C) (no. 14580665) from the Ministry of Education, Culture, Sports, Science and Technology of Japan. The publication costs of this article were defrayed in part by payment of page charges. This article must therefore be hereby marked “advertisement” in accordance with 18 USC section 1734 solely to indicate this fact. Abbreviations
Notes Supplemental material: see www.proteinscience.org Article and publication are at http://www.proteinscience.org/cgi/doi/10.1110/ps.04814404. References
|
PubMed related articles
Your browsing activity is empty. Activity recording is turned off. |
|||||||||||||||||||||||
Genome Biol. 2001; 2(9):RESEARCH0035.
[Genome Biol. 2001]Protein Sci. 1998 Jan; 7(1):201-5.
[Protein Sci. 1998]FEBS Lett. 1998 Feb 27; 423(3):281-5.
[FEBS Lett. 1998]Protein Sci. 1998 Apr; 7(4):1029-38.
[Protein Sci. 1998]Biophys Chem. 1999 Dec 13; 82(2-3):165-71.
[Biophys Chem. 1999]Protein Eng. 2003 Jul; 16(7):479-88.
[Protein Eng. 2003]Comput Biol Chem. 2004 Feb; 28(1):39-49.
[Comput Biol Chem. 2004]In Silico Biol. 2002; 2(1):19-33.
[In Silico Biol. 2002]Nucleic Acids Res. 2004 Jul 1; 32(Web Server issue):W390-3.
[Nucleic Acids Res. 2004]Comput Biol Chem. 2004 Feb; 28(1):51-60.
[Comput Biol Chem. 2004]Biochemistry. 1994 Mar 15; 33(10):3038-49.
[Biochemistry. 1994]J Mol Biol. 1998 Oct 23; 283(2):489-506.
[J Mol Biol. 1998]Cell. 1994 May 6; 77(3):401-12.
[Cell. 1994]J Biol Chem. 1997 Mar 7; 272(10):6119-27.
[J Biol Chem. 1997]Nucleic Acids Res. 2004 Jan 1; 32(Database issue):D138-41.
[Nucleic Acids Res. 2004]Mol Microbiol. 1988 Jan; 2(1):109-19.
[Mol Microbiol. 1988]Nucleic Acids Res. 2004 Jan 1; 32(Database issue):D23-6.
[Nucleic Acids Res. 2004]Bioinformatics. 1998; 14(4):378-9.
[Bioinformatics. 1998]Bioinformatics. 2002 Dec; 18(12):1562-6.
[Bioinformatics. 2002]Nucleic Acids Res. 2004 Jul 1; 32(Web Server issue):W390-3.
[Nucleic Acids Res. 2004]Gene. 2003 Jan 30; 304():77-86.
[Gene. 2003]Nucleic Acids Res. 2002 Jan 1; 30(1):294-8.
[Nucleic Acids Res. 2002]J Mol Biol. 1990 Oct 5; 215(3):403-10.
[J Mol Biol. 1990]Nucleic Acids Res. 1997 Sep 1; 25(17):3389-402.
[Nucleic Acids Res. 1997]Comput Appl Biosci. 1988 Mar; 4(1):11-7.
[Comput Appl Biosci. 1988]Proc Natl Acad Sci U S A. 1992 Nov 15; 89(22):10915-9.
[Proc Natl Acad Sci U S A. 1992]Nucleic Acids Res. 2004 Jan 1; 32(Database issue):D277-80.
[Nucleic Acids Res. 2004]