![]() | ![]() |
Formats:
|
||||||||||||||||||
A New Machine Learning Approach for Protein Phosphorylation Site Prediction in Plants 1 Department of Computer Science, University of Missouri, Columbia, Missouri 65211 2 C.S. Bond Life Sciences Center, University of Missouri, Columbia, Missouri 65211 3 Department of Biochemistry, University of Missouri, Columbia, Missouri 65211 4 Center for Information Science and Technology, Temple University, Philadelphia, PA 19122 5 Center for Computational Biology and Bioinformatics, Indiana University Schools of Medicine and Informatics, Indianapolis, IN 46202, Tel.: +1 573-884-1887; Fax: +1 573-882-8318, Email: xudong/at/missouri.edu * Corresponding author Abstract Protein phosphorylation is a crucial regulatory mechanism in various organisms. With recent improvements in mass spectrometry, phosphorylation site data are rapidly accumulating. Despite this wealth of data, computational prediction of phosphorylation sites remains a challenging task. This is particularly true in plants, due to the limited information on substrate specificities of protein kinases in plants and the fact that current phosphorylation prediction tools are trained with kinase-specific phosphorylation data from non-plant organisms. In this paper, we proposed a new machine learning approach for phosphorylation site prediction. We incorporate protein sequence information and protein disordered regions, and integrate machine learning techniques of k-nearest neighbor and support vector machine for predicting phosphorylation sites. Test results on the PhosPhAt dataset of phosphoserines in Arabidopsis and the TAIR7 non-redundant protein database show good performance of our proposed phosphorylation site prediction method. Keywords: Protein Phosphorylation, Phosphoproteomics, Arabidopsis, Protein Disorder, KNN, SVM 1 Introduction Reversible protein phosphorylation is one of the most pervasive posttranslational modification mechanisms, regulating diverse cellular processes in various organisms. It has been estimated that about 30% of all proteins in a cell are phosphorylated at any given time [1]. In recent years, publicly available protein phosphorylation data have rapidly accumulated due to large-scale, mass spectrometry studies of protein phosphorylation in different organisms [2-6] and the development of associated phosphorylation web resources [7-11]. Protein phosphorylation can occur on serine, threonine and tyrosine residues, as well as histidine and aspartate residues in the case of two-component phosphorelays. However, O-linked phosphorylation, specifically on serine residues, is the most common form of phosphorylation in eukaryotes. Despite the increasing number of large-scale phosphorylation studies, experimental identification of phosphorylation sites is still a difficult and time-consuming task. Therefore, more efficient methods for predicting phosphorylation sites in silico are desirable. A number of phosphorylation site prediction tools have been developed, including Scansite 2.0 [12], NetPhosK [13], PredPhospho [14], DISPHOS [15], KinasePhos [16], PPSP [17], pkaPS [18], Predikin [19], GPS 2.0 [20], AutoMotif [21] and CRPhos [22]. However, these tools have limitations when predicting phosphorylation sites in plants for two major reasons: (1) they were trained mostly on phosphorylation data from non-plant—mainly mammalian organisms; (2) all of them except DISPHOS, were trained on kinase-specific phosphorylation data and aimed to predict kinase substrate specificities. Meanwhile, the phosphorylation data in plants are not as well annotated as those in mammals, with much less information available on the specificity of phosphorylation sites and their corresponding kinases. Therefore, there is a clear need to train a reliable phosphorylation predictor in plants given the increased frequency of protein kinases in plant genomes and the lack of knowledge about their substrate specificities. With the recently released PhosPhAt database, potential phosphoserines were predicted for the Arabidopsis protein database TAIR7 [23] by support vector machine (SVM) trained on the experimental data collected in the database [10]. Nevertheless, there is room for improvement in prediction accuracy. In this paper, we proposed a new machine learning approach for phosphorylation site prediction in plants, which integrates features from protein disorder information, nearest neighbors of known phosphorylation sites, and amino acid frequencies in the surrounding sequences of phosphorylation sites to train an SVM for phosphorylation site prediction. The key differences between our method and the previous study [10] are that we incorporated protein disorder prediction and nearest neighbor information in the prediction. A previous study demonstrated that disorder information significantly improved the discrimination between phosphorylation and non-phosphorylation sites [15]. With increasing volume of empirical phosphorylation sites, it is advantageous to use nearest neighbor information. Test results on the PhosPhAt [10] dataset of phosphoserines and the TAIR7 [23] non-redundant protein database indeed shows the remarkable performance of our proposed phosphorylation prediction method. 2 Materials and Methods Phosphorylation site prediction can be formulated as a binary classification problem, namely each serine/threonine/tyrosine can be classified as either phosphorylation site or non-phosphorylation site. As with all general binary classification problems, there are three key issues: (1) a well-collected and curated dataset including positive and negative data; (2) a set of effective features to characterize the common patterns in each category and the differences between the two categories; (3) a classifier trained from the known data, capable of making reliable predictions for new data. In this study, datasets were extracted from the TAIR7 protein database and PhosPhAt phosphorylation database as discussed in Section 2.1. Outputs from a protein disorder predictor, outputs from k-nearest neighbor predictions and amino acid frequencies around the phosphorylation sites were taken as features as discussed in Section 2.2. We used SVM as the classifier. 2.1 Phosphorylation Dataset Phosphorylation data in the model organism Arabidopsis thaliana collected in PhosPhAt [10] and the Arabidopsis thaliana protein database TAIR7 were utilized in this study. Sequences with high similarities were first removed from TAIR7 to build a non-redundant (NR) protein database using BLASTClust in the BLAST package version 2.2.19 with a sequence identity threshold of 30%. As a result, 12,018 representative proteins remain in the TAIR NR database. The PhosPhAt phosphorylation data were then incorporated resulting in 1152 phosphoproteins in the TAIR NR database, which contain 2050 phosphorylation sites, including 1818 phosphoserines, 130 phosphothreonines and 102 phosphotyrosines. We only study phosphoserine events in this paper because of the large number of available data for training and testing. However, the proposed method can be applied to all types of phosphorylation sites. A 25-residue-long amino acid sequence surrounding each phosphoserine with the phosphoserine in the middle was extracted from each phosphoprotein in the TAIR NR database. Phosphoserines with upstream or downstream less than 12 residues were discarded. As a result, we retrieved a positive set with 1671 sequences surrounding phosphoserines. Similarly, the 433,744 sequences surrounding the non-phosphoserines (serines other than the phosphoserines) were assumed to be the negative set. Although not all these sites are necessarily true negatives, it is reasonable to believe that the vast majority of them are. 2.2 Feature Extraction and Selection 2.2.1 K-Nearest Neighbor Features Both of the positive and negative sets are very diverse at the sequence level. However, clusters may exist in the positive set, since each phosphorylation site is the substrate of a specific protein kinase, and one kinase could target multiple substrates. It is well known that substrates of the same kinase may share similar patterns in sequence [24]. To take advantage of the cluster information when predicting phosphorylation for a new site (represented by its surrounding sequence), we extracted features from its similar sequences in both positive and negative sets retrieved by a k-nearest neighbor (KNN) algorithm as the following procedure.
2.2.2 Protein Disorder Features It was observed that sites of posttranslational modifications, including protein phosphorylation sites, are frequently located within disordered regions [15, 26]. In [15], the disorder prediction results for the phosphorylation sites were employed as features to construct a phosphorylation predictor—DISPHOS. In this study, we extracted the disorder information for all surrounding residues of each phosphorylation site and combined them to form a set of disorder features in SVM. The procedure is as follows:
2.2.3 Amino Acid Frequency Features In [15], Iakoucheva et al. analyzed the amino acid composition of the surrounding sequences of phosphorylation sites and found that rigid, buried, neutral amino acids (W, C, F, I, Y, V and L) are significantly depleted, while flexible, surface-exposed amino acids (S, P, E, K) are significantly enriched. This conclusion was confirmed by this study as illustrated in Section 3.3. This fact makes the amino acid frequencies good candidates as features for phosphorylation site prediction. In this paper, all 20 amino acid frequencies in each 25-residue sequence were extracted as features for the phosphorylation predictor. 3 Results and Discussions 3.1 KNN Scores as Features The KNN scores were extracted as features according to the procedure described in Section 2.2.1. A KNN score for a sequence of interest actually compares its average distance (or dissimilarity) to the nearest neighbors (NNs) in the positive set with that in the negative set. A score smaller than 1 means the sequence is more similar to the positive set; a score larger than 1 means more similar to the negative set. The smaller the KNN score, the more similar the sequence is to known phosphorylation sites, and thus the more likely it contains a phosphorylation site. Figure 1
Interestingly, all of the non-phosphoserines' average KNN scores are around 1, which means overall the sequences in the negative set are not predominantly more similar to NNs in either the positive or negative sets. This is not surprising, since phosphorylation-related clusters are unlikely to exist in the negative set, and thus the sequences in the negative set have similar chance to find close neighbors in either positive or negative set. In short, KNN scores capture the cluster information in phosphoserines, and hence distinguish them from non-phosphoserines. Therefore, KNN scores are suitable to serve as features for the phosphorylation site prediction. The prediction performance of KNNs scores will be demonstrated in Section 3.4 3.2 Protein Phosphorylation and Disorder In this section, we will demonstrate that phosphoserines in the dataset we used are predominantly overrepresented in disordered regions, and hence confirm the effectiveness of the disorder scores as features for phosphorylation prediction. Figures 2(A) and 2(B)
3.3 Amino Acid Frequency Features In this section, we will study the amino acid composition surrounding the phosphoserines. In Figure 3
3.4 SVM Training and Testing In this study, an SVM was trained as the classifier between phosphoserines and non-phosphoserines. The SVMlight Version 6.02 [28] was used. The parameters were optimized as ‘-t 2 -g 1 -c 10 –x 1’, which means selecting the kernel as radial basis function with gamma equal to 1, setting C—the tradeoff between training error and margin to 10, and computing the leave-one-out estimate. As mentioned in Section 2.1, there are 1671 serines in the positive set and 433,744 in the negative set. Testing of the proposed method was performed using the following procedure:
The above testing procedure was performed on each separate set of features (amino acid features only, disorder features only, or KNN features only) and combined features (all three sets of features together) 10 times each. Table 1 shows the area under receiver operating characteristic (ROC) curve (AUC) for each test of each set of features, and also the mean AUCs and the standard deviations. Figure 4
Table 1 and Figure 4 The phosphoserine predictor in [10] gave a performance of AUC around 0.81 on the redundant Arabidopsis TAIR7 protein dataset. It is worth mentioning that for the redundant dataset, the test results of our method achieved 0.84-0.85 on AUC, as KNN may find sequence neighbors in close homologs of the query protein. 4 Conclusion and Future Work In this paper, we developed a new approach for predicting protein phosphorylation sites in plants. We treated phosphorylation site prediction as a binary classification problem, and employed machine learning techniques to solve it. Multiple features were first extracted from the dataset, including features from nearest neighbors, protein disordered regions and amino acid frequencies. We demonstrated that phosphoserines in the PhosPhAt dataset are predominantly overrepresented in disordered regions. An SVM was then trained based on these features, and used to predict phosphorylation sites in new data. Our method combined both KNN to take advantage of similar known sequence fragments around phosphorylation sites to query protein sequences and SVM to account for other generic features. Test results show good performance of this proposed phosphorylation prediction method. As more phosphorylation sites are experimentally identified, the accuracy of our method is expected to increase automatically. In future work, we plan to apply our method on phosphothreonines and phosphotyrosines, as well as to the whole proteomes of Arabidopsis and other plant species. We will also develop a standalone application and a web service based on this work. Acknowledgments This work was supported by the funding from the National Science Foundation-Plant Genome Research Program [grant number DBI-0604439 awarded to JJT] and the National Institute of Health [grant number R21/R33 GM078601 awarded to DX]. The authors wish to thank Dr. Predrag Radivojac, Dr. Jingfen Zhang, and Zhiquan He for helpful discussion and technical assistance. References 1. Steen H, Jebanathirajah JA, Rush J, Morrice N, Kirschner MW. Phosphorylation analysis by mass spectrometry: myths, facts, and the consequences for qualitative and quantitative measurements. Mol Cell Proteomics. 2006;5(1):172–181. [PubMed] 2. Olsen JV, Blagoev B, Gnad F, Macek B, Kumar C, Mortensen P, Mann M. Global, in vivo, and site-specific phosphorylation dynamics in signaling networks. Cell. 2006;127:635–648. [PubMed] 3. Villén J, Beausoleil SA, Gerber SA, Gygi SP. Large-scale phosphorylation analysis of mouse liver. Proc Natl Acad Sci USA. 2007;104:1488–1493. [PubMed] 4. Chi A, Huttenhower C, Geer LY, Coon JJ, Syka JE, Bai DL, Shabanowitz J, Burke DJ, Troyanskaya OG, Hunt DF. Analysis of phosphorylation sites on proteins from Saccharomyces cerevisiae by electron transfer dissociation (ETD) mass spectrometry. Proc Natl Acad Sci USA. 2007;104:2193–2198. [PubMed] 5. Benschop JJ, Mohammed S, O'Flaherty M, Heck AJ, Slijper M, Menke FL. Quantitative Phosphoproteomics of Early Elicitor Signaling in Arabidopsis. Mol Cell Proteomics. 2007;6:1198–1214. [PubMed] 6. Sugiyama N, Nakagami H, Mochida K, Daudi A, Tomita M, Shirasu K, Ishihama Y. Large-scale phosphorylation mapping reveals the extent of tyrosine phosphorylation in Arabidopsis. Mol Syst Biol. 2008;4:193. [PubMed] 7. Diella F, Gould CM, Chica C, Via A, Gibson TJ. Phospho ELM: a database of phosphorylation sites–update 2008. Nucleic Acids Res. 2008;36(Database issue):D240–D244. [PubMed] 8. Gnad F, Ren S, Cox J, Olsen JV, Macek B, Oroshi M, Mann M. PHOSIDA (phosphorylation site database): management, structural and evolutionary investigation, and prediction of phosphosites. Genome Biol. 2007;8:R250. [PubMed] 9. Tchieu JH, Fana F, Fink JL, Harper J, Nair TM, Niedner RH, Smith DW, Steube K, Tam TM, Veretnik S, Wang D, Gribskov M. The PlantsP and PlantsT Functional Genomics Databases. Nucleic Acids Res. 2003;31:342–344. [PubMed] 10. Heazlewood JL, Durek P, Hummel J, Selbig J, Weckwerth W, Walther D, Schulze WX. PhosPhAt: a database of phosphorylation sites in Arabidopsis thaliana and a plant-specific phosphorylation site predictor. Nucleic Acids Res. 2008;36(Database issue):D1015–D1021. [PubMed] 11. Gao J, Agrawal GK, Thelen JJ, Xu D. P3DB: a plant protein phosphorylation database. Nucleic Acids Res. 2009;37(Database issue):D960–D962. [PubMed] 12. Obenauer JC, Cantley LC, Yaffe MB. Scansite 2.0: Proteome-wide prediction of cell signaling interactions using short sequence motifs. Nucleic Acids Res. 2003;31(13):3635–3641. [PubMed] 13. Blom N, Sicheritz-Ponten T, Gupta R, Gammeltoft S, Brunak S. Prediction of post-translational glycosylation and phosphorylation of proteins from the amino acid sequence. Proteomics. 2004;4(6):1633–1649. [PubMed] 14. Kim JH, Lee J, Oh B, Kimm K, Koh I. Prediction of phosphorylation sites using SVMs. Bioinformatics. 2004;20(17):3179–3184. [PubMed] 15. Iakoucheva LM, Radivojac P, Brown CJ, O'Connor TR, Sikes JG, Obradovic Z, Dunker AK. The importance of intrinsic disorder for protein phosphorylation. Nucleic Acids Res. 2004;32(3):1037–1049. [PubMed] 16. Huang HD, Lee TY, Tzeng SW, Horng JT. KinasePhos: a web tool for identifying protein kinase-specific phosphorylation sites. Nucleic Acids Res. 2005;33(Web Server issue):W226–W229. [PubMed] 17. Xue Y, Li A, Wang L, Feng H, Yao X. PPSP: prediction of PK-specific phosphorylation site with Bayesian decision theory. BMC Bioinformatics. 2006;7:163. [PubMed] 18. Neuberger G, Schneider G, Eisenhaber F. pkaPS: prediction of protein kinase A phosphorylation sites with the simplified kinase substrate binding model. Biol Direct. 2007;2:1. [PubMed] 19. Saunders NF, Kobe B. The Predikin webserver: improved prediction of protein kinase peptide specificity using structural information. Nucleic Acids Res. 2008;36(Web Server issue):W286–W290. [PubMed] 20. Xue Y, Ren J, Gao X, Jin C, Wen L, Yao X. GPS 2.0, a tool to predict kinase-specific phosphorylation sites in hierarchy. Mol Cell Proteomics. 2008;7(9):1598–1608. [PubMed] 21. Plewczynski D, Tkacz A, Wyrwicz LS, Rychlewski L, Ginalski K. AutoMotif Server for prediction of phosphorylation sites in proteins using support vector machine: 2007 update. J Mol Model. 2008;14(1):69–76. [PubMed] 22. Dang TH, Van Leemput K, Verschoren A, Laukens K. Prediction of kinase-specific phosphorylation sites using conditional random fields. Bioinformatics. 2008;24(24):2857–2864. [PubMed] 23. Swarbreck D, Wilks C, Lamesch P, Berardini TZ, Garcia-Hernandez M, Foerster H, Li D, Meyer T, Muller R, Ploetz L, Radenbaugh A, Singh S, Swing V, Tissier C, Zhang P, Huala E. The Arabidopsis Information Resource (TAIR): gene structure and function annotation. Nucleic Acids Res. 2008;36(Database issue):D1009–D1014. [PubMed] 24. Kennelly PJ, Krebs EG. Consensus sequences as substrate specificity determinants for protein kinases and protein phosphatases. J Biol Chem. 1991;266:15555–15558. [PubMed] 25. Henikoff S. Amino acid substitution matrices from protein blocks. Proc Natl Acad Sci USA. 1992;89:10915–10919. [PubMed] 26. Dunker AK, Oldfield CJ, Meng J, Romero P, Yang JY, Chen JW, Vacic V, Obradovic Z, Uversky VN. The unfoldomics decade: an update on intrinsically disordered proteins. BMC Genomics. 2008;9(Suppl 2):S1. 27. Obradovic Z, Peng K, Vucetic S, Radivojac P, Dunker AK. Exploiting heterogeneous sequence properties improves prediction of protein disorder. Proteins. 2005;61(suppl 7):176–182. [PubMed] 28. Joachims Thorsten. 2008. SVMlight Version 6.0.2. http://svmlight.joachims.org. |
PubMed related articles
Your browsing activity is empty. Activity recording is turned off. |
|||||||||||||||||
Mol Cell Proteomics. 2006 Jan; 5(1):172-81.
[Mol Cell Proteomics. 2006]Cell. 2006 Nov 3; 127(3):635-48.
[Cell. 2006]Mol Syst Biol. 2008; 4():193.
[Mol Syst Biol. 2008]Nucleic Acids Res. 2008 Jan; 36(Database issue):D240-4.
[Nucleic Acids Res. 2008]Nucleic Acids Res. 2009 Jan; 37(Database issue):D960-2.
[Nucleic Acids Res. 2009]Nucleic Acids Res. 2003 Jul 1; 31(13):3635-41.
[Nucleic Acids Res. 2003]Proteomics. 2004 Jun; 4(6):1633-49.
[Proteomics. 2004]Bioinformatics. 2004 Nov 22; 20(17):3179-84.
[Bioinformatics. 2004]Nucleic Acids Res. 2004; 32(3):1037-49.
[Nucleic Acids Res. 2004]Nucleic Acids Res. 2005 Jul 1; 33(Web Server issue):W226-9.
[Nucleic Acids Res. 2005]Nucleic Acids Res. 2008 Jan; 36(Database issue):D1015-21.
[Nucleic Acids Res. 2008]Nucleic Acids Res. 2004; 32(3):1037-49.
[Nucleic Acids Res. 2004]Nucleic Acids Res. 2008 Jan; 36(Database issue):D1009-14.
[Nucleic Acids Res. 2008]Nucleic Acids Res. 2008 Jan; 36(Database issue):D1015-21.
[Nucleic Acids Res. 2008]J Biol Chem. 1991 Aug 25; 266(24):15555-8.
[J Biol Chem. 1991]Proc Natl Acad Sci U S A. 1992 Nov 15; 89(22):10915-9.
[Proc Natl Acad Sci U S A. 1992]Nucleic Acids Res. 2004; 32(3):1037-49.
[Nucleic Acids Res. 2004]Proteins. 2005; 61 Suppl 7():176-82.
[Proteins. 2005]Nucleic Acids Res. 2004; 32(3):1037-49.
[Nucleic Acids Res. 2004]Nucleic Acids Res. 2008 Jan; 36(Database issue):D1015-21.
[Nucleic Acids Res. 2008]Nucleic Acids Res. 2004; 32(3):1037-49.
[Nucleic Acids Res. 2004]Nucleic Acids Res. 2008 Jan; 36(Database issue):D1015-21.
[Nucleic Acids Res. 2008]