- We are sorry, but NCBI web applications do not support your browser and may not function properly. More information

# Using support vector machine combined with auto covariance to predict protein–protein interactions from protein sequences

^{1}College of Chemistry, Sichuan University, Chengdu 610064 and

^{2}State Key Laboratory of Biotherapy, Sichuan University, Chengdu 610041, P.R. China

## Abstract

Compared to the available protein sequences of different organisms, the number of revealed protein–protein interactions (PPIs) is still very limited. So many computational methods have been developed to facilitate the identification of novel PPIs. However, the methods only using the information of protein sequences are more universal than those that depend on some additional information or predictions about the proteins. In this article, a sequence-based method is proposed by combining a new feature representation using auto covariance (AC) and support vector machine (SVM). AC accounts for the interactions between residues a certain distance apart in the sequence, so this method adequately takes the neighbouring effect into account. When performed on the PPI data of yeast *Saccharomyces cerevisiae*, the method achieved a very promising prediction result. An independent data set of 11 474 yeast PPIs was used to evaluate this prediction model and the prediction accuracy is 88.09%. The performance of this method is superior to those of the existing sequence-based methods, so it can be a useful supplementary tool for future proteomics studies. The prediction software and all data sets used in this article are freely available at http://www.scucic.cn/Predict_PPI/index.htm.

## INTRODUCTION

Identification of protein–protein interactions (PPIs) is crucial for elucidating protein functions and further understanding various biological processes in a cell. It has been the focus of the post-proteomic researches. In recent years, various experimental techniques have been developed for the large-scale PPI analysis, including yeast two-hybrid systems (1,2), mass spectrometry (3,4), protein chip (5) and so on. Because experimental methods are time-consuming and expensive, current PPI pairs obtained from experiments only cover a small fraction of the complete PPI networks (6). Hence, it is of great practical significance to develop the reliable computational methods to facilitate the identification of PPIs.

So far, a number of computational methods have been proposed for the prediction of PPIs. Some methods are based on the genomic information, such as phylogenetic profiles (7), gene neighbourhood (8) and gene fusion events (9,10). Methods using the structural information of proteins (11–13) and the sequence conservation between interacting proteins (14,15) have been reported. Previously predicted (known protein) domains that are responsible for the interactions between proteins have also been considered too (16–20). However, all these methods cannot be implemented if such pre-knowledge about the proteins is not available. Several sequence-based methods (21–25) have shown that the information of amino acid sequences alone may be sufficient to identify novel PPIs, but the highest accuracy of these methods is only ~80%, such as the methods by Martin *et al.* (22), and Chou and Cai (25). So Shen *et al.* (26) have developed an alternative method that yields a high prediction accuracy of 83.9%, when applied to predicting human PPIs. This method considers the local environments of residues through a conjoint triad method, but it only accounts for the properties of one amino acid and its proximate two amino acids. However, the interactions usually occur in the discontinuous amino acids segments in the sequence, and the information of these interactions may be able to further improve the prediction ability of the existing sequence-based methods.

In this article, a new method based on support vector machine (SVM) and auto covariance (AC) was proposed. AC accounts for the interactions between amino acids within a certain number of amino acids apart in the sequence, so this method takes neighbouring effect into account and makes it possible to discover patterns that run through entire sequences. The amino acid residues were translated into numerical values representing physicochemical properties, and then these numerical sequences were analysed by AC based on the calculation of covariance. Finally, the SVM model was constructed using the vectors of AC variables as input. The optimization experiment demonstrated that the interactions of one amino acid and its 30 vicinal amino acids would contribute to characterizing the PPI information. The method was tested by the PPI data of yeast *Saccharomyces cerevisiae* and yielded a prediction accuracy of 87.36%. At last, this model was further evaluated by an independent data set of other yeast PPIs with the prediction accuracy of 88.09%.

## MATERIALS AND METHODS

### Data collection and data set construction

The PPI data was collected from *Saccharomyces cerevisiae* core subset of database of interacting proteins (DIP) (27), version DIP_20070219. The reliability of this core subset has been tested by two methods, expression profile reliability (EPR) and paralogous verification method (PVM) (28). At the time of doing the experiments, the core subset contained 5966 interaction pairs. The protein pairs that contained a protein with <50 amino acids were removed and the remaining 5943 protein pairs comprised the final positive data set. All proteins in the data set were aligned using the multiple sequence alignment tool, cd-hit program (29). The aligned result shows that among the 5943 protein pairs, the overwhelming majority of them (5594 PPIs) have <40% pairwise sequence identity to one another. Although there are only 349 pairs with ≥40% identity in the training data set, the classifier will possibly be biased to these homologous sequence pairs.

Since the non-interacting pairs were not readily available, three strategies for constructing negative data set were used in order to compare the effects of different training data sets on the performance of the method. The first strategy has been described by Shen and colleagues (26) in detail. The non-interacting pairs were generated by randomly pairing proteins that appeared in the positive data set. Here the negative data set based on this method is called Prcp. The second is based on such an assumption that proteins occupying different subcellular localizations do not interact. The subcellular localization information of the proteins in the positive data set was extracted from Swiss-Prot (http://www.expasy.org/sprot/). The proteins without subcellular localization information and those denoted as ‘putative’, ‘hypothetical’ were excluded. The remaining proteins were grouped into eight subsets based on the eight main types of localization—cytoplasm, nucleus, mitochondrion, endoplasmic reticulum, golgi apparatus, peroxisome, vacuole and cytoplasm&nucleus. Each subset contained 10 proteins at least. The non-interacting pairs were generated by pairing proteins from one subset with proteins from the other subset. It must be pointed out that proteins from cytoplasm subset and nucleus subset cannot be paired with those from cytoplasm&nucleus subset. Here the negative data set based on subcellular localization information is called Psub. The two strategies must meet three requirements: (i) the non-interacting pairs cannot appear in the whole DIP yeast interacting pairs, (ii) the number of negative pairs is equal to that of positive pairs and (iii) the contribution of proteins in negative set should be as harmonious as possible (24,26).

As a comparison, the third strategy was used for creating non-interacting pairs composed of artificial protein sequences. It has been demonstrated that if a sequence of one interacting pair is shuffled, then the two proteins can be deemed not to interact with each other (30). Thus, the negative data set was prepared by shuffling the sequences of right-side interacting pairs with *k*-let (*k* = 1,2,3) counts using the Shufflet program (31).

### Feature extraction and AC

Protein–protein interaction can be defined as four interaction modes: electrostatic interaction, hydrophobic interaction, steric interaction and hydrogen bond. Here seven physicochemical properties of amino acids were selected to reflect these interaction modes whenever possible and they are hydrophobicity (32), hydrophicility (33), volumes of side chains of amino acids (34), polarity (35), polarizability (36), solvent-accessible surface area (SASA) (37) and net charge index (NCI) of side chains of amino acids (38), respectively. The original values of the seven physicochemical properties for each amino acid are listed in Supplementary Table S1. They were first normalized to zero mean and unit standard deviation (SD) according to Equation (1):

where *P _{i,j}* is the

*j*-th descriptor value for

*i*-th amino acid,

*P*the mean of

_{j}*j*-th descriptor over the 20 amino acids and

*S*the corresponding SD. Then each protein sequence was translated into seven vectors with each amino acid represented by the normalized values of seven descriptors.

_{j}Artificial intelligence-based techniques such as SVM and the neural network require a fixed number of inputs for training. However, there are often unequal-length vectors because of protein sequences with different lengths. So auto cross covariance (ACC) was used to transform these numerical vectors into uniform matrices. As a statistical tool for analyzing sequences of vectors developed by Wold *et al.* (39), ACC has been adopted by more and more leading investigators for protein classification (40–42). ACC results in two kinds of variables, AC between the same descriptor, and cross covariance (CC) between two different descriptors. In this study, only AC variables were used in order to avoid generating too large number of variants, compared to the limited number of PPI pairs. Given a protein sequence, AC variables describe the average interactions between residues, a certain *lag* apart throughout the whole sequence. Here, *lag* is the distance between one residue and its neighbour, a certain number of residues away. The AC variables are calculated according to Equation (2), where *j* represents one descriptor, *i* the position in the sequence *X*, *n* the length of the sequence *X* and *lag* the value of the lag.

In this way, the number of AC variables, *D* can be calculated as *D* = *lg* × *P*, where *P* is the number of descriptors and *lg* is the maximum *lag* (*lag* = 1, 2, … , *lg*). After each protein sequence was represented as a vector of AC variables, a protein pair was characterized by concatenating the vectors of two proteins in this protein pair.

### Model construction

The classification model for predicting PPIs was based on SVM. Vapnik (43) has given a full description about how to use SVM to do classification. The software libsvm 2.84 (http://www.csie.ntu.edu.tw/~cjlin/libsvm/) was employed in this work. A radial basis function (RBF) was chosen as the kernel function. Two parameters, the regularization parameter *C* and the kernel width parameter *γ* were optimized using a grid search approach. In statistical prediction, sub-sampling test and jackknife test are often used as two cross-validation methods (44). Jackknife test is deemed more objective and has been widely adopted by many investigators (41,45–56) to test the power of various predictors, but it will take much long time to perform the jackknife test. Although it has been demonstrated that the sub-sampling test cannot avoid arbitrariness according to a recent comprehensive review (45) and a penetrating analysis in (57), it is still a good validation method for the large data set. Considering the numerous samples used in this work, 5-fold cross-validation was used to investigate the training set.

The final data set consisted of 11 886 protein pairs, half from the positive data set and half from the negative data set. Here three-fifths of the protein pairs respectively from the positive and negative data set were randomly chosen as the training set (7130 protein pairs) and the remaining two-fifths (4576 protein pairs) were used as the test set. An SVM model was built using the training set and 5-fold cross-validation, and the performance of this model was evaluated by the test set. In order to test the robustness of the method, this process of random selection of training set and test set, model-building and model-evaluating was repeated five times. Thus, five training sets and five test sets were prepared, so five models were generated. Three parameters, sensitivity, precision and accuracy were used to measure the performance of this method. They are defined as follows:

where *TP*, *TN*, *FP* and *FN* represent true positive, true negative, false positive and false negative, respectively.

## RESULTS AND DISCUSSION

### Comparing the prediction performances of different negative data sets

SVM models were constructed using the five negative data sets derived from the three strategies. In this step, *lg* was initialized to be 25 amino acids. Table 1 gives the average prediction results of SVM models using different negative data sets. Using 1-let, 2-let and 3-let shuffled protein sequences as negative data set, the average prediction accuracy is 79.25, 77.30 and 70.25%, respectively. The trend that the prediction accuracy decreases with the increase of *k* is resulted from the fact that the shuffling procedure provides more native-like artificial proteins by conserving higher-order biases. The model based on the negative data set Prcp yields very low prediction accuracy of 58.42%, while the model built with the same strategy by Shen *et al.* (26) achieves a good performance with an accuracy of 83.90%. It is probably due to the different feature representation methods and data sources. Compared to other four models, the model based on the negative data set Psub gives the best performance. The average prediction accuracy, sensitivity and precision are 86.23, 85.22 and 87.83%, respectively, which indicate that this method is successful in predicting PPIs using the non-interacting pairs of non co-localized proteins as the negative data set. However, it is necessary to point out that selecting non-interacting pairs of non co-localized protein will lead to over-optimistic estimates of classifier accuracy, as denoted by Ben-hur and Noble (58).

### Selecting optimal *lg*

The use of AC with large *lag*s will result in more variables that account for interactions of amino acids with large distances apart in the sequence. The maximal possible *lg* is the length of the shortest sequence (50 amino acids) in the data set. In this study, several *lg*s were optimized in order to achieve the best characterization of the protein sequences. Using Psub as the negative data set, nine models were constructed with nine different *lg*s, respectively (*lg* = 5, 10, 15, 20, 25, 30, 35, 40, 45). The prediction results for the nine models are shown in Figure 1. As seen from the curve, the prediction accuracy increases when *lg* increases from 5 to 30, but it slightly fluctuates when *lg* increases from 30 up to 45. There is a peak point with an average accuracy of 87.36% and the *lg* of 30 amino acids. It is concluded that AC with *lg* less than 30 amino acids would lose some useful features of the protein sequences and larger *lg*s could introduce noise instead of improving the prediction power of the model. So the optimal *lg* is 30 amino acids.

### Comparing the performance of AC with that of ACC

After represented by the seven descriptors, a protein pair was converted into a 420-dimensional (2 × 30 × 7) vector by AC with *lg* of 30 amino acids. However, when ACC is used, a protein sequence will be a vector of 2940 dimension (2 × 30 × 7 × 7). To reduce the calculating time, only AC variables were used as the input of SVM. Here, we also used ACC to transform the protein sequences and compared the performance of the model based on ACC with that of the model based on AC. From Table 2, we can see that the model based on ACC transform gives good results with the average sensitivity, precision and accuracy of 89.93, 88.87 and 89.33 ± 2.67%, respectively. However, when the dimension of vector space is dramatically reduced from 2940 to 420 using AC transform, the performance of the model based on AC is very close to that of the model based on ACC. It proves that CC variables only have a little contribution to the performance of the model and AC variables are the principal components of ACC variables.

*lg*of 30 amino acids

So in this work, the optimal model was based on the negative data set Psub and AC transform with *lg* of 30 amino acids. The prediction results for five test sets are listed in Table 2. For all five models, the prediction accuracies are all >86% with a relatively low SD of 1.38%. On average, the sensitivity, precision and prediction accuracy of this model are 87.30, 87.82 and 87.36%, respectively. These results are obtained based on the original data set that contains homologous protein pairs. However, for the statistical predictions, it is absolutely necessary to avoid redundancy and homology bias in the training data set (57). In order to determine the homology effects, the non-redundant data set was constructed by removing the protein pairs with ≥40% pairwise sequence identity from the whole original data set. The performance of the five models based on this non-redundant data set is shown in Supplementary Table S2. The average prediction accuracy of the non-redundant data set is 86.55%.

Two SVM parameters, *C* and *γ* were optimized as 32 and 0.03125. So using the whole data set, the final prediction model was built with the optimal parameters.

### Performance on the independent data set

In order to evaluate the practical prediction ability of the final prediction model, a large independent data set was constructed. In DIP, the yeast data set contained 17 491 interaction pairs, out of which that which contained a protein with <50 amino acids and those appearing in the training data set were all excluded. Among the remaining 11 474 protein pairs, 10 108 PPIs are correctly predicted by the prediction model and the success rate is 88.09%. In this article, the negative training set was generated by selecting non-interacting pairs of non-co-localized proteins. However, Ben-Hur and Noble (58) have denoted that restricting negative examples to non-co-localized protein pairs leads to a biased estimate of the accuracy of a PPI predictor. So it is necessary to generate a test data set of the non-interacting pairs with the same localization to test the effects of this bias. The yeast proteins used in the positive training set were assigned with the seven main types of localization. The non-interacting protein pairs with the same localization were generated and none of them has occurred in the whole DIP yeast interacting pairs. The performance of this method in predicting such negative samples is summarized in Supplementary Table S3. For cytoplasm and nucleus subsets, only 8000 non-interactions were randomly selected from the large-scale data set, respectively. The result shows that the prediction model is able to correctly predict the non-interacting pairs of all subsets with >80% accuracy, except the cytoplasm subset with 77% accuracy and endoplasmic reticulum subset with 69% accuracy. For all 27 204 non-interactions, the total prediction accuracy is 81.46%. In addition, using the model based on the non-redundant data set, the prediction accuracy for 11 474 yeast PPIs is 93.25% and the result of the non-interacting pairs is shown in Supplementary Table S4. All these results demonstrate that this method is also able to predict non-interacting pairs with the same localization.

## CONCLUSION

In this article, we developed a new method for predicting PPIs only using the primary sequences of proteins. The prediction model was constructed based on SVM and AC. Shen *et al.* (26) have denoted that usually the methods with no local environments of amino acids are not reliable and robust, so they proposed a conjoint triad method to consider the properties of each amino acid and its two proximate amino acids. However, in most cases, the long-range interactions are also important for representing the PPI information. In this article, AC was used to involve the information of interactions between amino acids a longer distance apart in the sequence. A protein sequence was characterized by a series of ACs that covered the information of interactions between one amino acid and its 30 vicinal amino acids in the sequence. So this method adequately takes the neighbouring effect into account. As expected, this method improved the prediction accuracy compared with the current methods. Moreover, three different negative data sets were compared and the model trained using non-interacting pairs of non co-localized proteins yielded the best performance with a high accuracy of 87.36%, when applied to predicting the PPIs of *S. cerevisiae*. Meanwhile, the final prediction model was tested using the independent data set of the yeast PPIs with a good performance. Overall, such a robust method will be a useful tool to elucidate the biological function of newly discovered proteins and to expedite the study of protein networks.

## ACKNOWLEDGEMENTS

The authors gratefully thank Eivind Coward for sharing the Shufflet sequence-randomizing code. The work was funded by the National Natural Science Foundation of China (No. 20775052). Funding to pay the Open Access publication charges for this article was provided by the National Natural Science Foundation of China (No. 20775052).

*Conflict of interest statement*. None declared.

## REFERENCES

*k*-let counts. Bioinformatics. 1999;15:1058–1059. [PubMed]

**Oxford University Press**

## Formats:

- Article |
- PubReader |
- ePub (beta) |
- PDF (131K)

- Prediction of protein-protein interactions from protein sequence using local descriptors.[Protein Pept Lett. 2010]
*Yang L, Xia JF, Gui J.**Protein Pept Lett. 2010 Sep; 17(9):1085-90.* - Predicting protein-protein interactions based only on sequences information.[Proc Natl Acad Sci U S A. 2007]
*Shen J, Zhang J, Luo X, Zhu W, Yu K, Chen K, Li Y, Jiang H.**Proc Natl Acad Sci U S A. 2007 Mar 13; 104(11):4337-41. Epub 2007 Mar 5.* - Predicting protein-protein interactions from sequence using correlation coefficient and high-quality interaction dataset.[Amino Acids. 2010]
*Shi MG, Xia JF, Li XL, Huang DS.**Amino Acids. 2010 Mar; 38(3):891-9. Epub 2009 Apr 24.* - Two multi-classification strategies used on SVM to predict protein structural classes by using auto covariance.[Interdiscip Sci. 2009]
*Wu J, Li YZ, Li ML, Yu LZ.**Interdiscip Sci. 2009 Dec; 1(4):315-9. Epub 2009 Nov 14.* - Evaluation of different domain-based methods in protein interaction prediction.[Biochem Biophys Res Commun. 2009]
*Ta HX, Holm L.**Biochem Biophys Res Commun. 2009 Dec 18; 390(3):357-62. Epub 2009 Oct 2.*

- Computational prediction of the human-microbial oral interactome[BMC Systems Biology. ]
*Coelho ED, Arrais JP, Matos S, Pereira C, Rosa N, Correia MJ, Barros M, Oliveira JL.**BMC Systems Biology. 824* - Prediction of Protein-Protein Interaction with Pairwise Kernel Support Vector Machine[International Journal of Molecular Sciences...]
*Zhang SW, Hao LY, Zhang TH.**International Journal of Molecular Sciences. 15(2)3220-3233* - Sequence Based Prediction of DNA-Binding Proteins Based on Hybrid Feature Selection Using Random Forest and Gaussian Na?ve Bayes[PLoS ONE. ]
*Lou W, Wang X, Chen F, Chen Y, Jiang B, Zhang H.**PLoS ONE. 9(1)e86703* - Computational Prediction of Protein-Protein Interaction Networks: Algo-rithms and Resources[Current Genomics. 2013]
*Zahiri J, Bozorgmehr JH, Masoudi-Nejad A.**Current Genomics. 2013 Sep; 14(6)397-414* - Combining Phylogenetic Profiling-Based and Machine Learning-Based Techniques to Predict Functional Related Proteins[PLoS ONE. ]
*Lin TW, Wu JW, Chang DT.**PLoS ONE. 8(9)e75940*

- PubMedPubMedPubMed citations for these articles
- SubstanceSubstancePubChem Substance links
- TaxonomyTaxonomyRelated taxonomy entry
- Taxonomy TreeTaxonomy Tree