• We are sorry, but NCBI web applications do not support your browser and may not function properly. More information
Logo of plosonePLoS OneView this ArticleSubmit to PLoSGet E-mail AlertsContact UsPublic Library of Science (PLoS)
PLoS One. 2012; 7(2): e30361.
Published online Feb 2, 2012. doi:  10.1371/journal.pone.0030361
PMCID: PMC3271071

TANGLE: Two-Level Support Vector Regression Approach for Protein Backbone Torsion Angle Prediction from Primary Sequences

Christian Schönbach, Editor

Abstract

Protein backbone torsion angles (Phi) and (Psi) involve two rotation angles rotating around the Cα-N bond (Phi) and the Cα-C bond (Psi). Due to the planarity of the linked rigid peptide bonds, these two angles can essentially determine the backbone geometry of proteins. Accordingly, the accurate prediction of protein backbone torsion angle from sequence information can assist the prediction of protein structures. In this study, we develop a new approach called TANGLE (Torsion ANGLE predictor) to predict the protein backbone torsion angles from amino acid sequences. TANGLE uses a two-level support vector regression approach to perform real-value torsion angle prediction using a variety of features derived from amino acid sequences, including the evolutionary profiles in the form of position-specific scoring matrices, predicted secondary structure, solvent accessibility and natively disordered region as well as other global sequence features. When evaluated based on a large benchmark dataset of 1,526 non-homologous proteins, the mean absolute errors (MAEs) of the Phi and Psi angle prediction are 27.8° and 44.6°, respectively, which are 1% and 3% respectively lower than that using one of the state-of-the-art prediction tools ANGLOR. Moreover, the prediction of TANGLE is significantly better than a random predictor that was built on the amino acid-specific basis, with the p-value<1.46e-147 and 7.97e-150, respectively by the Wilcoxon signed rank test. As a complementary approach to the current torsion angle prediction algorithms, TANGLE should prove useful in predicting protein structural properties and assisting protein fold recognition by applying the predicted torsion angles as useful restraints. TANGLE is freely accessible at http://sunflower.kuicr.kyoto-u.ac.jp/~sjn/TANGLE/.

Introduction

As a result of the completion of whole-genome sequencing projects, the sequence-structure gap is rapidly increasing. In this context, the accurate prediction of protein structure and function from sequences remains a challenging task. An useful intermediate way to address this is to predict one-dimensional structural properties of proteins including secondary structure, solvent accessibility, residue contact number/order, residue depth, and dihedral torsion angles [1][12]. For a comprehensive review of recent progress on the development of one-dimensional predictors, refer to Kurgan and Disfani [13]. In the past two decades, most efforts have been made to predict the former three properties of proteins, leading to ongoing improvements in prediction performance [14][16]. However, with respect to torsion angles, there is increasing interest in the field of structural bioinformatics in developing efficient algorithms that are capable of accurately predicting protein backbone torsion angles from amino acid sequences. This is because they can provide more detailed description of the backbone conformations, which, if known, can significantly reduce the conformational search and contribute towards the final prediction of protein three-dimensional structure predictions. For example, predicted torsion angles have been applied to improve protein secondary structure prediction [17], [18], protein fold recognition [19][21], multiple sequence alignments [22], [23] and fragment-free tertiary-structure prediction [10].

There are three different backbone torsion angles along with protein polypeptide chains: ϕ (Phi), ψ (Psi) and ω (Omega), which involve the backbone atoms C-N-Cα-C, N-Cα-C-N and Cα-C-N-Cα, respectively. Due to the planarity of the linked rigid peptide bonds, the two angles Phi and Psi can essentially determine the backbone geometry of proteins. The third angle Omega does not need to be specified as it is almost always fixed at 180° [11]. This means protein local structures can be unambiguously described by their backbone torsion angles [10]. Therefore, if the real values of Phi and Psi of all residues of a given protein are known, it will be more straightforward to re-construct the protein structure using the standard bond length [11]. In addition, protein backbone torsion angles are closely correlated with protein secondary structures [24]. Particularly, different secondary structure types are clustered in different regions in the Ramachandran Phi-Psi diagram [25], so it is therefore possible to predict protein secondary structures based on the predicted torsion angle probabilities. Accordingly, predicted torsion angles have been used as a replacement or supplement to secondary structure for refined local-structure predictions and have also been used to construct simplified protein models for sampling efficiency [9], [10].

Conventionally, torsion angles were predicted as a few discrete states based on the backbone conformation distributions and various computational algorithms were developed to predict the discrete states of Phi/Psi angle values [26][32]. Machine learning techniques are typically used to train and build prediction models, including neural networks [3], [11], [24], support vector machines [11], [24], [32] and hidden Markov models [28], [30]. In this direction, Helles and Fonseca have recently developed an artificial neural network framework to predict torsion angle probability distribution of coiled residues [33]. Their method achieved prediction accuracy comparable to that of secondary structure prediction (80%) and was significantly better (4–68%) than the baseline statistics. More recently, Kountouris and Hirst have created an SVM-based predictor called DISSPred of multi-state torsion angles and three-state secondary structures. It has achieved a more competitive predictive performance compared with other previously developed classifiers [34]. As a result of the free movement of proteins in the three-dimensional space, however, protein backbone torsion angles are actually continuously varying variables. Although these earlier methods have achieved prediction accuracy of up to 80% [24], [32], [34] based on the arbitrarily defined discrete states, such predictions cannot specify the actual Phi/Psi values for each state, and therefore have limited value in protein structure prediction.

In view of this, in recent years more attention has been given to real-value prediction of both Phi and Psi torsion angles. The first real-value prediction approach, DESTRUCT, was proposed by Wood and Hirst [35]. In their work, they used the PSI-BLAST program [36] to generate position-specific scoring matrices (PSSM), which was further taken as input to train the iterative neural network models and predict one of the two major torsion angles Psi. Nevertheless, the correlation coefficient between predicted and actual values of the Psi angles was only 0.47. Berjanskii et al. developed a web server, named PREDITOR for predicting protein torsion angles [37]. It combines sequence alignment methods with advanced chemical shift data to generate the predicted torsion angles. 88% of Phi/Psi predictions by PREDITOR are located within 30° of the correct values. Wu and Zhang proposed the ANGLOR predictor based on the composite machine-learning algorithm using support vector machines and neural networks, which has achieved a mean absolute error (MAE) of 28°/46° using built models trained on only 500 protein chains [11]. Dor and Zhou developed a method called Real-SPINE that predicts the real values of structural properties of proteins including residue solvent accessibility and backbone torsion angles, based on integrated neural networks [3]. Trained on a large dataset of 2,640 protein chains, Real-SPINE substantially improved the correlation coefficient to 0.62 between the predicted and actual Psi angles (10-fold cross-validation) through large-scale learning with a slow learning rate and over-fitting protection. Real-SPINE 2.0 server [12], Real-SPINE 3.0 [9] and SPINE X [10] were further developed by Zhou's group, with the prediction accuracy continuously improved by guided learning through neural networks and other refinement techniques. In addition, using a database of 997 non-redundant NMR structures, they have further developed a neural-network based predictor for the real-valued prediction of Phi and Psi angle fluctuations [38] based on sequence information only. This predictor achieved ten-fold cross-validated Pearson correlation coefficients (CC) of 0.59 and 0.60, and mean absolute errors of 22.7° and 24.3° for the angle fluctuation of ϕ and ψ, respectively [38]. Altogether, the consensus of these studies has been that real-valued torsion angle predictions by state-of-the-art algorithms have the potential to be employed as a replacement of or supplement to secondary-structure prediction tools, and are expected to substantially improve the quality of protein structure prediction when high-confidence predicted torsion angles are applied as constraints.

More recently, Ahmad et al. proposed a novel approach for the simultaneous prediction of eight one-dimensional structural features (including solvent accessibility, helix-helix contact and backbone torsion angles) for helical membrane proteins by using an integrated prediction system called HTM-One [39]. The performance of HTM-One has been shown to outperform respective models that were separately trained on individual features, which was evaluated using rigorous leave-one-out jackknife tests based on a non-redundant dataset of 286 helical membrane proteins [39]. The results indicate that compared with previous practice of training models individually, the performance of one-dimensional predictors can be significantly improved using this prediction system in an integrated manner. This is clearly an important step in the right direction for addressing the issue of how to improve the prediction performance of one-dimensional structural features of proteins from amino acid sequences.

In this study, we propose a new complementary approach to predict the Phi/Psi angles by support vector regression (SVR) learning from sequence information only. We want to take advantage of the excellent ability of SVR to generalize learning rules and predict the raw values of the given samples. The developed TANGLE (Torsion ANGLE) predictor works by integrating multiple local sequence profiles and global sequence features within a two-level SVR learning framework. Features used by TANGLE include multiple sequence alignment profiles retrieved from the position-specific scoring matrix (PSSM), predicted secondary structure, predicted solvent accessibility and predicted native disorder information. Moreover, other global sequence information such as amino acid contents, sequence length and sequence weight are used as the inputs to TANGLE. To improve the prediction accuracy, various combinations of different feature types with different local window sizes are systematically examined and compared. Finally, TANGLE achieves a significantly better prediction accuracy compared to the ANGLOR predictor [11] and a random amino acid-specific predictor when trained and evaluated on a large dataset with 1,989 protein chains. As an implementation of this approach, we have developed the TANGLE webserver for protein backbone torsion angle prediction. This is freely available at http://sunflower.kuicr.kyoto-u.ac.jp/~sjn/TANGLE/.

Materials and Methods

Datasets

In order to objectively compare our approach with other available approaches developed previously, we used the same datasets as originally developed by Wu and Zhang [11], where the PDB entries with any broken chains or missing residues were excluded. In this dataset, every two sequences in the dataset had a pair-wise sequence identity of less than 25%. Among them, 500 proteins were used as the training set, while the rest 1,026 proteins were used as the independent testing set. The total residues in the training and testing sets were 70,646 and 142,091, respectively.

The experimental values of Phi and Psi torsion angles were calculated by the DSSP program [40]. Because the four residues in the N- and C-terminus lacked four consecutive atoms that were required to form the torsion angles, they were neglected and not included in the prediction analysis. The calculated Phi/Psi angles by DSSP can be downloaded from our TANGLE website: http://sunflower.kuicr.kyoto-u.ac.jp/~sjn/TANGLE/links.

We normalized the original Phi and Psi angles using their average and standard deviations based on the whole training datasets, to make most of their values fall within the range between 0 and 1, as suggested previously [5][7]. In the training stage, the prediction models were trained based on the normalized values of Phi and Psi, instead of the original values. In the prediction stage, we first predicted the normalized Phi and Psi angles from primary sequences in the independent test set, and then recovered the absolute Phi and Psi angles from their respectively predicted normalized values. The calculated Phi and Psi angles in the training set of 500 proteins chains can be found in Datasets S1 and S2, respectively, while the calculated Phi and Psi angles in the testing set of 1,026 protein chains can be found in Datasets S3 and S4, respectively.

Performance Evaluation

To measure the performance of real-valued torsion angle predictions, we calculated three different measures, the Pearson correlation coefficient, the mean absolute error and root mean square error between predicted and observed Phi and Psi torsion angles.

The Pearson's correlation coefficient (CC) between the predicted and observed torsion angle values is defined as:

equation image
(1)

where xi and yi are the observed and predicted torsion angle values of the i-th residue, respectively, An external file that holds a picture, illustration, etc.
Object name is pone.0030361.e002.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0030361.e003.jpg are their corresponding means and N is the total number of residues in a protein sequence. CC = 1 indicates that the two sets of values are fully correlated, while CC = 0 indicates that they are completely uncorrelated.

The mean absolute error (MAE) is defined as the average difference in angle degrees between the predicted and the observed torsion angles of all residues, i.e.

equation image
(2)

The root mean square error (RMSE) is given by:

equation image
(3)

Two RMSE measures were calculated in this study: RMSE_norm and RMSE_raw. The former was calculated based on the normalized values of Phi/Psi angles, while the latter was calculated based on the original (raw) values of Phi/Psi angles. In addition, the CC, RMSE_norm, RMSE_raw and MAE measures were calculated on both the protein chain and residue level, respectively.

Support vector regression (SVR)

Support vector machine (SVM) is a sophisticated supervised machine learning technique based on statistical learning theory [41], [42]. SVM is especially effective when the input data is not linearly separable and the kernel function is required to map the data into a higher dimensional space to find the optimal separating hyperplane. In practice, SVM has two modes: support vector classification (SVC) and support vector regression (SVR). Due to its excellent regression ability, SVR has been applied to predicting accessible surface area [43], contact number [5], [44], B-factor [45], residue depth [8], disulfide connectivity [46], caspase cleavage site [47], gene expression level [48], missing value estimation in microarray data [49], peptide-MHC binding affinity [50], siRNA efficacy [51], gene selection [52], domain boundary [53], and antigenic epitope [54].

In the present study, we use SVR (implemented in the SVM_light package, available at http://svmlight.joachims.org/) to predict torsion angle values from amino acid sequences. We selected radial basis kernel function (RBF) at ε = 0.01, γ = 0.01 and C = 5.0 to build the models for both the first-level and second-level SVR in TANGLE. This combination of parameters has been shown to provide the best prediction performance in the preliminary analysis through selecting and comparing different combinations of C and ε and examining their respective prediction performances. In the following analysis, we constantly set ε as 0.01, γ as 0.01 and C as 5.0 to evaluate the prediction performance of other sequence encoding schemes. Selection of SVM parameters and features using a sliding window size were done using only the training dataset.

Two-level support vector regression approach of TANGLE

In this section, we will describe the design of our two-level TANGLE approach that uses two SVR predictors in cascade for predicting protein backbone torsion angles from protein primary sequences. In TANGLE, the first-level accepts all the sequence-derived features as inputs to SVR and outputs the initially predicted torsion angles. The second-level accepts the initially predicted torsion angles by the first-level SVR predictor and outputs the final refined torsion angles. As the torsion angles of a residue at a particular position in the sequence depend on the local structure of its neighboring residues, introducing another layer of SVR predictor that incorporates the contextual relationship of torsion angles in the proximal neighborhood can potentially enhance the torsion angle prediction of that residue [55]. The idea of designing a two-level SVR approach has been proposed in previous studies of predicting protein solvent accessibility [55][57], residue B-factors [58], as well as analyzing condition-specific regulatory networks [59], where use of two-level SVR has been demonstrated to improve the robustness of the prediction system and enhance prediction accuracy.

In this study, we are interested in investigating the influence of various sequence features and their combinations on the prediction performance of torsion angles, within the two-level SVR framework. Figure 1 illustrates the flowchart of our two-level TANGLE approach. As can be seen, there are six different types of sequence-derived features that will be used as inputs to the first-level SVR. These features include (1) position-specific scoring matrices (PSSM) [36]; (2) PSIPRED-predicted secondary structure [60]; (3) SCRATCH-predicted solvent accessibility [61]; (4) DISOPRED2-predicted native disorder [62] and two other global features including (5) sequence length and (6) sequence weight [5][8]. Detailed description of these features and their extraction and encoding procedures are provided in the following “Sequence encoding schemes” Section.

Figure 1
The architecture of TANGLE for protein backbone Phi and Psi angle predictions.

The second-level SVR takes the predicted output of the first-level SVR with the purpose to further enhance the prediction of torsion angles. Previous studies have indicated that the use of a second-level SVR in cascade can improve the prediction accuracy by capturing the contextual relationships underlying protein structural property values like solvent accessibility and B-factors from the output of the first-level SVR [55][58]. Notice that in both in the first- and second-level SVR predictors, the sequence features for a residue of interest are encoded into input vectors of SVR using a sliding local window approach. This will be briefly discussed in the following section.

Sequence encoding schemes

Selecting appropriate sequence encoding schemes is an important step as it determines the quality of feature extraction of SVR models and thus has a significant impact on the prediction performance. In this section, we describe in more detail how to extract and encode different types of sequence feature.

Position-specific scoring matrices (PSSMs) in the form of PSI-BLAST profiles

Position-specific scoring matrix (PSSM) of a residue in the form of PSI-BLAST profile contains important evolutionary information that determines whether this residue is conserved in its family of related proteins. Each element in the PSSM represents the probability of each residue position in the multiple sequence alignment. Numerous previous studies have shown that multiple sequence alignments in the form of position-specific scoring matrices (PSSMs) can significantly improve overall prediction performance [63][77].

In this study, we obtained the PSSM profile for each sequence in the datasets by running PSI-BLAST search and encoded each residue using a local sliding window approach based on the PSSM profiles. PSI-BLAST was run for three iterations against the non-redundant NCBI nr database using a default E-value cutoff to obtain the PSSMs profiles. All the elements in the PSSM profiles were divided by 10 for normalization, so that most of the values fell with the range of 0 and 1. For a given residue, its local sequence fragment was extracted and encoded as a 20×(2l+1)-dimensional vector using a sliding window scheme where l denotes the half window size and L = 2l+1 is the full window length (See Figure 2 for extraction and encoding). In order to select the optimal local window size L for the Phi and Psi angle prediction, we evaluated prediction performance of a variety of different local window sizes L, ranging from 3 to 21. In summary, in this encoding scheme, a residue was encoded by a 20×L = 20×(2l+1)-dimensional vector.

Figure 2
A sliding window approach is employed to extract and encode local profiles into the first-level SVR model of TANGLE.

Predicted secondary structure information by PSIPRED

The PSIPRED program was chosen to predict the secondary structure information. PSIPRED is an accurate neural network-based predictor for the prediction of secondary structure with an accuracy of up to 80% [60]. The output of PSIPRED includes three-state (helix/strand/loop) prediction and probability scores for each secondary structure type. The users can submit a protein sequence and receive the prediction result both textually via e-mail and graphically via the webserver. In our previous work, we have shown that incorporation of PSIPRED-predicted secondary structure information can significantly improve the prediction performance [6][8].

Similarly, for a given residue, its three-state secondary structure profile was extracted and encoded using a sliding window of L = 2l+1 (l = 1, 2, 3, …, 10) consecutive residues. Therefore, in this encoding scheme, a residue was encoded by a 3×L = 3×(2l+1)-dimensional vector.

Predicted solvent accessibility information by SCRATCH

The SSpro program in the SCRATCH software package [61] was used to predict the solvent accessibility of each residue in the datasets. SSpro yields the predicted solvent accessibility status for a residue, in a binary format- either as “exposed” or “buried”. The predicted solvent accessibility has been shown to be able to improve the prediction accuracy for predicting natively unstructured regions [78], [79] or loops [80], DNA-binding sites [66], as well as protein interaction hotspots [67]. In this encoding scheme, a residue was encoded by a 2×L = 2×(2l+1)-dimensional vector.

Predicted native disorder information by DISOPRED2

In recent years, researchers have realized that natively disordered regions are commonly responsible for important protein function. As such, there has been an increasing interest in studying such regions in proteins. Natively disordered or unstructured regions are found to be associated with molecular assembly, protein modification and molecular recognition [81][83]. Therefore, inclusion of this feature into the SVR models could potentially improve the performance of torsion angle prediction. In previous work, native disorder features have been used to enhance the prediction performance on caspase cleavage sites [46] and phosphorylation sites [84].

In this study, we used the DISOPRED2 server, which was developed using neural networks and is considered to be one of the best predictors for predicting natively unstructured or disordered region [62]. DISOPRED2 outputs the predicted possibility of each residue being natively disordered or ordered, which will be extracted and input into the SVR models. In this encoding scheme, a residue was encoded by a 2×L = 2×(2l+1)-dimensional vector.

Other global sequence features

In addition to the sequence and structural features discussed above, we also included some representative global sequence features like the compositions of twenty amino acids, sequence length and sequence weight (Figure 1) and incorporated them into the SVR models of TANGLE. These complement local features. Previous studies have indicated that inclusion of these global sequence features can help to further improve prediction performance in a number of different real-value prediction tasks, i.e. prediction of residue contact number [5], residue-contact order [7], disulfide connectivity pattern [46], half-sphere exposure [6] and residue depth [8]. Incorporation of these global features has been shown to be helpful for improving the prediction performance [6][8].

To comprehensively investigate the influence of each feature type and improve the prediction performance, we train SVR models using six different sequence encoding schemes. For brevity, we refer to the encoding schemes based on PSI-BLAST profile, PSIPRED-predicted secondary structure, SCRATCH-predicted solvent accessibility, DISOPRED-predicted native disorder and all the combined sequence features, as ‘PB’, ‘PP’, ‘SC’, ‘DISO’ and ‘ALL’, respectively. With the increasing complexity of considered features, the dimensionality of input vector will increase accordingly. In the case of sequence encoding scheme “PB+PP+SC+DISO”, the total number of vector dimension is (20×L+3×L+2×L+2×L) = 27L. For example, for a local window size of L = 9, there are in total 243-dimensional vector designed to characterize each residue.

The Sliding window approach to extract the local sequence and structural profiles

For residue encoding, a sliding window approach was used to extract the local sequence profile of each residue in the datasets. For sequence encoding schemes based on feature combinations, the extracted local profiles of various feature types will be further concatenated to generate the SVR inputs. Figure 2 illustrates how to extract local sequence profiles using this sliding window approach in TANGLE, taking sequence encoding scheme “PB+PP+SC+DISO” as an example.

Results

Statistical distribution of Phi and Psi angles

The distribution of Phi and Psi angles are displayed using the Ramachandran plot, as shown in Figure 3. This distribution is calculated using the training set with 500 PDB structures containing 70,646 residues. It is apparent that Phi and Psi angles have different distribution patterns: the former only has one peak around −70°, while the latter has two peaks around −50° and 130°, respectively. As discussed previously, the single-peak distribution of phi angles and double-peak distribution of psi angles in the Ramachandran plot, result in the different degrees of uncertainty and therefore the different prediction accuracy for the phi and psi angles [11]. This leads to different prediction difficulty for these two types of torsion angles. Due to their double-peak distribution, it is more difficult to predict Psi angles than the single-peak Phi angles, which is reflected by higher MAE and RMSE values for Phi angles but lower values for Psi angles.

Figure 3
The Ramachandran plot and histogram distributions of Phi and Psi angles for all residues in the training set of 500 proteins.

The distribution of Phi/Psi torsion angles shows strikingly different patterns between different secondary structure types. As can be seen from Figure 3, most residues in alpha-helices are located within a narrow range of Phi and Psi angles. The populated area of alpha-helix residues is in the range of −150°<Phi<−20° and −100°<Psi<45°. While in the case of beta-strand residues, the two most populated areas are in the range of −150°<Phi<−20° and −100°<Psi<45°, and the range of −150°<Phi<−20° and −100°<Psi<45°, respectively. In contrast to alpha-helix and beta-strand residues, coil residues populate a much broader and diverse area, indicating that torsion angles of coil residues are very flexible and there are no apparent recurrent patterns like those in alpha-helices and beta-strands. This makes it more difficult to predict their Phi and Psi angles [33]. In the case of proline residues, the majority of them are found in the most populated area with torsion angles (Phi, Psi) of roughly roughly (−75°, 150°), corresponding to polyproline II helix. In summary, the distribution patterns of torsion angles reflect their roles of internal steric constraints that form different types of secondary structures.

Effect of different local window size on the prediction performance

In this section, we chose different local window sizes and calculated the resulting prediction performance in order to examine the effect of various local window sizes using PSI-BLAST profiles. The performance achieved is shown in Table 1. As increasing the local window size provides more local information, it is reasonable to expect that prediction performance would increase with the enlargement of the window size. It is also expected that prediction performance would begin to decrease beyond a certain window size, as increasing the local window size also leads to the inclusion of more noise on the other hand. From Table 1, we find that this is indeed the case. At a local window size L = 9, the SVR model achieved the best prediction performance for the Phi angle prediction, with a CC of 0.486 and MAE of 29.92. In the case of Psi angle prediction, using local window size L = 13 led to the best prediction accuracy of CC = 0.581 and MAE = 55.38. However, L = 9, 11 and 13 have very similar effect on the prediction performance in terms of CC, RMSE and MAE measures. Consequently, in the following analysis, we selected all the three window sizes for comparing the performance of different sequence encoding schemes.

Table 1
Predictive performance of Phi and Psi angles based on different local window sizes using the PSI-BLAST profile.

Effect of different sequence encoding schemes on the predictive performance

Based on the extracted sequence and predicted structural profiles, we further developed two-level SVR models using different combinations of these profile features, as described in the Methods Section. The prediction performance of Phi and Psi angles by this two-level TANGLE approach on the testing set of 1,026 proteins can be found in Datasets S5 and S6, respectively.

Table 2 compares the prediction performance between six different sequence encoding schemes on the testing dataset with 1,026 protein chains. As shown in Table 2, we see that the sequence encoding scheme “PB+PP” that combines evolutionary information in the form of PSI-BLAST profiles (“PB”) along with predicted secondary structure information by PSIPRED (“PP”) achieved the best overall results for Phi angle prediction. The TANGLE model based on this encoding scheme achieved an overall CC of 0.529, RMSE of 46.72 and MAE of 27.85. This is better than other sequence encoding schemes. In addition, another two sequence encoding schemes “PB+PP+SC” and “PB+PP+DISO” achieved similar results, with the same CC values of 0.528, and slightly different MAE values of 27.87 and 27.89, respectively. These results, however, are slightly worse than the best sequence encoding scheme “PB+PP”.

Table 2
Prediction performance of Phi and Psi angles using the SVR predictors based on eight different sequence encoding schemes that incorporate various combinations of different types of sequence and structural features.

For Psi angle prediction, the sequence-encoding scheme “PB+PP+SC” that integrates the PSI-BLAST profile with predicted secondary structure and solvent accessibility information, achieved the best overall results. This encoding scheme achieved CC of 0.654, RMSE of 69.45 and MAE of 44.64 between the predicted and observed Phi angles (Table 2). These results suggest that using predicted secondary structure information in combination with PSI-BLAST profiles greatly enhanced the prediction of Phi and Psi torsion angles, which is reasonable considering that there are strong correlations between torsion angle distribution and regular secondary structure types such as alpha-helices and beta-strands. In addition, compared with Phi angle, higher RMSE and MAE values of Psi angle prediction again confirm that they are more difficult to predict.

We further incorporated the predicted solvent accessibility profile (“SC”) into the two-level SVR models. We found that usage of this information is particularly helpful for improving the prediction performance of Psi angles. However, it is not very useful for Phi angle prediction. We also investigate whether inclusion of predicted native disorder information (“DISO”) would further improve the prediction performance of torsion angles. It is somewhat surprising to see that usage of this information actually decreases the prediction accuracy, as reflected by lower CC and higher MAE values after incorporation of such features into two-level SVR models. This suggests that the predicted native disorder profile is not helpful in improving the prediction quality of the Phi/Psi angles.

To measure the prediction performance at the protein chain level, we calculated the CCs between the predicted and observed Phi/Psi angles for each protein chain in the testing dataset, as shown in Figure 4. We can see that more than 50% of protein chains have a CC of 0.6 or more, and no less than 70% of proteins have CC of at least 0.5. We further analyzed the distribution of MAEs that were averaged on each protein chain, in relation to the observed Phi/Psi angles. This is shown in Figure 5. We can see that residues with Phi angles in the range of 100° to 160° and residues with Psi angles in the range of −180° to −100° have relatively large MAEs, indicating that the predicted Phi/Psi angels for these residues have greater errors. This is both because higher magnitude values will tend to have higher magnitude MAEs and because these residues are under-represented in the current datasets. It is also due to the fact that the SVR models cannot be well trained given that inadequate numbers of data points are fed into SVR. In comparison, residues in the most populated areas in the Ramachandran plot (Figure 3) have the smallest MAEs, e.g. those with Phi angles in the range of −140° to −60° and those with Psi angles in the range of −60° to 120° (Figure 5).

Figure 4
The distributions of correlation coefficients (CCs) of the Phi and Psi angle prediction for 1,026 protein chains in the testing dataset.
Figure 5
The mean absolute errors (MAEs) between the predicted and observed Phi and Psi angles, as a function of the observed angles, divided into bins with equal size of 20°.

In Table 3, we provided the MAEs of Phi/Psi angle prediction results for residues according to twenty residue, three secondary structure and two-state solvent accessibility types. It is generally accepted that that coils are much more flexible and tend to adopt a greater variability of torsion angles. Accordingly, the MAE values of the coil residues are much higher than that of alpha-helix and beta-strand residues (Table 3). Overall, alpha-helix residues have the smallest MAEs (9.9° for Phi and 18.7° for Psi angle), while coil residues have the largest MAE values (40.8° for Phi and 66.0° for Psi angle). The difficulty of torsion angle prediction for different secondary structure types, as evaluated by MAE values, is closely related with the complexities of the torsion angle distribution (Figure 3) [11].

Table 3
Prediction performance comparison of TANGLE with ANGLOR and the random amino acid-specific predictor.

Moreover, because of the various degrees of steric collisions between the side-chain and main-chain of different amino acids, it is expected that different amino acid types have different levels of MAEs. In turn, this could reflect the various degree of difficulty for torsion angle predictions [11]. Taking this into consideration, we examined the prediction performance of TANGLE for twenty amino acid types and calculated their MAE values, as shown in Table 3. Among them, glycine has the largest prediction error, with MAE of 84° for Phi and 77° for Psi, respectively. This is not surprising because glycine has no side chain atom except for a proton, meaning that this amino acid has little geometrical restriction to its backbone torsion angle rotations. Proline is a special amino acid due to the presence of a distinctive cyclic structure in its side chain. Its Phi angle, which is almost locked at approximately −75°, restricts the backbone rotation in the direction of Phi angle. This gives proline an exceptional conformational rigidity compared to other amino acids. On the other hand, because it does not have an amide proton, the inclination of its side-chain towards the nitrogen atom results in nearly no steric restriction in the direction of Psi angle. As a result, proline has the least MAE error for Phi angle (13.6°), but the second largest MAE of 59° for Psi angle.

We further divided the residues into two types (buried or exposed) according to the conventional two-state solvent accessibility. The assignment of two-state solvent accessibility was based on the prediction results by the SCRATCH program [61]. From Table 3, we found that the buried residues have relatively smaller MAE values (24.1° for Phi and 40.2° for Psi, respectively) than exposed residues (30.7° for Phi and 47.0° for Psi, respectively). This indicates that the torsion angles of the exposed residues are more difficult to predict than the buried residues. It is worth mentioning that this result is consistent with previous work [11]. The reason might be that residues buried in the core regions of protein structures have less flexibility and more rigid structural constraints compared with exposed residues located on protein surfaces.

Performance comparison with other approaches

The work that was most closely related to the present study was recently developed by Wu and Zhang, who presented a neural network and support vector machine-based predictor called ANGLOR to predict real values of torsion angles from primary sequences [11]. We compared the prediction performance of our TANGLE predictor with ANGLOR. This is a predictor built using support vector machines and neural networks, based on three different types of sequence-derived features including position-specific scoring matrices (PSSMs), predicted secondary structure and solvent accessibility information.

Another state-of-the-art predictor HTM-One is an integrated model that was specifically developed to predict eight one-dimensional structural features (including Phi and Psi torsion angles) for membrane proteins only [39], while TANGLE is a two-stage model that was trained to predict protein backbone torsion angles. Due to the different properties of membrane proteins, it is infeasible to make a fair comparison of the predictive capabilities of HTM-One and TANLGE. In terms of the advantages and disadvantages of integrated model versus two-stage model, the integrated model is more likely to avoid overfitting because it uses various kinds of training data. Further, the integrated model may be particularly useful when the availability of protein data is limited because it can use various features for training. However, in the case of two-stage model learning using SVM or SVR, it is difficult to use SVM or SVR for integrated model learning because standard SVM/SVR is designed for prediction of a single feature. Thus, it is difficult to apply the integrated approach to solve problems for which SVM/SVR is very useful.

We note that rigorous comparison with other available tools is meaningful only when they are developed and tested based on the same training and testing datasets. As we used exactly the same training dataset and testing dataset as the ones used in developing ANGLOR, we could directly make a performance comparison between the two tools. In addition, we also compared TANGLE with a random amino acid-specific predictor, which was built by randomly assigning the Phi/Psi angles to a residue from amino acid-specific pool collected from 500 protein chains in the training dataset, as suggested by [11]. Intuitively, this amino-acid-specific random predictor is able to provide more accurate torsion angle prediction than a complete random predictor which did not take into account amino acid type information. The randomization process for assigning Phi/Psi angles for each predicted residue in the testing dataset of 1,026 protein chains is repeated 10,000 times to achieve a stable predicted angle distribution [11]. The performance comparison between these three predictors is presented in Table 3.

Overall, for Phi angle prediction, the performance of TANGLE is higher (with MAE = 27.8° for all residues) than that of the random amino acid-specific (with MAE = 33.8° for all residues) and also outperforms ANGLOR (with MAE = 28.2° for all residues). In particular, the prediction of TANGLE is significantly better than a random predictor that was built on the amino acid-specific basis, with the p-value <1.46e-147 and 7.97e-150 for Phi and Psi angle prediction, respectively, by the Wilcoxon signed rank test. In contrast to the Phi prediction, the Psi prediction accuracy of TANGLE (with MAE = 44.6° for all residues) is significantly higher than that of the random amino acid-specific predictor (with MAE = 80.9° for all residues) and also higher than that of ANGLOR predictor (with MAE = 46.4° for all residues). At specific amino acid residue level, the MAE of TANGLE is significantly smaller than that of the random predictor for all the twenty amino acid types. At the second structure level, the MAE of TANGLE is also smaller than the random predictor for all the three-second structure types.

Compared with ANGLOR, the MAE of TANGLE is smaller than that of the ANGLOR predictor in terms of both Phi and Psi angle prediction, except for glycine, for which the MAE of TANGLE (84.1° for Phi and 76.7° for Psi) is higher than that of ANGLOR (75.1° for Phi and 66.9° for Psi). The improvement of real-value prediction of torsion angles by TANGLE can be attributed to a combination of multiple factors. While ANGLOR used neural networks to train the predictors for Phi angle prediction and SVM and three types of sequence-based features to train the models for Psi angle prediction, TANGLE used a two-level support vector regression system to refine the prediction results, based on more integrated multiple sequence and predicted structural features. In addition to the difference of optimal local window sizes used by the two predictors, the performance improvement may be attributed to the design and implementation of the two-level support vector regression-learning framework in TANGLE.

The TANGLE server

For the implementation of this work, we have constructed an online server to provide a free academic service of torsion angle prediction from primary sequences, which is available at http://sunflower.kuicr.kyoto-u.ac.jp/~sjn/TANGLE/webserver.html. TANGLE requires the user to submit a single amino acid sequence in the FASTA format of the query protein as input, and an Email address to send out the prediction result. When the query sequence is submitted, several third-party programs including PSI-BLAST, PSIPRED, SCRATCH and DISOPRED2 will be executed to generate the respective PSSM, predicted secondary structure, solvent accessibility and native disorder profiles. These will be subsequently used as an input for the trained TANGLE models to make the prediction. As soon as the submission task is completed, the prediction result will be sent to the user via Email.

The TANGLE server is implemented in HTML+Perl and the prediction webpage is shown in Figure 6A. Figure 6B illustrates an example of the prediction results by TANGLE. Basically, there are two sections of the prediction results: the first section is the primary sequence information of the submitted sequence; in the second section, columns 1–4 correspond to the residue position, residue name, the predicted Phi and Psi angles, respectively. Furthermore, the plots of the predicted Phi and Psi angles are accessible by clicking the link at the bottom of the result webpage. To facilitate the method developers, the training dataset, testing dataset, and the calculated Phi/Psi angles for all residues in the training/testing dataset used in this work are downloadable in the links webpage. The TANGLE server is currently hosted by a four-CPU Linux system with 16 GB of main memory. The computational time is mainly dependent on the execution of PSI-BLAST, PSIPRED, SCRATCH and DISOPRED2 programs. A typical job of a sequence with 500 residues will take approximately 5 minutes to accomplish.

Figure 6
An example of the prediction results by the TANGLE web server.

Case study

To understand from where the difficulties of torsion angle prediction arise and illustrate the significance of CC, RMSE and MAE measures used in this study, we presented three illustrative examples of TANGLE prediction of Phi and Psi angles and compared the predicted and observed torsion angle profiles for three proteins (Figure 7): the beta1-subunit of the signal-transducing G protein heterotrimer (PDB ID: 1b9x, chain A) [85], the enzyme IIAlactose from Lactococcus lactis (PDB ID: 1e2a, chain A) [86] and the bee venom hyaluronidase in a complex with hyaluronic acid tetramer (PDB ID: 1fcv, chain A) [87]. To investigate the prediction performance with respect to three secondary structure types, the selected three proteins are classified as beta, alpha, alpha and beta. These are abundant in beta-strands, alpha-helices and mixed with alpha-helices and beta-strands, respectively. The predicted and observed Phi/Psi torsion angles of these three proteins are displayed in Figure 7.

Figure 7
The predicted and observed torsion angles for three typical alpha-, beta-, and alpha/beta-proteins.

The first example is the beta1-subunit of the signal-transducing G protein heterotrimer with 336 residues and 25 beta-strands [85]. As an all beta-protein, this protein was predicted with a CC of 0.75, a RMSE of 41.6° and a MAE of 25.5° for the Phi angle, and a CC of 0.74, a RMSE of 57.1° and a MAE of 35.9° for the Psi angle. From Figure 7A, we can see that the majority of its regions are in good agreement with the corresponding observed Phi/Psi values, except for several separate positions like residue positions 53, 141, 182, 224 and 306 for the Phi angle, and residue positions 3, 111, 116, 153 and 306 for the Psi angle.

The second example is an all alpha-protein, the enzyme IIAlactose from Lactococcus lactis [86]. It contains 3 alpha-helices with 98 residues. In contrast, this protein was predicted with better accuracy (CC = 0.72, RMSE = 18.0° and MAE = 10.2° for Phi, and CC = 0.77, RMSE = 18.0° and MAE = 16.3° for Psi, respectively). The MAE values of this protein are much better than the first and third examples (See discussion below). Most of the predicted torsion angles are in good agreement with the corresponding observed values. Only the region between residue positions 68 and 73 has the worst prediction with relatively large MAE values (Figure 7B).

The third example is an alpha/beta-protein, the bee venom hyaluronidase. It has 9 alpha-helices, 8 beta-strands, and 320 residues [87]. Compared with the former two examples, it is poorly predicted with a CC of 0.58, an RMSE of 40.9° and an MAE of 21.5° for Phi angle, and a CC of 0.69 and an RMSE of 62.8° and an MAE of 32.4° for Psi angle. The prediction errors, as evaluated by MAEs, are particularly large for residues with the highest or lowest peak torsion angle values (Figure 7C). For this protein, the prediction performance for alpha-helix residues (RMSE = 15.3° and MAE = 13.5° for Phi angle, and RMSE = 80.0° and MAE = 79.5° for Psi angle, respectively) is better than beta-strand (RMSE = 52.4° and MAE = 38.9° for Phi angle, and RMSE = 96.5° and MAE = 91.8° for Psi angle, respectively) and coil residues (RMSE = 57.6° and MAE = 36.7° for Phi angle, and RMSE = 88.3° and MAE = 76.3° for Psi angle, respectively). These results again suggest that the prediction difficulty of torsion angles becomes higher with the increasing degree of irregularity.

Discussion

Support vector regression (SVR) is a powerful machine learning technique for addressing real-valued prediction tasks in bioinformatics and computational biology, as its strong theoretical basis in statistical learning makes it possible to minimize the generalization error in the prediction [41], [42]. Compared with other traditional techniques, SVR has several advantages such as the handling of data that are non-regularly distributed or have unknown distribution patterns based on kernel functions, the dealing with high-dimensional data, the provision of robust out-of-sample generalization given the approximate choice of parameters, the generation of a solution encompassed by support vectors, the proper balance between bias and variance, etc. Additionally, two-level SVR approach is appropriate for constructing optimal predictors for predicting raw values of samples, as the second-stage predictor is introduced to minimize the generalization error produced in the first stage [55][58].

Accurate prediction of protein structural properties such as residue contact number (CN) [5], contact order (CO) [7], solvent accessible surface area (ASA) [9], half-sphere exposure (HSE) [6], residue depth (RD) [8], [16], [73] and so forth can provide valuable information for protein tertiary structure prediction. In previous studies, incorporation of the evolutionary profile in the form of position-specific scoring matrices and predicted structural features such as secondary structure, solvent accessibility and native disorder in the machine learning framework has been shown to be useful for improving the prediction accuracy of protein structural properties. In this study, we have developed a new SVR-based approach TANGLE for the real-valued prediction of protein backbone torsion angles from protein primary sequences. Based on a large benchmark dataset of non-homologous proteins, TANGLE has outperformed an amino acid-specific predictor and one of the state-of-the-art tools ANGLOR [11].

Nevertheless, the further improvement of the prediction accuracy of these structural properties is still a challenging problem. More recently, Ahmad et al. proposed novel computational frameworks to predict a variety of structural features of proteins in an integrated manner and the performance of their integrated system was significantly better than that of the models trained separately on individual features [39]. This represents an important step towards developing next-generation of one-dimensional predictors and have important implications in better understanding of how these predictable structural features correlate with each other and collectively dictate the dynamics of the protein structures. In future work, it would be particularly interesting to explore the possibility of applying this integrative framework to develop more accurate predictors and comprehensively compare the integrated models, individual models and two-stage models in terms of computational cost, performance and parameters that need to be optimized.

In general, the Psi angles are more difficult to predict than the Phi angles. We found that the distribution of Phi/Psi angles shows different diversities between different secondary structure types, thereby resulting in different degrees of prediction difficulties. Among the three secondary structure types, the prediction error for alpha-helix residues is the smallest, followed by beta-strand residues, while coil residues have the largest MAE values. Also, the torsion angles of the exposed residues are more difficult to predict than the buried residues. Due to the various degrees of steric collision effects on side-chains with backbones, different amino acids also have different degrees of prediction difficulties. All these results indicate that the training specific predictors for various residue types and secondary structure types might be helpful for the further improvement of the prediction performance. Moreover, incorporation of more relevant features that complement the current feature sets and proper selection of more informative features by powerful feature selection techniques will also be useful for improving prediction accuracy in future. Further improvement can be also achieved by better dealing with the under-represented residues that have less adequate numbers of data points fed into the prediction models. All these issues constitute the subject of future studies.

In this article, we have developed a new approach TANGLE to predict real-valued torsion angles from primary sequences by using a two-stage support vector regression approach. TANGLE used a variety of multiple sequence-derived features, including the evolutionary profiles in the form of position-specific scoring matrices, predicted secondary structure, solvent accessibility and natively disordered region as well as other global sequence features. We have comprehensively assessed the effects of different sequence encoding schemes on the prediction performance of torsion angles. When evaluated based on a large benchmark dataset of 1,526 non-homologous proteins, the prediction performance of TANGLE has been shown to outperform a state-of-the-art predictor ANGLOR and an amino acid-specific predictor. Our work provides a complementary and useful approach towards the more accurate prediction of protein backbone torsion angles and complements the current torsion angle prediction algorithms. We hope that by applying the predicted torsion angles as useful restraints, TANGLE will provide significant assistance in facilitating protein structure prediction and protein fold recognition.

Supporting Information

Dataset S1

The Phi angles in the training set of 500 protein chains. The first, second, third, fourth and fifth columns in this file correspond to the residue name, the chain name in PDB structures, the original residue position in the PDB ATOM records, the observed Phi angle calculated by DSSP [40], and the normalized Phi angle which will be used as input to TANGLE, respectively. The last three columns correspond to the annotations of secondary structures by DSSP [40], predicted solvent accessibility by SCRATCH [61] and predicted native disorder by DISOPRED2 [62].

(TXT)

Dataset S2

The Psi angles in the training set of 500 protein chains. The description for each column in this file is similar as the above Dataset S1.

(TXT)

Dataset S3

The Phi angles in the testing set of 1,026 protein chains. The description for each column in this file is similar as the above Dataset S1.

(TXT)

Dataset S4

The Psi angles in the testing set of 1,026 protein chains. The description for each column in this file is similar as the above Dataset S1.

(TXT)

Dataset S5

The prediction performance of Phi angle by TANGLE on the testing set. The prediction performance of Phi angle by TANGLE on the testing set of 1,026 protein chains, as evaluated by four measures: CC, RMSE_norm, RMSE_raw and MAE. These measures were calculated at the protein chain level. The first to fourth columns in the file correspond to CC, RMSE_norm, RMSE_raw and MAE, respectively.

(TXT)

Dataset S6

The prediction performance of Psi angle by TANGLE on the testing set. The prediction performance of Psi angle by TANGLE on the testing set of 1,026 protein chains, as evaluated by four measures: CC, RMSE_norm, RMSE_raw and MAE. These measures were calculated at the protein chain level. The description for each column in this file is as the above Dataset S3.

(TXT)

Acknowledgments

The authors wish to thank the Editor and the two reviewers for their constructive comments on strengthening the presentation of this paper. The computational resources were provided by the Bioinformatics Center, Institute for Chemical Research, Kyoto University and the Monash e-Research Centre, Monash University.

Footnotes

Competing Interests: The authors have declared that no competing interests exist.

Funding: This work was supported by grants from the National Health and Medical Research Council of Australia (NHMRC) (No. 490989), the Australian Research Council (ARC) (No. LP110200333), the Hundred Talents Program of the Chinese Academy of Sciences (CAS), the Japan Society for the Promotion of Science (JSPS) (No. S11156), the Knowledge Innovative Program of CAS (No. KSCX2-EW-G-8), and the Tianjin Municipal Science & Technology Commission (No. 10ZCKFSY05600). JS is a NHMRC Peter Doherty Fellow and the Recipient of the Hundred Talents Program of the CAS and the JSPS Short-Term Fellowship to Bioinformatics Center, Institute for Chemical Research, Kyoto University, Japan. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

References

1. Rost B, Sander C. Prediction of protein secondary structure at better than 70% accuracy. J Mol Biol. 1993;232:584–599. [PubMed]
2. Rost B, Sander C. Improved prediction of protein secondary structure by use of sequence profiles and neural networks. Proc Natl Acad Sci USA. 1993;90:7558–7562. [PMC free article] [PubMed]
3. Dor O, Zhou Y. Real-SPINE: an integrated system of neural networks for real-value prediction of protein structural properties. Proteins. 2007;68:76–81. [PubMed]
4. Yuan Z, Burrage K, Mattick JS. Prediction of protein solvent accessibility using support vector machines. Proteins. 2002;48:566–570. [PubMed]
5. Yuan Z. Better prediction of protein contact number using a support vector regression analysis of amino acid sequence. BMC Bioinformatics. 2005;6:248. [PMC free article] [PubMed]
6. Song J, Tan H, Takemoto K, Akutsu T. HSEpred: predict half-sphere exposure from protein sequences. Bioinformatics. 2008;24:1489–1497. [PubMed]
7. Song J, Burrage K. Predicting residue-wise contact orders in proteins by support vector regression. BMC Bioinformatics. 2006;7:425. [PMC free article] [PubMed]
8. Song J, Tan H, Mahmood K, Law RH, Buckle AM, et al. Prodepth: predict residue depth by support vector regression approach from protein sequences only. PLoS ONE. 2009;4:e7072. [PMC free article] [PubMed]
9. Faraggi E, Xue B, Zhou Y. Improving the prediction accuracy of residue solvent accessibility and real-value backbone torsion angles of proteins by guided-learning through a two-layer neural network. Proteins. 2009;74:847–856. [PMC free article] [PubMed]
10. Faraggi E, Yang Y, Zhang S, Zhou Y. Predicting continuous local structure and the effect of its substitution for secondary structure in fragment-free protein structure prediction. Structure. 2009;17:1515–1527. [PMC free article] [PubMed]
11. Wu S, Zhang Y. ANGLOR: a composite machine-learning algorithm for protein backbone torsion angle prediction. PLoS One. 2008;3:e3400. [PMC free article] [PubMed]
12. Xue B, Dor O, Faraggi E, Zhou Y. Real-value prediction of backbone torsion angles. Proteins. 2008;72:427–433. [PubMed]
13. Kurgan L, Disfani FM. Structural Protein Descriptors in 1-Dimension and their Sequence-Based Predictions. Curr Protein Pept Sci. 2011;12:470–489. [PubMed]
14. Rost B. Review: protein secondary structure prediction continues to rise. J Struct Biol. 2001;134:204–218. [PubMed]
15. Dor O, Zhou Y. Achieving 80% ten-fold cross-validated accuracy for secondary structure prediction by large-scale training. Proteins. 2007;66:838–845. [PubMed]
16. Zhang H, Zhang T, Chen K, Kedarisetti KD, Mizianty MJ, et al. Critical assessment of high-throughput standalone methods for secondary structure prediction. Brief Bioinform. 2011;12:672–688. [PubMed]
17. Wood MJ, Hirst JD. Protein secondary structure prediction with dihedral angles. Proteins. 2005;59:476–481. [PubMed]
18. Kountouris P, Hirst JD. Predicting beta-turns and their types using predicted backbone dihedral angles and secondary structures. BMC Bioinformatics. 2010;11:407. [PMC free article] [PubMed]
19. Zhang C, Hou J, Kim SH. Fold prediction of helical proteins using torsion angle dynamics and predicted restraints. Proc Natl Acad Sci USA. 2002;99:3581–3585. [PMC free article] [PubMed]
20. Zhang W, Liu S, Zhou Y. SP5: improving protein fold recognition by using torsion angle profiles and profile-based gap penalty model. PLoS ONE. 2008;3:e2325. [PMC free article] [PubMed]
21. Wu S, Zhang Y. MUSTER: Improving protein sequence profile-profile alignments by using multiple sources of structure information. Proteins. 2008;72:547–556. [PMC free article] [PubMed]
22. Huang YM, Bystroff C. Improved pairwise alignments of proteins in the Twilight Zone using local structure predictions. Bioinformatics. 2006;22:413–422. [PubMed]
23. Miao X, Waddell PJ, Valafar H. TALI: local alignment of protein structures using backbone torsion angles. J Bioinform Comput Biol. 2008;6:163–181. [PubMed]
24. Kuang R, Leslie CS, Yang AS. Protein backbone angle prediction with machine learning approaches. Bioinformatics. 2004;20:1612–1621. [PubMed]
25. Ramachandran GN, Ramakrishnan C, Sasisekharan V. Stereochemistry of polypeptide chain configurations. J Mol Biol. 1963;7:95–99. [PubMed]
26. Rooman MJ, Kocher JP, Wodak SJ. Prediction of protein backbone conformation based on seven structure assignments: Influence of local interactions. J Mol Biol. 1991;221:961–979. [PubMed]
27. Kang HS, Kurochkina NA, Lee B. Estimation and use of protein backbone angle probabilities. J Mol Biol. 1993;229:448–460. [PubMed]
28. Bystroff C, Thorsson V, Baker D. HMMSTR: a hidden Markov model for local sequence-structure correlations in proteins. J Mol Biol. 2000;301:173–190. [PubMed]
29. de Brevern AG, Etchebest C, Hazout S. Bayesian probabilistic approach for predicting backbone structures in terms of protein blocks. Proteins. 2000;41:271–287. [PubMed]
30. Karchin R, Cline M, Mandel-Gutfreund Y, Karplus K. Hidden Markov models that use predicted local structure for fold recognition: alphabets of backbone geometry. Proteins. 2003;51:504–514. [PubMed]
31. Mooney C, Vullo A, Pollastri G. Protein structural motif prediction in multidimensional phi-psi space leads to improved secondary structure prediction. J Comput Biol. 2006;13:1489–1502. [PubMed]
32. Zimmermann O, Hansmann UH. Support vector machines for prediction of dihedral angle regions. Bioinformatics. 2006;22:3009–3015. [PubMed]
33. Helles G, Fonseca R. Predicting dihedral angle probability distributions for protein coil residues from primary sequence using neural networks. BMC Bioinformatics. 2009;10:338. [PMC free article] [PubMed]
34. Kountouris P, Hirst JD. Prediction of backbone dihedral angles and protein secondary structure using support vector machines. BMC Bioinformatics. 2009;10:437. [PMC free article] [PubMed]
35. Wood MJ, Hirst JD. Protein secondary structure prediction with dihedral angles. Proteins. 2005;59:476–481. [PubMed]
36. Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, et al. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 1997;25:3389–3402. [PMC free article] [PubMed]
37. Berjanskii MV, Neal S, Wishart DS. PREDITOR: a web server for predicting protein torsion angle restraints. Nucleic Acids Res. 2006;34(Web Server issue):W63–69. [PMC free article] [PubMed]
38. Zhang T, Faraggi E, Zhou Y. Fluctuations of backbone torsion angles obtained from NMR-determined structures and their prediction. Proteins. 2010;78:3353–3362. [PMC free article] [PubMed]
39. Ahmad S, Singh YH, Paudel Y, Mori T, Sugita Y, et al. Integrated prediction of one-dimensional structural features and their relationships with conformational flexibility in helical membrane proteins. BMC Bioinformatics. 2010;11:533. [PMC free article] [PubMed]
40. Kabsch W, Sander C. Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features. Biopolymers. 1983;22:2577–2637. [PubMed]
41. Vapnik V. Statistical learning theory. New York, NY: Wiley; 1998.
42. Vapnik V. The nature of statistical learning theory. New York, NY: Springer; 2000.
43. Yuan Z, Huang B. Prediction of protein accessible surface areas by support vector regression. Proteins. 2004;57:558–564. [PubMed]
44. Ishida T, Nakamura S, Shimizu K. Potential for assessing quality of protein structure based on contact number prediction. Proteins. 2006;64:940–947. [PubMed]
45. Yuan Z, Bailey TL, Teasdale RD. Prediction of protein B-factor profiles. Proteins. 2005;58:905–912. [PubMed]
46. Song J, Yuan Z, Tan H, Huber T, Burrage K. Predicting disulfide connectivity from protein sequence using multiple sequence feature vectors and secondary structure. Bioinformatics. 2007;23:3147–3154. [PubMed]
47. Song J, Tan H, Shen H, Mahmood K, Boyd SE, et al. Cascleave: towards more accurate prediction of caspase substrate cleavage sites. Bioinformatics. 2010;26:752–760. [PubMed]
48. Raghava GP, Han JH. Correlation and prediction of gene expression level from amino acid and dipeptide composition of its protein. BMC Bioinformatics. 2005;6:59. [PMC free article] [PubMed]
49. Wang X, Li A, Jiang Z, Feng H. Missing value estimation for DNA microarray gene expression data by Support Vector Regression imputation and orthogonal coding scheme. BMC Bioinformatics. 2006;7:32. [PMC free article] [PubMed]
50. Liu W, Meng X, Xu Q, Flower DR, Li T. Quantitative prediction of mouse class I MHC peptide binding affinity using support vector machine regression (SVR) models. BMC Bioinformatics. 2006;7:182. [PMC free article] [PubMed]
51. Qiu S, Lane T. A framework for multiple kernel support vector regression and its applications to siRNA efficacy prediction. IEEE/ACM Trans Comput Biol Bioinform. 2009;6:190–199. [PubMed]
52. Chen PC, Huang SY, Chen WJ, Hsiao CK. A new regularized least squares support vector regression for gene selection. BMC Bioinformatics. 2009;10:44. [PMC free article] [PubMed]
53. Chen P, Liu C, Burge L, Li J, Mohammad M, et al. DomSVR: domain boundary prediction with support vector regression from sequence information alone. Amino Acids. 2010;39:713–726. [PMC free article] [PubMed]
54. Liang S, Zheng D, Standley DM, Yao B, Zacharias M, et al. EPSVR and EPMeta: prediction of antigenic epitopes using support vector regression and multiple server results. BMC Bioinformatics. 2010;11:381. [PMC free article] [PubMed]
55. Nguyen MN, Rajapakse JC. Two-stage support vector regression approach for predicting accessible surface areas of amino acids. Proteins. 2006;63:542–550. [PubMed]
56. Nguyen MN, Rajapakse JC. Prediction of protein relative solvent accessibility with a two-stage SVM approach. Proteins. 2005;59:30–37. [PubMed]
57. Chen K, Kurgan M, Kurgan L. Sequence based prediction of relative solvent accessibility using two-stage support vector regression with confidence values. J Biomed Sci Eng. 2008;1:1–9.
58. Pan XY, Shen HB. Robust prediction of B-factor profile from sequence using two-stage SVR based on random forest feature selection. Protein Pept Lett. 2009;16:1447–1454. [PubMed]
59. Chen L, Xuan J, Riggins RB, Wang Y, Hoffman EP, et al. Multilevel support vector regression analysis to identify condition-specific regulatory networks. Bioinformatics. 2010;26:1416–1422. [PMC free article] [PubMed]
60. Jones DT. Protein secondary structure prediction based on position-specific scoring matrices. J Mol Biol. 1999;292:195–202. [PubMed]
61. Cheng J, Randall AZ, Sweredoski MJ, Baldi P. SCRATCH: a protein structure and structural feature prediction server. Nucleic Acids Res. 2005;33:W72–76. [PMC free article] [PubMed]
62. Ward JJ, Sodhi JS, McGuffin LJ, Buxton BF, Jones DT. Prediction and functional analysis of native disorder in proteins from the three kingdoms of life. J Mol Biol. 2004;337:635–645. [PubMed]
63. Ahmad S, Sarai A. PSSM-based prediction of DNA binding sites in proteins. BMC Bioinformatics. 2005;6:33. [PMC free article] [PubMed]
64. Xie D, Li A, Wang M, Fan Z, Feng H. LOCSVMPSI: a web server for subcellular localization of eukaryotic proteins using SVM and profile of PSI-BLAST. Nucleic Acids Res. 2005;Jul 1;33(Web Server issue):W105–110. [PMC free article] [PubMed]
65. Song J, Burrage K, Yuan Z, Huber T. Prediction of cis/trans isomerization in proteins using PSI-BLAST profiles and secondary structure information. BMC Bioinformatics. 2006;Mar 9;7:124. [PMC free article] [PubMed]
66. Ofran Y, Mysore V, Rost B. Prediction of DNA-binding residues from sequence. Bioinformatics. 2007;23:i347–i353. [PubMed]
67. Ofran Y, Rost B. Protein–Protein Interaction Hotspots Carved into Sequences. PLoS Comput Biol. 2007;3:e119. [PMC free article] [PubMed]
68. Chen K, Kurgan L. PFRES: protein fold classification by using evolutionary information and predicted secondary structure. Bioinformatics. 2007;23:2843–2850. [PubMed]
69. Kumar M, Gromiha MM, Raghava GP. Identification of DNA-binding proteins using support vector machines and evolutionary profiles. BMC Bioinformatics. 2007;8:463. [PMC free article] [PubMed]
70. Kalita MK, Nandal UK, Pattnaik A, Sivalingam A, Ramasamy G, et al. CyclinPred: a SVM-based method for predicting cyclin protein sequences. PLoS ONE. 2008;3:e2605. [PMC free article] [PubMed]
71. Zhang T, Zhang H, Chen K, Shen S, Ruan J, et al. Accurate sequence-based prediction of catalytic residues. Bioinformatics. 2008;24:2329–2338. [PubMed]
72. Zheng C, Kurgan L. Prediction of beta-turns at over 80% accuracy based on an ensemble of predicted secondary structures and multiple alignments. BMC Bioinformatics. 2008;9:430. [PMC free article] [PubMed]
73. Zhang H, Zhang T, Chen K, Shen S, Ruan J, et al. Sequence based residue depth prediction using evolutionary information and predicted secondary structure. BMC Bioinformatics. 2008;9:388. [PMC free article] [PubMed]
74. Verma R, Tiwari A, Kaur S, Varshney GC, Raghava GP. Identification of proteins secreted by malaria parasite into erythrocyte using SVM and PSSM profiles. BMC Bioinformatics. 2008;9:201. [PMC free article] [PubMed]
75. Chauhan JS, Mishra NK, Raghava GP. Prediction of GTP interacting residues, dipeptides and tripeptides in a protein from its evolutionary information. BMC Bioinformatics. 2010;11:301. [PMC free article] [PubMed]
76. Mizianty MJ, Kurgan L. Improved identification of outer membrane beta barrel proteins using primary sequence, predicted secondary structure, and evolutionary information. Proteins. 2011;79:294–303. [PubMed]
77. Chen K, Stach W, Homaeian L, Kurgan L. iFC(2): an integrated web-server for improved prediction of protein structural class, fold type, and secondary structure content. Amino Acids. 2011;40:963–973. [PubMed]
78. Schlessinger A, Liu J, Rost B. Natively unstructured loops differ from other loops. PLoS Comput Biol. 2007;3:e140. [PMC free article] [PubMed]
79. Schlessinger A, Punta M, Yachdav G, Kajan L, Rost B. Improved disorder prediction by combination of orthogonal approaches. PLoS One. 2009;4:e4433. [PMC free article] [PubMed]
80. Schlessinger A, Punta M, Rost B. Natively unstructured regions in proteins identified from contact predictions. Bioinformatics. 2007;23:2376–2384. [PubMed]
81. Dyson HJ, Wright PE. Intrinsically unstructured proteins and their functions. Nat Rev Mol Cell Biol. 2005;6:197–208. [PubMed]
82. Haynes C, Oldfield CJ, Ji F, Klitgord N, Cusick ME, et al. Intrinsic disorder is a common feature of hub proteins from four eukaryotic interactomes. PLoS Comput Biol. 2006;2:e100. [PMC free article] [PubMed]
83. Gsponer J, Futschik ME, Teichmann SA, Babu MM. Tight regulation of unstructured proteins: from transcript synthesis to protein degradation. Science. 2008;322:1365–1368. [PMC free article] [PubMed]
84. Gao J, Thelen JJ, Dunker AK, Xu D. Musite, a tool for global prediction of general and kinase-specific phosphorylation sites. Mol Cell Proteomics. 2010;9:2586–2600. [PMC free article] [PubMed]
85. Gaudet R, Savage JR, McLaughlin JN, Willardson BM, Sigler PB. A molecular mechanism for the phosphorylation-dependent regulation of heterotrimeric G proteins by phosducin. Mol Cell. 1999;3:649–660. [PubMed]
86. Sliz P, Engelmann R, Hengstenberg W, Pai EF. The structure of enzyme IIAlactose from Lactococcus lactis reveals a new fold and points to possible interactions of a multicomponent system. Structure. 1997;5:775–788. [PubMed]
87. Markovic-Housley Z, Miglierini G, Soldatova L, Rizkallah PJ, Muller U, et al. Crystal structure of hyaluronidase, a major allergen of bee venom. Structure Fold Des. 2000;8:1025–1035. [PubMed]

Articles from PLoS ONE are provided here courtesy of Public Library of Science

Formats: