- We are sorry, but NCBI web applications do not support your browser and may not function properly. More information

- Journal List
- NIHPA Author Manuscripts
- PMC3093657

# Multi-Body Coarse-Grained Potentials for Native Structure Recognition and Quality Assessment of Protein Models

^{1,}

^{2}Sumudu P. Leelananda,

^{2}Andrzej Kolinski,

^{1}Robert L. Jernigan,

^{2}and Andrzej Kloczkowski

^{2,}

^{3,}

^{*}

^{1}Faculty of Chemistry, University of Warsaw, Warsaw, Poland

^{2}Department of Biochemistry, Biophysics and Molecular Biology, Iowa State University, Ames, IA, USA

^{3}Battelle Center for Mathematical Medicine, The Research Institute at Nationwide Children’s Hospital, Columbus, OH, USA

## Summary

Multi-body potentials have been of much interest recently because they take into account three dimensional interactions related to residue packing and capture the cooperativity of these interactions in protein structures. Our goal was to combine long range multi-body potentials and short range potentials to improve recognition of native structure among misfolded decoys. We optimized the weights for four-body non-sequential, four-body sequential and short range potentials in order to obtain optimal model ranking results for threading and have compared these data against results obtained with other potentials. (Twenty six different coarse-grained potentials from the Potentials ‘R’Us web server have been used.) Our optimized multi-body potentials outperform all other contact potentials in the recognition of the native structure among decoys, both for models from homology template-based modeling and from template-free modeling in CASP8 decoy sets. We have compared the results obtained for this optimized coarse-grained potentials, where each residue is represented by a single point, with results obtained by using the DFIRE potential, which takes into account atomic level information of proteins. We found that for all proteins larger than 80 amino acids our optimized coarse-grained potentials yield results comparable to those obtained with the atomic DFIRE potential.

## Introduction

Knowledge-based potential functions are used in many different types of computational protein studies, including protein structure prediction^{1}^{–}^{5}, protein design^{6}^{–}^{9}, docking applications^{10}^{–}^{13} and protein folding mechanism studies^{14}^{–}^{17}. Many atomistic potential functions^{18}^{–}^{20} and coarse-grained potential functions^{21}^{–}^{24} have been developed. The use of these potentials has grown significantly, and they are of interest because their use can significantly reduce the computational cost of modeling and prediction of protein structures. A major challenge in computational biology is to derive better coarse-grained potentials that are able to perform as well as atomistic potentials, yet are computationally much less expensive.

Many different coarse-grained potentials have been extensively applied in the assessment of protein models and the native structure recognition. One of the most widely used two-body potentials are the Miyazawa-Jernigan potentials^{22}. Betancourt and Thirumalai ^{25} suggested that pair-wise potentials are not likely to be sufficient for threading applications. The alternative multi-body potentials, in principal, are able to take account of more complex three dimensional interactions, revealing the effects of dense residue packing. In particular, they can capture the strong cooperativity operative within protein structures. Three-body potentials were proposed and developed by Munson and Singh^{26} and also by Li and Liang^{27} and they all showed improvements over two-body potentials. Four-body potentials were first derived in the context of Delaunay tessellation by Krishnamoorthy and Tropsha^{28} and they demonstrated that these potentials also perform better than two-body potentials.

The four-body contact potentials developed by our group^{29} incorporated sequence information and considered in detail the interactions between backbones and side chains through a simple geometric construction (see Methods for the model description). We also developed them to distinguish between different levels of solvent accessibility of the residues.

These four-body potentials (both sequential and non-sequential) have been successful in recognizing the native structure among most of the *misfolded* decoy sets from Decoys ‘R’Us data set. However these potentials fail to recognize the native structures of some significant number of proteins.

In this paper we have improved the performance of the four-body contact potentials by combining the four-body sequential^{29} with the four-body non-sequential potentials^{30} and with short range potentials. For the short range knowledge-based potentials, we consider the identity for two consecutive amino acids along the sequence, and the pairwise couplings between their virtual torsion and bond angles^{31}. The results for the rankings of the best models are obtained by combining these three sets of potentials, and optimizing globally the weights for each component in the sum.

Different measures of the quality of model selection predictions such as: rankings of the native structure for the decoy sets, RMSD values of the best ranked model and correlation coefficients all show that both the four-body sequential and the four-body non-sequential potentials on average perform better than or as well as two-body coarse-grained potentials. After optimization, however, the resulting residue-level coarse-grained potentials, i.e. the weighted sum of four-body sequential, non-sequential potentials and short range potentials performs better than all other coarse-grained potentials and almost as good as much more detailed (but computationally more costly) atomistic empirical potentials.

## Methods

### Geometric construction for considering interactions

For each four consecutive amino acids (*i*, *i*+1, *i*+2, *i*+3) along the sequence (in black in Figure 1), we calculated the geometrical center (red) of their four side chain centers (C^{α} for Gly). Blue residues are residues in close proximity to the geometrical center. Six planes can be defined by the combinations of all possible pairs of these four points and the red center point, and these planes subdivide the space surrounding the red point into four tetrahedra., Each tetrahedron has a common vertex, which is the geometrical center of four side chain centers. Each of the four contacting bodies for our four body potentials are obtained as follows. One triplet of amino acids from a tetrahedron is taken along the sequence with another amino acid which is not along the sequence but within a cutoff distance from the quartet’s geometrical center (blue residue in Figure 1). This amino acid is considered to be in contact with the triplet within a cutoff distance of 8 Å. The cutoff distance 8 Å was selected because it gives the best threading results compared to other values of cutoff distances that we considered. One example of four-bodies is marked in Figure 1 by the four residues in black boxes. We use tetrahedra to capture long-range interactions between non-bonded side chains and groups of backbone residues. In case of these sequential four-body potentials we require, the triplet of amino acids to be sequential, but for the nonsequential four-body potentials this requirement is no longer necessary. Optimized potentials in this paper combine both the sequential and non-sequential four-body potentials along with short ranged pair-wise potentials mentioned earlier.

**...**

Extensive studies have been carried out, where the performance of different knowledge-based potential functions was compared^{20}^{,}^{33}^{,}^{38} on large data sets of protein models. The way the evaluations have been done is by finding the success in the ranking of the native structure as the conformation with the lowest energy and also by computing average Z-score between the energy of the native structure and the next most favorable structure (the larger the average Z-score the better the evaluation).

We have used CASP8 models as decoy sets (see supplementary materials) for the evaluations of how well two-body and four-body potential functions perform in identifying native (or near native) protein structures. Twenty-three different two-body (more details about these potentials can be found in Pokarowski *et al.*^{34} ) and sequential^{29} and non-sequential^{30} four-body potentials were used. The targets were divided into two subsets according to the method used to generate decoys for each target. One set is comprised of models that were obtained using homology (template-based) modeling (153 cases) and the other set of models is obtained from template-free modeling approaches (12 cases).

The four-body sequential, the four-body non-sequential and the short ranged potentials were combined in simple linear way by using the following formula:

Optimization of the weight of each term was performed to find an optimized potential for computational applications.

The optimization was carried out using Particle Swarm Optimization^{32} (PSO) technique. We set the weight of the four-body sequential term to 1.0 (w_{4-body-seq} = 1) and vary the weight coefficients for the other two terms w_{4-body-nonseq} and w_{SR} by using PSO. The main philosophy behind PSO lies in the observation of swarms of birds or bees. The optimal solution is searched for by maintaining a population of candidate solutions (also called particles) and the best found positions for each particle and the whole population are remembered by the algorithm. Particles scan the search-space according to a simple movement formula which takes into account the best found solution by individual particles and the whole population. For the case of optimizing only two parameters, there are other possible methods to optimize them and get similar results. However, in the case of optimization of a function in a higher dimensional space, this method has significant advantage over the other, because in comparison to, for example, grid methods, it is computationally more efficient, and in comparison to simulated annealing methods it does not require any arbitrary assumptions. For each combination of terms we calculated the average RMSD for the best ranked model and the Z-scores for all CASP8 targets. Heat maps for average best ranked models RMSD and Z-score were computed for varying weights w_{4-body-nonseq} and w_{SR} of the optimized potentials for proteins modeled using (homology) template-based methods, and using template-free modeled targets. The native structure rankings obtained for the optimized potentials were compared to those obtained using other coarse-grained potentials and for the atomistic DFIRE potentials^{20}. The Decoys ‘R’Us dataset ^{33} was used for comparison with atomistic potentials. Both single and multiple decoy sets were used in this assessment. A single decoy set consists of a pair of structures: native structure and decoy structure, and multiple decoys set contains many decoys for each target structure. We have excluded the *multiple loop* set from our assessment because of the poor amino acid packing in loop regions, and also excluded the *ifu* decoys set*,* because multi-body potentials do not perform well for small structures. (For small proteins there are problems with proper tessellation. Residues at the surface cannot be tessellated correctly without taking into account neighboring solvent molecules.)

The RMSD values between the native structure and the best fitting decoy for each decoy set was computed with the TM-score algorithm^{39}. Spearman’s, Pearson’s and Kendall’s correlation coefficients were calculated for all the target-decoy pairs by using potential energies and RMSD values to the native conformation. All incomplete decoys were removed from the sets. Z-scores were also calculated for decoys to evaluate the separation between the native structure and other structure sets in energy space. Pearson’s correlation coefficient is expressed as the covariance of two variables normalized by their standard deviations:

Because Pearson’s correlation coefficient assumes linearity between the two variables (in the context of this paper: energy and RMSD), it would be more suitable to use alternative correlation measures. In particular it seems appropriate to use rank order correlation coefficients. Spearman’s rank correlation coefficient is a non-parametric measure of the statistical dependence between two ranked variables. In the case of existence of tied ranks (when two different observations have the same value - in case of this study, when two structures with different RMSD have the same energy) *ρ _{S}* is computed from the same formula as for

*ρ*. In the case where there are no tied rankings Spearman’s correlation coefficient is computed from the simpler formula:

_{P} with *d _{i}* =

*x*−

_{i}*y*being the difference between the ranks on the two variables for the same structure model.

_{i}Kendall’s τ coefficient is a measure of rank correlation, *i.e.* the similarity of the ordering of the data when ranked by different quantities, defined as:

where *n*_{c} is the number of concordant pairs, *n*_{d} is the number of discordant pairs, and the denominator is the total number of pairs. We call the two pairs of variables [*E*_{i}, RMSD_{i}] and [*E*_{j}, RMSD_{j}] concordant with each other, if *E*_{i} > *E* _{j}; then RMSD_{i} > RMSD_{j} (or *vice versa*), otherwise we consider them to be discordant.

The three correlation coefficients are calculated for each target using energy and RMSD values away from the native target structure for each target decoy. Then all coefficient values are averaged over all targets in each of the two categories to obtain average values for each potential function.

## Results

### Performances of different individual potential functions for model ranking

Tested potentials are all knowledge-based coarse-grained potentials and they usually capture the statistics of contacts based on the coordinates of C^{α} (sometimes C^{β}) atoms. Therefore they do not take into account the atomic details of proteins. We observe that for template-based modeled targets, the BT potential derived by Betancourt and Thirumalai^{25} performs best in comparisons with other two-body potentials and the two four-body potentials individually (in terms of correlation coefficients, average Z-score and average RMSD). The best RMSD values are in the range of 4 Å to 5 Å (See Table 1). Four-body potentials perform well in the identification of native structures and there are a few other two-body potentials which show similar performances with RMSD in the 4 Å range.

**...**

For the targets from template-free modeling, the performance (in terms of correlation coefficients or average values of Z-score or RMSD) is worse than that for the homology-based modeled proteins (See Table 2). Potentials that perform best for template-free modeled targets also perform best for homology template-based modeled targets but do not yield results that are as good as the latter. This is due to the fact that the template-free modeled structures submitted to CASP8 deviated significantly more from the native structures than template-based homology models, and were usually poorly packed and/or poorly folded. Therefore empirical potentials, which are derived based on real globular proteins interactions, cannot be applied well to these cases.

**...**

Rankings, RMSDs and correlation coefficients results all show that the four-body sequential and four-body non-sequential potentials on average perform better than or as well as two-body potentials.

#### Performance of the optimized potentials

The heat map shows the average RMSD (expressed by color) from the native structure for best ranked homology models, where w_{4-body-nonseq} is plotted on the x-axis and w_{SR} on the y-axis, both in steps of 0.05 (see Figure 2). Additional heat maps are given in the Supplementary materials (Figures S1, S2, S3). The best weights in linear combination of four-body non-sequential, four-body sequential and short range potentials correspond to the yellow regions in Fig. 2. The weight for four-body sequential potentials is equal 1.0. It can be seen that all heat maps (see Supplementary materials Figures S1, S2, S3) show the same region of best weights and there can be several values that give similar results. The optimized weights obtained for the four-body non-sequential and short-range potentials are about 0.28 and 0.22 respectively for the template-based modeled (homology) targets. For template-free modeled targets the corresponding weights are different and equal 1.01 and 0.56, respectively. The weights obtained for homology modeled targets were used in assessing the quality of our optimized potential using Decoys ‘R’Us data set.

**...**

The four-body non-sequential potentials don’t necessarily perform better than the sequential potentials, but after optimization, the resulting potentials perform better than either of the two individually, better than all other coarse-grained potentials (with an average RMSD approaching ~3.7 Å for the homology modeled targets), and almost at the same level of performance as fully atomistic potentials. For template-free modeled targets the Betancourt-Thirumalai^{25} potentials perform almost as well as the optimized potentials but for template-based modeled targets the improvement of the RMSD for the optimized potentials is significantly better.

For the *misfolded*, *asilmarh* and *Pdberr&sgpa* data sets from the Decoys ‘R’Us database the optimized potentials identify all native structures from these datasets and thereby performs as well as the other atomistic potentials (data not shown) like RAPDF^{33} atomic KBP^{19} and DFIRE (in the case of the DIFIRE-B potential, there was one mismatch). In Table S1 (see Supplementary materials), the native structure ranks and the Z-scores are compared for the above atomistic potentials and for our optimized potentials using multiple decoy sets. Optimized potentials are able to predict all native structures in the *lattice-ssfit* decoy set and they fail to identify only two native states in the *4-state reduced* decoy set. Average Z-scores for the optimized potentials for these decoys is 1.87. Multi-body potentials perform well if protein structures are large enough, sufficiently compact and well-packed with many multi-body contacts (see *Discussion*).

## Discussion

Coarse-grained potentials cannot be expected to recognize protein native structures with 100% accuracy regardless of the type of modeling used to generate structural models. This limitation could be due to the sample of structures used to derive the knowledge-based potentials, the geometric characterization afforded by the models used and the optimization methods used to generate models or the importance of long distance ranges of interactions that are not considered in their derivations. Therefore in order to obtain better quality assessments it is reasonable to produce decoys using one potential and assess their quality using other scoring functions. Such an example can be found in McGuffin^{35}.

The RMSDs and Z-scores of the best predicted (by any potential) models using decoys for homology-based modeled targets and template-free modeled targets have been averaged over all targets. The results are shown in Table 3. This suggests that if we obtain RMSD and Z-score values that are not as good as these average values, then it might be possible to further improve the potentials used either by taking a linear combination of potentials or perhaps even by using a non-linear combination. For the results presented in Table 3, we knew the answer in advance, but in cases where there is not a large difference between results from single potentials, there is a chance that by combining potentials we might obtain a better performing combination. We recognize that there may be a significant opportunity for improvements in this field because for the template-free modeled targets there is a large gap between the best average prediction for a single (or optimized) potential, and those using sophisticated methods to combine them.

Here we have combined two types of multi-body potentials along with the short range pair-wise potentials to obtain optimized potentials. The optimized potentials failed to identify the native structure for several cases of small protein from Decoys ‘R’Us data set (see Supplementary Materials), or in cases where the structure was stabilized by ions (Zn^{2+}) or ligands (RNA). For proteins larger than about 80 amino acids and for those which are stable alone, our optimized potentials perform as well as the atomistic potentials. This simply reflects the fact that the correct packing is essential for protein stability, whether atomic or coarse-grained. In case when proteins are large, atomistic potentials in protein folding simulations are simply impractical. Thus, there is a need for efficient, well performing coarse-grained potentials. We believe that our optimized potentials will be helpful not only for threading and model ranking problems, but also in protein folding simulations.

It is also important to point out that this linear combination of three potential terms is robust. In Figure 2, where we show the average RMSD for the best ranked models for template-based (homology) modeled targets, a yellow island is observed within which the performances are nearly equal. It is interesting that the parameters set, which we received from optimization on template-free modeled targets (considered in the context of Figure 2), show no significant difference, to parameters optimized on homology template-based models. Thus these potentials can be considered to be universal and do not depend strictly on what type of modeling (homology or template-free) is being considered.

Principal component analysis (PCA) is a method to reduce the number of possibly correlated variables into a smaller number of uncorrelated variables. Li *et al*. carried out a PCA of Miyazawa-Jernigan potentials^{40}. They used eigenvalue decomposition, which is the most commonly used method in PCA. By identifying the first principal component vector and finding a significant correlation with the vector of hydrophobicity indices of amino acids they showed that the dominant driving force for protein folding is the hydrophobic force. It is much more difficult and it requires more work to interpret major principal components in multi body combined potentials. We have carried out a principal component analysis using the four-body sequence dependent and non-sequence dependent, short-range, BT^{25}, MJ3^{36} and SKJG ^{37} for the case of the set *1sn3* from Decoys ‘R’Us. The variances of the principal components for the decoy energies with each potential are shown in Fig. 3. Each principal component is a combination of the above six potentials. It can be clearly seen that there is a major principal component that has the highest variance. The other five principal components are less important and, by definition, are orthogonal to the major principal component, and themselves. This tells that in energy model space there is a high redundancy of data (models usually capture common features of the system, and differ mostly in their details). Correlation coefficients between two-body potentials were calculated earlier by Pokarowski *et al.*^{34}. Feng *et al.* found the correlations between sequential and non-sequential four-body potentials^{30}. We presume that combining the best performing potentials that are less correlated should provide the best results. This is something that we will pursue in our future studies.

## Acknowledgments

We would acknowledge support from NIH Grants **R01GM072014, R01GM081680** and **R01GM081680-S1**. We would like also to thank Yaping Feng for providing the server code.

## References

## Formats:

- Article |
- PubReader |
- ePub (beta) |
- PDF (454K)

- Free energies for coarse-grained proteins by integrating multibody statistical contact potentials with entropies from elastic network models.[J Struct Funct Genomics. 2011]
*Zimmermann MT, Leelananda SP, Gniewek P, Feng Y, Jernigan RL, Kloczkowski A.**J Struct Funct Genomics. 2011 Jul; 12(2):137-47. Epub 2011 Jun 15.* - Novel nonlinear knowledge-based mean force potentials based on machine learning.[IEEE/ACM Trans Comput Biol Bioinform. 2011]
*Dong Q, Zhou S.**IEEE/ACM Trans Comput Biol Bioinform. 2011 Mar-Apr; 8(2):476-86.* - An accurate, residue-level, pair potential of mean force for folding and binding based on the distance-scaled, ideal-gas reference state.[Protein Sci. 2004]
*Zhang C, Liu S, Zhou H, Zhou Y.**Protein Sci. 2004 Feb; 13(2):400-11.* - Continuous anisotropic representation of coarse-grained potentials for proteins by spherical harmonics synthesis.[J Mol Graph Model. 2004]
*Buchete NV, Straub JE, Thirumalai D.**J Mol Graph Model. 2004 May; 22(5):441-50.* - Development of novel statistical potentials for protein fold recognition.[Curr Opin Struct Biol. 2004]
*Buchete NV, Straub JE, Thirumalai D.**Curr Opin Struct Biol. 2004 Apr; 14(2):225-32.*

- Energy Functions in De Novo Protein Design: Current Challenges and Future Prospects[Annual review of biophysics. 2013]
*Li Z, Yang Y, Zhan J, Dai L, Zhou Y.**Annual review of biophysics. 2013; 4210.1146/annurev-biophys-083012-130315* - Effective inter-residue contact definitions for accurate protein fold recognition[BMC Bioinformatics. ]
*Yuan C, Chen H, Kihara D.**BMC Bioinformatics. 13292* - Scoring Protein Interaction Decoys using Exposed Residues (SPIDER): A Novel Multi-Body Interaction Scoring Function based on Frequent Geometric Patterns of Interfacial Residues[Proteins. 2012]
*Khashan R, Zheng W, Tropsha A.**Proteins. 2012 Aug; 80(9)2207-2217* - Free energies for coarse-grained proteins by integrating multibody statistical contact potentials with entropies from elastic network models[Journal of structural and functional genomi...]
*Zimmermann MT, Leelananda SP, Gniewek P, Feng Y, Jernigan RL, Kloczkowski A.**Journal of structural and functional genomics. 2011 Jul; 12(2)137-147* - Evaluation of residue-residue contact predictions in CASP9[Proteins. 2011]
*Monastyrskyy B, Fidelis K, Tramontano A, Kryshtafovych A.**Proteins. 2011; 79(Suppl 10)119-125*

- Multi-Body Coarse-Grained Potentials for Native Structure Recognition and Qualit...Multi-Body Coarse-Grained Potentials for Native Structure Recognition and Quality Assessment of Protein ModelsNIHPA Author Manuscripts. Jun 2011; 79(6)1923PMC

Your browsing activity is empty.

Activity recording is turned off.

See more...