- We are sorry, but NCBI web applications do not support your browser and may not function properly. More information

# GOAP: A Generalized Orientation-Dependent, All-Atom Statistical Potential for Protein Structure Prediction

## Abstract

An accurate scoring function is a key component for successful protein structure prediction. To address this important unsolved problem, we develop a generalized orientation and distance-dependent all-atom statistical potential. The new statistical potential, generalized orientation-dependent all-atom potential (GOAP), depends on the relative orientation of the planes associated with each heavy atom in interacting pairs. GOAP is a generalization of previous orientation-dependent potentials that consider only representative atoms or blocks of side-chain or polar atoms. GOAP is decomposed into distance- and angle-dependent contributions. The DFIRE distance-scaled finite ideal gas reference state is employed for the distance-dependent component of GOAP. GOAP was tested on 11 commonly used decoy sets containing 278 targets, and recognized 226 native structures as best from the decoys, whereas DFIRE recognized 127 targets. The major improvement comes from decoy sets that have homology-modeled structures that are close to native (all within ~4.0 Å) or from the ROSETTA ab initio decoy set. For these two kinds of decoys, orientation-independent DFIRE or only side-chain orientation-dependent RWplus performed poorly. Although the OPUS-PSP block-based orientation-dependent, side-chain atom contact potential performs much better (recognizing 196 targets) than DFIRE, RWplus, and dDFIRE, it is still ~15% worse than GOAP. Thus, GOAP is a promising advance in knowledge-based, all-atom statistical potentials. GOAP is available for download at http://cssb.biology.gatech.edu/GOAP.

## Introduction

One key to the solution of the protein folding and structure prediction problems is an accurate energy function. A perfect energy function should have its global minimum free energy in the native state of a protein. In principle, such an energy function can be obtained from quantum mechanics (1). This is only feasible for small molecules and in general is not yet possible for large systems such as a protein in a solvent. Thus, by necessity, current physics-based approaches approximate the energy function using empirical molecular mechanics force fields (2–5) that contain terms associated with bond lengths, angles, torsional angles, van der Waals, and electrostatic interactions (2,3). The parameters associated with these terms are typically obtained by fitting data from quantum mechanical calculations of small peptide fragments and data from experiment (2–4,6). The resulting physics-based potentials often ignore the contribution of multibody interactions beyond pairs.

In practice, physics based potentials are currently less successful than knowledge-based potentials (7). Knowledge-based potentials make use of the growing number of experimental protein structures and can be categorized into contact potentials (8–10) and distance-dependent potentials (11–17) and describe interactions at the residue- or atomic level (8,9,12–19). Whereas most potentials are pairwise-additive, some multibody potentials have been developed (20–23); these are often residue-based (24–30). On the atomic level, orientation dependencies for subsets of atoms have also been investigated (31–33). For example, in dDFIRE, Yang and Zhou (31,34) introduced into DFIRE the orientation dependence of polar atom interactions (treated as dipoles), which includes hydrogen-bonding interactions, and some improvement over DFIRE (14) in refolding the protein terminal regions with secondary structures was observed.

Lu et al. (32) developed an all-atom, orientation-dependent side-chain contact potential. The orientations are defined for blocks of atoms bonded rigidly to the same residue that lie in the same plane. Because the interaction centers are on the block, rather than on individual atoms, this requires that the orientation angles be defined at high resolution to accurately determine the atomic positions within the block. Zhang and Zhang (33) added a side-chain orientation-dependence to their all-atom, distance-dependent potential that uses a reference state generated by random walk theory and showed some improvement over potentials lacking such an orientation dependence. Kortemme et al. (35) developed an orientation-dependent potential specifically for hydrogen bonding.

In this work, because an all-atom distance-dependent potential is likely needed for atomic resolution modeling and refinement, we focus on developing a more accurate knowledge-based, all-atom distance-dependent potential. We generalize the treatment of the orientation-dependence of polar atoms, blocks of atoms, or side chains to all 167 residue-specific, heavy atom types. This generalization is based on the observation that the environment around each atom is anisotropic. This effect is more pronounced for polar atoms and cannot be fully captured by introducing a vectorlike dipole (31) that still requires rotational symmetry around the dipole vector. When residues are hydrogen-bonded, this rotational symmetry might be broken (35). To better characterize their anisotropic environment, a planelike object is introduced for each atom using two of its bonded neighboring atoms and itself. When there is only one bonded neighboring atom (e.g., a backbone oxygen), a next neighboring atom is used (e.g., the C* _{α}* atom for the backbone oxygen). The introduction of a plane associated with each heavy atom requires five angle parameters in addition to the distance between interaction centers to describe a pair interaction.

We decompose the potential, named “generalized orientation-dependent all-atom potential” (GOAP), into a distance-dependent and a conditional (dependent on the given distance) angle (orientation)-dependent part. The distance dependence is treated identically as in DFIRE (14), a potential that has performed well across various applications (31,36–40). The angle-dependent part is denoted as GOAP AnGular (GOAP_AG). GOAP naturally integrates orientation-dependent polar atom interactions (34), hydrogen-bonding (35), and side-chain interactions (33). It also captures the geometry of the Cysteine disulfide bond. The only free parameter needed to derive GOAP is the sequence separation cutoff for the orientation-dependent part GOAP_AG that ignores the angular dependence between heavy atoms in residues that are close in sequence. This cutoff does not require any training and is determined from simulating the angle distributions with steric interactions and chain-connectivity (i.e., background distributions when specific pairwise interactions are switched off). GOAP was tested on 11 commonly used decoy sets for native structure recognition (33,41–44,46 and (R. Samudrala, E. Huang, and M. Levitt, unpublished)). We describe the results of this evaluation below.

## Method

### Definition of the relative orientation of interacting heavy atom planes

In this method, for each heavy (nonhydrogen) atom, we define an associated plane defined by it and the neighboring bonded heavy atoms; see Fig. 1. When an atom has two or more bonded heavy atoms (atom *A* in Fig. 1, *left*), any two of the bonded heavy atoms can be used (of course, a consistent selection is always made in deriving and evaluating the energy score). When there is only one bonded heavy atom (atom *A* in Fig. 1, *right*, e.g., main-chain oxygen), the next-neighbor, bonded atom is used (e.g., the C* _{α}* atom for the main-chain oxygen).

For each plane defined by these three atoms (e.g., *A*, *A*_{1}, *A*_{2} in Fig. 1, *left*), we define a local coordinate system using the following unit vectors,

where ${\overrightarrow{r}}_{12}=\overrightarrow{r}\left({A}_{1}\right)-\overrightarrow{r}\left(A\right),\phantom{\rule{1em}{0ex}}{\overrightarrow{r}}_{13}=\overrightarrow{r}\left({A}_{2}\right)-\overrightarrow{r}\left(A\right)$ are the relative vectors from atom A to atoms A_{1} and A_{2}, respectively; ${\overrightarrow{v}}_{x}$ and ${\overrightarrow{v}}_{z}$ lie within the plane, and ${\overrightarrow{v}}_{y}$ is the normal vector to the plane. When there is only one bonded heavy atom (Fig. 1, *right*), we change the definition of ${\overrightarrow{v}}_{z}$ to ${\overrightarrow{v}}_{z}={\overrightarrow{r}}_{12}/\left|{\overrightarrow{r}}_{12}\right|$. The values ${\overrightarrow{v}}_{x}$ and ${\overrightarrow{v}}_{y}$ do not change.

To specify the relative position of the two planes associated with the interacting atoms, we require the distance among A and B, *r _{ab}*, and five angles as defined in Fig. 1: The polar angles (

*θ*,

_{a}*ψ*) of vector ${\overrightarrow{r}}_{ab}=\overrightarrow{r}\left(B\right)-\overrightarrow{r}\left(A\right)$ in the local coordinate system of atom A, the polar angles (

_{a}*θ*,

_{b}*ψ*) of vector ${\overrightarrow{r}}_{ba}=\overrightarrow{r}\left(A\right)-\overrightarrow{r}\left(B\right)$ in the local coordinate system of atom B, and the torsional angle

_{b}*χ*between ${\overrightarrow{v}}_{z}\left(A\right)$ and ${\overrightarrow{v}}_{z}\left(B\right)$ around the axis ${\overrightarrow{r}}_{ab}$ or ${\overrightarrow{r}}_{ba}$.

### The GOAP potential

The GOAP potential is extracted from known protein structures based on the inverse Boltzmann equation,

where *a* and *b* are the atom types of the two interacting atoms, *p ^{obs}* is the probability of the property (

*r*,

_{ab}*θ*,

_{a}*ψ*,

_{a}*θ*,

_{b}*ψ*,

_{b}*χ*) observed in known protein structures, and

*p*

^{exp}is the expected probability of the same property in a reference state without specific interactions (i.e., when

*E*(

*r*,

_{ab}*θ*,

_{a}*ψ*,

_{a}*θ*,

_{b}*ψ*,

_{b}*χ*) = 0).

*R*is the universal gas constant and

*T*is the absolute temperature at which all the observed states equilibrate.

*T*is usually assumed to be room temperature (~300 K). In this work, as in others (14,17,19), we consider 167 heavy atom types. Equation 2 can be decomposed into two terms, as

where the first term depends only on the distance *r _{ab}*, and the second term depends on the conditional probabilities

*p*(

^{obs}*θ*,

_{a}*ψ*,

_{a}*θ*,

_{b}*ψ*,

_{b}*χ*|

*r*) and

_{ab}*p*(

^{exp}*θ*,

_{a}*ψ*,

_{a}*θ*,

_{b}*ψ*,

_{b}*χ*|

*r*). We deal with the two types of terms separately.

_{ab}### The DFIRE potential

For the term that depends only on distance in Eq. 3, we employ the DFIRE (14) reference state for extracting the energy score. The DFIRE reference state is a uniformly distributed set of ideal gas (or equivalently an ideal solution of) points in a finite space. In an unbounded system comprised of an ideal gas (noninteracting point particles), the number of pairwise counts at a given distance is *density* × 4*πr*^{2}* _{ab}*Δ

*r*. Here, Δ

_{ab}*r*is the bin size at the given distance. Proteins are of course finite in size. The finite size effect is taken into account by introducing a scaling factor

_{ab}*α*< 2; then, the dependence on distance in the reference state becomes

*density*× 4

*πr*Δ

^{α}_{ab}*r*. Another important feature of the DFIRE reference state is the assumption that at a large distance cutoff (

_{ab}*r*), the distribution in the reference state equals the observed distribution in real protein structures:

_{cut}*N*(

^{obs}*r*) =

_{cut}*density*× 4

*πr*Δ

^{α}_{cut}*r*.

_{cut}This assumption not only eliminates the problem of an unphysical nonzero energy at the cutoff distance found in other statistical potentials (17,19), but also determines the unknown density parameter. Therefore, the pairwise counts in the reference state can be written as

By substituting the probabilities in the first term in Eq. 3 with the number of pairwise observations and expected values, we obtain the DFIRE energy function:

The cutoff *r _{cut}* is set to 15 Å, and

*α*= 1.61 as determined by the best fit of

*r*to the actual distance-dependent number of ideal gas points in the 1011 finite protein-size spheres that have sizes corresponding to the 1011 nonredundant high-resolution protein structures used for deriving DFIRE (14). Beyond the cutoff distance

^{α}*r*(i.e., for

_{cut}*r*>

_{ab}*r*),

_{cut}*E*(

^{DFIRE}*r*) is set to zero.

_{ab}### The GOAP_AG potential

To overcome the problem of insufficient statistics if the angle is treated as nonseparable, we make the assumption for GOAP_AG that the dependence of the potential on the angles *θ _{a}*,

*ψ*,

_{a}*θ*,

_{b}*ψ*, and

_{b}*χ*are independent of each other at the given distance. This gives for the angular contribution

where *E*(*θ _{i}*|

*r*) = –

_{ab}*RT*log(

*p*(

^{obs}*θ*|

_{i}*r*)/

_{ab}*p*

^{exp}(

*θ*|

_{i}*r*));

_{ab}*E*(

*ψ*|

_{i}*r*) = −

_{ab}*RT*log(

*p*(

^{obs}*ψ*|

_{i}*r*)/

_{ab}*p*

^{exp}(

*ψ*|

_{i}*r*)),

_{ab}*i*=

*a,b*; and

*E*(

*χ*|

*r*) = –

_{ab}*RT*log(

*p*(

^{obs}*χ*|

*r*)/

_{ab}*p*

^{exp}(

*χ*|

*r*)). Here,

_{ab}*p*(

^{obs}*angle|r*) and

_{ab}*p*

^{exp}(

*angle|r*) with angle =

_{ab}*θ*,

_{a}*ψ*,

_{a}*θ*,

_{b}*ψ*,

_{b}*χ*, are the conditional probabilities of the observed and expected angles at the given distance

*r*. This assumption of independence of angular distributions has also been made in treatments of H-bonding (36) and in dDFIRE (31,35). In deriving

_{ab}*E*, we bin the cos(

^{GOAP_AG}*θ*

_{a,}*),*

_{b}*ψ*,

_{a,b}*χ*-values into

*N*= 12 equally sized bins and assume that the expected probabilities are constant for all bins,

_{bin}However, this assumption is only good when a suitable sequence separation cutoff is applied for *E ^{GOAP_AG}*. To avoid a zero count, the initial count values for each angle bin are set to 0.1.

### The sequence separation cutoff for GOAP_AG

The rationale of applying a sequence separation cutoff *s* (i.e., two interacting atoms *a* and *b* must reside in separate residues *i* and *j* satisfying |*i* − *j*| ≥ *s*) for the angle-dependent energy term *E ^{GOAP_AG}* is based on the observation that at small

*s*, the angle distributions are mainly determined by steric interactions and direct chain connectivity rather than by nonbonded, nonsteric interactions. Therefore, the expected distributions (when nonbonded, pairwise interactions are switched off) will not be constant (i.e., independent of angle). It is not trivial to obtain accurate expected angle distributions when they are not constant, which is what happens for small cutoff values of

*s*. Therefore, we introduce the cutoff parameter

*s*into GOAP and ignore the orientation dependence when

*|i – j|< s*(because it cannot be accurately obtained). Then, GOAP can be written as

where *i* and *j* are the residue numbers on which the two interacting atoms reside.

To determine the value of *s* for which the expected angle distribution is essentially constant, we employed a Monte Carlo simulation to simulate the angular distributions allowing only steric interactions (we exclude any nonbonded heavy atom pair atoms within a distance of 3.3 Å from each other) in a 50-residue Alanine peptide with ideal bond lengths and bond angles taken from CHARMM (3). The standard deviations of the binned distributions are then examined. The standard deviation is defined as

where the average is over all bins at given distance, in which *N _{bin}* = 12 is the number of bins for each angle parameter. The value

*σ*measures the uniformity of the distribution. The value

*s*should be large enough so that the background distribution is close to uniform, i.e.,

*σ*= 0. It should also be small enough so that the contribution of GOAP_AG to the total potential will not be neglected too much (note that

*σ*= 0 when

*s*= ∞). Therefore, a suitable value of

*s*is a compromise of the two effects. In practice, we look at the change of the background standard deviation at each

*s*(the slope of

*σ(s*) versus

*s*curve). A reasonable choice of

*s*is when the change in slope is close to zero (then, an increase of

*s*will result in a negligible change of the standard deviation).

Fig. 2 shows the average standard deviations from the simulation of expected angle distributions on a 50-residue Alanine peptide when only steric interactions are applied. The main-chain dihedral angles of the peptide were randomly sampled for 5,000,000 steps. At each step, all the dihedral angles are set to random values. When no steric clashes are present, the conformation will be used for counting the angle distributions. The angle distributions (normalized summation over bins to be 1 for each angle) at each distance bin (20 bins span from 0 to 15 Å) were obtained. The plot shows the standard deviations (*σ*) of the distributions averaged over all-atom type pairs and all distance bins. The steric interactions and chain-connectivity most affect the *θ*-angle and least the *χ*-dihedral angle. At a small cutoff of *s* = 3, the distributions deviate the most from uniform. Beyond *s* = 7, the curves are almost flat with little change. Defining the slope at *s* as *σ*(*s*+1)–*σ*(*s*), we find the slope at *s* = 7 for *θ*-angle distribution to be −0.013. The slopes at *s* = 5, *s* = 6, and *s* = 8 are −0.045, −0.020, and −0.010, respectively. Therefore, at ~*s* = 7, *σ* starts to change very slowly. In this work, *s* = 7 is applied to GOAP_AG. For the distance-dependent part, DFIRE potential, we set *s =* 1, so as to exclude interactions within the same residue.

### URL for Protein Structure Library and GOAP

GOAP is obtained using the same 1011 protein structures as in DFIRE (14); the list of structures along with the GOAP potential is available at http://cssb.biology.gatech.edu/GOAP. Using more structures to obtain the potential does not significantly change the performance of GOAP.

### Decoy sets and potentials evaluated

Tested decoy sets include the multiple decoy sets from the ‘R’ Us s decoy set at http://dd.compbio.washington.edu/. These sets are the 4state_reduced (42), fisa (41), fisa_casp3 (41), lmds (43), lattice_ssfit (44,47), hg_structal and ig_structal (R. Samudrala, E. Huang, and M. Levitt, unpublished), and ig_structal_hires (45). The MOULDER decoy set (46) is downloaded from http://salilab.org/decoys/. The ROSETTA all-atom decoy set is obtained from http://depts.washington.edu/bakerpg/decoys/, and the I-TASSER set (33) from the Zhang lab is obtained from http://zhanglab.ccmb.med.umich.edu/.

We compare our GOAP potential with the following all-atom potentials: the DFIRE potential (14) that is part of GOAP; the RWplus potential (33) that uses random walk theory for the reference state and includes side-chain orientations; the dDFIRE potential (31,34) that includes polar-polar, polar-nonpolar atom orientations described with vectorlike dipoles; and the OPUS-PSP potential (32) that defines orientations of blocks of side-chain atoms. OPUS-PSP is a contact potential, whereas the others are distance-dependent. The programs dDFIRE, RWplus, and OPUS-PSP are downloaded from the corresponding author's websites. A perfect potential should rank the native structure as the lowest energy structure. The significance of the native structure energy (*E _{native}*) is given by its Z-score defined as Z-score = (

*E*–

_{native}*E*)/

_{ave}*σ*where

_{E}*E*is the average energy of all decoys and

_{ave}*σ*is the energy standard deviation of all decoys.

_{E}## Results

### Native structure recognition from decoys

The performance of various potentials on the 11 decoy sets for native structure recognition is compared in Table 1. GOAP achieves the best success rate with 226 out of 278 targets having their native energy as the lowest and the best average Z-score per target. Compared to DFIRE (at 128), RWplus (at 135), and dDFIRE (at 164), GOAP provides for a significant improvement in both success rate and Z-score. Only the performance of OPUS-PSP (at 196) is comparable. Still, our method shows a 15% better success rate compared to OPUS-PSP. All the improvements of GOAP are from the three homology modeling sets (hg_structal, ig_structal, and ig_structal_hires (45)) and the ab initio ROSETTA set.

These sets have the common feature that their decoys have more realistic bond lengths and angles than decoys in most other sets. They are relatively hard for conventional methods such as DFIRE and RWplus without fully incorporating orientation dependence. dDFIRE's success rate on the homology modeling sets is comparable to OPUS-PSP, but performs poorest on the ROSETTA set. For the five traditional Decoy ‘R’ Us sets (4stat_reduced, fisa, fisa_casp3, lmds, and lattice_ssfit) that mostly used in the literature (14,17,19,21,33), GOAP recognizes the native energy as lowest for 30 out of 34 targets, whereas DFIRE, RWplus, and dDFIRE all recognize 28, and OPUS-PSP recognizes 31 targets whose native energy is the lowest. It should be noted that OPUS-PSP has a free parameter (the weight of the repulsive Lennard-Jones term) trained on the 4stat_reduced set.

### Correlation of energy score with model quality and model selection

Although the ability to assign the native structure as being lowest in energy is the most important characteristic of a good potential, for an energy function to be useful for guiding conformation sampling, it should also have a good correlation with model quality. In Table 2, we compare the performance of different potentials as assessed by both their Pearson correlation coefficient of energy and TM-score (48) and the TM-score of the lowest energy structure. The 112-protein CASP9 (49) target set (models were generated by all CASP9 servers and downloaded from the CASP9 website http://predictioncenter.org/casp9/; most are homology modeling structures) is also included. Here, we use the TM-score (48) instead of the root mean-square deviation of the model to native, because if the majority of the structure is of good quality, the TM-score is insensitive to local substructures that differ significantly from native, whereas root mean-square deviation is quite sensitive to such effects.

We find that GOAP gives the best Pearson coefficient of the energy score with TM-score (48) and has the best average TM-scores of the selected models. OPUS-PSP does much worse as assessed by the correlation coefficient, but comes in second in model selection. DFIRE is very close to OPUS-PSP in model selection but does much better than OPUS-PSP in its correlation with TM-score. Because DFIRE is part of GOAP, its good performance in correlation and model selection is passed on to GOAP. Because of the inclusion of the orientation-dependent part GOAP_AG, GOAP performs better than DFIRE; e.g., for the three homology modeling decoy sets, GOAP is >5% better, on average, than the other methods in terms of its correlation with TM-score.

Fig. 3 shows some examples of the correlation of TM-score and GOAP energy. It is noteworthy that for target T0581 in the CASP9 set, a template-free modeling target, only GOAP identifies the single good model with a TM-score = 0.66 (BAKER-ROSETTASERVER_TS4; the next best has a TM-score of 0.36) in the first position (see Fig. 3 *d*). The ranking of this model by other methods are: fourth in DFIRE and RWplus, third in dDFIRE, and fifth in OPUS-PSP. These results show the advantage of GOAP in selecting the best models.

*a*–

*c*) Native structures are included to show their positions in the energy landscapes.

To establish the statistical significance of the small difference between GOAP and other methods, two-sided *P* values of the Student's *t*-test of the differences between GOAP and other methods for the Pearson's correlation coefficients and the TM-scores of the lowest energy models are also shown in Table 2. Except for the TM-score difference between GOAP and OPUS-PSP (*P* value = 0.065), GOAP gives statistically significant (*P* value < 0.05) better results than all other methods for both TM-score and Pearson correlation. The number of targets whose top-ranked models have a TM-score to native >0.5 are also given; clearly, there is very little difference between methods.

We have shown that GOAP performs much better than other all-atom statistical potentials in native structure recognition and consistently better than those potentials in the correlation of the energy with TM-score of the models and in selecting the best models. In what follows, we shall examine the factors that could contribute to the better performance of GOAP as well as the validity of its approximations.

### Effects of sequence separation cutoff and main-chain atoms

The angle-dependent GOAP_AG potential depends on the sequence separation cutoff *s*. We have suggested that *s* should not be too small and have chosen a reasonable value *s* = 7. To show that this choice indeed results in better potential than a smaller, or a larger *s* (more orientation-dependent energy will be neglected), in Table 3, we give the performance of GOAP with *s* = 2 and *s* = 10. Because two of the compared methods RWplus (33) and OPUS-PSP (32) considered orientations using only side-chain atoms, we also give in Table 3 the performance of GOAP with *s* = 7 (the default value) and evaluated using GOAP_AG for main-chain, side-chain atoms, respectively. Clearly, with a smaller *s* = 2, or larger *s* = 10, GOAP's performance is worse than with the default choice *s* = 7 in native recognition success rate and Z-score (compare Table 3 with Table 1).

*s*= 2, 10, and the default

*s*= 7 and GOAP_AG evaluated only for main-chain or side-chain atoms, respectively

Even worse performance is seen when main-chain atoms are not included in the GOAP_AG energy. Therefore, the contribution to GOAP_AG from main-chain atoms is more important than that from side-chain atoms. However, the sensitivities of decoy sets to the *s* cutoff and main-chain atom inclusion are different. The most sensitive sets are the homology modeling sets hg_structal, ig_structal, ig_structal_hires, and ROSETTA ab initio set. Thus, proper choice of sequence separation cutoff and inclusion of all atoms in the orientation-dependent energy term are crucial for our method to improve over other orientation-dependent/independent, all-atom potentials.

### Examples of orientation dependence

Here, we examine some examples of angle distributions involving polar-polar, polar-nonpolar, and nonpolar-nonpolar atom pairs. To show that it is necessary to consider the orientations of all, not just polar, atoms, and at what distance the effects of orientations are most important, in Fig. S1 *a*–*c* (see Supporting Material), we present the average standard deviation of the angle-dependent energy terms (see Eq. 6) of the GOAP potential over all polar-polar, polar-nonpolar, and nonpolar-nonpolar pairs, respectively for 1), *E*(*θ*|*r _{ab}*), 2),

*E*(

*ψ*|

*r*), and 3),

_{ab}*E*(

*χ*|

*r*). Polar atoms are nitrogen, oxygen, and sulfur in Cysteine; all other atoms are nonpolar. The standard deviation for the energy term of a given pair at given distance is defined as

_{ab}where the average is over *N _{bin}* = 12 of angle bins. From Fig. S1, we see that all three angles (

*θ*,

*ψ*, and

*χ*) for all three kinds of pairs (polar-polar, polar-nonpolar, and nonpolar-nonpolar) deviate most from uniform at ~4 Å, but the differences between different kinds of pairs become obvious at ~6 Å. It is understandable that the differences between different kinds of pairs are larger at distances <4 Å. These results demonstrate that even for nonpolar-nonpolar atom pairs, their full orientation dependence is required. When GOAP is used to calculate the energy scores on the 1011 native protein structures that are used for deriving the GOAP potential, the average DFIRE score per protein is −21,565, whereas the average GOAP_AG score is −19,769. This means that the energy contribution of orientation-dependent part is almost the same as that of the distance-dependent part for a typical protein.

In Fig. S2, we show some specific examples of the angular dependence of polar-polar, polar-nonpolar, and nonpolar-nonpolar pairs. We shall focus mainly on the *ψ*-dependence because the *θ*, *χ* dependences for polar atoms have been investigated in the dDFIRE potential (31,34). Fig. S2 *a* shows the nonuniform *ψ*-dependence of the disulfide bond Cys SG-Cys SG at 2.25 Å. The energy has two favorable positions of ±75°. Fig. S2 *b* shows the *ψ*-dependence of a typical hydrogen bond (H-bond) between Ala N and Ala O at 2.75 Å. The dip at −105° shows that the *ψ*-degree of freedom is necessary for accurately describing a H-bond. In the dDFIRE potential (31), polar atoms are represented by a dipole and only *θ* is defined for each atom. Fig. S2 *c* shows an example of a polar-nonpolar interaction, Ala O-Ala CB at 3.25 Å. The *ψ*-dependence of the nonpolar atom Ala CB is shown. The interaction is favored when *ψ* > 75°. Fig. S2, *d*–*f*, shows the *θ*-, *ψ*-, and *χ*-dependences of the Ala CB-Ala CB interactions at 3.75 Å, respectively. Even though this interaction involves only nonpolar atoms, its dependence on all three angles is not uniform. Thus, they are required to describe this typical nonpolar atom interaction accurately.

### Effects of orientation dependence on GOAP's performance

The above analysis presented with some observations regarding the orientation dependence of atomic pair interactions. Here, we analyze the contributions of different orientational terms and the overall orientational contribution of the nonpolar-nonpolar interactions. In Table 4, we show the performance of GOAP when the contribution of each of the three types of angle (*θ*, *ψ*, and *χ*) terms is not included and when all angle-dependent terms are not considered for nonpolar-nonpolar pairs. Table 4 shows that the contribution from *θ*-angle is the most important and from *ψ* the least. It also shows that, consistent with previous analysis, the orientational dependence from nonpolar-nonpolar interactions contributes somewhat positively to GOAP's performance.

### The independence of angular distributions

The assumption made in Eq. 6 that all angle distributions are independent is intended to overcome the problem of too few cases that satisfy the joint distribution of five angles at each distance. How well this assumption holds is not yet known. We examine here a typical polar-polar interaction of main-chain N-O pairs using amino-acid nonspecific atom types to increase the statistics of the joint distribution and derive the joint distribution with a larger dataset of 3506 proteins downloaded from http://dunbrack.fccc.edu/PISCES.php (50). The covariance of all angle pairs at different distances is given in Fig. S3, *a*–*c*. From these figures, we observe that *θ _{N}* has a relatively stronger covariance with

*θ*at ~6 Å and 8 Å, whereas, with the other three angles, it shows weaker covariation for all distances (see Fig. S3

_{o}*a*). The covariance of

*ψ*with

_{N}*ψ*and

_{o}*χ*has a dip or peak at ~2 Å. Beyond 4 Å, the covariance is weak (see Fig. S3

*b*). These results indicate that except for a somewhat narrow range of distances, the assumption of independent angular distributions holds reasonably well.

## Discussion

In this article, we have improved the description of pairwise atomic interactions by introducing the orientation dependence of all individual heavy atoms. However, to obtain the orientational contribution to the potential, a sequence separation cutoff is needed. The cutoff is the only free parameter and is obtained by Monte Carlo simulation of a noninteracting peptide. We find that inclusion of main-chain atoms has a greater effect on GOAP's performance than the cutoff (see Table 3). This is consistent with the findings in the OPUS-PSP article (32), where the authors reported the results for the decoy sets ig_structal and ig_structal_hires when main-chain block types were included.

The results in Wu et al. (26) (46(−2.79) for ig_structal and 19(−3.03) for ig_structal_hires) are much better than the ones (20(0.693) and 14(−0.768)) we obtained using the downloaded OPUS-PSP program that ignores such main-chain interactions. The main-chain blocks include the main-chain amide and carbonyl groups, and therefore, they take into account the hydrogen-bond interactions. However, when these blocks are included in OPUS-PSP, it only recognizes 24 of 34 native structures in the five Decoys ‘R’ Us sets. Thus, inclusion of main-chain interactions does not necessarily improve the overall performance of OPUS-PSP. The authors of OPUS-PSP(32) suggest that their rigid-body description is not suited for optimizing main-chain hydrogen-bond interactions. Another reason could be that OPUS-PSP defines main-chain blocks that do not depend on specific amino-acid types, whereas in real proteins, there are different preferences of different amino acids for different secondary structures; this feature is included in GOAP.

In testing of the GOAP potential on commonly used decoy sets, we find that GOAP performs better than other all-atom potentials in native structure recognition and is consistently better in terms of the correlation of energy score with model quality as assessed by the correlation of the TM-score to native and in good model selection. The close homology modeling decoys and the ROSETTA ab initio decoys are particularly sensitive to the performance of all-atom potentials. Here, GOAP performs consistently better than other potentials on these decoy sets. Thus, GOAP might prove to be useful in high accuracy protein structure refinement and in ab initio structure prediction, but this remains to be demonstrated. Its application to side-chain modeling and protein design might give better results than the OPUS-PSP potential, because it has atomic resolution and a distance dependence, and it includes main-chain atoms compared to the block resolution, contact nature, and side-chain atom restrictions of the OPUS-PSP potential (51). Applications of GOAP to these areas are under investigation.

GOAP can also be included in possible composite knowledge-based scores like the QMEAN score (52) and that employed by Eramian et al. (53) to develop a more accurate score function for model rank and selection, and for absolute model quality prediction (54,55). These methods integrate different kinds of scores using a machine learning approach or a linear combination with trained weights. Because GOAP does not include short-range (<7 sequence separation) angle correlations, some kind of backbone torsional, angle-dependent, knowledge-based scores as in the QMEAN approach might further enhance its performance in native structure recognition and model selection.

The physical source of the orientation-dependence is the anisotropic nature of the atomic electronic environment that also depends on the position and identity of the interacting partner. Our potential demonstrates that such anisotropy is found in all kinds of atoms (polar and nonpolar). The improved performance of our GOAP potential and other orientation-dependent potentials over orientation-independent ones (such as the DFIRE) has implications for the development of more accurate physics-based, all-atom potentials. Traditional physics-based all-atom force fields (2,3) represent atoms as hard spheres and take into account orientation dependencies only for bonded atoms in the angle and dihedral angle terms.

Nonbonded interactions are described by short-range van der Waals and long-range electrostatic terms and lack any angular orientation dependence. In recent developments of molecular-mechanics force fields, the electronic polarization of the atomic environment has been taken into account by calculating induced charges during the simulation (56). However, this is too computationally expensive for protein simulations even though a dipole description of electronic polarization is still inadequate for protein atoms. In contrast, GOAP naturally takes into account the orientation-dependence of H-bonds, disulfide bonds, salt-bridges, and other possible pair interactions at all distances.

However, due to the introduction of a sequence separation cutoff *s* = 7 because of the inaccurate estimation of expected angle distributions at shorter cutoffs, only nonlocal H-bonds (e.g., those in *β*-sheets and long separated side chains) are included. Although the knowledge-based potential can be directly used in Monte Carlo simulations and in model selection, its application to molecular dynamics simulations requires differentiable functions. This could be done using splines. Although GOAP might be useful for sampling conformations using molecular dynamics, the resulting thermodynamic properties might be unrealistic because GOAP is a potential of mean force derived from the statistics of solved protein structures. However, as a means of generating good quality models, our ranking results here are suggestive; but it remains to be demonstrated whether the good performance of GOAP will be retained when it is used to drive the conformational search rather than to select among extrinsically generated decoys. This is a promising avenue that is currently being pursued.

## Acknowledgments

The authors thank Dr. Bartosz Ilkowski for managing the cluster on which this work was conducted.

This work is supported by National Institutes of Health grant GM-48835.

## Supporting Material

**Document S1. Three figures:**

^{(213K, pdf)}

## References

*π*interactions in proteins and comparison with semiempirical and quantum chemistry approaches. J. Chem. Inf. Model. 2006;46:884–893. [PubMed]

*α*positions. Protein Sci. 2007;16:1449–1463. [PMC free article] [PubMed]

**The Biophysical Society**

## Formats:

- Article |
- PubReader |
- ePub (beta) |
- PDF (352K)

- An accurate, residue-level, pair potential of mean force for folding and binding based on the distance-scaled, ideal-gas reference state.[Protein Sci. 2004]
*Zhang C, Liu S, Zhou H, Zhou Y.**Protein Sci. 2004 Feb; 13(2):400-11.* - Distance-scaled, finite ideal-gas reference state improves structure-derived potentials of mean force for structure selection and stability prediction.[Protein Sci. 2002]
*Zhou H, Zhou Y.**Protein Sci. 2002 Nov; 11(11):2714-26.* - OPUS-PSP: an orientation-dependent statistical all-atom potential derived from side-chain packing.[J Mol Biol. 2008]
*Lu M, Dousis AD, Ma J.**J Mol Biol. 2008 Feb 8; 376(1):288-301. Epub 2007 Nov 19.* - Explicit orientation dependence in empirical potentials and its significance to side-chain modeling.[Acc Chem Res. 2009]
*Ma J.**Acc Chem Res. 2009 Aug 18; 42(8):1087-96.* - A novel side-chain orientation dependent potential derived from random-walk reference state for protein fold selection and structure prediction.[PLoS One. 2010]
*Zhang J, Zhang Y.**PLoS One. 2010 Oct 27; 5(10):e15386. Epub 2010 Oct 27.*

- In Silico Modeling of Human ?2C-Adrenoreceptor Interaction with Filamin-2[PLoS ONE. ]
*Pawlowski M, Saraswathi S, Motawea HK, Chotani MA, Kloczkowski A.**PLoS ONE. 9(8)e103099* - Evaluation of Unrestrained Replica-Exchange Simulations Using Dynamic Walkers in Temperature Space for Protein Structure Refinement[PLoS ONE. ]
*Olson MA, Lee MS.**PLoS ONE. 9(5)e96638* - Determining Effects of Non-synonymous SNPs on Protein-Protein Interactions using Supervised and Semi-supervised Learning[PLoS Computational Biology. ]
*Zhao N, Han JG, Shyu CR, Korkin D.**PLoS Computational Biology. 10(5)e1003592* - Improvement in Low-Homology Template-Based Modeling by Employing a Model Evaluation Method with Focus on Topology[PLoS ONE. ]
*Dai W, Song T, Wang X, Jin X, Deng L, Wu A, Jiang T.**PLoS ONE. 9(2)e89935*

- GOAP: A Generalized Orientation-Dependent, All-Atom Statistical Potential for Pr...GOAP: A Generalized Orientation-Dependent, All-Atom Statistical Potential for Protein Structure PredictionBiophysical Journal. Oct 19, 2011; 101(8)2043PMC

Your browsing activity is empty.

Activity recording is turned off.

See more...