• We are sorry, but NCBI web applications do not support your browser and may not function properly. More information
Logo of springeropenLink to Publisher's site
Journal of Molecular Modeling
J Mol Model. Sep 2009; 15(9): 1093–1108.
Published online Feb 21, 2009. doi:  10.1007/s00894-009-0454-9
PMCID: PMC2712621
NIHMSID: NIHMS120758

Solvent accessible surface area approximations for rapid and accurate protein structure prediction

Abstract

The burial of hydrophobic amino acids in the protein core is a driving force in protein folding. The extent to which an amino acid interacts with the solvent and the protein core is naturally proportional to the surface area exposed to these environments. However, an accurate calculation of the solvent-accessible surface area (SASA), a geometric measure of this exposure, is numerically demanding as it is not pair-wise decomposable. Furthermore, it depends on a full-atom representation of the molecule. This manuscript introduces a series of four SASA approximations of increasing computational complexity and accuracy as well as knowledge-based environment free energy potentials based on these SASA approximations. Their ability to distinguish correctly from incorrectly folded protein models is assessed to balance speed and accuracy for protein structure prediction. We find the newly developed “Neighbor Vector” algorithm provides the most optimal balance of accurate yet rapid exposure measures.

Keywords: Environment free energy, Protein structure prediction, Solvent accessible surface area

Introduction

Computational protein structure prediction gains importance in the post-genomic area

Genome sequencing has provided a wealth of information about the amino acid sequence of proteins. While x-ray crystallography and nuclear magnetic resonance spectroscopy made great progress in elucidating the structure of many of these proteins, these experimental techniques are laborious and are not feasible for use on all proteins [1]. In particular, membrane proteins, which comprise greater than 50% of all drug targets [2], and large protein complexes evade experimental structure elucidation. While up to 35% of all proteins are membrane proteins [3], less than 2% of structures deposited in the PDB belong to this class (as of 02/2008). Therefore, there has been an increased demand for computational methods to predict the structure for such proteins and to assist in structure elucidation from sparse or low-resolution experimental data generated by complementary techniques such as electron paramagnetic resonance spectroscopy [4], x-ray crystallography [5], and cryo-electron microscopy [5].

Protein structure prediction techniques can be categorized into comparative modeling techniques that build a model of the target protein based on the known structure of a related template protein, and de novo structure prediction techniques that can be used in the absence of a suitable template structure [6]. Proteins usually fold into the conformation with the lowest free energy, so protein structure prediction is essentially a search amongst all possible conformations of an amino acid sequence for the conformation with the lowest free energy. While both classes of protein structure prediction techniques depend critically on energy functions to evaluate the candidate conformations (also commonly called models), de novo structure prediction in particular requires very rapid yet accurate energy evaluation functions in order to search a large conformational space in a short period of time [6]. These energy evaluation functions approximate the energy of a given protein model and thus provide a way to “score” each model. Both comparative modeling and de novo structure prediction methods have been evaluated in recent critical assessment of structure prediction (CASP) experiments [7] during which computational methods have repeatedly predicted protein structures de novo to within 5 Å equation M1rmsd [8].

Knowledge-based energy functions allow accurate and rapid calculation of classical energy terms

Energetic terms, such as hydrogen bonding, electrostatics, and van der Waals forces contribute to the interactions of atoms within a protein as well as between the protein and solvent [9]. While molecular mechanics force-fields seek to individually describe each of these starting from first principles, knowledge-based potentials (KBPs) seek to derive energy functions that describe the net effect of all these contributions in a specific setting, e.g., protein structures [10]. Hence, they approximate the overall free energy more generally, and frequently encompass multiple classical energy terms associated with a physical interaction [11]. KBPs have been shown to be an effective alternative to using atomic solvation parameters to more precisely model the folding process [12].

KBPs relate the probability of a conformation to the energy associated with that conformation using an inverse Boltzmann relation [13]:

equation M2

which provides a means for the derivation of a free energy from a propensity. Advantages of knowledge-based potentials include the comprehensive and unbiased inclusion of all experimentally elucidated protein structures. Disadvantages are the requirement of a vast knowledge-base [11], potential biases in the knowledge-base that translate into the potentials [11], and difficulty aligning components of the knowledge-based energy contributions with classical energy terms [11]. Nevertheless, the widespread use of knowledge-based free energy potentials in predicting protein structure [1418], protein-protein interactions [19, 20], protein-ligand interactions [2124], and in protein design [25, 26] underlines their success in recent years. Knowledge-based energy terms have been derived for all levels of protein architecture, most notably atoms [15, 27], amino acids [18], secondary structure elements [28], and the overall protein fold [29]. Often, several of these knowledge-based free energy approximations are linearly combined into a single composite energy function without addressing the certain overlap between the individual terms that results from the description of the same classical terms, mostly on different levels of architecture.

Amino acid environment energy depends on an accurate yet rapid estimation of solvent accessible surface area (SASA)

The amino acid “environment free energy” [30, 31] encompasses amino acid interactions with the solvent (solvation) as well as with the protein core and integrates hydrogen bonding, electrostatics, and van der Waals forces among others. It is an important driving force in protein folding as it maps to effects like surface area minimization, burial of hydrophobic side chains, and side-chain packing density [30].

The extent to which an amino acid interacts with its environment, the solvent and the protein core, is naturally proportional to the degree to which it is exposed to these environments [32]. The solvent-accessible surface area (SASA) is a geometric measure of this exposure, and therefore a dependency exists between SASA and environment free energy [33, 34]; some approaches even assume a strictly linear relation between the two values [32, 35]. An explicit calculation of the SASA is computationally intractable as this value is, by nature, not pair-wise decomposable [36]. Hence an accurate but pair-wise decomposable approximation of SASA is often used in conjunction with KBPs to describe environment free energy [18].

A precise calculation of solvent accessible surface area is numerically demanding and not practical for computational protein structure prediction

SASA is typically calculated by methods involving the in-silico rolling of a spherical probe, which approximates a water molecule, around a full-atom protein model. Lee and Richards presented the first algorithm for calculating the solvent-accessible surface area (SASA) of a molecular surface [37]. Their method involved the extension of the van der Waals radius for each atom by 1.4 Å (the radius of a polar solvent probe) and the calculation of the surface area of these expanded-radius atoms. The Shrake and Rupley algorithm [38] involves the testing of points on an atom’s van der Waals surface for overlap with points on the van der Waals surface of neighboring atoms. Many SASA approximations have been developed including spline approximations [39] and approximations that take advantage of boolean logic and look-up tables [40]. Wodak and Janin’s statistical SASA approximation algorithm is a function of only interatomic distances that approximates each amino acid by one sphere at the center of mass [41]. Many approaches employ a lattice surrounding the protein to approximate its SASA [4244].

A pairwise-decomposable method of SASA approximation is desirable as it can then be employed in minimization approaches, such as dead end elimination. One SASA approximation that achieves this criteria is the method of Street and Mayo in which a scaled two-body approximation of the buried area is subtracted from the total surface area in order to approximate SASA [36]. The method of Zhang et al. improved upon the Street and Mayo method by accounting for its shortcoming, the overlapping burial of core residues. Areas were calculated in the presence of generic side chains rather than the backbone alone, which reduced the error of the area calculations [45]. One of the more efficient non-pairwise-decomposable algorithms is the maximal speed molecular surfaces (MSMS) algorithm which fits spherical and toroidal patches onto the surfaces of atoms based on which points on the atom are accessible to a spherical probe that approximates a solvent molecule [46].

Several approximations for burial are based upon “neighborhood densities [47],” a weighted sum of neighboring atoms, which take advantage of the idea that neighborhood density is inversely related to SASA. The method used to approximate burial in an early version of Rosetta, a state-of-the-art protein structure prediction algorithm, uses the number of equation M3 atoms within 10 Å of the equation M4 of the amino acid of interest [18]. Since that time, this has been modified slightly so that centroids, pseudo-atoms located at the side chain’s center of mass, rather than equation M5 are used [48]. Other work has examined various burial approximations and found that the number of equation M6 atoms within 14 Å of the equation M7 of the amino acid of interest is most conserved in structural alignments, most predictable from amino acid sequence, and provides the greatest utility in fold recognition and sequence alignment [49]. A shortcoming of burial approximations is their inability to take into account the spatial orientation of neighboring atoms (illustrated in Fig. 3). A method that calculates burial by examining neighborhood densities in four different tetrahedral directions attempts to address this shortcoming [50]. The “neighbor vector” algorithm introduced in this manuscript attempts to address this shortcoming as well.

Fig. 3
This figure depicts a shortcoming of the neighbor count algorithm. Lines are drawn from the amino acid of interest in this case to all neighboring (as defined by the neighbor count algorithm) amino acids. Two scenarios are shown for which the neighbor ...

As is evidenced by the wealth of related literature, this area has been researched extensively and many SASA approximations have been developed. While many of the discussed methods are very accurate, they are also time-consuming and not tractable for use in protein structure prediction, where thousands of protein models need to be evaluated. Additionally, the majority of these methods work on full-atom protein models whereas reduced amino acid representations are often used in early stages of protein structure prediction. Finally, many of these methods return the SASA of the protein model as a whole rather than the SASA of each amino acid (known as rSASA or per-residue SASA), which is necessary in order to take advantage of the knowledge-based potentials.

In this manuscript, the authors seek to build upon several of these approaches and refine them specifically for use in protein structure prediction. While this manuscript focuses on the benefits of a rapid SASA approximation method for protein folding, there are additional areas that would benefit from such a method, such as protein binding and design. Specifically, hydrophobic surface patches, which are important in molecular recognition processes, constitute up to 60% of the SASA of a protein, and methods for their rapid identification based on SASA calculation have been developed [51]. The rSASA calculated by the MSMS algorithm is used as reference standard throughout the present work.

Four SASA approximation algorithms are presented that reflect the trade-off between accuracy and speed

This manuscript systematically introduces and compares a series of rSASA approximations of increasing complexity. KBPs describing the environment free energy of an amino acid in dependence of these SASA approximations have been derived. All approximations are examined in terms of both runtime and the ability to discriminate native-like from nonnative-like protein models obtained in structure prediction applications, in order to fine-tune the balance between algorithm speed and accuracy.

Materials and methods

Exposure algorithms of increasing complexity

Neighbor count (NC) The central idea behind the neighbor count algorithm is that the number of neighboring amino acids is inversely proportional to the exposure of an amino acid. The definition of a “neighbor” is expanded in this work by assigning a weight between 0.0 and 1.0 to all amino acids in the protein model based on their proximity to the amino acid of interest. A lower boundary and an upper boundary are chosen such that all amino acids whose equation M8 lies at a distance less than or equal to the lower boundary are assigned a neighbor weight of 1.0 (i.e., they are counted as complete neighbors), amino acids whose equation M9 lies at distance greater than the upper boundary are assigned a neighbor weight of 0.0 (i.e. they are not considered neighbors at all), and amino acids whose equation M10 lies at a distance between the lower and upper bounds are assigned a weight between 0.0 and 1.0 (see Fig. 1). For glycine, a pseudo- equation M11 atom is introduced at the geometric position where an actual equation M12 would sit. This expansion of the definition of “neighbor” allows for amino acids that are spatially close to the amino acid of interest to have a greater weight in determining the neighbor count keeping the potential continuously differentiable at the same time, a characteristic essential for gradient-based minimization.

equation M13
Fig. 1
This figure depicts ways in which a “neighboring” amino acid can be defined. a) Previous work uses a step function with a hard boundary to determine which amino acids are neighbors. Any amino acids lying within that boundary are considered ...

The neighbor count value for each amino acid is generated by adding the neighbor weight values of all other amino acids in the protein model as shown in the equation below and Fig. 2.

equation M14
Fig. 2
This figure depicts the neighbor count algorithm. The inner and outer gray rings represent the lower and upper bounds respectively. The small circles represent the equation M15 atoms of amino acids. The black circle represents the amino acid of interest. Amino acids ...

A shortcoming of using the number of neighboring amino acids as a measure of burial is that this approach disregards the spatial distribution of its neighbors. Figure Figure33 shows two examples that represent different exposure scenarios, yet return the same neighbor count value.

Neighbor vector (NV) The neighbor vector algorithm is an extension of the neighbor count algorithm that takes into account the spatial orientation of neighboring amino acids.

equation M16

The neighbor vector is a vector associated with each amino acid whose length can range between 0.0 and 1.0. A neighbor vector of length [reverse congruent]1.0 implies high exposure whereas a neighbor vector of length [reverse congruent]0.0 implies low exposure (i.e. burial). This is shown graphically in Fig. 4. Note that the neighbor vector is still a pair-wise decomposable measure of exposure.

Fig. 4
This figure depicts the neighbor vector algorithm. The vectors drawn to the equation M17 of neighboring amino acids are shown in black and the vector sum is shown in heavyweight black. a) When summed, the vectors essentially cancel out yielding a vector of zero length ...

Artificial neural network (ANN) As input for an ANN that approximates SASA, an additional term not used in previous measures is introduced: the dot product of the equation M18 vector with the neighbor vector (NVequation M19). Recall that the side chain atoms extend from the equation M20 atom. Therefore, this dot product term provides information about the orientation of the side chain of the amino acid of interest, with respect to neighboring amino acids. If the equation M21 vector points in the same direction as the neighbor vector, the angle between these vectors will be small and the dot product will be [reverse congruent]+1.0. If the equation M22 vector points in the opposite direction as the neighbor vector, the angle between these vectors will be large and the dot product will be [reverse congruent]−1.0 (see Fig. 5). Therefore, this dot product provides additional information about the position of the side chain atoms with respect to the neighboring amino acids. The neighbor count, neighbor vector, and NVequation M23 are input to the ANN.

Fig. 5
A β-strand is shown where the equation M24 atoms and equation M25 atoms of the strand are represented by black and white circles respectively. The equation M26 of neighboring amino acids are represented by white circles. The neighbor vectors are shown as dashed lines. The equation M27 vectors ...

The ANN contains a single hidden layer with three neurons. The ANN was trained using a feed-forward algorithm with back-propagation over 2670 steps (5000 steps were allowed, but the training terminated early due to convergence). The data was split into a training set (80% of the data), a monitor set (10% of the data), and an independent set (10% of the data). The learning rate η was 0.01 and the momentum α was 0.5.

Overlapping spheres (OLS) The overlapping spheres algorithm is a variant of the Shrake and Rupley [38] algorithm for calculating molecular surfaces with the exception that spheres surround amino acids rather than atoms. In this algorithm, a sphere is placed around each equation M29 and points are placed on the surface of the sphere surrounding the amino acid of interest. The fraction of points on an amino acid’s sphere that do not overlap with any other sphere is used as a measure of exposure (see Fig. 6). The spheres where chosen to have a uniform size regardless of amino acid type. Usage of amino acid specific radii did not lead to a significant improvement in rSASA calculation (data not shown). While the optimal number of points placed on the sphere has been investigated [52], this parameter was not optimized. Points were distributed uniformly every 5° along the surface of the sphere.

Fig. 6
The overlapping spheres algorithm places a sphere around each equation M30 and places points on the surface of the spheres. The points that do not overlap with the spheres of any other amino acids are used as a measure of relative exposure. The equation M31 atoms are colored ...

Establishment of rSASA reference standard

The maximal speed molecular surfaces (MSMS) [46] algorithm as implemented in the visual molecular dynamics (VMD) [53] molecular visualization package serves as the reference standard method for rSASA. Protein models with the hydrogen atoms removed are used in order to ensure a consistent representation. In order to convert this rSASA measure into a relative exposure, the rSASA for each amino acid in the protein is divided by the rSASA for that amino acid alone in space (i.e., all other amino acids in the protein were removed). This gives a relative exposure for each amino acid in the protein with a minimum exposure of 0.0 (completely buried) and a maximum exposure of 1.0 (completely exposed).

Optimization of parameters for each approximation algorithm

In order to determine the optimal parameters for each SASA approximation, a Monte Carlo parameter optimization method is used. The parameter set that produces the output that correlates most highly with the rSASA reference standard is selected as optimal. 90% of the proteins in the representative protein database (described below) are used in parameter optimization while 10% was withheld. The correlations reported in Table 1 are based only upon the withheld 10%.

Table 1
Optimal parameters

The optimal parameters found for each exposure algorithm are shown. The parameters that maximized the correlation of exposures produced by each algorithm with exposures produced by the rSASA reference standard are selected as optimal.

Establishment of representative protein database for generation of KBPs

Statistics are generated for each amino acid type and each of the exposure algorithms by analysis of the representative protein database described in Table 2. This database contains high resolution (<2.5 Å) structures with <25% homology. The complete list of proteins from the PDB was submitted to the PISCES server [54, 55] to identify proteins with low sequence similarity. The input parameters used for culling are the following: sequence percentage identity <=25%, resolution = 0.0 Å–3.0 Å, R-factor = 0.3, sequence length 40–10,000 amino acids, non X-ray entries were excluded as were equation M35 entries. The resulting database of unique structures contained 1795 soluble proteins. Information about the proteins used to create the KBPs is summarized in Table 2.

Table 2
Proteins used in KBP generation

Generation of knowledge-based environment potentials using inverse Boltzmann relation

The following equation describes how histograms are generated for each amino acid type.

equation M36

where aai is amino acid type i, n is the number of amino acids of type i in the database, j is a specific exposure value, ej is the range of exposure values j associated with that bin, and m is the number of bins (20 bins are used for all algorithms). Prior to multiplication by the number of exposure values, the values in each bin are probabilities (0 ≤ probability ≤1). Multiplying by the number of bins converts these probabilities to propensities (0 ≤ propensity ≤ number of bins). Propensities are then converted to energies according to the inverse Boltzmann relation discussed earlier.

The relationship between probabilities, propensities, and energies as used in creation of KBPs is shown in Table 3. Prandom is defined as 1/# possible exposure values. States found rarely are associated with high energy whereas states found frequently are associated with low energy.

Table 3
Relationship between probabilities, propensities, and energies

Essentially, exposure values that are seen rarely in native proteins are associated with high energy values whereas exposure values that are seen often in native proteins are associated with low energy values. A spline is used to smooth the bins into a differentiable potential. A pseudo-count of 1 is added to each bin so that exposure values that are never seen (i.e., have a count of 0) are not associated with an infinitely large energy.

Benchmark proteins are selected such that 10% of the protein models are “native-like”

Nineteen benchmark proteins are selected for analysis of the exposure algorithms. The decoys were generated by the Rosetta folding algorithm and are a subset of the Rosetta benchmark set. For each of the benchmark proteins, multiple protein models are included in the benchmark (between 70 and 1030 depending on the availability of protein models for each benchmark protein). Rmsd100, a normalized form of rmsd [56], is used to examine the deviation of each protein model from the native conformation. Protein models having an rmsd100 value <5 Å are referred to as “native-like” whereas protein models that have an rmsd100 value ≥5 Å are referred to as “nonnative-like.” Additional values (between 4 Å and 7 Å) were also tested as a threshold for the definition of “native-like” and yielded similar results. Protein models are selected such that 10% of the decoys are “native-like” and 90% of the protein models are “nonnative-like”. This provided a “level playing field” and basis for comparison as the maximum enrichment for all benchmark proteins with this distribution of protein models is 10.0.

The protein models analyzed are a subset of the protein models available for a given benchmark protein and are randomly selected from this larger group. This random selection procedure is repeated ten times to provide standard deviations of the evaluation criteria (read below). Additionally, proteins of various sizes, secondary structure composition, and CATH classifications are chosen to ensure a representative benchmark set (see Table 4).

Table 4
Summary of benchmark proteins used in KBP analysis

Table 4 provides information about the benchmark proteins used for analysis of the KBPs based upon each exposure algorithm. Proteins with multiple types of secondary structural elements and of various sizes are included.

Average rSASA values are used to convert the actual rSASA into a relative exposure for benchmark proteins

In order to facilitate comparison amongst the exposure algorithms, rSASA values computed with the VMD implementation of the MSMS algorithm are converted from actual areas in Å2 to relative exposures (on a scale of 0.0 (completely buried) to 1.0 (completely exposed)). To convert areas into relative exposures, the rSASA is divided by the average rSASA for that amino acid type alone in space. The average values for each amino acid type alone in space are shown in Table 5 along with the standard deviations and the number of amino acids (n) used in determining the average.

Table 5
Average SASA values for amino acids

Evaluation metrics: enrichment, receiver operating characteristic (ROC) curves, and Z-scores are measures of the KBP’s discriminatory power

In order to evaluate the KBPs based upon each exposure algorithm, the ability of each KBP to discriminate between native-like and nonnative-like models is examined. The KBP for each algorithm is used to evaluate the energy of all protein models for each benchmark protein. The metric enrichment is used to evaluate the ability of each KBP to distinguish between native-like and nonnative-like protein models.

equation M37

As 10% of the protein models for each benchmark protein are native-like, the maximum enrichment possible for each KBP is 10.0 and a random enrichment is an enrichment of 1.0.

ROC curves display the true positive rate versus the false positive rate for a binary classification system. In this case, the ability of the KBPs based on the approximation algorithms to correctly classify native-like and nonnative-like protein models, is examined. Additionally, the area under the ROC curve (AUC) is determined from these ROC curves. An AUC of 1.0 indicates perfect classification whereas an AUC of 0.5 is representative of a random measure.

Z-scores are calculated for each KBP. A random KBP is expected to achieve a z-score of 0.0. A more negative z-score indicates greater power of the KBP in distinguishing between native-like and nonnative-like protein models.

equation M38

Results

Increasing algorithm complexity corresponds to a more accurate rSASA approximation yet slower run times

In order to determine how well each exposure algorithm approximates rSASA, the correlation of exposure values produced by each algorithm to the exposure values given by the reference standard rSASA algorithm is examined. Results displaying the correlation with the reference standard rSASA and run times for each algorithm are shown in Table 6. The rSASA reference standard method takes several orders of magnitude longer (0.39e-2 seconds per amino acid for the rSASA reference standard compared to <6e-5 seconds per amino acid for NC, NV, and ANN and 3e-3 for OLS) than any of the approximation methods indicating its infeasibility for use in rapid protein structure prediction. As expected, as the algorithm complexity increases, the runtime increases as well. Of note, the OLS algorithm is two orders of magnitude slower than the other approximation algorithms but still 12 times faster than the rSASA reference standard algorithm. The correlation of the neighbor count algorithm is negative due to the fact that the number of neighbors is inversely proportional to the rSASA. As algorithm complexity increases, the correlation with the rSASA reference standard also increases. The ANN approximation algorithm correlates most highly (r = 0.89) with the rSASA reference standard.

Table 6
Exposure algorithm performance

Visual inspection of KBPs confirm expected trends

A visual inspection of the KBPs ensures that the potentials agree with expectations (see Fig. 7). For example, one expects for hydrophobic amino acids in solution to prefer burial. This is in fact what is seen. Consider the preference of hydrophobic amino acids, such as valine (V), methionine (M), and phenylalanine (F) for a large number of neighbors, a small neighbor vector magnitude, and small relative exposures. Additionally, one expects hydrophilic amino acids to prefer exposure in solution. This is also the case. Consider the preference of the hydrophilic amino acids lysine (K), asparagine (N), and glutamine (Q) for low neighbor counts, a large neighbor vector magnitude, and large relative exposures.

Fig. 7
The knowledge-based potentials based upon each exposure algorithm are shown and colored by value where white represents low values and dark gray represents high values. A visual inspection of the KBPs confirms that the energies shown in the KBPs agree ...

Evaluation metrics indicate that the neighbor count algorithm does not perform as well as other approximation algorithms

As evidenced by the enrichment values in Fig. 8, the rSASA reference standard and the neighbor vector, artificial neural network, and overlapping spheres algorithms perform similarly (enrichment = ~3.0) and all outperform the NC method (enrichment <2.5). While no single method clearly dominates the others, some trends can be seen (Fig. 9). In several cases (i.e., 1bq9, 1iib, 1enh), the neighbor count algorithm does not perform as well as the other algorithms. While the rSASA reference standard algorithm often provides the greatest enrichment (i.e., 1bq9, 1iib, 1a19), there are several cases in which the neighbor vector algorithm provides the better results (i.e., 1ail, 1b3a, 1e6i).

Fig. 8
The average enrichment, z-score, and area under the ROC curve (AUC) is shown for each exposure algorithm over all benchmark proteins. The z-scores are in light gray, the AUC values are in medium gray, and the enrichment values are in dark gray. The neighbor ...
Fig. 9
The enrichment is shown for each algorithm over all benchmark proteins. There are some proteins for which none of the exposure algorithms provided an enrichment (for example 1scj) while there are some benchmark proteins for which many of the exposure ...

Additionally, the area under the ROC curve (AUC) is examined for the KBPs over the benchmark proteins (see Fig. 10). Again, the AUC values vary widely across benchmark proteins. However, the neighbor count algorithm (AUC = 0.7) lags a bit behind the neighbor vector, artificial neural network, overlapping spheres, and reference standard rSASA algorithms (AUCs ≥0.75).

Fig. 10
The area under the ROC curve (AUC) is shown for each exposure algorithm over all benchmark proteins. The AUC varies widely over the benchmark proteins. There are some proteins for which all algorithms perform very well (for example, 1c9o) while there ...

The z-scores also support the trends shown by the other evaluation metrics. The neighbor count has the least negative z-score (−0.61) whereas the artificial neural network has the most negative z-score (−0.83) with neighbor vector coming in a close second (−0.80).

A detailed analysis of the benchmark protein 1enh

The benchmark protein 1enh is an example where the potentials are able to distinguish between native-like and nonnative-like models to an extent that corresponds to the complexity of each algorithm (i.e., the NC algorithm is the least effective and the OLS algorithm is the most effective). This is indicated by the increasing area under the ROC curve (see Fig. 11a) moving from NC to NV to ANN to OLS. This can also be seen when the rmsd100 is plotted against the energy score assigned to each protein model (see Fig. 11b-f). As the algorithm complexity increases, the KBP is able to more effectively identify native-like protein models. Of note, the OLS KBP yields a higher enrichment than the rSASA reference standard. This indicates that environment free energy KBPs based on rSASA approximation alone may not be a complete picture of the environment free energy and that additional factors should be taken into account in order to more completely capture the environment free energy. Further examination is necessary to explore this question.

Fig. 11
a) The ROC curve for 1enh. As the algorithm complexity increases, the area under the ROC curve increases. In this case, the OLS algorithm is able to distinguish between native-like and nonnative-like models more effectively than the reference standard ...

For a specific example, consider ALA5 of a 1enh protein model (Fig. 12). The rSASA method determines that the relative exposure of ALA5 is 0.375, ranked the 13th most exposed amino acid of the 54 amino acids in the protein model. The NC algorithm calculated that ALA5 has 6.495 neighbors and ranked ALA5 as the 21st most exposed amino acid in the protein model. However, the NV algorithm was able to discern that the majority of ALA5’s neighbors are on one side of the amino acid leaving the other side relatively exposed. The NV algorithm assigned ALA5 a vector of magnitude 0.568 and ranked ALA5 as the 19th most exposed amino acid in the model, closer to its true rank. The ANN predicted a relative exposure of 0.348 for ALA5 and ranked it as the 18th most exposed amino acid in the protein model, again closer to its true rank than the ranks achieved by the NC and NV algorithms. The OLS algorithm returned a relative exposure of 0.372 for ALA5 and ranked it as the 13th most exposed amino acid in the protein model, which is in fact its correct ranking. The exposure value given for the ALA5 of a 1enh protein model as well as the rank of ALA5 amongst the 54 amino acids in the protein model is shown in Table 7.

Fig. 12
The backbone and equation M39 are shown in gray. The ALA5 equation M40 is shown in black. The actual relative rSASA as determined by the reference standard method of ALA5 is 0.375 and it is the 13th most exposed exposed amino acid in the protein model. Lines are drawn from the ...
Table 7
Exposure algorithm performance for ALA5

Discussion

Four algorithms for determining the relative exposure on a reduced protein model are presented. The complexity of these algorithms varies and as expected, the simplest algorithms are the most efficient in terms of runtime but less effective in approximating the reference standard rSASA method and distinguishing between native-like and nonnative-like protein models. Also as expected, the more complex algorithms, such as the artificial neural network and overlapping spheres, achieve more accurate exposure measures and are more effectively able to distinguish between native-like and nonnative-like protein models.

Neighbor count is the simplest measure of exposure and achieves the lowest average enrichment. Also as expected, as the algorithms increase in complexity, they are able to achieve a higher enrichment. The ANN is particularly effective at this task and achieves enrichments on reduced protein models that are nearly as high as the enrichments achieved by the rSASA on full-atom protein models.

As the Rosetta models used for benchmarking were generated using the Rosetta environment score, most of these models bury apolar amino acids and expose polar amino acids and fulfill overall the generally expected environment architecture within proteins. Hence the enrichment test performed in this work is a stringent test that measures improvement over the Rosetta energy function which explains the rather moderate enrichment values. Substantially higher enrichments can be obtained if models are created without the use of the environment score.

As is seen in Fig. 9 and indicated by the large standard deviations shown in Fig. 8, the degree to which the algorithms are able to recognize native-like protein varies widely. Consider the high enrichments produced for the protein 1e6i. In this case, the algorithms are fairly effective in distinguishing between native-like and nonnative-like protein models. However, there are proteins that are “hard,” for example 1scj. All algorithms produce an enrichment of 0.0 (worse than random).

In all cases, the maximum possible enrichment of 10.0 is not achieved by any algorithm, including the rSASA reference standard. This indicates that the environment free energy approximations based on SASA contain a limited amount of information and additional energy terms should be considered in order to achieve additional discriminatory power.

The large standard deviations of the enrichment values (shown in Fig. 8) indicate that further improvements to these algorithms are possible. The fact that the reference standard rSASA method does not always perform best in terms of ability to distinguish between native-like and nonnative-like protein models is unexpected (for example, consider the benchmark protein 1tig). The assumption that environment free energy is directly proportional to SASA should be investigated further to determine if this is strictly the case or if there may be other crucial contributions to environment free energy as well.

Future work includes an in depth analysis of various histogram sizes used in creation of the KBPs and optimizing parameters with the standard for optimality being the parameters that yield the greatest enrichment for protein structure prediction rather than correlation with the reference standard rSASA.

Conclusions

Four exposure algorithms of varying complexities are presented that efficiently produce exposures on reduced protein models that closely correlate with the exposure measures given by the rSASA reference standard on a full-atom model. These exposure measures can be used to derive KBPs that provide discriminatory power in distinguishing between native-like and nonnative-like models. This measure of environment free energy is an important energy term but is best utilized as part of a more comprehensive energy evaluation function. For use in computational protein structure prediction, the neighbor vector algorithm provides the most optimal balance of accurate yet very rapid exposure measures. The assumption that environment free energy is directly proportional to SASA will be investigated further.

Acknowledgments

The authors would like to thank all members of the Meiler Lab for helpful discussions and support. This work was supported by grant R01-GM080403 from the National Institute of General Medical Sciences and training grant 2-T15 LM07450-06 from the National Library of Medicine.

Open Access This article is distributed under the terms of the Creative Commons Attribution Noncommercial License which permits any noncommercial use, distribution, and reproduction in any medium, provided the original author(s) and source are credited.

Abbreviations

ANN
artificial neural network
AUC
Area under the receiver operating characteristic curve
CASP
Critical assessment of structure prediction
KBP
Knowledge-based potential
MSMS
Maximal speed molecular surfaces
NC
Neighbor count
NV
Neighbor vector
OLS
Overlapping spheres
PDB
Protein data bank
ROC
Receiving operating characteristic
RMSD
Root mean square deviation
rSASA
Per-residue solvent-accessible surface area
SASA
Solvent-accessible surface area
VMD
Visual molecular dynamics

References

1. Baker D, Sali A. Protein structure prediction and structural genomics. Science. 2001;294:93–96. doi: 10.1126/science.1065659. [PubMed] [Cross Ref]
2. Fang Y, Frutos AG, Lahiri J. Membrane protein microarrays. J Am Chem Soc. 2002;124(11):2394–2395. doi: 10.1021/ja017346+. [PubMed] [Cross Ref]
3. Wiener MC. A pedestrian guide to membrane protein crystallization. Methods. 2004;34(3):364–372. doi: 10.1016/j.ymeth.2004.03.025. [PubMed] [Cross Ref]
4. Alexander N, et al. De novo high-resolution protein structure determination from sparse spin-labeling EPR data. Structure. 2008;16(2):181–195. doi: 10.1016/j.str.2007.11.015. [PMC free article] [PubMed] [Cross Ref]
5. Jiang W, et al. Bridging the information gap: computational tools for intermediate resolution structure interpretation. J Mol Biol. 2001;308(5):1033–1044. doi: 10.1006/jmbi.2001.4633. [PubMed] [Cross Ref]
6. Baker D, Sali A. Protein structure prediction and structural genomics. Science. 2001;294(5540):93–96. doi: 10.1126/science.1065659. [PubMed] [Cross Ref]
7. Bourne PE. CASP and CAFASP experiments and their findings. Methods Biochem Anal. 2003;44:501–507. [PubMed]
8. Bradley P, et al. Free modeling with Rosetta in CASP6. Proteins. 2005;61(Suppl 7):128–134. doi: 10.1002/prot.20729. [PubMed] [Cross Ref]
9. Dill KA. Dominant forces in protein folding. Biochemistry. 1990;29(31):7133–7155. doi: 10.1021/bi00483a001. [PubMed] [Cross Ref]
10. Lazaridis T, Karplus M. Effective energy functions for protein structure prediction. Curr Opin Struct Biol. 2000;10(2):139–145. doi: 10.1016/S0959-440X(00)00063-4. [PubMed] [Cross Ref]
11. Sippl MJ. Knowledge-based potentials for proteins. Curr Opin Struct Biol. 1995;5(2):229–235. doi: 10.1016/0959-440X(95)80081-6. [PubMed] [Cross Ref]
12. Juffer AH, et al. Comparison of atomic solvation parametric sets: applicability and limitations in protein folding and binding. Protein Sci. 1995;4(12):2499–2509. doi: 10.1002/pro.5560041206. [PMC free article] [PubMed] [Cross Ref]
13. Boas FE, Harbury PB. Potential energy functions for protein design. Curr Opin Struct Biol. 2007;17(2):199–204. doi: 10.1016/j.sbi.2007.03.006. [PubMed] [Cross Ref]
14. Chen CT, et al. HYPLOSP: a knowledge-based approach to protein local structure prediction. J Bioinform Comput Biol. 2006;4(6):1287–1307. doi: 10.1142/S0219720006002466. [PubMed] [Cross Ref]
15. Lu H, Skolnick J. Application of statistical potentials to protein structure refinement from low resolution ab initio models. Biopolymers. 2003;70(4):575–584. doi: 10.1002/bip.10537. [PubMed] [Cross Ref]
16. Ferrada E, Melo F. Nonbonded terms extrapolated from nonlocal knowledge-based energy functions improve error detection in near-native protein structure models. Protein Sci. 2007;16(7):1410–1421. doi: 10.1110/ps.062735907. [PMC free article] [PubMed] [Cross Ref]
17. Casadio R, et al. Thinking the impossible: how to solve the protein folding problem with and without homologous structures and more. Methods Mol Biol. 2007;350:305–320. [PubMed]
18. Simons KT, et al. Assembly of protein tertiary structures from fragments with similar local sequences using simulated annealing and Bayesian scoring functions. J Mol Biol. 1997;268(1):209–225. doi: 10.1006/jmbi.1997.0959. [PubMed] [Cross Ref]
19. Audie J, Scarlata S. A novel empirical free energy function that explains and predicts protein-protein binding affinities. Biophys Chem. 2007;129(2–3):198–211. doi: 10.1016/j.bpc.2007.05.021. [PubMed] [Cross Ref]
20. Darnell SJ, Page D, Mitchell JC. An automated decision-tree approach to predicting protein interaction hot spots. Proteins. 2007;68(4):813–823. doi: 10.1002/prot.21474. [PubMed] [Cross Ref]
21. Evers A, Gohlke H, Klebe G. Ligand-supported homology modelling of protein binding-sites using knowledge-based potentials. J Mol Biol. 2003;334(2):327–346. doi: 10.1016/j.jmb.2003.09.032. [PubMed] [Cross Ref]
22. Roche O, Kiyama R, Brooks CL., 3rd Ligand-protein database: linking protein-ligand complex structures to binding data. J Med Chem. 2001;44(22):3592–3598. doi: 10.1021/jm000467k. [PubMed] [Cross Ref]
23. Gohlke H, Hendlich M, Klebe G. Knowledge-based scoring function to predict protein-ligand interactions. J Mol Biol. 2000;295(2):337–356. doi: 10.1006/jmbi.1999.3371. [PubMed] [Cross Ref]
24. Grzybowski BA, et al. From knowledge-based potentials to combinatorial lead design in silico. Acc Chem Res. 2002;35(5):261–269. doi: 10.1021/ar970146b. [PubMed] [Cross Ref]
25. Poole AM, Ranganathan R. Knowledge-based potentials in protein design. Curr Opin Struct Biol. 2006;16(4):508–513. doi: 10.1016/j.sbi.2006.06.013. [PubMed] [Cross Ref]
26. Isogai Y, et al. Design of lambda Cro fold: solution structure of a monomeric variant of the de novo protein. J Mol Biol. 2005;354(4):801–814. doi: 10.1016/j.jmb.2005.10.005. [PubMed] [Cross Ref]
27. DeBolt SE, Skolnick J. Evaluation of atomic level mean force potentials via inverse folding and inverse refinement of protein structures: atomic burial position and pairwise non-bonded interactions. Protein Eng. 1996;9(8):637–655. doi: 10.1093/protein/9.8.637. [PubMed] [Cross Ref]
28. Frishman D, Argos P. Knowledge-based protein secondary structure assignment. Proteins. 1995;23(4):566–579. doi: 10.1002/prot.340230412. [PubMed] [Cross Ref]
29. Domingues FS, et al. Sustained performance of knowledge-based potentials in fold recognition. Proteins. 1999;Suppl 3:112–120. doi: 10.1002/(SICI)1097-0134(1999)37:3+<112::AID-PROT15>3.0.CO;2-R. [PubMed] [Cross Ref]
30. Koehl P, Delarue M. Polar and nonpolar atomic environments in the protein core: implications for folding and binding. Proteins. 1994;20(3):264–278. doi: 10.1002/prot.340200307. [PubMed] [Cross Ref]
31. Koehl P, Levitt M. Structure-based conformational preferences of amino acids. Proc Natl Acad Sci USA. 1999;96(22):12524–12529. doi: 10.1073/pnas.96.22.12524. [PMC free article] [PubMed] [Cross Ref]
32. Ooi T, et al. Accessible surface areas as a measure of the thermodynamic parameters of hydration of peptides. Proc Natl Acad Sci USA. 1987;84(10):3086–3090. doi: 10.1073/pnas.84.10.3086. [PMC free article] [PubMed] [Cross Ref]
33. Viacarra C, Mayo S. Electrostatics in computational protein design. Curr Opin Chem Biol. 2005;9(6):622–626. [PubMed]
34. Pokala N, Handel TM. Energy functions for protein design I: efficient and accurate continuum electrostatics and solvation. Protein Sci. 2004;13(4):925–936. doi: 10.1110/ps.03486104. [PMC free article] [PubMed] [Cross Ref]
35. Gordon DB, Marshall SA, Mayo SL. Energy functions for protein design. Curr Opin Struct Biol. 1999;9(4):509–513. doi: 10.1016/S0959-440X(99)80072-4. [PubMed] [Cross Ref]
36. Street AG, Mayo SL. Pairwise calculation of protein solvent-accessible surface areas. Fold Des. 1998;3(4):253–258. doi: 10.1016/S1359-0278(98)00036-4. [PubMed] [Cross Ref]
37. Lee B, Richards FM. The interpretation of protein structures: estimation of static accessibility. J Mol Biol. 1971;55(3):379–400. doi: 10.1016/0022-2836(71)90324-X. [PubMed] [Cross Ref]
38. Shrake A, Rupley JA. Environment and exposure to solvent of protein atoms. Lysozyme and insulin. J Mol Biol. 1973;79(2):351–371. doi: 10.1016/0022-2836(73)90011-9. [PubMed] [Cross Ref]
39. Colloc’h N, Mornon JP. A new tool for the qualitative and quantitative analysis of protein surfaces using B-spline and density of surface neighborhood. J Mol Graphics. 1990;8:133–140. doi: 10.1016/0263-7855(90)80053-I. [PubMed] [Cross Ref]
40. Grand SM, Merz KM., Jr Rapid approximation to molecular surface area via the use of boolean logic and look-up tables. J Comp Chem. 1992;14(3):349–352. doi: 10.1002/jcc.540140309. [Cross Ref]
41. Wodak SJ, Janin J. Analytical approximation to the accessible surface area of proteins. Proc Natl Acad Sci USA. 1980;77(4):1736–1740. doi: 10.1073/pnas.77.4.1736. [PMC free article] [PubMed] [Cross Ref]
42. Pearl LH, Honegger A. Generation of molecular surfaces for graphic display. J Mol Graphics. 1983;1(1):9–12. doi: 10.1016/0263-7855(83)80048-4. [Cross Ref]
43. You T, Bashford D. An analytical algorithm for the rapid determination of the solvent-accessibility of points in a three-dimensional lattice around a solute molecule. J Comp Chem. 1994;16(6):743–757. doi: 10.1002/jcc.540160610. [Cross Ref]
44. Juffer AH, Vogel HJ. A flexible triangulation method to describe the solvent-accessible surface of biopolymers. J Comput Aided Mol Des. 1998;12(3):289–299. doi: 10.1023/A:1016089901704. [PubMed] [Cross Ref]
45. Zhang N, Zeng C, Wingreen NS. Fast accurate evaluation of protein solvent exposure. Proteins. 2004;57(3):565–576. doi: 10.1002/prot.20191. [PubMed] [Cross Ref]
46. Sanner MF, Olson AJ, Spehner JC. Reduced surface: an efficient way to compute molecular surfaces. Biopolymers. 1996;38(3):305–320. doi: 10.1002/(SICI)1097-0282(199603)38:3<305::AID-BIP4>3.0.CO;2-Y. [PubMed] [Cross Ref]
47. Stouten PFW, et al. An effective solvation term based on atomic occupancies for use in protein simulations. Mol Simul. 1993;10(2–6):97–120. doi: 10.1080/08927029308022161. [Cross Ref]
48. Rohl CA, et al. Protein structure prediction using Rosetta. Methods Enzymol. 2004;383:66–93. doi: 10.1016/S0076-6879(04)83004-0. [PubMed] [Cross Ref]
49. Karchin R, Cline M, Karplus K. Evaluation of local structure alphabets based on residue burial. Proteins. 2004;55(3):508–518. doi: 10.1002/prot.20008. [PubMed] [Cross Ref]
50. Weiser J, Shenkin PS, Still WC. Approximate solvent-accessible surface areas from tetrahedrally directed neighbor densities. Biopolymers. 1999;50(4):373–380. doi: 10.1002/(SICI)1097-0282(19991005)50:4<373::AID-BIP3>3.0.CO;2-U. [PubMed] [Cross Ref]
51. Eisenhaber F, Argos P. Hydrophobic regions on protein surfaces: definition based on hydration shell structure and a quick method for their computation. Protein Eng. 1996;9(12):1121–1133. doi: 10.1093/protein/9.12.1121. [PubMed] [Cross Ref]
52. Flower DR. SERF: a program for accessible surface area calculations. J Mol Graph Model. 1997;15(4):238–244. doi: 10.1016/S1093-3263(97)00082-X. [PubMed] [Cross Ref]
53. Humphrey W, Dalke A, Schulten K. VMD: visual molecular dynamics. J Mol Graph. 1996;14(1):33–38. doi: 10.1016/0263-7855(96)00018-5. [PubMed] [Cross Ref]
54. Wang GL, Dunbrack RL. PISCES: recent improvements to a PDB sequence culling server. Nucleic Acids Res. 2005;33:W94–W98. doi: 10.1093/nar/gki402. [PMC free article] [PubMed] [Cross Ref]
55. Wang GL, Dunbrack RL. PISCES: a protein sequence culling server. Bioinformatics. 2003;19(12):1589–1591. doi: 10.1093/bioinformatics/btg224. [PubMed] [Cross Ref]
56. Carugo O, Pongor S. A normalized root-mean-square distance for comparing protein three-dimensional structures. Protein Sci. 2001;10(7):1470–1473. doi: 10.1110/ps.690101. [PMC free article] [PubMed] [Cross Ref]

Articles from Springer Open Choice are provided here courtesy of Springer

Formats:

Related citations in PubMed

See reviews...See all...

Links

  • PubMed
    PubMed
    PubMed citations for these articles
  • Substance
    Substance
    PubChem Substance links

Recent Activity

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

See more...