Logo of pnasPNASInfo for AuthorsSubscriptionsAboutThis Article
Proc Natl Acad Sci U S A. Mar 14, 2000; 97(6): 2550–2555.
Published online Mar 7, 2000. doi:  10.1073/pnas.040573597
PMCID: PMC15966
Biophysics

Environment-dependent residue contact energies for proteins

Abstract

We examine the interactions between amino acid residues in the context of their secondary structural environments (helix, strand, and coil) in proteins. Effective contact energies for an expanded 60-residue alphabet (20 aa × three secondary structural states) are estimated from the residue–residue contacts observed in known protein structures. Similar to the prototypical contact energies for 20 aa, the newly derived energy parameters reflect mainly the hydrophobic interactions; however, the relative strength of such interactions shows a strong dependence on the secondary structural environment, with nonlocal interactions in β-sheet structures and α-helical structures dominating the energy table. Environment-dependent residue contact energies outperform existing residue pair potentials in both threading and three-dimensional contact prediction tests and should be generally applicable to protein structure prediction.

The main obstacle for protein structure prediction comes from the immense number of possible conformations accessible to a polypeptide chain and the complexity and variety of interactions involved in the folding process. This obstacle makes it a formidable task to carry out protein folding simulations using an all-atom representation of the polypeptide. Coarse-grained protein models, which treat amino acid residues as united interaction sites, offer a more practical approach to tackling the protein folding problem (1). Because such models omit many detailed features of atomic interactions, new energy parameters suitable for representing interactions at low level of resolution are required. These parameters often are derived empirically from the analysis of experimentally observed residue contact preferences in sets of known structures (2). Whether such knowledge-based potentials correctly reflect the actual physical forces stablizing the native structures of proteins remains a subject of debate (36). Nevertheless, structure-derived potentials have contributed substantially to the current theoretical studies of protein folding (7).

Over the years, many specifically designed residue pair potentials have been proposed (2, 7, 8), most of which use the 20-aa alphabet and assume a constant energy contribution for a given residue pair regardless of its local structural environment. However, it is known that the folded conformation of a protein is stabilized by a multitude of weak noncovalent interactions and the contribution of each of these interactions depends on its context within the folded structure (911). To derive better, more specific potential energy functions, one has to take into account the influence of varied structural circumstances on the specificity of inter-residue interactions. In this study, we examine pairwise amino acid interactions in the context of secondary structural environments (helix, strand, and coil) and report a set of residue contact energies for an expanded 60-residue alphabet (20 aa × three secondary structural states).

The discriminatory capability of the environment-dependent contact energies (ERCE) was tested first in the threading experiments where the secondary structural states of the template protein structures are experimentally determined. Though only the simple nongapped threading protocol was used, the improvement over existing residue pair potentials was already evident as ERCE correctly recognized the native structures for more than 97% of the testing proteins. The applicability of ERCE is greatly extended when combined with secondary structure prediction. The new generation of algorithms for secondary structure prediction benefits from using evolutionary information provided by multiple sequence alignment (1214). Recently developed structure prediction methods, including both ab initio methods (1517) and fold recognition-based methods (1823), all used predicted secondary structures and have achieved an important degree of success. We compared ERCE with several existing residue pair potentials in predicting residue contacts in the native protein structure from the amino acid sequence. A test based on a large number of distinct protein domains showed that the use of ERCE in combination of predicted secondary structures led to a significant improvement in the accuracy of contact prediction. This result has broader implications for protein structure prediction, because an energy function that favors conformations with higher percentage of native contacts has a better chance to guide global optimization toward the native folded state (2426).

Methods

Protein Structure Set.

A set of representative protein domain structures were extracted from the Protein Date Bank (27) by referring to the scop database (version 1.37) (28). Specifically, we started with the <40% identity set built by the authors of scop and then performed additional sorting. First, we removed composite domains and domains with fewer than 50 residues or more than 300 residues. A single member then was selected from each scop family in the four structural classes, all-α, all-β, α/β, and α+β; wherever applicable, structures that had been determined to the highest crystallographic resolution were chosen. Domains with only Cα-traces or containing large sequence gaps were excluded. The structures thus obtained were inspected at the scop superfamily level. If a superfamily had multiple representatives, those structures with resolution lower than 2.30 Å were deleted. Domains that do not form a compact globular structure (e.g., long α-helical coiled-coils) also were removed. The remaining protein domains were examined by pairwise sequence alignment; when the sequence identity between a pair was higher than 25%, only one structure, typically the one with higher resolution, was kept. The final data set contains 407 protein domains; of those, 80 are all-α, 105 are all-β, 91 are α/β, and 131 are α+β (the complete list is available from the authors on request).

The Definition of Secondary Structural States.

The experimentally determined (real) secondary structural states (helix, strand, and coil) of residues in these structures were extracted from the dssp database (29). A simple mapping scheme was used where only H states in dssp were mapped to helix, E states mapped to strand, and all other states mapped to coil. The predicted secondary structural states were obtained by running the psi-pred program (14). This program uses the position-specific scoring matrices generated by psi-blast (30) as input. To prepare for these matrices, we performed a psi-blast search for each of the 407 protein domains against a nonredundant sequence database NRDB90 (31). The default E-value cutoff (0.0001) was used, and the maximum number of iterations was set to seven. The overall accuracy of secondary structure prediction for 407 protein domains is 79.5%.

Extracting Contact Energies from Known Protein Structures.

The combination of 20 aa and three secondary structural states gives rise to 60 residue types. We applied the Miyazawa–Jernigan procedure (MJ) (32) with a different definition of the reference state (33) to derive contact energies for this expanded residue alphabet from the 407 protein domain structures. Specifically, amino acid residues were represented by the centroids of their side chains, and two residues were considered to be in contact if the distance between their centroids fell within RC = 6.5 Å. The numbers of contacts formed between two residues, i and j, and between them and the solvent molecules (represented by 0) were related to the contact energy by a hypothetical chemical reaction

equation M1
1

The effective contact energy (eij) is defined as the negative logarithm of the equilibrium constant of the reaction (32)

equation M2
2

where [n with macron]ij, 2[n with macron]i0, 2[n with macron]j0, and [n with macron]00 are the number of residue i-residue j contacts, the number of residue i-solvent contacts, the number of residue j-solvent contacts, and the number of solvent-solvent contacts, respectively. In practice, eij was estimated by

equation M3
3

to minimize the bias from amino acid compositional heterogeneity and polypeptide chain connectivity, where Nij, Ni0, Nj0, and N00 are the contact numbers observed in known structures, and Cij, Ci0, Cj0, and C00 are the corresponding quantities expected in a reference state.

The calculations of Nij = Σpnij;p and Ni0 = Σpni0;p require the knowledge of the contact numbers nij;p and ni0;p in individual proteins (represented by p). Whereas nij;p was counted directly from the structure, ni0;p was derived from

equation M4
4

where qi is the precalculated coordination number of residue i (Table (Table1)1) and ni;p is the number of residue i in protein p.

Table 1
Coordination numbers of amino acids in different secondary structural environments for RC = 6.5 Å

We assume that the reference state of a protein exhibits the same compactness as the native folded state but with randomly arranged residue-residue and residue-solvent contacts (for details see ref. 33). Under this assumption, we have

equation M5
5

equation M6
6

equation M7
7

where nrr;p = ΣiΣjnij;p and nr0;p = Σini0;p.

Testing the Derived Energies.

To assess whether ERCE offer better discriminatory power than the existing residue pair potentials, we first tested these potentials by using a nongapped threading protocol (3436). The sequences of proteins with 200 or fewer residues in the data set were threaded through the structures of all proteins of the same or larger size at all possible positions. When the energy of the native structure of a protein is lower than that of any threaded model, we consider that the sequence of the protein successfully recognizes its structure.

To assess the potential in a more realistic setting where only predicted secondary structure information can be obtained, we applied ERCE to the prediction of the three-dimensional contacts in protein structures. In particular, we concentrated on nonlocal contacts, i.e., contacts formed by residues not closely associated by the polypeptide chain. For contacts between on-chain neighbors, short-range interactions are probably more important. A minimum of four residue separations were required for a contact to be considered. For the purpose of comparing energy functions, we used a simple method for contact prediction where any residue pair that had energy lower than a specified cutoff was predicted to be in contact. By varying the cutoff value, the tradeoff between the completeness (or coverage) of the prediction (how many native contacts are predicted) and the accuracy of the prediction (how many predicted contacts correspond to actually observed contacts) can be examined. The actual three-dimensional contacts were identified from protein structures by using the criterion that two residues in contact must form at least four atom–atom contacts (two atoms are considered in contact if the distance between them is less than 6.0 Å). The results were essentially the same when a more (or less) stringent definition of contact was used.

Results

The Influence of Secondary Structural Environments on Inter-Residue Interactions.

The calculated contact energies are given in Fig. Fig.1.1. The 60-by-60 ERCE parameters are separated into six groups with respect to the secondary structural environments of the interacting pair: α-α, β-β, α-β, β-C, β-C, and C-C (where α, β, and C represent helix, strand, and coil, respectively). Similar to the contact energies determined by MJ (32), the energies derived here are dominated by hydrophobic interactions. The correlation coefficients of the contact energies from different groups with each other and with those determined by MJ (32) are summarized in Table Table2. 2.

Figure 1
Secondary structural ERCE (in RT units). The energy parameters are divided into six groups, α-α, β-β, α-β, α-C, β-C, and C-C, where α, β, and C represent helix, strand, ...
Table 2
Comparisons between energy parameters

Despite the high correlations, the average values of contact energies from different groups differ significantly (Table (Table2).2). The α-α and β-β contacts are generally more favorable than other types of interactions. This result is understandable considering that the cores of most proteins consist of closely packed regular secondary structural elements where helices tend to gather into bundles and β-strands often appear in assembled β-sheets. The β-β contact energies are on average nearly one RT unit lower than the α-α contact energies. Because most of the tertiary contacts formed by β residues are cross-β-bridge contacts, this result suggests that there may be a fundamental difference between the structural and energetic principles governing β-strand register and α-helix packing. Among the energy groups, the α-β contact energies show the largest variations, and therefore, highest specificity, consistent with the unique geometric feature of α-β packing observed in protein structures (37, 38).

Threading Experiments.

There are 316 protein domains in our data set that contain 200 or fewer residues; the number of conformations generated by threading ranges from 4,000 (for a 200-residue protein) to 40,000 (for a 50-residue protein). As shown in Table Table3,3, ERCE consistently outperformed MJ in the treading tests in all four structural categories, with an overall 6% improvement in success rate. For 88 all-β proteins and 49 α/β proteins, perfect recognition was achieved. In seven of the nine cases where ERCE failed to identify the native structures, the native structures were among the five most favorable conformations. MJ failed three times more; in half of those cases, the native structures were ranked below five.

Table 3
Results of the threading experiments

Contact Prediction.

Fig. Fig.22 shows the performance of ERCE in contact prediction for 407 proteins. Both predicted and real secondary structures were used, and the prediction based on real secondary structures in conjunction with ERCE (ERCE-Real) is noticeably better. However, a significant improvement over MJ in terms of prediction accuracy was obtained by both methods. The encouraging result achieved by ERCE in combination with predicted secondary structures (ERCE-Pred) adds a practical value to ERCE. We also have applied other published residue pair potentials for the same task; of those, the only potential that outperformed MJ was one recently reported by Skolnick and coworkers (JS) (5). Interestingly, a recent update to MJ performed slightly worse than the original energy table (data not shown), although the updated energy table was derived from a much larger set of protein structures (36).

Figure 2
Contact prediction using three different potentials: MJ, JS, and ERCE. ERCE predictions based on predicted (ERCE-Pred) and real secondary structures (ERCE-Real) both are shown. The x axis indicates the fraction of experimentally determined contacts ...

More detailed comparisons of different prediction methods are shown in Table Table4.4. The 407 testing proteins were divided into three size categories and four structural classes. Because the fraction of residue pairs that form tertiary contacts in a larger protein is less than that in a smaller protein, the level of difficulty in predicting contact rises as the size of the protein increases. This trend is clearly seen in the case of a random prediction (Table (Table4). 4). Although the use of potential energy function improves the chance of detecting native contacts, the basic trend of prediction accuracy versus protein size persists. ERCE-based predictions are consistently better than other residue pair potentials for all three size categories.

Table 4
Comparisons of the accuracy of contact prediction by different methods

The analysis of prediction performance with respect to structural classes is more revealing. Compared with the random prediction, the use of residue pair potentials such as MJ and JS yields the maximal gain in contact prediction for all-α proteins. The added value of using ERCE is that it also improves predictions on other three structural classes (all-β, α/β, and α+β) such that the overall improvement ratios relative to a random prediction are comparable for all four structural classes.

Discussion

Long-range interactions play an important role in determining the tertiary structures of proteins. However, computational simulations of such interactions have encountered great difficulties in the past. Coarse-grained reside pair potentials attempt to strike a balance between the accuracy of energy representation and the computational expediency by harnessing the rich amount of information about intramolecular interactions provided by experimentally solved structures. In this study, we extended the contact energy formulation (32) to include the influence of secondary structural environments on the specificity of residue interactions, taking advantage of the amount and quality of structural data currently available. Other ways to divide up the data to extract features usable in various categories of protein simulations also have been reported (10, 3941).

The added value of ERCE over existing residue pair potentials was explored in both threading and three-dimensional contact prediction experiments. The merit of using contact prediction to test energy functions is that it is independent of the conformational search algorithms and can be readily applied to a large set of protein structures. Because the objectives of two-dimensional contact map prediction and three-dimensional structure prediction are essentially the same (16, 4244), the combined power of ERCE and secondary structure prediction illustrated here should be of general interest to protein structure prediction.

It should be noted that we used the same set of 407 structures to derive and test ERCE. It has been noticed before that contact energies derived by the MJ formulation is relatively insensitive to the changes in the database size and content (36, 40). Because our data set contains a large number of nonredundant structures, a Jackknife test produces essentially the same results (data not shown). In fact, we also used an independent set of 91 protein chains (33) to derive an ERCE table, and the parameters thereof were highly correlated with those given in Fig. Fig.11 and achieved comparable accuracy in threading and contact prediction experiments.

Acknowledgments

We thank Inna Dubchak and Bob Jernigan for critical reading of the paper. We also thank David Jones for communicating the psi-pred program. This work was supported by grants from the Department of Energy (DE-AC03-76SF00098), the National Science Foundation (97–23352), and the National Institutes of Health (CA78406).

Abbreviations

ERCE
environment-dependent residue contact energies
MJ
Miyazawa–Jernigan procedure
JS
J. Skolnick et al. potential

Footnotes

Article published online before print: Proc. Natl. Acad. Sci. USA, 10.1073/pnas.040573597.

Article and publication date are at www.pnas.org/cgi/doi/10.1073/pnas.040573597

References

1. Levitt M. J Mol Biol. 1976;104:59–107. [PubMed]
2. Jernigan R L, Bahar I. Curr Opin Struct Biol. 1996;6:195–209. [PubMed]
3. Godzik A, Kolinski A, Skolnick J. Protein Sci. 1995;4:2107–2117. [PMC free article] [PubMed]
4. Thomas P D, Dill K A. J Mol Biol. 1996;257:457–469. [PubMed]
5. Skolnick J, Jaroszewski L, Kolinski A, Godzik A. Protein Sci. 1997;6:676–688. [PMC free article] [PubMed]
6. Zhang C. Proteins Struct Funct Genet. 1998;31:299–308. [PubMed]
7. Hao M-H, Scheraga H A. Curr Opin Struct Biol. 1999;9:184–188. [PubMed]
8. Rooman M, Gilis D. Eur J Biochem. 1998;254:135–143. [PubMed]
9. Minor D L J, Kim P S. Nature (London) 1994;371:264–267. [PubMed]
10. Cootes A P, Curmi P M G, Cunningham R, Donnelly C, Torda A E. Proteins Struct Funct Genet. 1998;32:175–189. [PubMed]
11. Hutchinson E G, Sessions R B, Thornton J M, Woolfson D N. Protein Sci. 1998;7:2287–2300. [PMC free article] [PubMed]
12. Rost B, Sander C. J Mol Biol. 1993;232:584–599. [PubMed]
13. Cuff J A, Barton G J. Proteins Struct Funct Genet. 1999;34:508–519. [PubMed]
14. Jones D T. J Mol Biol. 1999;292:195–202. [PubMed]
15. Skolnick J, Kolinski A, Ortiz A R. J Mol Biol. 1997;265:217–241. [PubMed]
16. Ortiz A R, Kolinski A, Skolnick J. Proc Natl Acad Sci USA. 1998;95:1020–1025. [PMC free article] [PubMed]
17. Huang E S, Samudrala R, Ponder J W. J Mol Biol. 1999;290:267–281. [PubMed]
18. Russel R B, Copley R R, Barton G J. J Mol Biol. 1996;259:349–365. [PubMed]
19. Di Francesco V, Garnier J, Munson P J. J Mol Biol. 1997;267:446–463. [PubMed]
20. Rice D W, Eisenberg D. J Mol Biol. 1997;267:1026–1038. [PubMed]
21. Rost B, Schneider R, Sander C. J Mol Biol. 1997;270:471–480. [PubMed]
22. Aurora R, Rose G D. Proc Natl Acad Sci USA. 1998;95:2818–2823. [PMC free article] [PubMed]
23. Grigoriev I V, Kim S-H. Proc Natl Acad Sci USA. 1999;96:14318–14323. [PMC free article] [PubMed]
24. Onuchic J N, Luthey-Schulten Z, Wolynes P G. Annu Rev Phys Chem. 1997;48:545–600. [PubMed]
25. Dill K A, Chan H S. Nat Struct Biol. 1997;4:10–19. [PubMed]
26. Liwo A, Lee J, Ripoll D R, Pillardy J, Scheraga H A. Proc Natl Acad Sci USA. 1999;96:5482–5485. [PMC free article] [PubMed]
27. Berstein F C, Koetzle T F, Williams G J B, Meyer E F, Jr, Brice M D, Rodgers J R, Kennard O, Shimanouchi T, Tasumi M. J Mol Biol. 1977;112:535–542. [PubMed]
28. Murzin A, Brenner S E, Hubband T, Chothia C. J Mol Biol. 1995;247:536–540. [PubMed]
29. Kabsch W, Sander C. Biopolymers. 1983;22:2577–2637. [PubMed]
30. Altschul S F, Madden T L, Schaffer A A, Zhang J, Zhang Z, Miller W, Lipman D J. Nucleic Acids Res. 1997;25:3389–3402. [PMC free article] [PubMed]
31. Holm L, Sander C. Bioinformatics. 1998;14:423–429. [PubMed]
32. Miyazawa S, Jernigan R L. Macromolecules. 1985;18:534–552.
33. Zhang C, Vasmatzis G, Cornette J L, DeLisi C. J Mol Biol. 1997;267:707–726. [PubMed]
34. Hendlich M, Lackner P, Weitckus S, Floeckner H, Froschauer R, Gottsbacher K, Casari G, Sippl M J. J Mol Biol. 1990;216:167–180. [PubMed]
35. Kocher J P, Rooman M J, Wodak S J. J Mol Biol. 1994;235:1598–1613. [PubMed]
36. Miyazawa S, Jernigan R L. J Mol Biol. 1996;256:623–644. [PubMed]
37. Cohen F E, Sternberg M J E, Taylor W R. J Mol Biol. 1982;156:821–862. [PubMed]
38. Reddy B V B, Nagarajaram H A, Blundell T L. Protein Sci. 1999;8:573–586. [PMC free article] [PubMed]
39. Bahar I, Jernigan R L. Folding Des. 1996;1:357–370. [PubMed]
40. Bahar I, Jernigan R L. J Mol Biol. 1997;266:195–214. [PubMed]
41. Miyazawa S, Jernigan R L. Proteins Struct Funct Genet. 1999;15:347–356. [PubMed]
42. Gobel U, Sander C, Schneider R, Valencia A. Proteins Struct Funct Genet. 1994;18:309–317. [PubMed]
43. Shindyalov I N, Kolchanov N A, Sander C. Protein Eng. 1994;7:349–358. [PubMed]
44. Olmea O, Rost B, Valencia A. J Mol Biol. 1999;293:1221–1239. [PubMed]

Articles from Proceedings of the National Academy of Sciences of the United States of America are provided here courtesy of National Academy of Sciences
PubReader format: click here to try

Formats:

Related citations in PubMed

See reviews...See all...

Cited by other articles in PMC

See all...

Links

  • Cited in Books
    Cited in Books
    PubMed Central articles cited in books
  • PubMed
    PubMed
    PubMed citations for these articles

Recent Activity

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

See more...