• We are sorry, but NCBI web applications do not support your browser and may not function properly. More information
Logo of pnasPNASInfo for AuthorsSubscriptionsAboutThis Article
Proc Natl Acad Sci U S A. Aug 28, 2001; 98(18): 10125–10130.
Published online Aug 14, 2001. doi:  10.1073/pnas.181328398
PMCID: PMC56926
Biophysics

TOUCHSTONE: An ab initio protein structure prediction method that uses threading-based tertiary restraints

Abstract

The successful prediction of protein structure from amino acid sequence requires two features: an efficient conformational search algorithm and an energy function with a global minimum in the native state. As a step toward addressing both issues, a threading-based method of secondary and tertiary restraint prediction has been developed and applied to ab initio folding. Such restraints are derived by extracting consensus contacts and local secondary structure from at least weakly scoring structures that, in some cases, can lack any global similarity to the sequence of interest. Furthermore, to generate representative protein structures, a reduced lattice-based protein model is used with replica exchange Monte Carlo to explore conformational space. We report results on the application of this methodology, termed TOUCHSTONE, to 65 proteins whose lengths range from 39 to 146 residues. For 47 (40) proteins, a cluster centroid whose rms deviation from native is below 6.5 (5) Å is found in one of the five lowest energy centroids. The number of correctly predicted proteins increases to 50 when atomic detail is added and a knowledge-based atomic potential is combined with clustered and nonclustered structures for candidate selection. The combination of the ratio of the relative number of contacts to the protein length and the number of clusters generated by the folding algorithm is a reliable indicator of the likelihood of successful fold prediction, thereby opening the way for genome-scale ab initio folding.

The inability to predict routinely the tertiary structure of a protein from its amino acid sequence remains one of the most challenging unsolved problems in biophysics. Contemporary approaches to this problem can be divided roughly into three categories of increasing complexity: (i) homology modeling (1, 2), (ii) threading (3, 4), and (iii) ab initio folding (59). The first two methods use the structures of already solved proteins as templates. The third, the ab initio method, does not require that an example of the fold of the protein of interest be previously solved. In principle, such an approach is very powerful; however, significant unresolved issues remain. First, there are problems with the search algorithms used to explore the protein's conformational space (10). Second, the energy functions used to evaluate the fitness of a given conformation cannot, in general, distinguish the native structure from alternative, protein-like decoys (11). To compensate for the imperfections in the energy functions, another way of selecting representative folds is required, with clustering of the structures being a promising approach (79). Finally, for a folding algorithm to be practical, one has to develop criteria that allow one to estimate the likelihood that a given prediction will be successful.

In this article, we address each of these issues and present the results on the application of our ab initio method to a representative 65-protein test set. To restrict the protein's conformational space, we employ the SICHO (SIde CHain Only) model (5) to represent the protein as a lattice chain connecting vertices, each vertex lying at the center of mass of a given residue's α-carbon and side chain heavy atoms. To restrict further the conformational search as well as to improve the correlation of energy with fold quality, we used both predicted secondary structure and tertiary contacts. Residue-based contacts are extracted from a threading protocol (3) for the generation of consensus contacts even when the proteins used to predict these contacts are not globally similar to the fold of the sequence of interest. Quite often, the number and accuracy of the predicted contacts is sufficient to guide the model into the neighborhood of the native fold. Another set of restraints that contains predicted distances of pairs of residues in local fragments also is used. To address the issue of fold selection, we combine the structure-clustering algorithm of Betancourt and Skolnick (12) with a knowledge-based heavy-atom pair potential selection procedure to select representative structures (13). This statistical potential is distance-dependent and is based on 167 types of residue-specific heavy atoms. Finally, to estimate the likelihood that the prediction is successful, we show that the number of predicted contacts and the number of obtained clusters from the simulations provide a confidence level for the prediction quality. We call the entire procedure TOUCHSTONE.

Methods

The SICHO Lattice Model.

The SICHO model is a 646-neighbor lattice embedded in an underlying cubic lattice grid with a spacing of 1.45 Å. The energy function consists of three types of terms: Egeneric, Especific, and Erest. Egeneric biases the model chain toward protein-like conformations and is independent of amino acid sequence (5). Especific is a sequence-dependent potential that consists of three terms: a weak bias toward the predicted secondary structure (14, 15), a sequence-dependent short-range geometric bias for fragments (16), and a protein-specific pairwise potential (17). Homologous proteins are removed from the database when the latter two terms are calculated. As in threading discussed below, no proteins with an E value < 0.01 are considered. The last term, Erest, is the newly derived restraint term extracted from threading (see below).

Prediction of Tertiary Restraints.

Two kinds of restraints are incorporated into our prediction scheme. The first type is the side chain contact predictions derived from the threading results. Here, a pair of residues predicted to be in contact must be at least five residues apart in the sequence. Quite often in threading, even when no template is hit with a significant Z score, common contacting substructures can be found in templates with weak Z scores from which the contacts can be predicted. Sometimes these common substructures that are in contact have a similar secondary structure and sometimes they do not, but they can experience similar interaction environments. In particular, our new threading algorithm, PROSPECTOR (3), uses four different scoring functions. For the top 20 scoring structures (the top 5 structures from each scoring function), whose Z scores are >1.3, a contact is predicted when it is present in 25% of the structures. These contacts are also converted to a protein-specific pair wise potential (17), which is used in the subsequent threading iteration. The consensus contacts are again collected, and the procedure is repeated for a third time. Then, all of the predicted contacts from all stages are used in the folding simulation. The restraint potential is not designed to satisfy all predicted contacts, because they are not exactly correct. This inaccuracy is because these contacts are sometimes collected from incorrect hits and also because of alignment problems in the threading algorithm. Therefore, a given structure has a preferable energy gain when a predicted contact is satisfied within plus or minus two residues. Furthermore, there is no energy penalty when at least 50% of all of the predicted contacts are satisfied. The 50% figure comes from the average accuracy of the contact prediction, which is 73.6% (see below). The threshold should be lower than this average accuracy to ensure that too many wrong contacts are not enforced. In practice, for 62 of the 65 proteins, the accuracy is better than 50%. Finally, local distance restraints are derived from multiple sequence alignments for short-sequence fragments no more than four residues in length.

We employ replica exchange Monte Carlo (18) to search conformational space. This protocol has been shown to be more effective than the conventional simulated annealing in a simple protein-like model (19). Fifty copies at different temperatures covering the entire folding transition region are used. Then, the conformations in trajectories at the three lowest temperatures are clustered (12). It takes about 100–150 days of computer time to perform 50 runs for a protein. Clustering is performed in two steps: (i) first, structures are clustered within each trajectory, and (ii) the resulting obtained centroids are clustered again among the different trajectories.

Structure Selection with an Atomic Potential.

A heavy-atom knowledge-based potential (13) is used to rank-order the structures generated from the Monte Carlo simulations; then, they are rebuilt at atomic detail (20). A scan-and-delete procedure is applied, in which the lowest energy structure is selected for each cluster, and then all of the higher-energy structures in the same cluster are removed. After this process, all of the nonclustered structures and the lowest energy structures from each cluster remain. The top five lowest energy structures are then selected.

Results and Discussion

The 65 test proteins, which cover a wide variety of protein types, are given in Table Table1.1. There are 4 small proteins (which have little secondary structure), 21 α-proteins, 20 β-proteins, and 20 αβ-proteins, according to the CATH classification (21) obtained from the biomolquest server (22). The proteins range in length from 39 to 146 aa. The test set also includes 40 proteins randomly chosen from the paper by Simons et al. (7).

Table 1
Predicted tertiary restraints and folding simulation results

The tertiary restraints and the results of the folding simulations are also found in Table Table1.1. The average accuracy of secondary structure predictions (Q3) is 79.1%. On average, 33.0% of the long-range contacts are correctly predicted, and, on average, 73.6% are correct within plus or minus two residues. However, the average error in the rms deviation (rmsd) of the local fragment prediction was 0.38 Å. It also should be noted that the number of predicted contacts has substantially increased from our other study (6), where correlated mutation analysis was used.

Fig. Fig.11 shows that the prediction accuracy grows as the number of predicted contacts increases; accuracy reaches 70% for 34 of 45 cases where the number of restraints is larger than the number of protein residues. This improvement occurs because the enhancements of the number and the accuracy of the restraints occur at the same time when the threading algorithm detects significant common local structures.

Figure 1
The number of the predicted long-range contacts and their accuracy (within onr or two residues) are shown. Proteins of the different structural type are plotted separately: [open triangle], small proteins; ●, α-helical proteins; □, β-proteins; ...

For 47 of 65 proteins (72.3%), at least one cluster centroid (within the top five centroids, at most) with an rmsd 6.5 Å from native was successfully obtained (44 ≤ 6 Å, 39 ≤ 5 Å). 2lfb has the ninth cluster with an rmsd of 4.9 Å. All have the correct topology. When the atomic potential is used in the selection procedure, 50 proteins were successfully predicted (46 ≤ 6 Å, 39 ≤ 5 Å). If the best structure is counted, 58 proteins (89.2%) have a structure ≤6.5 Å. On the other hand, the lowest energy structures of only 36 proteins satisfy this criteria. This result shows the imperfections in the current folding potentials as well as the practical usefulness of selecting structures by populations with the clustering algorithm. In many cases, there are pairs of topological mirror-image structures (where the chirality of turns is reversed, but helices, if present, are right-handed) among the obtained cluster centroids. It is interesting to note that when one of the centroids has the proper fold, in most cases the mirror-image structure is also obtained.

Fig. Fig.22 shows some representative results for the superimposition of the experimental and predicted structures extracted from the native-like cluster. The predicted (experimental) structures are shown by thick lines and the native structures are shown by thin lines. Fig. Fig.22A shows 1aoy, whose rmsd from native is 4.5 Å. Fig. Fig.22B shows 1mba whose rmsd from native is 2.7 Å. Fig. Fig.22C shows the best cluster centroid of 2pcy whose rmsd from native is 4.0 Å. Fig. Fig.22D shows 2azaA, with an rmsd from native of 4.5 Å. Fig. Fig.22E shows 1shaA with an rmsd from native of 3.6 Å. Fig. Fig.22F shows 1erv whose rmsd from native is 2.3 Å. Fig. Fig.22G shows 1cewI, whose rmsd from native is 7.2 Å. Fig. Fig.22H shows 1tsg, whose rmsd from native is 8.7 Å, and Fig. Fig.22I shows 5fd1 whose rmsd from native is rmsd 9.7 Å.

Figure 2
Superimposition of representative experimentally observed and predicted structures. The predicted structures are shown by thick lines, and the native structures are shown by thin lines. (A) 1aoy, rmsd 4.5 Å. (B) 1mba, rmsd 2.7 Å. (C) 2pcy, ...

To make an ab initio folding algorithm practical, one has to establish the level of confidence of a given prediction. In the majority of the cases in Fig. Fig.33 there is a proper fold when the number of obtained clusters is small. Indeed, if the number of clusters is equal to or less than five, a proper fold is obtained in 28 of 33 (84.8%) cases. Moreover, all 16 cases were successful when the number of the obtained clusters was two or three.

Figure 3
The number of successful cases relative to the number of clusters. Black, the successful cluster (rmsd <6.5 Å) is obtained as the first cluster; crosshatch, the second cluster; horizontal hatch, one of the other clusters; white, successful ...

Fig. Fig.44 shows the relationship between the quality of the simulation results and the number of the predicted contacts, which is another indication of how successful the simulation should be. When the number of restraints is more than the number of residues in the sequence, a cluster centroid closer than 6.5-Å rmsd to the native structure is obtained in 32 of 41 cases (78.0%). When the number of restraints is 150% or more relative to the sequence length, the success rate improves further to 88.0% (22 of 25 proteins). A proper fold is always obtained in either of two cases: (i) when the number of obtained clusters is equal to or less than three or (ii) as shown in Table Table2,2, when the number of clusters is less than or equal to five, and the number of provided restraints is 150% or more of the sequence length.

Figure 4
The number of long-range restraints and the quality of the clusters for each protein. (A) rmsd of the best cluster centroid. (B) rmsd of the best structure among all of the simulations. [open triangle], small proteins; ●, α-helical proteins; ...
Table 2
Summary of successful predictions with the number of clusters and restraints

It is important to note that in contrast to other methods (7, 23), both the accuracy of contact prediction and the success rate when the number of predicted contacts is sufficiently large are completely independent of the type of secondary structure of the protein.

There are two situations in which our method failed to obtain a native-like cluster. In the first case, there are no proper structures below 6.5 Å in the predicted structure pool, so that there is no chance to get a resulting proper cluster centroid (eight cases: 2ezk, 1ah9, 1iyv, 1rip, 4fgf, 1ctf, 1tsg, and 5fd1). However, for 4fgf, 1tsg, and 5fd1, the global topology of the best cluster is almost correct (rmsds of 9.7 Å, 8.7 Å, and 9.7 Å, respectively). For 1ah9, the positions of the last two β-strands are exchanged, and the rest of the structure is correct in the seventh cluster centroid (rmsd of 7.5 Å). For 1ctf, even the best structure did not have the correct topology, although its rmsd was <6.5 Å. For the other proteins, global assembly of the correctly predicted local substructures went wrong.

The other undesirable scenario is when there are some proper folds below 6.5 Å in the pool. These folds were neglected or averaged out during the two steps of the clustering procedure because there were too few of them (10 cases: 6pti, 1a32, 2af8, 1bq9A, 1pse, 2fdn, 1stu, 1vcc, 1stfI, and 1cewI). However, for 1a32, the topology of the first cluster centroid is correct despite its poor rmsd. A small number of good structures are included in this cluster, but they are averaged out by a larger number of improper folds. As for 1stu, in the fourth cluster centroids, the direction of the C-terminal helix deteriorated because of the contamination of incorrect structures in the cluster, but the rest of its fold is correct. Interestingly, the eighth cluster centroid of 1stu is the mirror image of the native structure. As for 1cewI, in the first cluster centroid, a β-sheet with a large helix located over it are consensus and thus well reproduced, but the remaining fragment comprising residues 60–80 was distorted. For 1bq9A, structures with an rmsd <5 Å were neglected in the clustering process. For 1pse and 2fdn, there was only one proper structure (rmsd 6.5 Å) in the simulations, which was neglected in the clustering procedure.

Also, we have tried candidate selection by using the atomic potential to address the issue of rare but good quality structures. Furthermore, when the near-native structures do form a cluster, the atomic potential/cluster picking procedure can usually also pick those good candidates in the top five (see below). In each of 65 proteins, five structures are selected for final analysis. The best structures selected by the atomic potential also are shown in Table Table1.1. In three cases, 1a32, 1stu, and 1bq9A, the atomic potential selected near-native structures that don't belong to any cluster, which are 2–3 Å better than cluster-selected ones. In 2fdn, the atomic potential picked a structure 7.6-Å rmsd from native, whereas the best cluster has an rmsd of 9.6 Å. For the rest of the cases, the two methods have comparable performance. With this procedure, we have successfully predicted the near-native structure in 50 of the 65 cases (76.9%), an improvement of 3 proteins.

In examining the 40 proteins also used by Simons et al. (7), our method clearly did better in 19 proteins and worse in 5. For the remaining 16 proteins, the results are almost the same or sufficiently similar; thus, it is hard to say which is better (because of differences in clustering methods).

Conclusions

We have demonstrated that ab initio structure prediction has become more feasible by using tertiary restraints derived from threading results, even when the threaded structures lack the global topology of the target protein. For 47 of 65 proteins, the simulated structures are clustered into a proper fold of less than 6.5-Å rmsd to the native structure. When the atomic potential is used, the number of correct predictions increases to 50 of 65. The resulting structure can be used for further analyses such as functional annotation by matching three-dimensional active-site motifs (24) or for low-resolution ligand docking (25).

Based on the present study, we can draw the following conclusions. First and foremost, by using predicted tertiary restraints of moderate accuracy, it is possible to predict protein structures of up to ≈150 residues in length. For example, 1mba, which is 146-residues long, has folded to 2.7-Å rmsd from native structure, which was not previously possible. Considering the moderate accuracy and abundance of predicted contacts, the restraints are implemented in such a way that only 50% of them need to be satisfied; yet, this is sufficient to guide the conformation toward native-like structures in many cases. Another important point is that this procedure facilitates the correct folding of proteins having any kind of secondary structure. Finally, we have established empirical indicators of successful prediction; these are the ratio of the number of contacts to the protein's size (the number and accuracy is highly correlated) and the number of clusters generated by the folding simulation. These indicators of when folding is successful should be quite useful in blind predictions.

Despite these significant improvements, almost all of the components of the algorithm may have to be revised to increase the fidelity and accuracy of this prediction engine further. For better or worse, the quality of the tertiary restraints dictates the success of our folding algorithm. Thus, additional work to improve their number and accuracy is still required; efforts to improve the threading-based contact prediction protocol as well as the evolutionary methods (6) will be necessary. Furthermore, both the energy function and the conformational search scheme need to be dramatically improved to reduce their reliance on the tertiary contacts. Nevertheless, the current study demonstrates that the methodology has reached a practical level. We note that this fully automated ab initio folding algorithm is one of the components of a unified approach for protein structure/function prediction (26, 27) that also includes generalized comparative modeling and that is applicable for large-scale prediction. Efforts to fold all of the small proteins in Mycoplasma genitalium are estimated to take a minimum of 8,500 CPU days on our cluster.

Acknowledgments

This research was supported in part by National Institutes of Health Grants GM-37408 and GM-48835.

Abbreviation

rmsd
rms deviation

References

1. Sanchez R, Sali A. Proteins, Suppl. 1997;1:50–58. [PubMed]
2. Guex N, Peitsch M C. Electrophoresis. 1997;18:2714–2723. [PubMed]
3. Skolnick J, Kihara D. Proteins. 2001;42:319–331. [PubMed]
4. Panchenko A R, Marchler-Bauer A, Bryant S H. J Mol Biol. 2000;296:1319–1331. [PubMed]
5. Kolinski A, Skolnick J. Proteins. 1998;32:475–494. [PubMed]
6. Ortiz A R, Kolinski A, Skolnick J. J Mol Biol. 1998;277:419–448. [PubMed]
7. Simons K T, Strauss C, Baker D. J Mol Biol. 2001;306:1191–1199. [PubMed]
8. Aszodi A, Gradwell M J, Taylor W R. J Mol Biol. 1995;251:308–326. [PubMed]
9. Huang E S, Samudrala R, Ponder J W. J Mol Biol. 1999;290:267–281. [PubMed]
10. Berne B J, Straub J E. Curr Opin Struct Biol. 1997;7:181–189. [PubMed]
11. Park B, Levitt M. J Mol Biol. 1996;258:367–392. [PubMed]
12. Betancourt M R, Skolnick J. J Comp Chem. 2001;22:339–353.
13. Lu H, Skolnick J. Proteins. 2001;44:223–232. [PubMed]
14. Rost B, Sander C. Proc Natl Acad Sci USA. 1993;90:7558–7562. [PMC free article] [PubMed]
15. Jones D T. J Mol Biol. 1999;292:195–202. [PubMed]
16. Kolinski A, Jaroszewski L, Rotkiewicz P, Skolnick J. J Phys Chem. 1998;102:4628–4637.
17. Skolnick J, Kolinski A, Ortiz A. Proteins. 2000;38:3–16. [PubMed]
18. Swedensen R H, Wang J S. Phys Rev Lett. 1986;57:2607–2609. [PubMed]
19. Gront D, Kolinski A, Skolnick J. J Phys Chem. 2001;113:5065–5071.
20. Feig M, Rotkiewicz P, Kolinski A, Skolnick J, Brooks C L., III Proteins. 2000;41:86–97. [PubMed]
21. Orengo C A, Michie A D, Jones S, Swindells M B, Thorton J M, Jones D T. Structure (London) 1997;5:1093–1108. [PubMed]
22. Bukhman Y, Skolnick J. Bioinformatics (2001) 2001;17:468–478. [PubMed]
23. Pillardy J, Czaplewski C, Liwo A, Lee J, Ripoll D R, Kamierkiewicz R, Oldziej S, Wedemeyer W J, Gibson K D, Arnautova Y A, et al. Proc Natl Acad Sci USA. 2001;98:2329–2333. . (First Published February 20, 2001; 10.1073/pnas.041609598) [PMC free article] [PubMed]
24. Fetrow J S, Skolnick J. J Mol Biol. 1998;281:949–968. [PubMed]
25. Wojciechowski, M. & Skolnick, J. (2001) J. Comput. Chem., in press.
26. Kolinski A, Betancourt M R, Kihara D, Rotkiewicz P, Skolnick J. Proteins. 2000;44:133–149. [PubMed]
27. Skolnick, J. & Kolinski, A. (2001) Adv. Chem. Phys., in press.

Articles from Proceedings of the National Academy of Sciences of the United States of America are provided here courtesy of National Academy of Sciences
PubReader format: click here to try

Formats:

Related citations in PubMed

See reviews...See all...

Cited by other articles in PMC

See all...

Links

Recent Activity

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

See more...