A Measure of Progress in Fold Recognition?

Aron Marchler-Bauer and Stephen H. Bryant*

Computational Biology Branch, National Center for Biotechnology Information
National Library of Medicine, NIH Bldg. 38A, Room 8N805
8600 Rockville Pike, Bethesda, MD 20894

PROTEINS: Structure, Function, and Genetics Supplement 3:218-225, 1999

(Reproduced with permission - U.S. Government work)

We present a retrospective analysis of CASP3 threading predictions, applying evaluation and assessment criteria used at CASP2. Our purpose is twofold. Firstly, we wish to ask whether measures of model accuracy are comparable between CASP3 and CASP2, even though they have been calculated differently. We find that these quantities are effectively the same, and that either may be used to compare model accuracy. Secondly, we wish to assess progress in fold recognition by comparing the numbers of CASP2 and CASP3 models that cross specific accuracy thresholds. We find that the number of accurate models at CASP3 drops sharply as the targets become more difficult, with less extensive similarity to known structures, exactly the pattern seen at CASP2. CASP3 teams do not seem to have predicted accurate models for targets of greater difficulty, and for a given difficulty range the best CASP3 models seem no more accurate than the best models at CASP2. At CASP3, however, we find greater numbers of accurate models for medium-difficulty targets, with extensive similarity to a known structure but no shared sequence motifs. Threading methods would appear to have become more reliable for modeling based on remote evolutionary relationships.

Introduction

Measuring progress in fold recognition would appear to be a simple matter. "Blind" predictions for CASP2(1) and CASP3(2) represent the state of the art in threading methods as of 1996 and 1998, respectively, and one need only use this database to ask: Were the threading models produced for CASP3 more accurate than models produced for CASP2? Were there greater numbers of accurate models produced at CASP3, as compared to CASP2? These seemingly simple questions may not be so simple to answer, however. There are significant technical differences in the way threading alignment accuracy was measured at CASP2 and CASP3, and to compare model accuracy one must verify that these alternative measures are equivalent, or nearly so. To compare the numbers of accurate models one must also define what one means by "accurate". While it is straightforward to choose a specific threshold, there is certainly no unique or universally accepted way to do so. Furthermore, while one might expect prediction success to depend on target difficulty, there is no reason to expect that CASP2 and CASP3 have presented equal mixtures of easy, medium and hard targets. Yet to make a valid comparison one must somehow assign target difficulties, for which there is also no unique or universally accepted metric.

Model accuracy at CASP3 and CASP2 was evaluated by comparing threading alignments to a reference structure-structure alignment. The CASP3 assessor used the number of correctly aligned target residues, sf0+sf4, as a part of his competitive ranking of threading models(3,4), and the CASP2 assessor similarly used the fraction of correctly aligned residues, ASp4(5,6). The reference structure-structure alignments used to compute sf0+sf4 and ASp4 are quite different, however. At CASP3 the structure comparison program PROSUP7 searched among alternative structure-structure alignments of the predicted and observed target structures to find a reference alignment that maximized sf0. At CASP2 the structure comparison programs DALI(8), SSAP(9) and VAST(10) compared the target structure to all templates in the database and computed a single structure-structure alignment for each template found to be similar to the target. It is impossible to know a priori whether these differences in the "standard of truth" are important, and whether model accuracy has been measured in comparable ways at CASP3 and CASP2. To address this question we therefore compute CASP2 accuracy measures for all CASP3 models and present here a quantitative comparison.

Sustained performance of CASP3 threading methods was assessed by ranking models for each target, considering the sf0+sf4 accuracy measure, with 6 points awarded for first-place accuracy, 5 points for second place, and so on(3,4). As noted by the CASP3 assessor, the sum of points at CASP3 is analogous to a statistic from the sport of formula-1 automobile racing. In formula-1 racing, drivers are similarly awarded points based on their ranking in a number of races and compared according to the sum of points. Sustained performance at CASP2 was assessed differently, by counting the number of models that exceeded a fixed (though arbitrary) accuracy threshold with respect to the ASp4 or CSpc (Contact Specificity) accuracy measures(11,12). CASP2 scores are analogous to a statistic from a different competitive sport, baseball. Baseball players are often compared according to the number of "home runs", i.e. the number of times they hit the ball farther than a fixed (though arbitrary) distance. Either assessment style is a reasonable way to judge sustained performance. The CASP3 assessment style seems less suited to judging progress over time, however, because it is based entirely on relative performance. From the number of formula-1 points, for example, one cannot tell who was driving faster, the top driver in 1996, or the top driver in 1998. But one may infer that a baseball player with more "home runs" was hitting the ball farther than a player with less, no matter when these performances were recorded. To measure progress in fold recognition we therefore rely on counts of accurate models, using accuracy thresholds equivalent to those applied at CASP2.

Measuring progress in fold-recognition at CASP is perhaps more difficult than comparing formula-1 drivers or baseball players, however. At CASP the "playing field" does not stay the same from year to year, since new targets must be chosen and the extent of their similarity to known structures will vary. To assess progress one must somehow "level the playing field" by correcting for differences among the targets and comparing predictions for targets of comparable difficulty. After CASP2 it was suggested that target difficulty be characterized by plotting the degree of sequence similarity vs. the extent of structural similarity with respect to available templates(11,12). Difficulty categories are assigned based on a target's falling within distinct regions of this "phase diagram". "Medium" targets, for example, are those with 60% or more of residues superimposable on a known structure, but without recognizable sequence motifs. Here we present this "phase diagram" of target difficulty for both CASP3 and CASP2 targets, and we compare the numbers of accurate CASP3 and CASP2 models by difficulty categories. We suggest an interpretation, but the reader may of course use these data to make his or her own assessment of progress!

Methods

We obtained predictions for CASP3 fold recognition targets from the LLNL Prediction Center(13). CASP3 predictions were made available as three-dimensional models in PDB-format, including predictions originally submitted as target-template alignments. For calculation of CASP2 evaluation quantities we converted predictions back into target-template alignments, for all models where the PDB(14) template was named in the prediction and where we could unambiguously assign 60% or more of model residues to residues in that template, with 2.5 Å or less Ca RMS residual. We also converted models submitted as separate segments into a single model including a larger fraction of the target, whenever this did not result in physically implausible models. The CASP3 organizers treated individual segments as separate models, and a small number of predictions thus differ from those they evaluated, but this difference has no significant effect on evaluations shown below. CASP2 evaluation quantities were calculated using the structure-structure alignments generated by the VAST algorithm, as distributed to CASP3 predictors prior to the meeting(15). We could not calculate evaluation quantities based on reference alignments by DALI or SSAP, since target-template alignments by these methods were not computed for CASP3(2). For models based on templates not recognized by VAST and models where alignment reconstruction was not possible we calculated only those CASP2 evaluation quantities that do not depend on a reference structure-structure alignment, including Contact Specificity, CSpc. CASP2 evaluation quantities have been described in detail previously(6) and are summarized in the caption to Figure 1. The complete set of models used in this study and CASP2 evaluation quantities calculated for CASP2 and CASP3 predictions are available electronically(16,17). CASP3 evaluation quantities used in the comparisons below were taken directly from the WWW site maintained by the CASP3 organizers(13).

Results

Comparable Measures of Model Accuracy?

To determine whether measures of model accuracy from CASP2 and CASP3 are comparable we plot in Figure 1 (sf0+sf4)/nres vs. ASp4 and CSpc. The quantity sf0+sf4 gives the number of correctly aligned residues relative to the PROSUP structure-structure alignment used at CASP3, allowing a tolerance of 4 residues shift error. We express this value as a fraction, (sf0+sf4)/nres, where nres is the length of the predicted CASP3 alignment, so as to place it on the same scale as the CASP2 quantities. The CASP2 quantity ASp4 (Alignment Specificity) gives the fraction of correctly aligned residues relative to VAST structure-structure alignments. The CASP2 quantity CSpc (Contact Specificity) gives the fraction of correctly predicted contacts, Ca pairs under 8 Å apart, separated by 5 or more residues in the polypeptide chain. One may see that (sf0+sf4)/nres and ASp4 are highly correlated. For CASP3 fold-recognition models based on templates recognized as similar by VAST the correlation coefficient is .89, increasing to .92 if comparative modeling targets are also included (not shown). Values for sf0 and sf4 are calculated for all models, regardless of their similarity to the target, and in this respect (sf0+sf4)/nres is similar to CSpc. One may see there are a large number of CASP3 models recognized as inaccurate by either measure, most based on templates not recognized as similar to the target by VAST (small dots in Figure 1). For the subset of models based on templates similar to the target, as recognized by VAST, the correlation of (sf0+sf4)/nres and CSpc is .80.

Figure 1

Figure 1: Correlation of CASP2 and CASP3 model accuracy measures, (a) (sf0+sf4)/nres vs. ASp4, and (b) (sf0+sf4)/nres vs. CSpc. Values are expressed as percentages. The CASP3 measure sf0 gives the number of correctly aligned residues according to the PROSUP target vs. model structure-structure alignment. The CASP3 measure sf4 gives the number of additional residues that are correctly aligned if one allows a shift-error tolerance of 4 residues. Here we plot Alignment Specificity (sf0+sf4)/nres, where nres is the number of residues in the CASP3 model. The CASP2 measure ACrct gives the number of correctly aligned residues according to the VAST target vs. template structure-structure alignment. The CASP2 measure ACrct4 gives the total number residues that are correctly aligned allowing a 4-residue shift-error tolerance. Here we plot Alignment Specificity, ASp4=(ACrct+ACrct4)/nres. In (a) we draw a line with slope 1 and intercept 0, to indicate the expected behavior if (sf0+sf4)/nres and ASp4 were identical. The CASP2 quantity CCrct gives the number of residue pairs predicted to be in contact in by the threading model which are also in contact in the true structure of the target. Here we plot Contact Specificity, CSpc=CCrct/nc, where nc is the total number of contacts predicted by the threading model. The relationship of CSpc and (sf0+sf4)/nres (or ASp4) is approximately quadratic, and in (b) we plot CSpc on a scale that is linear in the square root of CSpc, and we calculate the correlation coefficient accordingly. CASP3 models based on templates recognized as similar by VAST are plotted as large dots, and those based on templates not recognized as similar by VAST are plotted as small dots. Fragmentary models assigning coordinates to less than 45% of domain residues are omitted from the plots.

To determine whether the ranking of CASP3 predictions would be affected by differences in model accuracy measures we have calculated the CASP3 assessor's "formula-1" table using CSpc (Contact Specificity), a CASP2 measure that does not depend on structure-structure alignment. As in the CASP3 assessor's table(18), models are given 6 points if they have the highest CSpc, 5 points if they have the second-highest CSpc, etc. We have not attempted to award "bonus points", and the ranking is based entirely on the relative values of CSpc for models based on templates judged by the assessor to belong to the correct SCOP superfamily or fold, i.e. those assigned a letter A through F in his assessment. This analysis is shown in Table 1. Comparing to the assessor's "formula-1" table(18), one sees that ranking by CSpc gives essentially the same results as a ranking that considered sf0+sf4. The top 6 prediction teams are the same and occur in almost the same order, and there are only minor differences in the ranking of the remaining teams. Results are also very similar to the CASP3 assessor's "formula-1" table if one uses ranks based on ASp4 (not shown). One may conclude that the differences in model accuracy measures between CASP2 and CASP3 were not critical with respect to CASP3 assessment. The CASP3 evaluation quantities are very similar to those used previously at CASP2.

Table 1

SCOP Class       | Superfamily              | Fold             | New Fold |
Target        MR | 81 44 85 83 54 53 63b 79 | 46 71a 67 43 71b | 52 56    | C|Pt
================================================================================
212 Bryant    (1)| B           B  A  E   F  | A  E             |    N     | 8|28
166 SB-Fold   (3)| A           A         C  | C  A             | N  N     | 7|28
005 Jones     (2)| D     D  B     C  B      | F  F             |          | 7|22
176 UNAGI     (4)|       B  E     B      D  | F                | N  N     | 7|18
217 Sippl     (5)| C  B        D  F         | F  D             |          | 6|17
019 UCSC      (6)|    D     E            B  | F                | N        | 5|12
-----------------+--------------------------+------------------+----------+-----
074 Sternberg (7)|             C     F      | D                |    N     | 4|09
028 Godzik   (11)|    C  E  A               |                  |          | 3|12
017 Sjolander(12)|    A     C               | F                |          | 3|11
156 Benner-Co(15)|             F     D   A  |                  |          | 3|10
066 Fischer  (10)|                   A      |        E         |    N     | 3|09
033 Elofsson (13)|                F         | F  B             |          | 3|07
061 Yang      (9)|             E         E  | F                |          | 3|05
090 Valencia (16)|                          | F  F             |    N     | 3|03
003 Hubbard  (17)|                          | B  C             |          | 2|09
201 Tatsuya  (14)|                F         |           C      |          | 2|05
147 BMERC    (22)|             F     C      |                  |          | 2|05
105 Moult    (25)|       C     F            |                  |          | 2|05
040 Olszewski(21)|                          |        D     F   |          | 2|04
273 Reva     (23)| E                        | F                |          | 2|03
009 Xu-Ying  (18)|       F        E         |                  |          | 2|03
085 Park     (19)|       F                  | F                |          | 2|02
035 Baker     (8)|             F         F  |                  |          | 2|02
257 Gregoret (38)|       A                  |                  |          | 1|06
045 Torda    (29)|                          |        C         |          | 1|04
162 Coulson  (35)|                D         |                  |          | 1|03
190 Kolinskol(20)|          E               |                  |          | 1|02
023 Timms    (31)|                          | F                |          | 1|01
224 Finkelste(37)|                          | F                |          | 1|01
053 Avbelj   (32)|                          | F                |          | 1|01
072 Weber    (33)|                          | F                |          | 1|01
168 Eisenberg(36)|                F         |                  |          | 1|01
136 Blundell (34)|             F            |                  |          | 1|01
142 Taylor   (28)|       F                  |                  |          | 1|01
143 Solovyev (26)|                          |                  |    N     | 1|01
179 GMD-SCAI (27)|                          |                  |    N     | 1|01

Legend to Table 1: CASP3-style assessment using CASP2 model accuracy measures. Models based on the correct fold, as judged by the CASP3 threading assessor, are awarded one "fold point" as indicated by letters A through F and counted in column "C". Models for each target are ranked by accuracy using the CASP2 evaluation quantity CSpc (see text). Only the model with id=1 is considered for each team and letters A through F (most through least accurate) score 6 through 1 accuracy points respectively, as at CASP3. Accuracy points are summed in column "Pt". Targets 52 and 56 were considered novel folds by the CASP3 assessor. "N" indicates that "None" was predicted for the model with id=1, and awarded one "fold point" and one accuracy point(18). Ranking of teams follows the number of "fold points" and secondarily the sum of accuracy points, as at CASP3. Column "MR" gives the rank assigned to this team in the CASP3 assessor's "formula-1" table, as presented at CASP3(18). We note that the CASP3 assessor's evaluation of model accuracy was not based strictly on numerical measures3). We find, however, that evaluation of model accuracy based on the sf0 and sf4 measures emphasized at CASP3 gives very similar results: Assignment of model accuracy points based on (sf0+sf4)/nres gives a ranking of the top 6 teams that is the same as shown here, and almost the same as the ranking presented at CASP3(18) (not shown). We emphasize that the CASP3-style assessment shown here is based on relative accuracy; models listed as "A" need not cross the CASP2-style accuracy thresholds applied in Table 2 and Figure 2.

Close examination of Figure 1 does reveal some differences in the CASP3 and CASP2 evaluation quantities. There are a small number of models that VAST finds to be completely misaligned, with ASp4=0, while PROSUP assigns nonzero sf0+sf4. Similarly one sees a few models with no correctly predicted contacts, CSpc=0, but with nonzero sf0+sf4 values. Presumably this difference reflects PROSUP's search for alternative structure-structure alignments that maximize sf0, although we note that we cannot directly compare the VAST and PROSUP alignments, since the latter have not been distributed to predictors. There are also a few models with intermediate values of (sf0+sf4)/nres and/or CSpc that appear to be based on templates not recognized as similar by VAST. Some are cases where the CASP3 prediction did not name the template, and others are cases where the extent of target-template similarity falls below VAST's significance threshold. The above analysis shows that these differences are minor issues, however, in comparison of CASP2 and CASP3 model accuracy. As was concluded after CASP2, different structure-structure comparison methods tend to agree in their identification of the more accurate threading models(5,11).

More or Less Difficult Targets?

In Figure 2 we plot a "phase diagram" of target difficulty for fold recognition targets from both CASP2 and CASP3. Each target is characterized by two values, the fraction of target residues that may be superimposed on database templates and the fraction of identical residues in the corresponding structure-structure alignments. These values reflect the extent and degree of similarity to previously known structures, and they together reflect the difficulty or "predictability" of a fold- recognition target. Data are based on structure-structure alignments by the VAST algorithm(10,15,17), although we note that values for CASP2 targets are very similar to those calculated previously from a combination of VAST and DALI alignments(11). As one might expect for fold recognition targets selected by the CASP2 and CASP3 organizers, all targets fall in the "twilight zone" of sequence similarity, below 20% identity. The CASP2 and CASP3 targets vary widely, however, with respect to the extent of structural similarity to available templates.

Figure 2

Figure 2: "Phase diagram" of target difficulty for CASP2 and CASP3 fold recognition targets. The extent of structural similarity of the target and database templates is given as the length of the VAST structure-structure alignment divided by the length of the target chain or domain. The degree of sequence similarity is given by the percentage of identical residues in the VAST alignment. CASP2 fold-recognition targets are numbered 2 through 38 and CASP3 fold-recognition targets are numbered 43 through 83. Targets for which at least one team predicted an accurate model are indicated by a large square symbol. Small symbols indicate other aspects of similarity of the target and database templates: "x" indicates that the target shares recognizable sequence motifs with one or more database templates (see text). Circles indicate that the similarity is detected by VAST only, with filled circles indicating "impossible" targets where the common substructure detected in the template is not very extensive, predicting 25% or less target residue contacts. For each target we consider only those structural neighbors that were available at the time of CASP2 or CASP3, taking VAST data from the sets distributed to predictors at the time(15,17). When more than one database template is similar to the target we average across structural neighbors where the VAST alignment contains at least 85% of the number of residues as the longest VAST alignment. Target length is taken as the length of the chain except in cases when domain boundaries were specified by the CASP2 or CASP3 organizers or correctly identified by one or more teams. For CASP3 targets 63, 71, 79 and 83 two domains were identified, and we consider as the target the domain most similar to database templates(15). Two of these domains are listed separately as 63b and 71a in the assessor's table(18) and in Table 1; there were few predictions for the additional domains of targets 63 and 71, but if they are treated as separate targets (63a and 71b)18 they fall in the "hard" region of the plot, with no accurate predictions. We exclude targets where similarity to database templates was recognizable by BLAST(19) with default parameters; this affects only CASP3 target 85. We note that all CASP3 targets shown as accurately modeled were predicted by at least one group as the model with id=1, with the exception of target 71, where a single model with id=4 had (sf0+sf4)/nres > 50%.

To understand the relationship of target difficulty and prediction success we identify in Figure 2 those targets for which at least one team produced an accurate model. We employ the "critical" accuracy threshold suggested after CASP211, that at least 50% of aligned residue pairs in the threading alignment agree with aligned residue pairs in the reference structure-structure alignment, within a shift-error tolerance of 4 residues. For CASP2 targets accurate models are those with Alignment Specificity (ASp4) greater than 50% or Contact Specificity (Cspc) greater than 25%. For CASP3 targets accurate models are those where (sf0+sf4)/n is 50% or greater, equivalent to the Alignment Specificity threshold from CASP2. We also require that predictors place at least 20% confidence in the corresponding model. For CASP2 models, Fold Recognition Specificity (Conf x TSpc) must be 20% or greater6), and for CASP3 models at least 1 of the 5 allowed alternatives must cross the model accuracy threshold. Fragmentary models including less that 45% of chain or domain residues are excluded and considered inaccurate for both CASP2 and CASP3 predictions.

The relationship of target difficulty and prediction success is rather obvious in Figure 2. One may simply draw a line that separates all 13 targets for which an accurate model was predicted from the remaining 12 targets where no team predicted an accurate model. The targets for which accurate models were predicted are those with more extensive structural similarity to database templates and/or a greater degree of sequence similarity. Roughly speaking, the accurately modeled targets are those where 60% or more of target residues could be superimposed on a database template. The number of data points is small, but there is no indication that this pattern has changed between CASP2 and CASP3. CASP3 predictors do not seem to have produced accurate models for targets of greater difficulty, where there is less extensive structural similarity to a previously known structure. We note, however, that some models for "hard" CASP3 targets came close to the critical accuracy threshold. Some models for CASP3 target 44, for example, were accurate with respect to individual domains of the target, even though they were inaccurate with respect to the complete prediction.

More or Less Accurate Models?

To categorize targets by difficulty one may divide the "phase diagram" in Figure 2 into distinct regions. Following CASP2 we suggested a 3-tier classification of "easy, medium and hard" targets(11), and this same system seems informative for CASP3. "Hard" prediction targets are those for which the fraction of residues that may be superimposed on a known template is less than 60%. "Medium" targets are those with more extensive structural similarity, where 60% or more of residues may be superimposed on a database template, but with no sequence motifs sufficient for fold identification. Easy targets are those with sequence motifs sufficient for fold assignment, as identified by PSI-BLAST(19) and/or search of relevant literature, usually with 12% or more sequence identity. Under this classification the only "easy" fold- recognition targets at CASP3 are targets 54 and 79. Target 54 (VanX) was assigned to a structural family present in PDB well before the CASP3 experiment(20), and the helix-turn-helix DNA-binding motifs in target 79 (MarA) could be detected using well known sequence-pattern collections(21).

Table 2

    D Size #Crct SCRms SCFrac SC%Id BSCLen BSCRms BMLen BMRms BMCSpc 
====================================================================
T04 E   84    16   2.5  58.69  12.8     61   2.05    65  2.97   61.6
T31 E  242    14   2.6  72.19  15.4    188   2.38   202  4.16   70.7
T02 M   88     1   2.2  69.32   6.6     70   1.87    64  2.83   65.0
T14 M  252     7   4.1  80.36   8.3    204   2.72   132  5.45   33.0
T38 M  152     2   3.7  78.29  10.2     94   3.50    98  5.72   53.9
T20 H  320     0   4.3  36.25   8.2     90   2.71   203  7.63   28.8
T22 H  591     0   3.1  13.54  11.2     65   2.18    93  9.27    9.3
--------------------------------------------------------------------
T54 E  202     3   1.8  46.04  17.2     93   1.80   116  8.80   40.8
T79 E   65     7   2.4  87.69   7.0     44   1.90    51  4.33   37.5
T46 M  119     6   3.3  68.32   7.3     88   3.10    84  6.62   51.0
T53 M  264     5   3.4  87.88  11.2    232   3.40   204  5.96   39.3
T63 M   65     2   1.7  73.38   8.8     50   1.90    60  3.81   67.5
T71 M  125     1   2.7  71.84   8.5     99   2.40    84  8.48   35.4
T81 M  152     6   3.4  67.89  10.9    106   2.20   109  2.93   70.3
T83 M   80     5   3.1  70.12   9.6     59   2.00    65  4.91   44.7
T43 H  158     0   2.9  51.65  11.9     88   3.20   105 15.00   21.9
T44 H  347     0   3.3  52.42  11.2    187   3.10   209 15.60   17.5
T59 H   75     0   1.9  50.00   9.0     40   1.30    62  9.47   34.4
T67 H  187     0   2.8  37.43  10.0     70   2.80   118 17.40   24.8
T80 H  219     0   2.1  29.45  16.3     62   2.20   118 16.10   18.1

Legend to Table 2: Counts of accurate models from CASP2-style assessment and properties of the targets and best models from CASP2 and CASP3. Column "D" refers to target difficulty: "E" for easy targets, "M" for medium targets, and "H" for hard targets. Targets 4 through 38 are from the CASP2 experiment. "Size" is the number of residues in the chain or domain regarded as the prediction target. Column "#Crct" is the number of models, counting only one from each team, that cross specific accuracy thresholds (see text). "SCRms" is the average Ca RMS-residual between the target and VAST structural neighbors, including all neighbors where alignment length is 85% or more of the longest VAST alignment. "SCFrac" is the fraction superimposed by VAST, and "SC%Id" the percentage of identical residues in structural superpositions, averaged as for "SCRms". "BSCLen" is the length of the longest VAST alignment and "BSCRms" its RMS superposition residual. "BMLen" gives the extent of a "best" model, chosen according to the value of CSpc. "BMRms" gives the RMS-residual when this model is superimposed on the true target structure and "BMCSpc" the percentage of contacts predicted correctly by the model. For brevity we exclude "impossible" fold-recognition targets from Table 2. For CASP2 "impossible" targets are those where the jury of structure comparison methods did not identify significant similarity with database templates11. For CASP3 "impossible" targets are those where the CASP3 assessor considered no prediction to be based on a correct fold, awarding no score of F or above, and where structural neighbors identified by VAST, averaged as for SCRms, conserve less than 25% of contacts in the target domain. "Impossible" CASP3 targets identified by the latter criterion are targets 52, 56, 61, 75 and 77, as indicated in Figure 2.

Table 2 shows the difficulty category for each CASP2 and CASP3 target and the number of accurate models for that target. Table 2 also lists a few properties of what we have picked as the "best" model for each target. One may see from Table 2 that no accurate models were predicted for "hard" targets at either CASP2 or CASP3. Accurate modeling of "hard" targets seems to be beyond the limits of current threading methods. Accuracy of the "best" models also seems little changed from CASP2 to CASP3. There was one model with under 3 Å RMS for a "medium" target at CASP2, target 2, and the same is true at CASP3, for target 81. The most striking difference between CASP2 and CASP3, perhaps, is the large number of accurate predictions for "easy" targets at CASP2. There may be several explanations for this. From Figure 2 we see that CASP2 targets 4 and 31 are more sequence-similar to database templates than other targets, with the exception of CASP3 target 54. These similarities may have been more accessible to sequence- based prediction methods, contributing to a higher level of prediction success. Targets 4 and 31 are also members of well-understood structural families, the OB- fold and the trypsin-like serine proteases, where characteristic sequence motifs were well-documented in the literature(12).

There is a clear suggestion of progress, however, if one focuses on the "medium" targets in Table 2. While there were relatively few accurate models for "medium" targets at CASP2, most of the "medium" targets at CASP3 were accurately modeled by 5 or 6 different teams. This suggestion of progress is confirmed when one considers the "medium" targets in more detail. Target 14, the CASP2 medium target with the greatest number of accurate models, is perhaps "easier" than the rest: It is a member of the TIM-barrel structural family, which is well described in the literature and very common in the structural database. The two CASP3 medium targets with the fewest accurate predictions are perhaps a little "harder" than the rest: Target 63 has two domains, but this was recognized in advance by few predictors. Target 71 differs from database templates in many structural details, such that the CASP3 assessor has assigned it to a novel SCOP superfamily(3). CASP2 certainly showed that accurate predictions for "medium" targets are possible. At CASP3, however, these predictions seem to have become more reliable, with different threading methods producing both specific recognition and accurate models for all 6 "medium" targets. It is also interesting to note that the two top-ranked teams at CASP3 used largely automated alignment procedures(22,23), while the top-ranked team at CASP2 relied on manual alignments(24).

Discussion

It is perhaps satisfying to find that the "extensive changes of the fold-recognition evaluation criteria"(13) between CASP2 and CASP3 do not seem to have greatly affected evaluation of model accuracy. Most of the model accuracy measures used at CASP2 and CASP3 depend on structure-structure alignments, and these have been calculated in different ways (target-template vs. target-model comparison) using different structure comparison programs (PROSUP vs. DALI, SSAP and VAST). We find, however, that the resulting model accuracy measures are highly correlated and that the differences have little effect on assessment of CASP3 predictions. It is perhaps not surprising that evaluations using target-template and target-model alignments are similar: Threading models copy coordinates from a template, and these alignments are thus nearly equivalent. The most novel feature of the CASP3 model accuracy measures is PROSUP's search of alternative structure-structure alignments, to find a reference alignment that maximizes sf0. This does not seem to affect identification of the more accurate models, however, as shown in Figure 1, and by graphical comparison of sf0+sf4 values calculated using DALI as opposed to PROSUP reference alignments(25). This is perhaps good news for future CASPs, since it suggests that the complexity of allowing "bets" on alternative structure-structure alignments is unnecessary for reliable model evaluation. It is also satisfying, perhaps, to see a clear dependence of prediction success on target difficulty. In the "phase diagram" of Figure 2 one sees an obvious relationship between the occurrence of accurate models and the extent of structural similarity of the target and database template. When 60% of target residues may be superimposed on a database template, roughly speaking, one or more teams have predicted accurate models, for all 13 "easy or medium" targets from CASP2 and CASP3. Conversely, for the 12 "hard or impossible" targets with less than 60% of residues superimposable on a database template, there were no accurate models at either CASP2 or CASP3. Threading methods score sequence-structure compatibility according to how each residue from the target sequence "fits" the structural environment of the site to which it is aligned. That environment may be described as the solvent accessibility at that site, for example, or, in sequence-based methods, as a list of residue types preferred at that site. As structural similarity of the target and template becomes less extensive, however, a greater proportion of environment descriptors will be incorrect: The actual solvent accessibility at a conserved site in the target will differ, as will the list of preferred residue types. Thus it would be rather surprising if one did not see a dependence of threading success on the extent of structural similarity. As the extent of structural similarity goes down the signal-to- noise ratio in a threading calculation must also go down.

What is perhaps least satisfying in the present analysis is its indication of limited progress in fold recognition. As one sees in Table 2, there are a greater number of accurate predictions at CASP3 for "medium" targets, and one may readily conclude that threading methods have become more reliable for detection of remote evolutionary relationships. On the other hand one sees that "hard" targets remain beyond the reach of threading methods. This can be interpreted negatively, in the sense that threading methods have a long way to go before they approach the sensitivity of structure-structure comparison. Indeed, the "phase diagram" in Figure 2 was originally proposed as a means to measure improvement in threading sensitivity(11), but it shows no obvious improvement between CASP2 and CASP3. Threading methods must ultimately fail, of course, as the extent of structural similarity between target and template decreases. Bearing this intrinsic limitation in mind, one can also make a positive interpretation of the similarity threshold apparent in Figure 2: Perhaps the best threading methods are already working about as well as is possible. The only way to distinguish these alternative interpretations, of course, is to wait and see what predictions are made at CASP4 and beyond!

Acknowledgements

We thank the CASP3 experimentalists, predictors and organizers for providing prediction and evaluation data for this analysis. We thank Ken Addess and Tom Madej for calculating VAST structure-structure alignments for CASP3 targets. We thank Anna Panchenko for valuable discussions and the NIH intramural research program for support.

References

  1. Moult, J., Hubbard, T., Bryant, S.H., Fidelis, K., Pedersen, J.T. Critical assessment of methods of protein structure prediction (CASP): round II. Proteins Suppl 1:2-6, 1997.
  2. Moult, J., Hubbard, T., Fidelis, K., Pedersen, J. T. Critical assessment of methods of protein structure prediction (CASP): round III. Proteins Suppl. 3:3-6, 1999.
  3. Murzin A.G. Structure classification based assessment of CASP3 predictions for the fold recognition targets. Proteins Suppl. 3:88-103, 1999.
  4. Lackner, P., Koppensteiner, W. A., Dominigues, F. S., Sippl, M. J. Automated large scale evaluation of protein structure predictions. Proteins Suppl. 3:7-14, 1999.
  5. Levitt, M. Competitive assessment of protein fold recognition and alignment accuracy. Proteins Suppl 1:92-104, 1997.
  6. Marchler-Bauer, A., Bryant, S.H. Measures of threading specificity and accuracy. Proteins, Suppl. 1:74-82, 1997.
  7. Feng, Z.K., Sippl, M.J. Optimum superimposition of protein structures: ambiguities and implications. Folding & Design 1:123-132, 1996.
  8. Holm, L., Sander, C. Mapping the protein universe. Science 273:595-602, 1996.
  9. Orengo, C.A., Taylor, W.R. SSAP: Sequential structure alignment program for protein structure comparison. Methods in Enzymology 266:617-635, 1996.
  10. Gibrat, J.-F., Madej, T., Bryant, S.H. Surprising similarities in structure comparison. Current Opinion in Structural Biology 6:377-385, 1996.
  11. Marchler-Bauer, A., Levitt, M., Bryant, S.H. A retrospective analysis of CASP2 threading predictions. Proteins, Suppl. 1:83-91, 1997.
  12. Marchler-Bauer, A., Bryant, S.H., A measure of success in fold recognition. Trends in Biochemical Sciences 22:236-240, 1997.
  13. http://predictioncenter.llnl.gov/casp3/
  14. http://rutgers.rcsb.org/pdb/
  15. http://www.ncbi.nlm.nih.gov/Structure/RESEARCH/casp3/casp3vast.html
  16. http://www.ncbi.nlm.nih.gov/Structure/RESEARCH/casp3/casp3eval.html
  17. http://www.ncbi.nlm.nih.gov/Structure/RESEARCH/casp2/index.html
  18. http://predictioncenter.llnl.gov/casp3/results/FR-Summary.gif
  19. Altschul, S.F., Madden, T.L., Schaffer, A.A., Zhang, J., Zhang, Z., Miller, W., Lipman, D.J. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25:3389-3402, 1997.
  20. McCafferty, D.G., Lessard, I.A., Walsh, C.T. Mutational analysis of potential zinc- binding residues in the active site of the enterococcal D-Ala-D-Ala dipeptidase VanX. Biochemistry 36:10498-10505, 1997
  21. Hofmann, K., Bucher, P., Falquet, L., Bairoch, A. The PROSITE database, its status in 1999. Nucleic Acids Research 27:215-219, 1999.
  22. Jones, D.T., Tress, M., Bryson, K., Hadley, C. Successful recognition of protein folds using threading methods biased by sequence similarity and predicted secondary structure. Proteins Suppl. 3:104-111, 1999.
  23. Panchenko, A., Marchler-Bauer, A. Bryant, S.H. Threading with explicit models for evolutionary conservation of sequence and structure. Proteins Suppl. 3:133-140, 1999.
  24. Murzin, A.G., Bateman, A. Distant homology recognition using structural classification of proteins. Proteins, Suppl 1:105-112, 1997.
  25. http://PredictionCenter.llnl.gov/casp3/SUMMARY/cm1/

Corresponding Author: Stephen H. Bryant
Phone: (301) 435-7792
FAX: (301) 480-9241
Email:
bryant@ncbi.nlm.nih.gov