- We are sorry, but NCBI web applications do not support your browser and may not function properly. More information

- Journal List
- NIHPA Author Manuscripts
- PMC3226935

# Evaluation of model quality predictions in CASP9

^{1}Genome Center, University of California, Davis, 451 Health Sciences Dr., Davis, CA 95616, USA

^{2}Department of Physics, Sapienza University of Rome, 5 P.le Aldo Moro, 00185 Rome, Italy

^{*}To whom correspondence should be addressed: Andriy Kryshtafovych Genome Center, University of California, Davis 451 Health Sciences Dr. Davis, CA 95616 USA ; Email: ude.sivadcu@hcyvofathsyrka Tel/Fax: +1 5307548977

## Abstract

CASP has been assessing the state of the art in the *a priori* estimation of accuracy of protein structure prediction since 2006. The inclusion of model quality assessment category in CASP contributed to a rapid development of methods in this area. In the last experiment forty six quality assessment groups tested their approaches to estimate the accuracy of protein models as a whole and/or on a per-residue basis. We assessed the performance of these methods predominantly on the basis of the correlation between the predicted and observed quality of the models on both global and local scales. The ability of the methods to identify the models closest to the best one, to differentiate between good and bad models, and to identify well modeled regions was also analyzed. Our evaluations demonstrate that even though global quality assessment methods seem to approach perfection point (weighted average per-target Pearson's correlation coefficients as high as 0.97 for the best groups), there is still room for improvement. First, all top-performing methods use consensus approaches to generate quality estimates and this strategy has its own limitations and deficiencies. Second, the methods that are based on the analysis of individual models lag far behind clustering methods and need a boost in performance. The methods for estimating per-residue accuracy of models are less accurate than global quality assessment methods with an average weighted per-model correlation coefficient in the range of 0.63–0.72 for the best 10 groups.

**Keywords:**CASP, QA, model quality assessment, protein structure modeling, protein structure prediction

## Introduction

The role of protein structure modeling in biomedical research is steadily increasing^{1–3}. Models are routinely used to address various problems in biology and medicine. Contrary to experimentally derived structures, where accuracy can be deduced from experimental data and typically falls within a narrow range, theoretical models are usually un-annotated with quality estimates and can span a broad range of the accuracy spectrum. Thus, reliable *a priori* estimates of global and local accuracy of models are critical in determining the usefulness of a model to address a specific problem. For example, high-resolution models (expected C-alpha atom RMSD from the experimental structure ~1Å; expected GDT_TS>80) often are sufficiently accurate for detecting sites of protein-ligand interactions^{4}, understanding enzyme reaction mechanisms^{5}, interpreting the molecular basis of disease-causing mutations^{6}, solving crystal structures by molecular replacement^{7,8} and even for drug discovery^{9–11}. A model of medium accuracy (typically 2–3Å C-alpha atom RMSD from the native structure, GDT_TS>50) can still be useful for detecting putative active sites in proteins^{12,13}, virtual screening^{14} or predicting the effect of disease-related mutations^{15}. Low resolution models can be useful for providing structural characterization of macromolecular ensembles^{13}, recognizing approximate domain boundaries^{13}, helping choose residues for mutation experiments^{16} or formulating hypotheses on the protein molecular function^{17,18}.

In response to these needs the computational biology community has focused on the Model Quality Assessment (MQA) problem, i.e. on the possibility of predicting the accuracy of structural models when experimental structural data are not available. Several dozen papers have been published on the subject in the recent years^{19}. CASP now includes a specific category for testing QA methods and a large number of prediction groups participate^{20,21}. In CASP9, 46 groups (including 34 servers) submitted predictions of the global quality of models and 22 also provided estimates of model reliability on a per-residue basis. Here we assess the performance of these groups and discuss the problems facing the field.

## Materials and Methods

### Submission procedure and prediction formats

The procedure for submitting QA predictions in CASP9 did not change from that used in CASP8. Server models submitted in the tertiary structure prediction categories (TS and AL) were archived at the Prediction Center and posted on the web following the closing of the server prediction deposition time window on a target. The same day, web locations of the tarballs were automatically sent to the registered QA servers, which in turn had three calendar days to submit quality estimates for the models. Human groups were invited to download the server models and submit their quality estimates to CASP according to the deadlines set by the organizers for the tertiary structure prediction on the corresponding target.

The QA predictions were accepted in two modes: QMODE 1 (QA1) for the assessment of the overall reliability of models, and QMODE 2 (QA2) for the assessment of the per-residue accuracy of models. In QMODE 1 predictors were asked to score each model on a scale from 0 to 1, with higher values corresponding to better models and value of 1.0 corresponding to a model virtually identical to the native structure. In QMODE 2 predictors were asked to report estimated distances in Angstroms between the corresponding residues in the model and target structures after optimal superposition. Details of the QA format are provided at the Prediction Center website http://predictioncenter.org/casp9/index.cgi?page=format#QA.

### Evaluation data: targets and predictions

7,116 QA predictions on 129 targets were submitted to CASP9; all are accessible from http://predictioncenter.org/download_area/CASP9/predictions/ (file names starting with QA). These predictions contain quality estimates (global and residue-based) for 39,702 tertiary structure models generated by the CASP9 server groups (http://www.predictioncenter.org/download_area/CASP9/server_predictions/). Thirteen targets were cancelled by the organizers and the assessors for tertiary structure prediction^{22}, and those were also excluded from the QA assessment, leaving 116 targets to be assessed^{*}.

Protein structure prediction is usually a time demanding process, and in order to allow human-expert predictors extra time for modeling challenging proteins, CASP9 targets were released as either human/server or server only targets. In the MQA category, though, methods are usually much faster and therefore all targets were used for model quality estimation.

In CASP9, targets that were difficult for structure prediction also appeared to be difficult for model quality prediction (see Figure S1 in Supplementary Material). This fact can be explained, in part, by the observation that the best performing methods are consensus methods (see further analysis in the Results), which work better for the TBM targets for which the cluster center is dominated by the presence of structurally similar templates, while for hard modeling cases there is usually no consensus or, in some cases, a wrong one. As results from structure comparison programs become less meaningful below some cut-off (e.g., a model with a GDT_TS score of 20 does not superimpose with a target significantly better than a model with a GDT_TS score of 15), the relationship between model quality estimates and structure similarity scores for difficult targets can be misleading. Thus, inclusion of such targets in the evaluation dataset might have introduced noise. To check this, we ran three separate evaluations: one on the whole set of 116 targets, and two more on reduced sets, composed of the targets where at least one model with a GDT_TS score above 40 or 50 existed (102 and 89 targets, respectively). The comparative analysis across these three target sets showed that each of the main evaluation scores is quite stable with the SRCC ranging from 0.92 to 0.94 for different pairs of the test sets. Thus, except when otherwise noted, throughout this paper we refer to the results of the analysis performed on all 116 targets.

Unlike tertiary structure assessment, the QA evaluation was performed on whole targets without splitting them into evaluation subunits, as it was impossible to dissect a single score submitted for the whole model into quality scores for the constituting domains. For the same reason, we excluded from the calculations the so-called multi-frame models consisting of two or more segments predicted independently, i.e. not using a common Cartesian frame of reference^{†}. We also disregarded models shorter than 20 amino acids and, for the QA2 assessment, those for which fewer than seven local quality prediction groups submitted their estimates. All in all, we evaluated the performance of QA methods on 35,198 server models.

### Evaluation measures and assumptions

#### What is compared and how?

In CASP, model quality predictions are evaluated by comparing submitted estimates of global reliability and per-residue accuracy of structural models with the values obtained from the sequence-dependent LGA^{23} superpositions of models with experimental structures (http://predictioncenter.org/download_area/CASP9/results_LGA_sda/). Therefore, perfect QA1 scores should ideally correspond to the LGA-derived GDT_TS scores (divided by 100) and predicted per-residue distances in QA2 should ideally reproduce those extracted from the optimal model-target superpositions. In both prediction modes, estimated and observed data are compared on a target-by-target basis and by pooling all models together. The first approach rewards methods that are able to correctly rank models regardless of their absolute GDT-TS values, while the second accentuates how well the method is able to assign different scores to models of different quality regardless of their ranking within the set of models for the specific target.

#### Correlation coefficients: Pearson or Spearman?

As predicted values should ideally duplicate the observed ones, a linear relationship between them is expected. This assumption is confirmed by the visual inspection of the data (see Figure S2 in Supplementary Material). Therefore, the Pearson's product-moment correlation coefficient *r* is a sensible choice to measure the level of association between the predicted and observed scores. However, PMCC is very sensitive to outliers and it assumes normally distributed data, which is usually not the case for sets of per-target GDT_TS scores or residue distance errors. Thus, it may seem that distribution-free association measures, e.g. Spearman's ρ or Kendall's τ, are more appropriate for the problem at hand. However, these measures also have flaws, as it is not appropriate to use rank-based measures for sets with multiple tied original values and because they present problems in handling big sets of data^{24}. Also, even though non-parametric measures are more robust in guarding against outliers, they cannot guarantee more sensible results on such data^{25}. In order to eliminate bias in the analysis connected with the selection of the association measure, we have evaluated all the data using both parametric and non-parametric inferential statistic methods. The comparison of the results showed that the choice of the association measure has only marginal influence on the conclusions (Spearman's ρ between the rankings based on SRCC and PMCC and their z-scores ranged from 0.97 to 0.99 for both QA1 and QA2). In what follows, we use Pearson's *r* for data analysis since, in general, it gives a more accurate estimate of the correlation between continuous values, and it has been shown to be less prone to bias than rank-based measures for big sets of data even when the assumption of a normal bivariate distribution is violated^{26}. The raw results of the correlation analyses are available at http://predictioncenter.org/casp9/qa_analysis.cgi.

#### Transformation of correlation coefficients

As correlation coefficients are not additive^{27,28}, their averaging has to be preceded by a transformation into additive quantities. Fisher's transformation^{29} is the best known technique to do so. The following transformation

converts the correlation coefficient *r* into a normally distributed variable Z with variance *s*^{2} = 1/(*n*−3), where *n* is the number of observations. Once *r* values are converted into *Z* values, an arithmetic mean score $\stackrel{\u2012}{Z}$ can be computed and subsequently transformed into the correlation coefficient weighted mean value $\stackrel{\u2012}{r}$ by using the inverse formula

Note that while the Fisher transformation is usually used for PMCCs when observations have a bivariate normal distribution, it can also be applied to SRCCs in more general cases.

#### Evaluation measures for QA1 assessment

Correlation between the predicted accuracy scores and the corresponding GDT_TS values for the submitted server models was used as a main evaluation measure for assessing the QA1 results. In the per-target assessment regime, we calculate the Pearson's correlation coefficient for each group on each target, and the corresponding z-score derived from the distribution of the per-target PMCC values obtained by all groups. The final score for each prediction group is determined by the weighted mean of PMCCs and the average z-score over the set of predicted targets. In the “all models pooled together” regime, correlation coefficients are calculated for all quality estimates submitted by a group on all targets. The group scores are next compared to the CCs obtained by other groups using the standard Z-test procedure^{‡}.

Besides the correlation measure, we have also evaluated performance of the global quality estimators by (a) testing the ability of prediction groups to distinguish between good and bad models, (b) calculating the difference in quality between the model predicted to be the best and the actual best model, and (c) comparing results of the methods to the results of two naïve predictors: BLAST/LGA^{20} and NAÏVE_CONSENSUS.

The ability of predictors to discriminate between good and bad models was assessed with the receiver operating characteristic (*ROC*) analysis^{30§}. A *ROC* curve shows the correspondence between the true positive rate of a predictor (*Sensitivity*) and its false positive rate (1*-Specificity*) for a set of probability thresholds (from 0 to 1 in our case). For each threshold, a model is considered a positive example if its predicted QA1 score is equal to or greater than the threshold value. The area under a *ROC* curve (*AUC*) is indicative of the classifier accuracy^{31}: an *AUC* of 1 identifies a perfect predictor, while an *AUC* of 0.5 corresponds to a random classifier. We have computed the *AUC* scores using the trapezoid integration rule with a threshold increment of 0.05 for four reference “model goodness” parameters: GDT_TS=30, 40, 50 and 60. The scores for all goodness parameters appeared to be highly correlated, with the lowest pair-wise PMCC of 0.98 (for the GDT_TS=30 and GDT_TS=60 pair). Therefore we show here the results for only one of the goodness parameters, defining good models as those having GDT_TS≥50.

The loss in quality between the best available and the estimated best model was calculated for the targets when at least one good model (scoring higher than the specified cutoff) was present.

The naïve BLAST/LGA predictor assigns a score to a model based on its structural divergence from the most closely related known protein structure detectable by standard sequence analysis. The predictor first searches the protein structure database – frozen at the time of release of the corresponding target – for the best potential template by running at most five PSI-BLAST iterations with default parameters. Next, it superimposes the selected structure onto the input protein model by running LGA with default parameters in sequence independent mode. Finally, the resulting LGA_S score is multiplied by the model-to-target coverage ratio (the shorter the model – the lower the ratio) and divided by 100 to obtain a number between 0.0 and 1.0.

The NAÏVE_CONSENSUS predictor assigns quality score to a model based on the average pair-wise similarity of the model to all other models submitted on that target. The predictor superimposes all models submitted on the target by running LGA with default parameters in the sequence dependent mode. Next, for each model the quality score is calculated by averaging the GDT_TS scores from all pair-wise comparisons, followed by appropriate scaling.

#### Evaluation measures for QA2 assessment

As in the QA1 mode, correlation was the basic evaluation measure for assessing the QA2 results. Here the correlation is measured between the estimated and actual distances in Angstroms between the corresponding Cα atoms of the model and the experimental structure after their optimal superposition. The Pearson *r* coefficients and the corresponding z-scores are computed for each server model. While calculating correlation for the QA2 data, we had to overcome the problem of CCs distortion due to the high distance values in the poorly modeled regions of a protein. From a practical point of view, for a residue being misplaced by several Angstroms (e.g. more than 5 Å) the exact distance does not make much difference and thus we set the predicted and observed distance errors exceeding 5Å to 5Å. The final score of each prediction group is determined by the weighted mean of PMCCs and the average z-score over the set of predicted models^{**}.

The aforementioned procedure of setting an upper limit on the distance values makes the analysis of distance error associations more sensible, but it also introduces a bias into the analysis as many data points acquire the same values, possibly affecting the accuracy of the correlation-based conclusions. On the other hand, our analysis in the “all models together” mode is meant to determine the ability of the QA methods to identify reliable and unreliable regions in the model regardless of this bias. To perform such an analysis we used two descriptive statistics measures: Matthews's correlation coefficient^{32}

and accuracy

The two measures are calculated on the whole set of residues for two distance cut-offs - 5Å and 3.8Å – separating reliably predicted residues from unreliable ones. The TP [FN] in formulae (3) and (4), is the number of residues in the model that are closer than the specified cut-off to the corresponding residues in the native structure and are estimated to be closer [at least as far away as] this cut-off in the QA prediction, respectively. The TN [FP] is the number of residues in the model that are at least as far away as the specified cut-off from the corresponding residues in the native structure and are estimated to be at least as far away [closer] than this cut-off in the QA prediction, respectively. The MCC and accuracy scores are highly correlated (Pearson's *r*=0.99 [0.98] for 5Å [3.8Å] distance thresholds respectively), and therefore we show the results for only one of them (MCC) in what follows.

#### Ranking of participating groups: z-scores, t-, Wilcoxon-, Z- and DeLong- tests

The correlation coefficients obtained by each group for each target (in QA1) or model (in QA2) and on the whole set of targets were converted into z-scores. As in previous CASPs^{20,21}, the performance of each group was measured by the average of the z-scores after replacing negative values with zeros. The choice of neglecting negative z-scores is meant not to penalize groups that, by attempting more novel and riskier methods, might obtain negative scores in some cases.

The statistical significance of the differences in performance of the participating methods was verified by the two-tailed paired t-tests (or Wilcoxon tests) on the common set of predicted targets/models in the target-based analysis regime and by Z-tests for the analysis of all models pooled together.

In the per-target assessment regime, we ran paired t-tests on PMCCs and Wilcoxon signed rank tests on SRCCs. The raw correlation coefficients are used because the significance levels of the tests based on Fisher's Z transformations are shown to be severely distorted for skewed distributions^{26}.

For the “all targets together” assessment, Z-tests were performed on the correlation coefficients in accordance with standard statistical practice. To test whether Pearson's correlation coefficients *r*_{1} and *r*_{2} from two different samples are significantly different, we converted them into the corresponding Fisher's Z_{1} and Z_{2} using formula (1) and then computed a statistics Z by dividing their absolute difference by the pooled standard error, i.e.

where *n*_{1} and *n*_{2} represent the number of models evaluated by the two predictors. The corresponding *p*-value from the standard normal probability table helps assessing whether the difference between *r*_{1} and *r*_{2} is statistically significant at the desired confidence level.

Statistical significance of the differences between the *AUC* scores in the *ROC* analysis was assessed using the DeLong non-parametric tests^{33}.

#### Software used

Quality assessment calculations were performed using a set of in-house Java, C and Perl scripts pulling data from the CASP results database and the statistical package *R*^{34} with the installed *pROC* library^{35}.

## Results

### QA1: assessment of global model accuracy estimates

#### QA1.1: per-target analysis

Figure 1A shows the mean z-scores and PMCC weighted means on the whole set of targets for all forty six prediction groups. Several top performing groups obtained very similar results. This visual conclusion is confirmed by the results of the statistical significance tests on the common set of predicted targets. According to the paired Student's *t*-test, the top-ranked eight predictors (MuFOLD-WQA, MuFOLD-QA, QMEANClust, United3D, Multicom-cluster, Mufold, MetaMQAPclust and MQAPmulti – all using clustering techniques) appear to be indistinguishable from each other, and perform better than the rest of the groups at the *p*=0.01 significance level (see Table S1 in Supplementary Material for details).

**...**

It should be noted that not all groups submitted quality estimates for all models and therefore correlation coefficients for different groups on a specific target might be calculated on slightly different subsets of models. This may raise a question of reliability of direct comparisons of the scores for different groups. To check the influence of this discrepancy on the evaluation scores, we compared the results of the QA methods on the whole set of models with those obtained on randomly selected subsets of models. For each QA group, we have randomly selected 30 models for each target (approximately 10% of all submitted models) and calculated the correlation with the observed quality. We repeated this procedure 100 times for each group and for each target and calculated the PMCC means weighted over the number of trials and over all targets predicted by the group. The resulting PMCCs appear to differ by no more than 0.2% (data not shown) from the correlation coefficients calculated on the whole set of models, therefore indicating a very high stability of the results.

The ability of predictors to identify the best models in the decoy sets of all models submitted for the target was assessed on targets for which at least one model obtained a GDT_TS score higher than 40. For each target we have calculated the ΔGDT_TS difference between the model identified as best by the QA predictor and the model with the highest GDT_TS score. Average ΔGDT_TS scores over all targets attempted by each group are presented in Figure 2A. The best prediction groups reach an average ΔGDT_TS score of about 5. Thus, the actual best models might be significantly down the list from those designated as best. Figure 2B supports this conclusion showing that even for the best groups, the model designated as best is 2 GDT_TS units or closer to the best available model for only approximately one in three targets (green + yellow bars in the figure). Even though all best predictors are again clustering methods, it is encouraging to see that the best quasi-single model method (QMEANdist) and the best single model method (ProQ2) attain ΔGDT_TS scores that are roughly only 2 GDT_TS units worse than that of the best clustering method (Figure 2A). It should be noted, however, that this small difference in absolute scores translates into substantial (approximately 40%) difference in relative terms, and overall low rankings of these two groups.

#### QA1.2: models from all targets pooled together

Figure 1B reports the results of the correlation analysis in the “all models pooled together” mode. The QMEANClust group proved to be the best in assigning absolute quality scores to models coming from proteins of different modeling difficulty. It outperforms all other groups, including the three next best - Multicom-cluster, ModFOLDclust2 and MetaMQAPclust - which are statistically indistinguishable from each other according to the Z-tests (Table S2 in Supplementary Material) and not far behind QMEANClust in terms of PMCC values.

The ability of predictors to discriminate between good and bad models was additionally assessed with the receiver operating characteristic analysis. Figure 3 shows that the ROC curves for the top performing groups (and subsequently their AUC scores) are very similar, suggesting that the corresponding methods have similar discriminatory power. However, according to the results of non-parametric DeLong tests, the QMEANclust AUC score proved to be statistically better than that of all other groups, except for MULTICOM-cluster (see Table S2 in Supplementary Material). Comparing the AUC scores for the GDT_TS=40 and GDT_TS=50 goodness cut-offs (see the two panels in the inset of Figure 3), one can assert that they are similar for all groups except for MuFOLD-WQA, which has better discriminating power at the smaller “goodness” cutoff.

*AUC*scores. The inset shows the

*AUC*scores for all the groups for two definitions

**...**

Summarizing the QA1.2 assessment we want to emphasize that, similarly to QA1.1, clustering methods dominate the results tables.

#### QA1 results: comparison with previous CASPs

The comparison of the CASP9 results with the results from previous experiments is important for establishing whether the MQA field is making progress.

Figures 4A and B show the correlation coefficients obtained by groups participating in CASP9 and CASP8 for both the per-target and “all models together” assessment. CASP9 groups display better performance than the CASP8 groups according to both assessment procedures. Consistent improvement in the correlation scores is noticeable for both the best and moderately-well performing groups, with a more pronounced improvement for the latter groups.

Figure 5 presents the cumulative distributions of the correlation coefficients for the last three CASPs. We show the fraction of the observed Pearson's correlation coefficients attaining values larger than those specified along the x-axis. It is apparent that the fraction of cases with larger *r* has consistently and significantly increased over the last four years. For example, the percentage of QA1 predictions yielding correlation coefficients 0.8 or higher increased from 30% in CASP7 to 50% in CASP8 and to 70% in CASP9! These results look even more impressive when one takes into consideration the fact that CASP9 targets were harder than CASP8 targets, which, in turn, were harder than CASP7 ones^{36}, and that there were fewer consensus methods in CASP9 than in CASP8. Therefore, the observed progress cannot be attributed to the decreased target difficulty or larger number of consensus methods, but rather reflects methodological improvements implemented over the last three CASPs. At the same time, it should be mentioned that there are no conceptually novel approaches among the best performing CASP9 methods and the observed progress is most likely associated with improvements of the existing QA servers. Indeed, the comparison of performance of the best CASP9 groups that have also participated in CASP8 shows that none performed worse, with many significantly improving their results. This is particularly true of the MUFOLD-QA and United-3D (Circle in CASP8) groups, which have improved their correlation scores by more than 30%.

#### QA1 results: comparison with naïve methods

The effectiveness of QA1 methods in CASP9 was tested by comparing their performance with that of two naïve predictors: BLAST/LGA, assigning a global accuracy score to a model based on its distance from the best template found by sequence similarity, and NAÏVE_CONSENSUS, assigning a quality score based on the structural similarity of a model to other models submitted on the target (see Materials and Methods).

BLAST/LGA uses only the information available from the best template and therefore is conceptually similar to quasi-single model methods. Quality assessment scores were generated for all models submitted on 79 single-domain TBM targets, where PSI-BLAST detected at least one potential template. In order to compare the naïve predictor with participating groups in an unbiased manner, we recomputed the z-scores on the selected 79 TBM targets from the average and standard deviation values of the Pearson's *r* distributions for the forty six official predictors. It is apparent that while the BLAST/LGA predictor performs worse than the best clustering and quasi- single model methods, its z-score is higher than that of any of the CASP9 pure single-model methods (see Figure S3, Supplementary Material).

To benchmark the effectiveness of clustering techniques we compared them to the NAÏVE_CONSENSUS method utilizing information from all tertiary structure models submitted on a target. Figure 1 demonstrates that this method would have been among the best performing methods, had it participated in CASP9. In the QA1.1 assessment mode, the naïve method achieves the highest wmPMCC of 0.97 and is statistically indistinguishable from the eight top performing groups (Table S1 in Supplementary Material); in the QA1.2 mode, it attains a PMCC of 0.946 and is statistically indistinguishable from the best performing method (QMEANclust, PMCC=0.949) both according to the correlation-based and ROC-based analysis (Table S2 in Supplementary Material). These results show that even though the best CASP consensus predictors reach very high correlation scores, they do not compare favorably with a simple naïve clustering method.

#### Open issues

Comparison of the QA1 results from the latest CASPs points to clear though modest progress in the area: all assessment scores have improved since CASP8 and correlation coefficients for the best groups have nearly reached saturation (0.97) so it may seem that the QA1 problem has been solved. But a closer look reveals hidden problems and issues that need attention.

As in two previous CASPs, all top performing methods in CASP9 relied on a consensus technique to assess model quality (see Figures 1–3 for the results and Table I for the classification and brief description of the methods). However, for real life applications researches may want to obtain estimates for single models downloaded from one of the many widely used model databases^{37–39}. Therefore there is an urgent need for methods that can assign a quality score to a single model without requiring the availability of tens of models from diverse servers. Unfortunately, these methods lag behind the best consensus-based techniques: the best quasi- single model method in CASP9 was ranked 18^{th} in both QA1.1 (Lee group) and QA1.2 (Splicer) correlation-based assessments, while the best “pure” single-model method (Multicom-novel) was ranked only 28^{th} in both QA1.1 and QA1.2.

Appreciating the outstanding performance of clustering methods in CASP, the question arises of whether such a performance can be attributed to the CASP model set being easier (for quality assessment) than those that one might expect in real life applications. As the CASP model set contains many models of different quality (while this is not necessarily the case in real life applications), it can be hypothesized that there is a bias in the scores arising from diversity of the models in the datasets. Unfortunately, it is impossible to confirm or reject this hypothesis based on the CASP data alone, but we can obtain an approximate answer to this question by assessing how much the scores of the participating methods differ for various subsets of the CASP models. Figure 6 shows that the correlation scores of the QA1 methods drop significantly and approximately linearly with the decrease in the number of bad models in the subset. If only the best 50% of the models for each target are taken into account, the PMCC values decrease by about 50% as well. When only the 60 best models per target (approximately 20% of the whole target set) are used for the analysis, the correlation coefficients for all groups drop below the significance level (<0.2). Another way of verifying that method scores are worse on sets of models with limited spread in quality is illustrated in Figure S4 of the Supplementary Material, where the correlation coefficients calculated on the whole model dataset are compared with those calculated on relatively good models only (GDT_TS above 50). Analysis of the results shows that the correlation coefficients for the best groups drop by approximately 0.2 in both assessment modes.

**...**

The aforementioned analyses provide grounds for speculation that clustering methods in general might lose their edge when the set of assessed models is more uniform in quality and composed of only relatively good models. This suggestion is backed up by two examples obtained retrospectively, after the end of CASP9 (August 2010), and presented at the CASP meeting in December 2010. We asked Pascal Benkert, the leader of the QMEAN and QMEANclust groups, to re-run his methods on the reduced datasets, containing for each target only the models with GDT_TS ≥ 50 (these datasets, for 85 targets having at least 30 qualified models, are publicly available at http://predictioncenter.org/download_area/CASP9/server_pred_over50/). Results of these two post-CASP model quality assessments were evaluated in the same way as those of regular CASP9 groups. Figure 7 compares the results of QMEAN and QMEANclust on three different prediction/evaluation datasets. It can be seen that for both methods the reduction in the number and diversity of models in the prediction datasets produces a similar drop-off in correlation scores as that caused by the removal of the same models from the evaluation datasets. It is also interesting to notice that the drop in performance is observed for both methods, with the decrease in scores for the clustering method (QMEANclust) being slightly more pronounced, as expected. This might indicate that both single-model and clustering methods are less effective in discriminating models of similar but reasonable quality, and that it is hard to expect the high, CASP-like correlation coefficients in applications outside of CASP.

**...**

Another aspect of global quality assessment that needs improvement is the capability of selecting the very best model in a decoy set. Even though the best methods can attain very high correlation coefficients, none can consistently select the best models for all targets^{††}. Figure 2B shows that even the best methods miss the best available model by 10 GDT_TS units or more in ~20% of cases (red bars).

### QA2: Assessment of residue-level accuracy estimates

For the twenty one groups that submitted model confidence estimates at the level of individual residues^{‡‡}, we measured the correlation between predicted and observed distance errors as well as the accuracy with which the correctly predicted regions were identified. As it is described in more detail in Materials and Methods, all distances higher than 5Å were set to 5Å in the calculation of the correlation coefficients.

#### QA2.1: local accuracy assessment on per-model basis

Figure 8A shows the mean z-scores and PMCC weighted means for the twenty one QA2 groups on the whole set of models. The PconsM group achieves the highest score according to both measures. The results of this group are statistically indistinguishable from those of the ModFOLDclust2 group (Table S3, Supplementary Material), but differ from those of the second tier of five QA2 groups - IntFOLD-QA, MQAPmulti, MetaMQAPclust, MULTICOM and Pcomb - which are statistically different from the first two and statistically indistinguishable from each other.

#### QA2.2: residues from all models and all targets pooled together

To evaluate the ability of prediction groups to identify good and bad regions in a model, we pooled the submitted estimates for all residues from all models and all targets together (approximately 7,000,000 residues from 35,000 models per QA predictor), and calculated the MCC and the accuracy on this dataset (see Materials and Methods). Figure 8B shows the results of this analysis. Two methods developed by the same research group (ModFOLDclust2 and IntFOLD-QA) show the best results in this analysis, although they are not very different from the others, as the MCC5 for the median 11^{th} group differs from that of the 1^{st} group by only 0.05.

#### QA2 results: comparison with previous CASPs

Figure 9A shows the weighted means of the correlation coefficients over all models submitted to CASP9 and CASP8. The best groups show a slightly worse performance in CASP9, while the remaining ones seem to have improved.

**...**

The analysis of performance of the best CASP9 QA2 groups that also participated in CASP8 shows that, on average, there is not much progress, with the best CASP8 group performing noticeably worse in CASP9 (likely due to an error in the automatic procedure of the server).

The cumulative distribution of the QA2 correlation coefficients for the last three CASPs is shown in Figure 5. In contrast to QA1, there is no clear progress between the last two CASPs according to this measure. Also, the percentage of correlation coefficients that are higher than a selected value is always lower in the QA2 mode than it is in the QA1.

Figure 9B compares the ability to distinguish between the well and not so well modeled regions in a protein. The accuracy is measured in terms of the averaged Matthews correlation coefficient MCC_avg=(MCC5+MCC38)/2. Similarly to what we have observed for the other measures in QA2, the results of the best groups did not improve, while groups achieving an average accuracy have submitted better predictions in CASP9 than in CASP8.

## Discussion and conclusions

In this paper we present the results of the third round of model quality assessment experiment within the scope of CASP. The methodology for the assessment is now sufficiently robust for drawing general conclusions about the state of the art in the field.

There is clearly room for improvement in this category of prediction. In particular, there is an apparent need for improving single-model methods. The ability to rank models by consensus methods, i.e. to sort a set of models according to their quality, is very useful for structural meta-predictors, but is of limited use for biologists who often need to estimate the quality of a single model or its specific regions.

To further promote the development of single-model methods, we plan to emphasize them in the next CASP by a separate assessment. Looking at the clustering approach, we note that the best methods participating in CASP9 cannot outperform a naïve consensus technique tested in this paper, a rather disappointing result. We also would like to see improvement in the ability of clustering methods to rank models of similar and relatively high quality.

Another issue is that presently the QA1 type assessments cannot be performed at the level of individual domains. This would be desirable though, as individual domains usually present different levels of modeling difficulty and thus constitute separate model quality assessment problems. However, separation into structural domains is feasible only with the knowledge of target structures. Solving the domain level assessment problem might be possible by developing techniques capable of deriving global quality estimates directly from those made at the level of individual residues.

We hope that residue-based estimates of model accuracy will gain more attention and that improvements in this area will continue to appear. After an impressive advances made between CASP7 and 8, the progress seems to have slowed down. Our assessment shows that the best QA2 methods in CASP9 performed at the same level or even slightly worse than those in CASP8. The reasons behind this are not clear, and the observed decrease in QA2 performance might just reflect an average increased difficulty of targets in CASP9^{36}. In any case, we would like to underline that the residue-based error estimates are still less than satisfactory and hope that this somewhat disappointing result will encourage the community to direct efforts in this direction.

## Supplementary Material

#### Supp Figure S1

^{(173K, pdf)}

#### Supp Figure S2

^{(4.1M, pdf)}

#### Supp Figure S3

^{(251K, pdf)}

#### Supp Figure S4

^{(176K, pdf)}

#### Supp Table S1

^{(325K, pdf)}

#### Supp Table S2

^{(335K, pdf)}

#### Supp Table S3

^{(319K, pdf)}

## Acknowledgements

This work was partially supported by the US National Library of Medicine (NIH/NLM) - grant LM007085 to KF, and by KAUST Award KUK-I1-012-43 to AT.

## Abbreviations

- MQA
- Model Quality Assessment
- QA[1,2]
- Quality Assessment mode [1,2]
- TBM
- Template-Based Modelling
- RMSD
- Root Mean Square Deviation
- GDT_TS
- Global Distant Test – Total Score
- CC
- Correlation Coefficient
- PMCC
- Pearson's product-Moment Correlation Coefficient
- SRCC
- Spearman's Rank Correlation Coefficient
- MCC
- Matthews' Correlation Coefficient
- MCC5 / MCC38
- MCCs for two distance cut-offs - 5Å and 3.8Å
- wmPMCC
- weighted mean of PMCC

## Footnotes

^{*}Results presented at the CASP meeting were based on 117 targets including T0549, the target canceled just before the meeting by the assessors due to its inadequate quality for tertiary structure assessment.

^{†}Format-wise, multi-frame models are those containing several PARENT…TER blocks - see format description for TS and AL predictions at http://predictioncenter.org/casp9/index.cgi?page=format.

^{‡}Details on calculating z-scores and conducting Z-tests are discussed further in this section.

^{§}The conceptually similar analysis can be performed using Matthews' correlation coefficient or statistical accuracy score, but the ROC curve analysis is more general as it does not require linear relationship between the predicted and observed scores but assumes only monotony.

^{**}We also calculated QA2 summary scores using somewhat different procedure. First, per-residue scores for each model were averaged over all models submitted on a target, and then per-target averages were averaged over all targets. The difference in summary scores from the two procedures constituted 0.35% on the average for all considered measures and all participating groups.

^{††}This is not only a limitation of the QA methods, but also a partial limitation of the assessment method, since QA predictions and evaluations are done on full (i.e. not split in domains) targets.

^{‡‡}QA2 results from the group Pcons were excluded from the analysis as they were identical to the results from the group PconsM.

## References

## Formats:

- Article |
- PubReader |
- ePub (beta) |
- PDF (1.8M)

- Assessment of the assessment: evaluation of the model quality estimates in CASP10.[Proteins. 2014]
*Kryshtafovych A, Barbato A, Fidelis K, Monastyrskyy B, Schwede T, Tramontano A.**Proteins. 2014 Feb; 82 Suppl 2:112-26. Epub 2013 Aug 31.* - Evaluation of CASP8 model quality predictions.[Proteins. 2009]
*Cozzetto D, Kryshtafovych A, Tramontano A.**Proteins. 2009; 77 Suppl 9:157-66.* - United3D: a protein model quality assessment program that uses two consensus based methods.[Chem Pharm Bull (Tokyo). 2012]
*Terashi G, Oosawa M, Nakamura Y, Kanou K, Takeda-Shitaka M.**Chem Pharm Bull (Tokyo). 2012; 60(11):1359-65.* - Improved model quality assessment using ProQ2.[BMC Bioinformatics. 2012]
*Ray A, Lindahl E, Wallner B.**BMC Bioinformatics. 2012 Sep 10; 13:224. Epub 2012 Sep 10.* - Rigorous performance evaluation in protein structure modelling and implications for computational biology.[Philos Trans R Soc Lond B Biol Sci. 2006]
*Moult J.**Philos Trans R Soc Lond B Biol Sci. 2006 Mar 29; 361(1467):453-8.*

- SMOQ: a tool for predicting the absolute residue-specific quality of a single protein model with support vector machines[BMC Bioinformatics. ]
*Cao R, Wang Z, Wang Y, Cheng J.**BMC Bioinformatics. 15120* - Protein structure quality assessment based on the distance profiles of consecutive backbone C? atoms[F1000Research. ]
*Chakraborty S, Venkatramani R, Rao BJ, Asgeirsson B, Dandekar AM.**F1000Research. 2211* - The ModFOLD4 server for the quality assessment of 3D protein models[Nucleic Acids Research. 2013]
*McGuffin LJ, Buenavista MT, Roche DB.**Nucleic Acids Research. 2013 Jul; 41(Web Server issue)W368-W372* - QA-RecombineIt: a server for quality assessment and recombination of protein models[Nucleic Acids Research. 2013]
*Pawlowski M, Bogdanowicz A, Bujnicki JM.**Nucleic Acids Research. 2013 Jul; 41(Web Server issue)W389-W397* - OpenStructure: an integrated software framework for computational structural biology[Acta Crystallographica Section D: Biologica...]
*Biasini M, Schmidt T, Bienert S, Mariani V, Studer G, Haas J, Johner N, Schenk AD, Philippsen A, Schwede T.**Acta Crystallographica Section D: Biological Crystallography. 2013 May 1; 69(Pt 5)701-709*

- Evaluation of model quality predictions in CASP9Evaluation of model quality predictions in CASP9NIHPA Author Manuscripts. 2011; 79(Suppl 10)91PMC

Your browsing activity is empty.

Activity recording is turned off.

See more...