Assessing the utility of CASP14 models for molecular replacement

Abstract The assessment of CASP models for utility in molecular replacement is a measure of their use in a valuable real‐world application. In CASP7, the metric for molecular replacement assessment involved full likelihood‐based molecular replacement searches; however, this restricted the assessable targets to crystal structures with only one copy of the target in the asymmetric unit, and to those where the search found the correct pose. In CASP10, full molecular replacement searches were replaced by likelihood‐based rigid‐body refinement of models superimposed on the target using the LGA algorithm, with the metric being the refined log‐likelihood‐gain (LLG) score. This enabled multi‐copy targets and very poor models to be evaluated, but a significant further issue remained: the requirement of diffraction data for assessment. We introduce here the relative‐expected‐LLG (reLLG), which is independent of diffraction data. This reLLG is also independent of any crystal form, and can be calculated regardless of the source of the target, be it X‐ray, NMR or cryo‐EM. We calibrate the reLLG against the LLG for targets in CASP14, showing that it is a robust measure of both model and group ranking. Like the LLG, the reLLG shows that accurate coordinate error estimates add substantial value to predicted models. We find that refinement by CASP groups can often convert an inadequate initial model into a successful MR search model. Consistent with findings from others, we show that the AlphaFold2 models are sufficiently good, and reliably so, to surpass other current model generation strategies for attempting molecular replacement phasing.

with disease or with escape from an immune response. It is also clear that protein structure prediction will accelerate the experimental determination of 3D structures, by improving the models for molecular replacement (MR).
MR is the most commonly used method to determine the unmeasured phases needed to compute an electron density map from a diffraction pattern. This is carried out, typically, by determining the orientation angles and translation vector (together referred to as the "pose") required to superimpose the model generated by prediction with the coordinates of the atoms in the crystal. Models generated by structure prediction supplement the models that can be derived from previously determined structures of homologues in the worldwide Protein Data Bank (wwPDB), 1 often involving extensive editing.
As recently as 20 years ago, it would have been fair to say that even template-based protein models were rarely more useful for MR than the templates on which they were based, because it was too difficult to distinguish the few ways in which they could be improved from the vast number of ways in which they could be degraded. Since then, modeling methods have turned a corner and are becoming progressively more useful. A test for utility in MR was introduced for CASP7, 2 showing that about half of the best available templates in the high accuracy category could be improved by at least one predictor group, although only 33 of 1588 models evaluated were better than the best template. It should be acknowledged here that there is less room for improvement in the high accuracy category than in cases where no closely related template is available. Indeed, in a striking case from CASP7, an ab initio model of a small globular protein was predicted to sufficient accuracy that it could have been used to solve that structure by MR. 3 Other work resulted in the program AMPLE, which seeks to isolate sufficiently accurate substructures from sets of ab initio models by clustering and truncation. 4 When model accuracy was low, a useful score could only be generated if the model was sufficiently good to identify the correct pose in the full search. This problem was circumvented later by the use of rigid-body refinement starting from a structural superposition instead of the full MR search, judging the models by the log-likelihood-gain (LLG) score of the refined model instead of by whether or not the model could be placed. This also had the benefit of dramatically reducing the CPU time required to explore many incorrect solutions with poor models that lack useful signal, and ensuring that the LLG scores corresponded with models in the correct pose. Although the success-or-failure aspect of the MR searches was lost, the LLG scores could still be interpreted in the knowledge that MR searches yielding LLG values above 60 are usually correct. 5 A second problem arose in MR scoring when there are multiple copies in the asymmetric unit, or more than one type of component.
With the full MR approach, the MR scoring was restricted to those cases for which there was a single copy of a single protein component in the asymmetric unit of the crystal. However, the rigid body refinement approach allowed these more complicated targets to be scored by placing all copies of the tested model within a background that includes the deposited structure for all other components of the crystal; the increase in the LLG obtained when adding the tested model to the background structure alone was the measure of model quality.
A Phaser script to carry out rigid-body refinement calculations, written by Gábor Bunk oczi, was used by other assessors in the refinement category of CASP10, 6 as well as by us for both the refinement 7 and template-based modeling 8 categories of CASP13. This script was again used here for assessment in CASP14.
Problems remain with the rigid-body refinement approach, not least the fact that it requires diffraction data to be made available to assessors; not all crystallographers contributing targets are able to share these data in advance of publication. A substantial number of targets and domain evaluation units (EUs) derived from them now arise from cryo-EM structure determinations (21 EUs from seven structures in CASP13, 9 and 22 EUs from seven structures in CASP14, 43 ) and hence have no diffraction data. In addition, the LLG scores vary in a crystal-form-dependent fashion, depending on the resolution and quality of the data, the number of copies of the protein in the asymmetric unit of the crystal, and the fraction of the asymmetric unit accounted for by the modeled component. Comparisons among targets require some normalization, generally through the calculation of Z-scores.
In this study, a novel likelihood score is introduced, the "relative expected LLG" (reLLG) that requires only the coordinates of the target to rank the suitability of a model for MR. Most significantly, it is a crystal-form independent measure. We test the reLLG against the LLG score as a ranking measure and demonstrate its utility as a more convenient and robust measure, which should supersede the use of the LLG for this purpose. We find that the ability of refinement groups to improve reLLG values correlates well with their ability to improve the performance of refinement targets in actual MR experiments.
Finally, our results provide another metric by which the superiority of the AlphaFold2 10 models over the others in the assessment can be seen.

| Target selection for log-likelihood-gain scoring
In CASP, structures contributed for the prediction season are examined and divided into smaller pieces (often individual domains) that usually have a relatively compact structure. These are referred to as "evaluation units" or EUs. For CASP14, a total of 96 EUs were selected for evaluation of structure prediction. Prior to the CASP14 meeting, diffraction data were made available by the experimentalists who contributed 32 crystal structures, from which 54 EUs were drawn. These EUs were therefore able to be included in the MR assessment, which used the previously described diffraction datadependent LLG score. Diffraction data were not available at the time of assessment for the remaining 17 EUs drawn from other crystal structures, nor of course for the EUs drawn from cryo-EM or NMR structures.
In the refinement round, a total of 49 prediction targets were selected. These included seven "extended" targets and seven "double-barrelled" targets used to conduct additional experiments in CASP14. For the extended targets, refined models were collected after the initial 3-week period and again after an additional 6 weeks, during which more extensive computations may have been performed (denoted with an "x" in the target name). For the doublebarrelled targets, two starting models were chosen for refinement, one typically chosen from the server models and the other from models submitted by the AlphaFold2 group (denoted with "v1" or "v2" in the target name, with the "1" or "2" chosen randomly).
Thirty-four of the 49 total targets were derived from structures determined by X-ray crystallography, of which 20 had diffraction data available at the time of assessment and could therefore be used for LLG calculations.

| Model selection
For the double-barrelled refinement targets, one group recognized correctly that one of the two starting models (the AlphaFold2 model, though it was not identified as such) was superior to the other, and they submitted the better model as a refinement model for the poorer one. While the ability to recognize good models is laudable, it does not reveal anything about the ability of the group to carry out refinement, so the Alphafold2 models provided by this group were excluded from consideration. All other models for both structure prediction and refinement were evaluated.

| Log-likelihood-gain
As in the case of CASP13, the LLG for each model of each EU was computed by rigid-body refinement in Phaser, using the rest of the final crystal structure as a fixed background for the calculation.
The initial superposition of the evaluation unit on the target was carried out using the sequence-independent structure alignment program TM-align. 11 To allow for an assessment of the impact of the predicted error estimates, the LLG calculations were performed in two different modes for each prediction: once with the B-factor field interpreted as error estimates (used to weight the MR calculations as discussed below) and once with all B-factors set to a constant value. From each of these scores, we subtracted the EU-specific null-model LLG (the LLG value of the models with the lowest GDT_HA, corresponding to the noise), thus calculating the equivalent to the CASP13 increase in LLG from the background. The definition of the EU-specific nullmodel-LLG stems from the observation that at low GDT_HA, LLG values in GDT_HA versus LLG plots can be approximated by linear regression for a given EU.
To calculate the EU-specific null-model-LLG, for each EU, the models were binned into 100 equally spaced GDT_HA bins and the average LLG value for each bin taken. This average was computed iteratively, removing at each iteration those data points with an LLG 1σ below the average until no more data points were excluded. Out of these bins, the first 35% (bottom 35% GDT_HA) were considered further, and the average of their average LLG taken. Those bins with an average LLG within 3σ the average over all bottom 35% were sorted by their average LLG and the middle 80% taken. A linear model was fitted to the averages of these bins and the intersection in the y axis taken as the null-model-LLG. All models with an LLG below the corresponding null-model-LLG were assigned a score of zero. This We refer to the difference LLG score as the dLLG for short.

| Relative expected log-likelihood-gain
As discussed above, there are substantial advantages to a likelihood score that measures suitability for MR independent of crystal form or structure determination method.
By the correlation theorem of Fourier transforms, the correlation between electron densities is proportional to the complex correlation between structure factors calculated from those electron densities. In turn, the complex correlation in a resolution shell is equivalent to the resolution-dependent σ A value used in crystallographic likelihood targets, such as the log-likelihood gain on intensities (LLGI) used for MR. 12 (Note that the complex correlation in a resolution shell is also equivalent to the Fourier shell correlation, or FSC, commonly used to assess cryo-EM reconstructions. 13 ) We have shown that there is a close relationship between σ A and the score expected to be obtained in likelihood-based MR. The expected log-likelihood-gain (eLLG) can be approximated 5 as the sum, over all Fourier terms, of σ 4 A =2, allowing valuable optimizations of the MR strategy depending on the qualities of the model and the data. 14 This relationship between electron density overlap and LLG is the basis of the reLLG score discussed below.
Superposition of model and target with an algorithm such as that in the LGA program 15 will not generally optimize the electron density overlap. Therefore, to enable the calculation of the reLLG score, a new phased rigid-body refinement mode was implemented in phasertng, which is under development to replace and enhance the functionality of Phaser. 16 The rigid-body refinement starts from a sequence-independent superposition using LGA. 15 Instead of optimizing the LLGI score, which lacks phase information, it uses a phased likelihood target. This target starts from the assumption, based on the Central Limit Theorem, that structure factors computed from two superimposed models are related by a bivariate complex normal distribution; the assumption of multivariate complex normal distributions also underlies many likelihood-based crystallographic algorithms, including MR, refinement and experimental phasing. The probability distribution relating two sets of structure factors is characterized by a Hermitian covariance matrix. This takes a particularly simple form if the structure factors are first normalized, giving E values for which the meansquared value is one. In this case, the off-diagonal complex covariance term of the covariance matrix becomes the complex correlation, σ A : Note that a complex covariance will in general be a complex number, but σ A is a real number because, if a systematic phase shift were known between the two structures, that would imply the existence of a known relative translation vector, which could be corrected instead.
The likelihood target is the conditional probability of the target structure factors given the known model structure factors. This is derived from the joint distribution by standard manipulations to obtain the conditional variance of the target E-value given the model and its expected value: These parameters are used to express the conditional probability as a complex normal distribution: The target that is optimized is a log-likelihood-gain, obtained by taking the logarithm of the conditional probability and subtracting the logarithm of the probability of the null hypothesis, which is the Wilson distribution of structure factors 17 and is equivalent to the conditional probability when σ A is zero, that is, when the model is uncorrelated to the target and is thus uninformative. The contribution of a single Fourier term to the total LLG is given in the following: The phased log-likelihood-gain is a function of the orientation and position of the model relative to the target, and of the current value for σ A for each structure factor pair. The orientation is defined in terms of three rotation angles specifying rotations of the pre-oriented model around axes parallel to x, y and z running through the center of the model. Because the perturbations of the initial orientation will be small, these rotations will be nearly orthogonal and will therefore behave well in the optimization. The position is defined in terms of translations along the x, y, and z axes, which are orthogonal and are essentially independent of the rotations applied around the center of the model. The σ A values are a function of the resolution of the relevant structure factors and are defined in terms of the radial RMSD for coordinate errors drawn from a single 3D Gaussian. The value of σ A is given, as a function of resolution, by the Fourier transform of that Gaussian: where f p is the fraction of the target explained by the model, assumed to be one for the calculations reported here, rmsd is the refined parameter and s is the magnitude of the diffraction vector (the inverse of the resolution).
The refinement against the phased log-likelihood-gain can be seen to optimize the electron density overlap: E t E m cos Δϕ ð Þ is equivalent, by the correlation theorem, to the contribution of a Fourier term to the density correlation. The variance term in the phased log-likelihood-gain is controlled by the rmsd parameter, which will be optimal when the σ A values computed as a function of resolution from that rmsd match the mean values of E t E m cos Δϕ ð Þ in resolution shells.
Once an optimal superposition is obtained, structure factors from The reLLG calculation also requires making a choice for the high resolution limit. A calculation carried out to a higher resolution limit would be more sensitive to model errors, whereas the use of lower resolution would be more forgiving. In principle, one could define scores based on different resolution limits, analogous to the way that the GDT_TS score is more forgiving than the GDT_HA score. 15 We have chosen a resolution limit of 2 Å for calculations here for two reasons. First, the median resolution of crystal structures in the wwPDB 1 is close to this value: 2.2 Å overall, and 2.1 Å for the year 2020. Second, 2 Å is approximately the resolution at which most structures can be completed starting from even a partial correct MR solution. 20 We note that it would be possible to compute an eLLG from the tice, as the model itself. By CASP13, most predictors in the templatebased modeling category included error estimates 8 but many participants in the refinement category did not. 7 In this round of CASP, we were pleased to see that most predictors and participants in the refinement category do seem to have provided coordinate error estimates within a plausible range.
Such error estimates are extremely valuable for MR models. If the B-factors of the models are increased by an amount that effectively smears each atom's density over its probability distribution of true positions using the following equation, the electron density overlap, and therefore the LLG score, is optimized.
This approach was suggested in the high-accuracy assessment for CASP7 2 and supported by tests using either the actual or estimated coordinate errors in models. 22 The practical impact was demonstrated further by showing that this treatment significantly improves the utility for MR of models submitted to CASP10, 23 as well as in the evaluation of template-based modeling for CASP13. 8 To measure the utility of the error predictions numerically, each model was evaluated two times. In the primary calculation, the number in the B-factor field of the model was transformed using the equation above from a coordinate error estimate into a B-factor providing an error weight; in the secondary calculation the B-factor was substituted with a constant value set to 25 Å. 2 (Because the calculation is carried out with normalized structure factors, or E-values, the actual value of the constant B-factor is irrelevant. By extension, the mean value of any B-factor distribution can be altered without affecting the result.) The difference between the two results is a measure of the value added by the error estimates.

| Computing group rankings
For all the evaluation measures, Z-scores were computed using an algorithm that has frequently been applied in other rounds of CASP.
The primary ranking was based on model #1 of up to five models submitted for each target; this choice implicitly rewards the ability of groups to assess the relative quality of their models. Z-scores were computed in two steps: a set of initial scores was calculated based on the mean and standard deviation (SD) of all models under consideration. All models yielding a Z-score below À2 in the first pass were considered as outliers and the Z-scores recomputed using the mean and SD obtained when the outliers were excluded. At the end, the minimum Z-score was set to À2 to avoid excessively penalizing outliers. For ranking, all Z-scores were summed and a penalty of À2 introduced per target for which a method did not produce a model, effectively treating missing models as outliers.
For rankings based on either the conventional LLG or the new reLLG score, the primary ranking was based on interpreting the Bfactor field as an estimate of the RMS error in that atomic position, as requested in the submission instructions provided by the CASP organizers. The difference between this LLG or reLLG for error-weighted models and the value computed setting all B-factors to a constant value was used to measure the value added by the coordinate error estimate.

| Software and data availability
The tables with the reLLG calculations as well as the Jupyter notebooks 24 used to analyze them can be found in the following repository: https://github.com/clacri/CASP14_MR_evaluation. The Jupyter notebooks have been prepared to be run in the cloud environment of Google Collaboratory, 25 so that the results can be reproduced without having to set up a specific local environment. The analysis relies on the following python scientific libraries: Matplotlib, 26 Pandas, 27 and Numpy. 28 Computation of the reLLG was implemented in phaser_voyager and a path to the folder containing models to evaluate (assumed to be pre-oriented by default, but with an option to carry out a superposition). In addition, the command phenix.voyager.rmsd_to_bfactor is available to facilitate the conversion of estimated RMSD in the Bfactor field to the equivalent B-factor and the pruning of residues with RMSD above a chosen threshold.

| Structure prediction assessment
The statistical analysis and ranking calculations were carried out as described in Materials and Methods. Briefly, the primary ranking was based on the sum of the Z-scores for the #1 predictions when the Bfactor field was interpreted as an error estimate, and including the penalty of assigning a Z-score of À2 for missing models.

| Group rankings by difference log-likelihoodgain (dLLG) scores
Conventional dLLG scores were calculated for 54 evaluation units that correspond to the 32 targets for which the experimental diffraction data were available to us at the time of assessment. We calculated the scores with and without using the error estimates that were intended to be encoded in the B-factor field, thus assessing the impact of the error estimates. The resulting rankings are shown in In order to compare and assess the novel reLLG score against the traditional CASP dLLG score, we addressed three questions. First, do the dLLG and reLLG yield similar rankings of models for a specific target? Second, do the dLLG and reLLG yield similar group rankings?
Third, do the reLLG calculations obtained from cryo-EM or NMR experiments also yield correlated group rankings?
We compared reLLG scores with dLLG scores for the targets for which diffraction data were available at the time of assessment. We two scores is roughly monotonic, indicating that they will deliver similar ranking order for models.
Next, we examined whether the group ranking on the subset of targets for which diffraction data were available was similar. Figure 3 shows a very strong correlation between the ranking orders, with the top five groups being identical for the two measures.
The top 20 groups ranked by the sum of Z-scores of the dLLGs for their #1 predictions. Methods were ranked based on the dLLGs computed when considering the values in the B-factor field as error estimates (predicted RMSD to the target) To verify that there are no systematic differences in how reLLG would score models of structures determined by other methods, we compared the group ranking scores that would have been achieved using only cryo-EM targets or NMR targets with those achieved using X-ray targets. The scatter plots in Figure 4 demonstrate a strong correlation among the rankings using all three types of target. Note that the NMR scores are based on only three EUs.
Given that rankings on common targets are very similar using either dLLG or reLLG, that reLLG rankings on sets of targets derived by different methods (X-ray, cryo-EM, NMR) are similar, and that the use of the reLLG allows the use of a much larger data set (96 EUs rather than 54), we expect the ranking based on reLLG to be closer to what would be achieved for dLLG if diffraction data were available for all 96 EUs than the dLLG ranking based on only 54 EUs. The ranking based on reLLG is more robust, and we take it as the authoritative ranking for this study.
The ranking for all targets by reLLG Z-score (Figure 5a We note that the top three groups are the same in this ranking as in the rankings using just targets for which diffraction data are available, but there are substantial differences in other methods near the top. Based on the comparisons discussed above, we believe that these differences reflect sampling error rather than a systematic difference between targets with and without diffraction data. Such sampling error should be reduced for the larger set of targets, further supporting the decision to use the reLLG Z-score as the primary ranking measure in this study.

| Utility of coordinate error estimates in MR calculations
CASP participants are asked to contribute error estimates for their predicted models in the B-factor field of submitted PDB files. While the group ranking analysis in this study has been done using the information from those estimates, we also computed the reLLG scores substituting those estimates by a constant value. We then computed the difference between the sum of the reLLG scores for each group, either using or not using the error estimates. As can be observed in Figure 5b, the general trend for the top scoring groups is that the inclusion of the error estimation in the reLLG calculation improves the score.
F I G U R E 3 Ranking scores based on dLLG (magenta bars) and reLLG (blue bars) using only targets for which diffraction data were available at the time of assessment. Groups are ordered by their reLLG ranking score

| Accuracy self-assessment in the prediction category
The ability of the groups to identify their best models and rank them is an important aspect for prospective users, as many users will focus on the top model. Arguably, this is somewhat less important for MR models, as it is reasonably common (though not universal) to test a number of alternative models. One metric that can be used to score the accuracy of self-ranking is a rank correlation. We chose instead to use the fraction of the time that the #1 model is also the best of the five models submitted, because it is easy to understand and corresponds to one of the possible MR scenarios where only the best model is tested.
A scatter plot comparing the percentage of #1 models ranked correctly with the reLLG ranking score (Figure 6a) shows that there is no overall correlation (correlation coefficient of À0.02) between the ability of an algorithm to predict structure and the ability to rank a set of predictions. This is unexpected, as one would expect ranking to be an essential component of successful prediction. Nonetheless, Figure 6b shows that the most successful groups do better than random, with BAKER and FEIG-R1 doing best. The effect of including the coordinate error estimates in the reLLG scoring was evaluated as for the prediction category. Figure 7b shows that, again, considerable value was added to the model by including good coordinate error estimates. How much this added can be seen from an alternative ranking based on reLLG Z-scores computed with constant B-factors ( Figure S1), which therefore judge purely coordinate accuracy and not the accuracy of the error estimates. A comparison of Figure S1 with Figure 7a shows that only three groups outperform the naïve predictor, based only on coordinate accuracy:

| Accuracy self-assessment in the refinement category
There is a weak positive correlation (correlation coefficient of 0.31) between the ranking scores for different groups and their ability to correctly rank their best model as #1 (Figures 6c,d). One would expect this to be a strength in deciding whether a starting model had been improved, but it is difficult to see why this ability should be more important for refinement than for the initial prediction where no overall correlation was seen.

| Success of the refined models in MR
We performed MR using search models generated in the refinement category for those cases where diffraction data were made available.
There were 13 targets that fulfilled this requirement. Four of these included extended submissions benefitting from 6 weeks of refinement in addition to the standard 3-week refinement submissions (T1034, T1056, T1067, and T1074). Further to this, T1053, T1067, and T1074 were double-barrelled cases with refinement performed on two initial starting models. In each of these cases one of the starting models was an AlphaFold2 prediction. This gave a total of 20 sets of refined models to be tested in MR. Refined models from 36 different groups were included with each group producing up to five models per target. Starting models were also used in MR for comparison. The full set of target details is provided in Table 1.
The MrBUMP automated pipeline 33  Note: The three double-barrelled cases had an additional refinement using an Alphafold2 starting model (highlighted). Refinements denoted with an "x" are where the model was refined for an additional 6 weeks. Cases with "D" denote starting models representing a single domain from the target. molecule in the asymmetric unit, but we searched for only one copy to reduce the time taken for the MR run. For proteins with multiple components this is a more demanding test, because the signal in the MR search has a quadratic dependence on the fraction of the scattering accounted for by the model. 5 We deemed this to be an acceptable compromise as correct placement of the first copy is  only five of these proved to be successful search models in MR. Three of these were the AlphaFold2 predictions, with the remaining two being the starting models for R1034/R1034x1 (provided by the Seok server) and R1056/R1056x1 (from UOSHAN). Using these starting models, most groups that participated produced refined models that could also be used successfully in MR. In nine of the remaining 13 cases (including extended targets) refined models were produced that were sufficient for correct placement in MR. The BAKER and FEIG groups proved to be the most successful, yielding positive results in 13 and 12 cases, respectively. Notably, the same six groups appear at the top of the actual MR test as those above the naïve predictor in the reLLG ranking ( Figure 7a); the groups that ranked below the naïve predictor provided very few models that succeeded in MR when the starting model failed. Figure S2 shows a ranking of groups by the number of MR successes, along with a comparison of the rankings obtained with the dLLG and reLLG Z-scores.
An example of a successful refinement by the FEIG-S group of a starting model unsuitable as a search model in MR, for the target T1090, is shown in Figure 9.

| Assessment of progress
As seen with many other CASP metrics, the quality of the AlphaFold2 models for MR represents a step change in what can be achieved. It is difficult to attach a numerical value to quantify progress in MR, but  there is strong qualitative evidence. In previous rounds of CASP, the quality of models for MR was only measured for the easy and hard subsets of template-based-modeling (TBM), but not for the most difficult free-modeling (FM) and borderline FM/TBM categories, because almost none of the FM and FM/TBM models were judged to bear sufficient resemblance to the targets to make that a meaningful exercise.
In addition, this is the first occasion in which targets contributed to CASP were actually solved using submitted models.
Target difficulty in CASP has traditionally been measured using the mean of target rankings by sequence coverage and sequence identity to the closest homologue of known structure. 38 Figure 10a shows that model quality for MR, measured by the reLLG score, still has some dependence on target difficulty by this measure, but there are useful models across the spectrum. In almost all cases, the best models are those produced by AlphaFold2. One striking example is their model #2 of T1078-D1, which achieves an reLLG score of 0.648, the highest seen for any of the targets; this is in spite of the best template in the PDB having a sequence identity of only 9.8% and a cover- these are outlined in a dashed blue box at the bottom of Figure 10a.
The blue points in this box represent the best AlphaFold2 models for (from left to right) T1093-D2, T1100-D1, T1092-D1, T1083-D1, T1095-D1, and T1099-D1. These all represent cases of targets extracted from subunits of larger assemblies: T1083-D1 is a subunit of a homotetramer stabilized by coiled-coil interactions, T1092-D1,  19 and this can be translated into a reduced reLLG, as shown in Figure 11. A sequence identity of 30% thus translates into an reLLG value of slightly <0.1. The majority of AlphaFold2 structures across the difficulty scale reach this value, as well as a substantial fraction of the best models from other groups ( Figure 10a).
In this study, we have not validated whether or not a reLLG over

| Relevance of refinement category in CASP
The CASP refinement category was instigated to encourage the development (and allow the evaluation) of expensive computational methods, ones for which most groups do not have the resources to apply to the large number of targets in the prediction round. However, there has been a trend for methods pioneered by refinement groups to be incorporated into the initial models in subsequent CASP rounds, raising the bar for current refinement groups. In this category, a number of server-generated models are traditionally provided for further improvement. In CASP14, this pool of models was supplemented with seven (non-server) AlphaFold2 models. Although the best refinement groups were consistently able to improve the servergenerated refinement targets, most refinement methods degrade the AlphaFold2 models, as seen here for MR as well as for other CASP assessment measures. 30 This is in spite of the lack, in the AlphaFold2 algorithm, 10 of the explicit physics-based knowledge employed by the most successful refinement groups (e.g., Heo et al. 31 ). Figure 12 shows that, with one marginal exception (a slight improvement on an AlphaFold2 starting model), the AlphaFold2 model would have scored equal or higher on the reLLG score compared to the best refined model, even including the double-barrelled targets starting from AlphaFold2 models. If the initial AlphaFold2 predicted models had simply been resubmitted for each refinement target then AlphaFold2 would have topped the refinement rankings as well. In light of the highest quality predictions, the refinement category as it currently stands appears to have become redundant. Some consideration of potential future changes can be found elsewhere in this issue. 30,31 In conclusion, we have shown that the reLLG is a useful addition to the assessment metrics for CASP and should replace the metrics