Reliable protein-protein docking with AlphaFold, Rosetta, and replica-exchange

Despite the recent breakthrough of AlphaFold (AF) in the field of protein sequence-to-structure prediction, modeling protein interfaces and predicting protein complex structures remains challenging, especially when there is a significant conformational change in one or both binding partners. Prior studies have demonstrated that AF-multimer (AFm) can predict accurate protein complexes in only up to 43% of cases.1 In this work, we combine AlphaFold as a structural template generator with a physics-based replica exchange docking algorithm. Using a curated collection of 254 available protein targets with both unbound and bound structures, we first demonstrate that AlphaFold confidence measures (pLDDT) can be repurposed for estimating protein flexibility and docking accuracy for multimers. We incorporate these metrics within our ReplicaDock 2.0 protocol2 to complete a robust in-silico pipeline for accurate protein complex structure prediction. AlphaRED (AlphaFold-initiated Replica Exchange Docking) successfully docks failed AF predictions including 97 failure cases in Docking Benchmark Set 5.5. AlphaRED generates CAPRI acceptable-quality or better predictions for 66% of benchmark targets. Further, on a subset of antigen-antibody targets, which is challenging for AFm (19% success rate), AlphaRED demonstrates a success rate of 51%. This new strategy demonstrates the success possible by integrating deep-learning based architectures trained on evolutionary information with physics-based enhanced sampling. The pipeline is available at github.com/Graylab/AlphaRED.


Introduction
Recent work in physics-based docking approaches tested induced-fit docking 2 , large ensembles 15 , and 44 fast-fourier transforms with improved energy functions 16 to capture conformational changes and better 45 dock protein structures. Coupling temperature replica exchange with induced-fit docking, ReplicaDock 46 2.0 2 achieved successful local docking predictions on 80% of rigid (unbound-to-bound root mean square 47 deviation, RMSD UB < 1.1Å) and 61% medium (1.1 ≤ RMSD UB < 2.2 Å) targets in the Docking Benchmark 48 5.0 set 17 . However, like most state-of-the-art physics-based docking methods, ReplicaDock 2.0 performance 49 was limited for highly flexible targets: 33% success rate on targets with RMSD UB ≥ 2.2 Å. Promisingly, by 50 focusing backbone moves on known mobile residues (i.e., residues that exhibit conformational changes 51 upon binding), ReplicaDock 2.0 sampling substantially improved the docking accuracy. But the flexible 52 residues must first, somehow, be identified. Additionally, physics-based docking is quite slow (6-8 hrs on a 53 24-core CPU cluster) compared to recent DL based docking tools (0.1-10 minutes on a single NVIDIA 54 GPU). However, docking-specific DL tools such as EquiDock 18 and dMASIF 19 do not allow for protein 55 flexibility, and recent tools like GeoDock, 20 and DockGPT 21 have very limited backbone flexibility. Further, 56 all of these DL docking tools have low success rates on unbound docking targets such as those in Docking 57 Benchmark 5.5. 20 58 In this work, we combine the features of a top deep learning approach (AlphaFold-multimer 4 ) with physics-59 based docking schemes (ReplicaDock 2.0 2 ) to systematically dock protein interfaces. The overarching 60 goal is to create a robust pipeline for easier, reproducible, and accurate modeling of protein complexes. 61 We investigate the aforementioned questions and create a protocol to resolve AFm failures and capture 62 binding-induced conformational changes. We first assess the utility of AFm confidence metrics to detect 63 conformational flexibility and binding site confidence. Next, we feed these metrics and the AFm-generated 64 structural template to ReplicaDock 2.0, creating a pipeline we call AlphaRED (AlphaFold-initiated Replica 65 Exchange Docking). We test AlphaRED's docking accuracy on a curated set of benchmark targets of bound 66 and unbound protein structures of varying levels of binding-induced conformational change, including 67 antibody-antigen interfaces, which additionally challenge AF2m due to the lack of evolutionary information 68 across the interface. 22,23 In summary, we to assess the promise of combining the best of deep learning and 69 biophysical approaches for predicting challenging protein complexes.

71
Dataset curation. 72 We curated a dataset for conformational flexibility from the Docking Benchmark Set 5.5 (DB5.5) 17 , which 73 comprises experimentally-characterized (X-ray or cryo-EM) structures of bound protein complexes and 74 their corresponding unbound protein subunits. Each protein target (with unbound and bound structures) 75 is classified based on their unbound-to-bound root-mean-square-deviation (RMSD UB ) as rigid (RMSD UB 76 ≤ 1.2 Å), medium (1.2 Å < RMSD UB ≤ 2.2 Å) or difficult (RMSD UB ≥ 2.2 Å). Further, owing to the 77 poor performance of AlphaFold and other predictor groups in predicting antibody-antigen targets in 78 the recent CASP15-CAPRI round 24 , we identified a subset comprising only antibody-antigen complexes 79 (including single domain antibodies, or nanobodies) by extracting all the 67 antibody-antigen structures 80 state. Antibody-antigen targets further demonstrate a similar trend, however with fewer targets predicted 94 within sub-angstrom accuracy to the bound form (29.7% for Ab-Ag targets as opposed to 41% for DB5.5).

95
AlphaFold pLDDT provides a predictive confidence measure for backbone flexibility. 96 AlphaFold employs multiple sequence alignments with a multi-track attention-based architecture to predict a residue-level confidence measure: the predicted local-distance difference test (pLDDT), estimating the agreement between predicted model to an experimental structure based on the Cα LDDT test (Methods).  In this regard, we compared the computational (AF-pLDDT) and experimental (per-residue RMSD and 105 LDDT) metrics against each other.

106
As a reference, we first superimposed the unbound partners over the bound structures and calculated 107 residue-wise Cα deviations to determine the per-residue RMSD BU values. LDDT BU was measured by 108 calculating the local distance differences in the unbound structure relative to the bound form. These  Interface-pLDDT correlates with DockQ and discriminates poorly docked structures. 121 When the prediction accuracy is lower, it is often evident from lower confidence metrics (such as average 122 pLDDT or PAE). However, for AlphaFold-multimer complex predictions, the confidence metrics of the 123 overall prediction do not correlate with the accuracy of the docked prediction, i.e., even if the complex 124 exhibits higher confidence, the docking interfaces could be incorrect.  We investigated whether any of the AlphaFold predictive metrics could be repurposed for distinguishing 131 native-like binding sites from non-native ones. That is, can one could utilize pLDDT or PAE from AFm 132 models to determine whether the predicted docked complex has the accurate binding orientation? Thus, 133 we evaluated accuracy with the DockQ score, the standard metric for docking model quality.  Docking benchmark targets initiated from AlphaFold models improves performance. 157 With metrics to identify the flexible regions in the protein and the docking accuracy of generated docked 158 models, we next fused AlphaFold-multimer (AFm) with our docking protocol, ReplicaDock 2.0 2 , to build a 159 protocol for: (1) improving on incorrect AF docking predictions and producing alternate, near-native binding 160 models and (2) capturing backbone conformational changes with our induced-fit protocol ReplicaDock2.0 2 . 161 We named the protocol AlphaRED (AlphaFold-initiated Replica Exchange Docking). AlphaRED uses 162 AFm predicted structures as the primary template, estimates docking accuracy metrics, and initiates global 163 docking or refinement protocols as required. curve as a function of different metrics for the docking dataset (n=254). Interface residues are defined based on whether atoms of residues on one partner are within 8 Å from atom/s on another partner. Interface-pLDDT is the average pLDDT of interface residues. Avg-pLDDT corresponds to the average pLDDT across all the residues in the predicted model.
Interface contacts and interface residues are the counts of the interface contacts and interface residues respectively.
Interface-pLDDT has the highest AUC score of 0.86. (B) Confusion matrix with an interface-pLDDT threshold between labels predicted false (<85) and true (≥85) and an interface-RMSD threshold between labels actually true (≤4 Å) and false(>4 Å) actual labels. (C) Interface-pLDDT versus DockQ for all protein targets in the benchmark set. DockQ is calculated from the predicted AlphaFold structure and the experimental bound structure in the PDB. We fit a sigmoidal curve to this available data.   We investigated AlphaRED's performance on all 254 benchmark targets (Fig. 6)  Evaluation on blind CASP15 targets. 209 All results presented thus far may be biased by the fact that these benchmark target structures were complexes where most of the groups performed poorly (Fig. 8).

216
For each target, we employed the AlphaRED strategy as described in Fig. 5. All targets predicted 217 with AFm had low interface-pLDDT thereby demanding global docking. This is unsurprising since the 218 targets were nanobody-antigen targets and their CDRs, particularly CDR H3, are not conserved with 219 a scarcity of co-evolution data with the antigen. 36 For representative target T205, our docking strategy 220 improves the performance drastically (interface RMSD 11.4 Å for AFm model to 2.84 Å for AlphaRED) 221 and binds in the correct site. The interface scores versus interface-RMSD plot shows a distinct funnel 222 with low-energy medium-quality structures (Fig. 7-top). Since the crystal structures are not yet released, Starting from the AFm model (orange), global docking performance on 2FJU shows native-like binding site (gray ) and sampled AlphaRED decoy (blue). For local docking, backbone sampling on mobile residues predicted by residue pLDDT (outlined cartoon) shows AlphaRED decoy (blue) moves backbone towards the bound form(gray ).  integrating AlphaFold with biophysical attributes to better predict protein complex structures.

228
AlphaFold has dramatically transformed the field of structural biology and is currently the state-of-the-art 229 method to predict protein structures from sequences, not just for monomers but also for complexes and 230 higher assemblies. 37 One of the key elements of its success was the ability to mine evolutionary links 231 between amino acids across protein families and determine structural templates. This approach dramatically 232 improves prediction accuracy for monomers as reflected from prior CASP rounds. However, across protein 233 interfaces, the evolutionary constraints can be weak and often skew predictions to inaccurate binding sites. 234 Here we demonstrated how augmenting the predictions of AlphaFold with an energy-function dependent 235 sampling approach reveals better backbone conformational diversity and accurate prediction of protein 236 complex structures. By utilizing the AlphaRED strategy, we show that failure cases in AFm predicted 237 models are improved for all targets (lower Irms for 97 of 254 failed targets) with CAPRI acceptable-quality 238 or better models generated for 66% of targets overall (Fig. 9). landscape as demonstrated with AlphaRED's performance on DB5.5. Finally, we evaluated recent CASP15 247 targets to investigate the extrapolation of this strategy over blind protein targets. CASP15 targets were 248 absent from the training routine of AlphaFold and served as blind challenges to determine the efficacy of the 249 protocol. With AlphaRED, we obtained DockQ scores over 0.23 for all five targets, with medium-quality 250 models (DockQ > 0.49) for targets T205, T207, and T208 respectively. AFSample, a top-performing group 251 in CASP15, employed stochastic perturbation with dropout and increased sampling to obtain medium and 252 high-quality models for these targets. However, AFSample requires GPU simulations to produce ∼240x 253 models with compute time ∼1000x more than the baseline AFm. 10 On other hand, we utilized ColabFold 11 254 to generate 1-5 structures for our docking routine with the baseline version. As opposed to a couple of 255 days on GPU (each GPU node contains up to 48 cores) utilized by AFSample, our docking routine fused 256 with ColabFold uses 5-7 hours on our CPU cluster (runs on 1 node, with 24 cores, approximating to ∼100 257 hours of CPU-hours per target). The AlphaRED docking strategy demonstrates a new and better way to 258 predict protein complex structures within feasible compute times.

259
This work is particularly impactful for its success rate on antibody-antigen targets. Deep learning 260 promises accurate design and optimization of antibody therapeutics 34 , but a lack of fast and accurate 261 docking methods for antibodies prevent high-throughput computational screening. Additionally, this work 262 is impactful because by integrating a physics-based method for refinement, the pipeline can potentially 263 handle post-translationally modified proteins or non-canonical residue types that are not defined in ML 264 approaches like AF. leading to structural differences between unbound and bound structures of protein targets.

289
To measure the conformational change in protein structures, we calculated two metrics: Cα root-mean-290 square-deviation (RMSD) and local distance difference test (LDDT) 40 . To get a detailed representation of 291 the intrinsic motion of a protein, we calculated RMSDs at a residue-level, i.e., per-residue Cα RMSD for 292 each residue of a protein target. The sequences+structures of unbound and bound proteins were aligned 293 and the RMSDs were calculated for the aligned residues. The total sequence lengths were also matched 294 and lingering end-termini residues were trimmed to ensure structural and sequential similarity.

295
Local Distance Difference Test (LDDT) is a superimposition-free score that estimates local distance 296 differences in a model relative to a reference structure. 40 Unlike the Global Distance Test (GDT) 46 score 297 based on rigid-body superimposition, the LDDT score measures the conserved local interactions in the 298 protein model to the reference. For every residue, it computes the distance between all pair of atoms 299 D(i, j) in both the model and the reference structure (bound) within a threshold (defined as the inclusion 300 radius, generally set to 10 Å). For each pairwise distance in both distance vectors, if the distance is within 301 the threshold, the distance is considered conserved and the fraction of conserved distances is calculated. 302 The final LDDT score is the average of this fraction for the tolerances of 0.5, 1, 2, and 4 Å.

303
For a protein structure with N number of residues, the overall LDDT score can be given as follows: where norm is the normalization factor 306 norm = 1 i,j dists_to_score(i, j) [2] 307 and score(i, j) is the LDDT score for the residue i with respect to every other residue j ∆D(i, j) denotes the absolute difference between D true (i, j) and D predicted (i, j) calculated as follows: where D true (i, j) and D predicted (i, j) denote the distances between the Cα coordinates of the i th residue 311 and the j th residue for the true (reference) and predicted (model) structures respectively. Let x k i and 312 y k i represent the k th coordinate of the Cα atom in the i th residue in the reference (true) structure and 313 predicted structure respectively, such that: Finally, the distances to score (dists_to_score(i, j)) are computed as those pairwise distances within an   Using AlphaFold2 as a structural module, we built a pipeline for protein-protein docking to better predict 328 protein complex structures with relatively higher accuracy. As illustrated in Fig. 5, given a sequence of a 329 protein complex, we use the ColabFold implementation of AF2-multimer to obtain a predictive template. (BalancedKIC) and Sidechain. The sampling weights are biased such that backbone and side-chain movers 345 are weighted higher than rigid body moves (3:1 weightage for backbone:rigid-body moves). We perform 346 directed backbone sampling by focusing on predicted mobile residues (per residue pLDDT < 80). This is 347 automated with the BFactorResidueSelector that selects contiguous sets of residues below the specified 348 pLDDT threshold.

349
However, unlike the induced-fit strategy in ReplicaDock 2 , we perform backbone sampling directed only 350 on the mobile residues (with per residue pLDDT < 80) identified from the AlphaFold model. We automate 351 it using the BFactorResidueSelector to select contiguous sets of residues below the specified pLDDT 352 threshold in the prior section. This residue subset is passed along to the backbone movers to sample 353 backbone moves along with small rigid-body moves. Sampled decoyed are then refined, i.e. undergo 354 side-chain packing and minimization, to output docked decoys. The best ranked decoys based on interface 355 scores are then identified as the top-scoring structures.

356
Data Availability. 357 The source code for AlphaRED will be available on github before publication 358 (github.com/Graylab/AlphaRED). An online server implementation will be available on the Gray 359 lab ROSIE server shortly (rosie.graylab.jhu.edu). Hopkins University may be entitled to a portion of revenue received on licensing Rosetta software including 364 some methods described in this paper. JJG has a financial interest in Cyrus Biotechnology. Cyrus 365 Biotechnology distributes the Rosetta software, which may include methods described in this paper. These 366 arrangements have been reviewed and approved by the Johns Hopkins University in accordance with its 367 conflict-of-interest policies.