Top-Down Machine Learning of Coarse-Grained Protein Force Fields

Developing accurate and efficient coarse-grained representations of proteins is crucial for understanding their folding, function, and interactions over extended time scales. Our methodology involves simulating proteins with molecular dynamics and utilizing the resulting trajectories to train a neural network potential through differentiable trajectory reweighting. Remarkably, this method requires only the native conformation of proteins, eliminating the need for labeled data derived from extensive simulations or memory-intensive end-to-end differentiable simulations. Once trained, the model can be employed to run parallel molecular dynamics simulations and sample folding events for proteins both within and beyond the training distribution, showcasing its extrapolation capabilities. By applying Markov state models, native-like conformations of the simulated proteins can be predicted from the coarse-grained simulations. Owing to its theoretical transferability and ability to use solely experimental static structures as training data, we anticipate that this approach will prove advantageous for developing new protein force fields and further advancing the study of protein dynamics, folding, and interactions.


Introduction
7][8][9] However, conventional MD methods face limitations in their applicability to these processes, mainly due to high computational costs and large timescales involved.][19] Early efforts in the development of coarsegrained (CG) protein potentials laid the foundation for knowledge-based (KB) protein models.Early KB works 20,21 established significant progress in creating statistical potentials from protein data bank (PDB) structures.Subsequent KB studies [22][23][24][25][26] furthered this field by designing CG potentials to stabilize PDB structures.Alternatively, bottom-up approaches for parameterizing a CG model from PDB conformations have also been utilized.These method-ologies postulate that PDB conformations dominate the equilibrium ensemble so can be used to determine transferable interaction potentials for CG protein models, with statistical physics approaches that treat many-mody structural correlation, 27 maximizing the likelihood that the CG model samples PDB conformations, as demonstrated in the Bayesian approach 28 or with the relative entropy approach. 291][32][33] A particularly promising application of machine learning in MD is the development of neural network potentials (NNPs). 34][38][39][40] The development of CG machine-learned NNPs for proteins generally adopts a bottomup approach, which attempts to reproduce the reference fine-grained statistics.Initial studies focused on directly learning potentials from high-fidelity simulations through variational force-matching, 32,37,39 while more recent research has explored data-efficient strategies, such as flow-matching 41 and denoising diffusion probabilistic models. 42This bottom-up approach has the advantage of preserving the thermodynamics of the projected degrees of freedom.However, it also necessitates a large quantity of all-atom 3D conformations and their corresponding energies and forces sampled from the equilibrium distribution for training the machine learning model.This requirement can be computationally expensive and may result in poor extrapolation in regions of conformational and sequence space where data is scarce. 43n the other hand, an alternative approach for learning potential energy functions has been demonstrated by Greener et al. 44 through the application of end-to-end differentiable molecular simulations (DMS).However, this method faces challenges when applied to medium-tolarge proteins, as it requires a significant amount of memory to store all simulation oper-ations during the forward pass, which are then used in the backward pass.This can lead to exploding gradients, causing instability during training as the accumulated gradients result in large updates in the neural network weights.To address this issue, the differentiable trajectory reweighting (DiffTre) method has been developed and applied to learn NNPs for atomistic systems in a more memory-efficient manner. 45Bypassing the differentiation of the entire trajectory through the MD simulation for time-independent observables, DiffTre enables the learning of top-down coarse-grained potentials.
In this work, we build upon the advantages of DiffTre and demonstrate its applicability for training NNPs of coarse-grained proteins using experimental structures and short differentiable MD trajectories from various proteins.Our approach allows us to train an NNP using differentiable trajectories and uncorrelated states while circumventing the need to save all simulation operations for the backward pass.As a result, our approach is considerably less memory-intensive while retaining the performance of DMS.We apply the proposed methodology to learn two distinct NNPs: one for 12 fast-folding proteins (FF-NNP), which can be utilized to recover their native conformations, and a general NNP (G-NNP) trained with a much larger dataset of computationally solved structures, which demonstrates the capability of extrapolating and folding proteins outside of the training set.

Coarse Grained Molecular Dynamics
The design of a coarse-grained model starts with the definition of the variables that should be preserved after the dimensionality reduction.In this study, to reduce the dimensionality of system R ∈ R 3N , we establish a linear mapping Ξ : R 3N → R 3n , projecting it onto a lower dimensional representation r ∈ R 3n .We transform the all-atom representation into a C α atom representation, with retained atoms referred to as coarse-grained "beads".Each C α bead is assigned a bead type based on amino acid type, resulting in 21 unique bead types, which are identified by distinct integers.Molecular dynamics simulations are employed for training and testing the NNP.We utilize TorchMD, 38 a fully differentiable MD package written in PyTorch, 46 enabling the execution of multiple parallel trajectories.To confine the space explored by the dynamics and incorporate physical information, we apply a prior energy function U λ (r, ϕ).TorchMD integrates prior energy terms and the NNP to compute total potential energy.Consequently, the potential energy function is decomposed into a prior potential U λ and a neural network potential U θ , with the total potential energy given by: where U θ represents the potential energy derived from the NNP, parameterized by parameters θ.The network is a graph neural network with a SchNet-based 47 architecture, a continuous-filter convolution neural network capable of modeling atomic interactions.The implementation, available in the Torchmd-NET package, 40 is defined by Majewski et al. 39 The prior energy U λ is parameterized by constant parameters λ, which can be decomposed into three terms: a pairwise bonded term to prevent chain breaking, a nonbonded repulsive term to avoid structure collapse into overlapping beads, and a dihedral term to enforce chirality in the system.The Supporting Information provides a detailed description of the prior energy terms.The total energy of the system is then expressed as: ( To compute the forces, TorchMD computes analytically the forces from the priors and obtains the NNP forces with an autograd Py-Torch call on the energy term computed with the NNP.

Differentiable Trajectory Reweighting
We implement a version of DiffTre 45 in Py-Torch to train the NNP, facilitating parallel and distributed training across multiple GPUs and nodes.The package is modular, allowing training for any molecular system and using any experimental observable as a training objective.DiffTre is used to match K outputs of molecular dynamics simulations to experimental observables.In this study, we focus on the folding of coarse-grained proteins, using conformations r 0 ∈ R 3n of proteins in their native states as experimental observables, where n denotes the number of beads in the system.
In the context of our work, a state denoted as S i , represents a specific configuration of the system at a given time in our simulations.More specifically, S i is a multidimensional entity that encapsulates both the spatial coordinates ri ∈ R 3n of the system, and the potential energy U λ, θ,i ∈ R of the system.
As illustrated in Algorithm 1, we simulate multiple trajectories of length N in parallel, sampling K uncorrelated states S i from each trajectory {S i } N i=1 .For each state, we compute the root-mean square deviation (RMSD) between the state's coordinates and the native conformation coordinates.Subsequently, the weighted RMSD ensemble average can be calculated as where U λ, θ denotes the potential energy calculated with the reference parameters that generate the trajectory, U λ,θ represents the potential energy calculated with the parameters to be updated, and β = 1/(k B T ), with k B being the Boltzmann constant and T the temperature.Note that before the first backward pass, θ = θ, thus w i = 1/K.The algorithm's objective is to minimize a loss function L θ , which in turn minimizes the ensemble average of the RMSD function.During optimization, the ensemble average RMSD between states sampled from short MD simulations and the native conformation of each protein is minimized.To avoid overfitting on proteins that are easier to optimize for the network, we employ a margin-ranking loss where M represents the batch size and m denotes the margin.For example, when the margin is set to −1 Å, if the RMSD ensemble average is lower than 1 Å, the loss is set to 0 Å, and the network parameters will not be updated.Training is considered to have reached convergence when the training loss remains constant within a specified error range, and further optimization is unlikely to yield significant improvements.

Markov State Models
Markov State Models (MSM) [48][49][50][51][52] are employed to analyze the CG simulations and compare them with their corresponding all-atom simulations.MSMs can describe the entire dynamics of a system by partitioning it into n discrete states.For a system to be Markovian, it must be "memoryless", meaning that future states depend only on the current state.In the case of Markovian systems, such as MD simulations, a transition probability matrix can be constructed, characterized by the n states and the lag time τ at which the system's state is recorded.From this matrix, state populations and conditional pairwise transition probabilities can be obtained, and the state populations can be converted into free energies.
In this work, we employ time-lagged independent component analysis (TICA) 53,54 to project the high-dimensional conformational space into an optimally reduced low-dimensional space.Following this, the resulting space is discretized using K-means clustering for MSM construction.We featurize the simulation data into pairwise C α distances and use TICA to project the data onto the first four components.For Update the neural network parameters by stochastic gradient descent: the coarse-grained (CG) data, we adopt the approach presented by Maciej et al., 39 using the covariance matrices of the all-atom molecular dynamics (MD) to project the first three components.This method ensures consistency with established methodologies and facilitates further analysis.
We use MSMs in the coarse-grained trajectories because we had them already for the allatom trajectories and to evaluate the shape of the folding basins.However, we have no expectation that the thermodynamics or kinetics of these coarse-grained simulations have anything to do with the original ones as the train-ing methods do not preserve these quantities.It can be interpreted as a way to get stable states, which we can take as predictions for the native structure.By comparing the most probable macrostate to the protein's native conformation, we can evaluate the predictive capabilities of our CG model.For clustering the data we apply the Pairwise Constrained Component Analysis 55 (PCCA) algorithm.

Datasets
The first dataset comprises the crystal structures of 12 fast-folding proteins, previously studied by Lindorff-Larsen et al. 56 (using allatom MD) and Majewski et al 39 (using Machine Learned CG MD).These proteins exhibit a variety of secondary structure elements, including α-helices and β-sheets.
The second dataset, used for training the general model, was created by searching the AlphaFold Database. 57This dataset contains 14,871 proteins between 50 and 150 residues, with computationally predicted structures solved using AlphaFold2. 58The selected structures in this dataset are high-quality predictions, with a global predicted local-distance difference test (pLDDT) higher than 90, and are clustered at 50% sequence identity using Usearch. 59We also removed any hits with more than 20% sequence identity to the fast-folding proteins used for testing.This approach ensures a diverse representation of protein structures in the dataset, allowing the NNP to generalize effectively to new and previously unseen protein structures.
By incorporating AlphaFold2-predicted structures into our dataset, we significantly increase its size compared to using experimentally solved structures alone.

A Neural Network coarse-
grained potential learns the structure of fast-folding proteins.
We trained the FF-NNP using the dataset of 12 fast-folding proteins.For training, the learning rate was set to ϵ = 0.0001, and the loss function was defined by Equation ( 5).The simulation temperature was set to 298 K, and the timestep was 5 fs.We employed trajectories of 1024 steps, with 128 states used for reweighting each trajectory.The margin was set to m = −1.0;Å, and the mini-batch size was 12.
Compared to bottom-up approaches, 32,37,39,41,42 our training process did not require generating expensive reference all-atom data beforehand, and the training took only 5 hours on a single NVIDIA GeForce RTX 2080 GPU.Despite this, the model successfully folded most of the proteins and stabilized their native conformations.
To validate the fast folders' NNP, we performed coarse-grained molecular dynamics simulations using the same proteins used for training.In order to ensure a more comprehensive exploration of the conformational space, we took advantage of the parallel processing capabilities of TorchMD to initiate multiple parallel trajectories for each protein.Rather than starting all trajectories from unfolded conformations, which might limit the exploration, we diversified our starting points.We used 32 different conformations as starting points, representing a wide array of distinct points in the conformational space.
These starting conformations were selected to create a wider initial condition set, promoting a broader exploration of the potential energy landscape.While in a typical experimental setup, such a wide range of initial conformations might not be readily available, our computational approach enabled this extended exploration, which we believe is key to a more robust validation of the NNP.Our simulations were conducted with a timestep of 1 fs and a temperature of 298 K, running for a total ag-Figure 1: Representative structures sampled from the minimum RMSD macrostate from coarsegrained simulations of the 12 fast-folding proteins.For each protein we show the experimental structure (gray), the selected conformations from the general NNP (blue) and the selected structures from the fast folders NNP (orange).10 structures are randomly selected from each model's minimum RMSD macrostate, with the minimum RMSD one highlighted and the other as transparent shadows.gregated time of 64 ns for each protein.
From the MSM analysis, we selected the minimum average RMSD macrostate for each protein, considering it as the native macrostate of the simulation.Table 1 presents the equilibrium probability of this macrostate, along with its mean and minimum RMSDs.The results suggest that the fast folders' NNP simulations successfully recovered the experimental structure for all simulated proteins, except λ-repressor, a3d, and Protein B (Tab. 1).Furthermore, the equilibrium probability of all these native macrostates is high, indicating extensive sampling of the native conformation.Representative conformations from the macrostate are shown in Figure 1.For Protein B and λ-repressor, the lowest RMSD macrostates exhibit high flexibility and do not form any secondary structure.In contrast, for a3d, the secondary structure is recovered, al-though the tertiary structure is not correctly aligned.

A Neural Network Potential trained on a Large Protein Dataset learns beyond training data distribution
In the previous section, we demonstrated that a folding NNP can be learned for a small set of proteins.In this section, we present the results of the G-NNP, which is trained on a much larger dataset of around 15 thousand protein structures.This enables us to test the generalization capabilities of our approach.We used a random 90/10% training/validation split.Furthermore, we built the training dataset such that it did not contain sequences with a sequence similarity greater than 20% to the fast folders.Thus, We trained the G-NNP using a learning rate ϵ = 0.0005.Our training utilized trajectories consisting of 100 steps, with each trajectory being reweighted using 20 state and the minibatch size was 32.With the abundance of data from our large dataset, we were able to set the margin (m) to 0 Å, maintaining an optimal balance and avoiding the issue of overfitting.This parameter configuration was found to yield optimal results when training on large datasets.
Similar to the FF-NNP, we initialized multiple parallel trajectories for each protein.We used the same conditions and starting points as those used in Section 3.1 with the same total aggregated time.
Results from the MSM analysis reveal that for specific proteins, such as Chignolin or BBA, the minimum RMSD macrostate aligns with the native structure, exhibiting an average RMSD of less than 3.0 Å.Moreover, their equilibrium probabilities are 24.5 ± 0.1 % for Chignolin and 17.3 ± 0.1 % for BBA (Table 1).This evidence suggests that our NNP can effectively generalize when trained on an ample dataset.For the other proteins, the minimum RMSD of the macrostates is consistently lower than 5 Å.As a result, although the macrostates do not precisely match the native structure, near-native conformations are sampled during the simulations with notable probability.
Simulations for Chignolin, BBA, and Homeodomain proteins successfully sampled folding events, even when starting from completely unfolded conformations.The folding events observed in our simulations, illustrated in Figure 2, provide compelling evidence that our G-NNP generalizes beyond the training set, accurately folding proteins not included in the training set, albeit with limitations in accurately reproducing the thermodynamic landscapes.As illustrated in Figure S3 of the supplementary information, the free-energy landscapes produced by our models do not reproduce those generated by all-atom MD.
In addition to the aforementioned observations, Figure (1) illustrates that the G-NNPselected structures exhibit greater variability than those sampled with the FF-NNP.Nevertheless, the G-NNP more accurately recovers secondary structure elements and the overall shape for proteins (λ-repressor, a3d, and Protein B) where the FF-NNP fell short.

The Neural Network Poten-
tial is comparable to other methods in de-novo structure prediction While AlphaFold2 58 has indeed revolutionized the domain with its ability to predict native state models, it largely depends on multiple sequence alignments (MSAs) during the initial stages of native state prediction.In contrast, our approach does not require the utilization of such information, positioning it as potentially more versatile, particularly in contexts where obtaining MSAs is a challenge.Additionally, our model holds the potential of not only predict the native structure but also the whole folding process.
In this section, we evaluate the G-NNP performance in de-novo structure prediction, where experimental structures are not available.With this, we evaluate the capabilities of our learned CG forcefield to not only run dynamics but also fold proteins to their native conformation.For this purpose, we predict structures by selecting the most probable macrostate of the simulations used in the previous section.To benchmark our model (G-NNP) against other methods, we calculated the average root mean square deviation (RMSD) of the most probable macrostate derived from the simulations.The results are presented in Table 2.We compared our findings with two web servers employing coarse-grained methods for protein folding, UNRES 60 and CABS-fold, 61 as well as the only other method utilizing differentiable molecular simulations to learn coarsegrained parameters, DMS. 44For the CABS-fold method, as the server was not operational during our analysis, the results for Chignolin, Trpcage, BBA, and Villin were obtained from the paper by Greener et al., 44 for DMS we run the predictions using the same settings they use in their paper and for the UNRES method, models were generated on their web server using the parameters provided in the MREMD structure prediction example from their tutorial.
Our general model has produced comparable results to other models that use coarsegraining simulations for predicting folded protein conformations.However, CABS-fold and UNRES employ replica exchange algorithms to enhance sampling.Additionally, the DMS method used an initial guess of the secondary structure as a starting point for the simulations, which may impact the comparability of the results.Nonetheless, we want to emphasize that our neural network model, which was trained from scratch on experimental structures, can achieve results similar to those of more sophisticated, pre-existing, manually crafted methods, or DMS, which are more memory and timeintensive.
Another aspect of our method, and the ones we have used as a benchmarks (UNRES, CABSfold and DMS) is their capacity to illustrate not only the end conformation, as current protein structure prediction methods, 58,62,63 but also the pathway the protein traverses towards it.This aspect could provide a more comprehensive understanding of protein dynamics, and in combination with additional reference data, it could eventually predict both structure and folding pathways.
It is worth noting that our current model does not fully encapsulate the comprehensive reproduction of the entire conformational landscape at this stage.Despite this, we envision our method as a significant stepping stone paving the way for future advancements in the field.Looking ahead, we perceive the potential of this approach to be used as a pre-training stage that can be trained on large amounts of proteins to capture relevant information.Subsequently, it could be combined with active learning strategies to learn the exact forces in sampled conformations, thus more accurately mirroring the thermodynamics, or used as a foundation model that can be fine-tuned for specific downstream tasks.

Conclusions
In this study, we have effectively extended the application of the differentiable trajectory reweighting algorithm for the parameterization of neural network-based protein force fields.We developed a fast-folders neural network potential (NNP) using 12 proteins, highlighting its ability to fold and stabilize the native conformations of proteins within the training data distribution.Furthermore, we constructed a general NNP and showed that the learned potential can generalize outside of the training distribution and predict the folded macrostates of proteins with accuracy similar to existing classical coarse-grained methods.Remarkably, the gen-eral NNP, while only trained to maintain the native structure, demonstrated the capability to fold some proteins, which their sequence was not present in the training, set starting from entirely unfolded conformations.
We demonstrated that neural network potentials (NNPs) can be trained in a top-down manner, removing the need for expensive reference calculations or memory-intensive end-toend differentiable simulations when addressing the protein folding problem.While our current results do not encompass the entire protein folding process, including kinetics and thermodynamics, we are optimistic that future enhancements to our approach, in conjunction with bottom-up methodologies, will enable NNPs to achieve superior accuracy and faster inference times compared to current techniques.Future research may involve integrating our method with labeled data from extensive simulations to create a model capable of accurately predicting protein folding behavior through coarse-grained molecular dynamics simulations.

Data and code availability
Code, models, prior parameters and all the data are freely available in: github.com/compsciencelab/torchmd-expAssociated Content Supporting Information available.The SI provides detailed mathematical formulations for the prior energy terms used in the model, as well as the architecture and equations for the Graph Neural Networks.It also includes training curves for different potentials (Figures S1-S2), a comparative analysis of the free energy landscape across initial TICA dimensions (Figure S3), and a table listing the sequences of fast-folding proteins studied (Table S1).Additional tables outline the hyperparameters used for neural network training (Table S2) and a comparison of MD simulation speeds for different proteins using different NNPs (Table S3).
Structure Database: massively expanding the structural coverage of proteinsequence space with high-accuracy models.for N interaction layers.Where W C are continuous filters, generated by expanding the pairwise distance between beads into a set of radial basis functions.Aggr is an aggregation function that reduces the convolution output, in our case we choose the sum as the aggregation method.Finally, the graph level feature U is computed which, in our case, corresponds to the potential energy of the protein and that can be used to compute the forces acting on each bead with an autograd call with respect to the Cartesian coordinates.

Figure 2 :
Figure 2: Three individual CG trajectories selected from MD of Chignolin, BBA and Homeodomain.Each trajectory is visualized using different colors ranging from purple to yellow.Each simulation started from an extended conformation and sampled the native structure.Top panels: Minimum RMSD conformation of the trajectory (blue) aligned with the experimental structure (gray) for Chignolin, BBA and Homeodomain.Middle panels: C α RMSD of the trajectory with reference to the experimental structure for Chignolin, BBA and Homeodomain.Bottom panels: 100 states sampled uniformly from the trajectory plotted over CG free energy surface, projected over the first two time-lagged independent components (TICs) for Chignolin, BBA and Homeodomain.The red line indicates the all-atom equilibrium density by showing the energy level above the free energy minimum with the value of 7.5 kcal/mol.

Figure S2 :
Figure S2: Train curve for the general NNP potential.

Table 1 :
Minimum average RMSD macrostate statistic from MSM built with CG simulations of the fast-folders, general NNP and all-atom MD.The data shows the average and minimum RMSD of the macrostate, as well as its equilibrium probabilities in percentage (Macro prob.) with standard deviation.In bold the proteins with Mean RMSD < 3.0 Å.

Table 2 :
Comparison of Cα RMSDs ( Å) 12 fast folding proteins predicted structures with our model and different Coarse-graining models.For our method (G-NNP) we show the mean RMSDs of the most probable macrostate.

Table S3 :
Comparison of MD simulation speed (ns/day) of the FF-NNP and G-NNP.The results obtained with NVIDIA GTX 1080 for the FF-NNP and with NVIDIA GeForce RTX 4090 for the G-NNP.