Resolution-adapted recombination of structural features significantly improves sampling in restraint-guided structure calculation

Recent work has shown that NMR structures can be determined by integrating sparse NMR data with structure prediction methods such as Rosetta. The experimental data serve to guide the search for the lowest energy state towards the deep minimum at the native state which is frequently missed in Rosetta de novo structure calculations. However, as the protein size increases, sampling again becomes limiting; for example, the standard Rosetta protocol involving Monte Carlo fragment insertion starting from an extended chain fails to converge for proteins over 150 amino acids even with guidance from chemical shifts (CS-Rosetta) and other NMR data. The primary limitation of this protocol—that every folding trajectory is completely independent of every other—was recently overcome with the development of a new approach involving resolution-adapted structural recombination (RASREC). Here we describe the RASREC approach in detail and compare it to standard CS-Rosetta. We show that the improved sampling of RASREC is essential in obtaining accurate structures over a benchmark set of 11 proteins in the 15-25 kDa size range using chemical shifts, backbone RDCs and HN-HN NOE data; in a number of cases the improved sampling methodology makes a larger contribution than incorporation of additional experimental data. Experimental data are invaluable for guiding sampling to the vicinity of the global energy minimum, but for larger proteins, the standard Rosetta fold-from-extended-chain protocol does not converge on the native minimum even with experimental data and the more powerful RASREC approach is necessary to converge to accurate solutions.

Incoming MPI messages are processed in process_message(), messages are NEW_JOB_ID, BAD_INPUT, and JOB_SUCCESS. Slave jobs ask for a new job--and batch--ID. The jobID is an index into the job--list, the batchID identifies the job--list to be used. If no more jobs are available the slave receives a NULL job--ID that causes him to spin down.
Batches are a set of jobs that are generated by a common set of flags. Usually all batches are known from the start. They can be given to the command--line by run:batches flags1 flags2 flags3 etc.

ARCHIVEJOBDISTRIBUTOR
In iterative mode batches are not known at start time, but are generated based on structural analysis of incoming decoys. To allow dynamic additions to the batch--queue the MpiFileBufJobDistributor is extended to the MPIArchiveJobDistributor. A BATCH_SYNC message is issued by a slave process that received a job with batchID larger than the slave's nr_batches(). Subsequently, batchID-nr_batches() ADD_BATCH messages are sent to update the slaves batch--list. Newly generated batches are added to the master batch queue by the JobDistributor upon receipt of an ADD_BATCH message from the ArchiveManager--process. Old but still unprocessed batches can be obsoleted from the ArchiveManager by sending a CANCEL_BATCH message to the Jobdistributor. No more jobs from a cancelled batch are issued and incoming decoys are not passed on to the ArchiveManager. This is mainly used when switching to a new resampling stage.
When the batch--queue is empty the ArchiveManager is prompted for additional work before taking the default action of spinning down. To this end, the virtual batch_underflow() method has been overridden to send a QUEUE_EMPTY message to the ArchiveManager.

FILEBUF
All work processes that work on the same batch write their results into the same set of files. Parallelization is now compeltely hidden from work--processes and setup, i.e., no process-dependent filenames, such as result_0001, result_0002, etc, are necessary. File output in Rosetta is generally handled via the ozstream class, as a requirement to be able to build the the Rosetta@home executable. To allow parallel output for all Rosetta protocols without changes in protocol source code, we simply replace the underlying streambuf of the ozstream object. The replacement with mpi_stream_buf (mpistream.hh) is triggered by calling the static method enable_MPI_reroute() (done in ::go method of MPIXXXJobDistributor).
File--operations such as open, close, write and flush are now related to the dedicated IO--process that runs an intance of MPIFileBuffer. Each file opened by a slave process creates a corresponding buffer in MPIFileBuffer, whose contents is only written to disk after flush() or close() is called on the mpi_stream_buf. Typical behavior for ROSETTA protocols is to open and close a file for each decoy generated. Thus, srambeling of lines between, e.g., silent--IO frames coming from different processes, is avoided by the unique slave--process dependent buffers.
For communication between mpi_stream_buf and MPIFileBuffer the following messages have been created: MPI_STREAM_OPEN, MPI_STREAM_SEND, MPI_STREAM_CLOSE, and MPI_STREAM_FLUSH. Messages MPI_BLOCK_FILE and MPI_RELEASE_FILE are used in iterative mode to avoid race--conditions when decoys are read by the master process (ArchiveManager) while slave--processes are writing into the same file. Due to the buffered output slave--process continue writing despite the blocking, however, finished frames (flush/close) will not be written from the internal buffer to to the disk.

a. ArchiveManager
An instance of ArchiveManager is the top--level controller in iterative mode. This class encapsulates all basic infrastructure work that is common to any iterative protocol while high-level details, i.e., decision making, is handed over to an instance of an ArchiveBase. The most important interface methods of ArchiveBase called by the ArchiveManager are generate_batches() and read_structures(). The former is called when the QUEUE_EMPTY message has been received from the Jobdistributor (see above) and the latter method is called when new decoys are available in one of the batches output files.
When the ArchiveManager receives JOB_COMPLETION messages issued by the job--distributor it reads the respective output file and compares decoys--tag to figure out which decoys in the file have not been read into the archive. The decoys with new tags are read and forwarded to the the Archive using read_structures(). Additionally, the Manager oversees storing of recovery information and restart, the cancelling of batches and transmission of a stop--signal to the Jobdistributor if no new batches are generated.
For each new batch a directory batch_nnnnn is created where nnnn represents the number of the batch. This directory contains flags and setup files, and takes all output--files. The directory contains a file BATCH_INFO which is maintained by the ArchiveManager to store status information regarding that batch in case that a run has to be restarted. The class Batch provides filenames for the typical files (flags, setup--files, output--decoys) and represents the BATCH_INFO information in memory.

b. The ArchiveBase Hierarchy b.i. ArchiveBase
The ArchiveBase provides an interface for a general controller of iterative protocols that maintain a pool of structures as common feature. It provides methods read_structures() to add fresh decoys to the archive, and generate_batches() to issue new (resampling) work. Moreover, the interface provides hooks to safe and restore the archive content and status.

b.ii. EvaluatedArchive
For RASREC, the decision which structures to keep, is based on a weighted sum of scores. This decision is handled by the EvaluatedArchive subclass. Any iterative protocol that is based on a pool of structures that are selected via a weighted some of energies and other metrics would be derived from this class. To give a negative example, a Pareto selection scheme, for instance, would have to implement a different selection scheme, and would not be derived from this class.
A clean modular design would compute all energies and metrics of incoming decoys when they arrive. However, on the BG/P system with relatively slow single processors and 2048 slave processors rescoring of incoming structures was too expensive. Benchmarking revealed that not the rescoring, as one would expect, but the generation of poses out of silent--file frames was causing the bottleneck. To avoid this problem a non--local evaluation is toggled by set_evaluate_local( false ) such that finished decoys have to be evaluated from the generated slave--job before submitted to file. The compound--score is stored as score_final in the silent--file. Obviously, with decoys coming from different batches care has to be taken that the values in score_final are consistent. This is really against good software engineering practice and caused all the commonly expected problems. Much later we found a way to make pose--generation out of silent--frames much faster (as implemented in current svn), such that the non--local evaluation might be obsolete. This design--decision is worth revisiting before further method--development.
Individual metrics or energies can be computed from each incoming decoy using instances of evaluation::PoseEvaluator and an instance of ScoreFunction. The weighted sum is cached as _archive_select_score_ in the silent--struct of each archived decoy. Only when weights are changed or score--terms added, these values get re--evaluated. All necessary book--keeping for the chaching mechanism is handled by EvaluatedArchive.

b.iii. IterativeBase (RASREC_Base)
The logic of the RASREC protocol is implemented in classes IterativeCentroid (stages I--IV) and IterativeFullatom( stages V--VI ), both classes are derived from IterativeBase. A wrapper class called IterativeAbrelax bundles IterativeCentroid and IterativeFullatom into a single object. Most of the functionality is implemented in the base class.
The class overloads generate_batch() to issue new batches based on its pool of decoys ( accessed with decoys() ).
Methods to setup input files for the batches are called gen_XXX and are invoked in stage-dependent manner from generate_batch(). These methods are gen_enumerate_pairings(I--II), gen_resample_topologies(II--III), gen_resample_stage2(IV), gen_resample_fragments(IV--VI), gen_cen2fullatom(V), gen_cen2fullatom_non_pool_decoys(V) in IterativeBase and gen_resample_core(VI) in IterativeFullatom. When non_local evaluation is activated the method gen_evaluation_output is called for each batch to add the computation of the scores and energies used for decoy selection. These functions are called with an instance of Batch as the only parameter, the generating functions add specific flags to the flag--file provided by Batch and add special input--files to the directory provided by Batch.
In stage V half of of the generated batches start an all--atom refinement from the archived structures. The other 50% of stage V--generated decoys are started from randomly selected decoys from all previous batches. This hedging yields a broader sampling of the all--atom energy function and thus ensures that a converged full--atom ensemble is only found if its all--atom energies are indeed much lower than those of surrounding conformational space.

NMR RESTRAINTS FOR BENCHMARK
Experimental chemical shift data was used in all calculations. Experimental or simulated RDC and NOE data was used as indicated in Table 1 of the main text. Details on restraint generation are given below. Experimental NOE data were used for proteins ER541, BtR324B, ARF1, and ALG13, and for 1i1b, 1i1b 2 and 2z2i NOE restraints were simulated.

SIMULATED RDC RESTRAINTS
Simulated RDCs were calculated from an ensemble of 20 structures obtained by relaxing the native structure within the Rosetta all-atom energy and adding Gaussian noise of 0.3Hz amplitude relative to a typical tensor amplitude of Da=10Hz . The structural variation in the RDC generating ensemble renders the simulated data more realistic: even before noise is added no single structure can be fitted to the RDCs with Q < 10%; the native structure fits the resulting RDCs with a Qfactor of 17-30%. The noise amplitude of 0.3Hz used here might be unrealistically small in the context of the larger structures and noise estimates around +/-1 or even +/-2 Hz might be more realistic. In practice, however, the magnitude of the added noise is relatively unimportant when used within a quadratic restraint potential. Thus the structural noise is far more detrimental to structure determination.

a. Experimental
For 2jyx and 2kd7 distance restraints based on experimental NOEs were taken from the published data set (BtR324B, PDB id: 2kd7; ER541, PDB id: 2jyx and ARF1, PDB id: 2k5u). To determine which of those published NOE cross-peaks would be unambiguously and automatically assignable from a 4D NOESY we filtered for HN-HN peaks that were separated from each other by more than the typical tolerance used in automated NOESY assignment for the (1H direct/1H indirect/15N) dimensions (0.03/0.05/0.4 ppm). We have deliberately chosen a more conservative set of tolerances than the usually employed (0.02/0.05/0.2ppm) to make sure that structure calculation from the generated set of sparse NOEs is not easier than during a blind calculation.

b. Simulated
For 1i1b, 2rn2, 2z2i, 1sua and 1g68 NOE distance constraints were generated from the native structure as follows: for each HN-HN distance < 4 Å we generated a constraint with lower bound of 1.5 Å and upper bound of 6 Å. For protein 1i1b we also generated a set of HN-HN NOEs with a cutoff < 5 Å . The results for this data set are reported as 1i1b2. Normally 5-6 Å is the detection limit in protonated samples. For deuterated samples and long NOE mixing times one can see peaks for amide protons up to 7 Å apart. In analogy to our conservative strategy for ER541 and BtR324B we have deliberately chosen the conservative cutoff of 4Å.