End-to-end learning of multiple sequence alignments with differentiable Smith–Waterman

Abstract Motivation Multiple sequence alignments (MSAs) of homologous sequences contain information on structural and functional constraints and their evolutionary histories. Despite their importance for many downstream tasks, such as structure prediction, MSA generation is often treated as a separate pre-processing step, without any guidance from the application it will be used for. Results Here, we implement a smooth and differentiable version of the Smith–Waterman pairwise alignment algorithm that enables jointly learning an MSA and a downstream machine learning system in an end-to-end fashion. To demonstrate its utility, we introduce SMURF (Smooth Markov Unaligned Random Field), a new method that jointly learns an alignment and the parameters of a Markov Random Field for unsupervised contact prediction. We find that SMURF learns MSAs that mildly improve contact prediction on a diverse set of protein and RNA families. As a proof of concept, we demonstrate that by connecting our differentiable alignment module to AlphaFold2 and maximizing predicted confidence, we can learn MSAs that improve structure predictions over the initial MSAs. Interestingly, the alignments that improve AlphaFold predictions are self-inconsistent and can be viewed as adversarial. This work highlights the potential of differentiable dynamic programming to improve neural network pipelines that rely on an alignment and the potential dangers of optimizing predictions of protein sequences with methods that are not fully understood. Availability and implementation Our code and examples are available at: https://github.com/spetti/SMURF. Supplementary information Supplementary data are available at Bioinformatics online.


Smooth and differentiable Smith-Waterman
Pairwise sequence alignment can be formulated as the task of finding the highest scoring path through a directed graph in which edges correspond to an alignment of two particular residues or to a gap. The edge weights are match scores for the corresponding residues or the gap penalty, and the score of the path is the sum of the edge weights. The Smith-Waterman algorithm is a dynamic programming algorithm that returns a path with the maximal score. A smooth version instead finds a probability distribution over paths in which higher scoring paths are more likely. We describe a Smooth Smith-Waterman formulation in which the probability that any pair of residues is aligned can be formulated as a derivative. x 1 x 2 x 3 y 1 _ y 2 alignment of bold path Figure 1: The alignment graph for sequences X = x 1 x 2 x 3 and Y = y 1 y 2 . Edge labels describe the corresponding aligned pair, and colors indicate the weights. All red edges start at the source s, and all orange edges end at the sink t. The bold path corresponds to the alignment of X and Y written on the right. Figure 1 illustrates an alignment graph. For sequences x 1 , x 2 , . . . x x and y 1 , y 2 , . . . , y y , the vertex set contains grid vertices v ij for 0 ≤ i ≤ x and 0 ≤ j ≤ y , a source s, and a sink t. The directed edges are defined so that each path from s to t corresponds to a local alignment of the sequences. The table below describes the definitions, meanings, and weights of the edges.

Edge
Meaning Weight v i−1,j−1 → v i,j x i and y j are aligned x i ,y j alignment score a ij v i,j−1 → v i,j y j is aligned with the gap character gap penalty g v i−1,j → v i,j x i is aligned with the gap character gap penalty g s → v i,j x k for k ≤ i and y k for k ≤ j are excluded 0 v i,j → t x k for k > i and y k for k > j are excluded 0 The Smith-Waterman algorithm iteratively computes the highest score of a path ending at each vertex and returns the highest scoring path ending at t. Let w(u → v) denote the weight of the edge u → v, and let N − (v) = {u | u → v is an edge} denote the incoming neighbors of v. Let f (v) be the value of the highest scoring path from s to v. Taking f (s) = 0, we compute For grid vertices this simplifies to f (v i,j ) = max{f (v i−1,j−1 ) + a ij , f (v i,j−1 ) + g, f (v i−1,j ) + g, 0}.
A path with the highest score is computed by starting at the sink t and tracing backward along the edges that achieve the maxima. (For further explanation see Chapter 2 of Durbin et al. (1998) or Smith and Waterman (1981)).
Following the general differentiable dynamic programming framework introduced in Mensch and Blondel (2018), we implement a smoothed version of Smith-Waterman. We compute a smoothed version of the function f , which we denote f S , by replacing the max with logsumexp. We again take f S (s) = 0, and define We use these smoothed scores and the edge weights to define a probability distribution over paths in G, or equivalently local alignments.
Definition 1. Given an alignment graph G = (E, V ), define a random walk starting at vertex t that traverses edges of G in reverse direction according to transitions probabilities and ends at the absorbing vertex s. Let µ G be the probability distribution over local alignments in which the probability of an alignment A is equal to the probability that the random walk follows the reverse of the path in G corresponding to A.
Under the distribution µ G , the probability that residues x i and y j are aligned can be formulated as a derivative. Mensch and Blondel (2018) describe this relationship in generality for differentiable dynamic programming on directed acyclic graphs. We state their result as it pertains to our context and provide a proof in our notation in Section 4.1.
Proposition 1 (Proposition 3 of Mensch and Blondel (2018)). Let G be an alignment graph and µ G be the corresponding probability distribution over alignments. Then

GREMLIN details
GREMLIN is a probabilistic model of protein variation that uses the MSA of a protein family to estimate parameters of a MRF of the form and is the number of columns in the MSA, v i represents the amino acid propensities for position i, w ij is the pairwise interaction matrix for positions i and j, and Z is the partition function (the value E( · ; v, w) summed over all sequences x). Typically the model is trained by maximizing the pseudo-likelihood of observing all sequences in the alignment (Kamisetty et al., 2013;Ovchinnikov et al., 2014;Ekeberg et al., 2013;Balakrishnan et al., 2011). Here we follow the approach of Bhattacharya et al. (2020) and Rao et al. (2020) and use Masked Language Modeling (MLM) to find the parameters w and v. The pairwise terms w ij can be used to predict contacts by reducing each matrix w ij into a single value that indicates the extent to which positions i and j are coupled.

Data selection for SMURF
For our analysis of SMURF on proteins, we used the MSAs and contact maps collected in Anishchenko et al. (2017). For training and initial tests, we used a reduced redundancy subset of 383 families constructed in Dauparas et al. (2019). Each family has least 1K effective sequences, and there is no pair of families with an E-value greater that 1e-10, as computed by an HMM-HMM alignment Hildebrand et al. (2009) Figure 11 includes data from 99 families from Hildebrand et al. (2009) that have at most 128 sequences. A list of the families used in each setting is available in our GitHub repository. For each non-coding RNA, we aligned the RNA sequence in the PDB along with the corresponding Rfam sequences to an appropriate Rfam covariance model using Infernal (Nawrocki and Eddy, 2013). We then analyzed these sequences using the same procedure outlined for proteins. We evaluated the efficacy of the predicted contact maps using the PDB-derived contact map, where two nucleotides are classified as in contact if the minimum atomic distance is below 8 angstrom. A list of the families used is available in our GitHub repository.

Details of SMURF
SMURF has two phases: BasicAlign and TrainMRF. Both begin with the learned alignment module (Main Text 1), but they have different architectures and loss functions afterwards.
BasicAlign. Similarity matrices produced by randomly initialized convolutions will produce chaotic alignments that are difficult for the downstream MRF to learn from. The purpose of BasicAlign is to learn initial convolutions whose induced similarity matrices yield alignments with relatively homogeneous columns (see Figure 8). The input to BasicAlign is a random subset of sequences S = {S (1) , . . . S (B) } in the protein family. A pairwise alignment between each sequence and the first sequence S (1) is produced via the learned alignment module (as described in Main Text Figure 1). This set of alignments can be viewed as an MSA where each column of the MSA corresponds to a position in the first sequence. Averaging the MSA yields the distribution of residues in each column. Let M ix be the fraction of sequences in S with residue x aligned to position i of S (1) , where k is the length of S (k) and p k ij is the probability that position i of S (1) is aligned to position j of S (k) under the smooth Smith-Waterman alignment. (Note that x M ix is less than one when there are sequences with a gap aligned to position i of S (1) .) The BasicAlign loss is computed by taking the squared difference between each aligned one-hot encoded sequence and the averaged MSA, TrainMRF. In TrainMRF, masked language modeling is used to learn the MRF parameters and further adjust the alignment module convolutions (see Figure 9). The input to TrainMRF is a set of sequences drawn at random from the MSA, S = {S (1) , . . . S (B) }. A random 15% of the residues of the input sequences are masked, and the masked sequences are aligned to the query via the learned alignment module (as described in ??). The parameters for the alignment module are initialized from BasicAlign, and the query is initialized as the one-hot encoded reference sequence for the family. The MRF has two sets of parameters: symmetric matrices w ij ∈ R A×A for 1 ≤ i, j ≤ R with w ij = w ji that correspond to pairwise interactions of the positions in the reference sequence and position-specific bias vectors b i ∈ R A for 1 ≤ i ≤ R . Here R denotes the length of the reference sequence, and A is the alphabet size (A = 20 for amino acids and A = 4 for nucleotides). Unlike traditional parameterizations of a MRF, we do not include gaps in our alphabet. Since our task is reconstructing masked positions in unaligned sequences, we have no need to predict gap characters.
After the sequences are aligned to the query, the infill distribution for each masked position is determined by the MRF parameters as follows. For a masked position j in sequence k, we defineŜ (k) j ∈ R A as the predicted probability distribution over residues at position j of sequence S (k) . Let p k it be the probability that position t of S (k) is aligned to position i of the query under the smooth Smith-Waterman alignment, and let m k t be the indicator that position t in sequence S (k) was masked. To computeŜ (k) j , we first compute a score for each residue x that is equal to the expected value (under the smooth alignment) of the terms of the function E( · ; b, w) specific to position j or involving position j and an unmasked position. Then we compute the infill distribution by taking the softmax. Formally, The infill distribution is an approximation of how likely each residue is to be present at position j in sequence k if position j were aligned to some position in the query sequence S (1) . The approximation considers the values of the linear terms b and the pairwise terms w corresponding only to unmasked positions. (In the case that position j in sequence k is almost certainly an insertion relative to the query sequence S (1) , i.e.
i p k ij is small, our computation will likely provide a poor guess for the residue; in the extreme case where i p k ij = 0 the infill distribution is uniform over the alphabet. Our model does not have a mechanism to learn the identities of residues that are insertions relative to the query sequence. Ultimately, this is not a concern since we do not use information about insertions to predict the contacts of the query sequence.) We train the network using a cross entropy loss and L2 regularization on w and b with λ = .01 After each iteration, the query is updated to reflect the inferred MSA. Let R be the one-hot encoding of the reference sequence. We define C i+1 as a rolling weighted average of the MSAs learned through iteration i and Q i as the query for iteration i, where M i is the averaged MSA computed as described in Equation (3) from the sequences in iteration i, η = 0.90, and γ = 0.3. This process is illustrated by the light blue arrows in Figure 9. Preliminary results on the training set had suggested that updating the query in this manner improved results for some families. However, the ablation study on the test set does not suggest improvement ( Figure 14); further investigation is needed to determine the benefits changing the query between iterations.
Once training is complete, we use w to assign a contact prediction score between each pair of positions. The score c ij measures the pairwise interaction between positions i and j, andc ij is score after applying APC correction (Dunn et al., 2008),

SMURF hyperparameter selection
Throughout our hyperparameter search, we kept the following parameters constant: fraction of residues masked at 15%, number of convolution filters at 512, convolution window size at 18, temperature 1.0, regularization λ in Equation (6) at 0.01. Our hyperparameter search consisted of three stages. We initialized the gap penalty as −3 and allowed the network to learn a family-specific gap penalty. 3. Based on the results of the above hyperparameter search on the select families, we performed a final hyperparameter search on the entire training set. We noticed that performance was better for larger batch sizes, but it was not always possible to run the large batch sizes on our 32 GB GPU for families with longer sequences. For our final hyperparameter search, we used the largest batch size of {64, 128, 256} that would fit in memory for each family. We set η = 0.90, γ = 0.3, and selected 3000 BasicAlign /1000 TrainMRF iterations because these parameters lead to relatively strong results across the restricted set of families. Learning rate 0.10 outperformed learning rate 0.05 on the restricted set, but learning rate 0.05 generally outperformed learning rate 0.10 in the initial grid search on the full training set. We ran a final test with the aforementioned parameters and the two learning rates on the entire training set, and found that learning rate 0.05 was optimal overall.
We also ran 4000 iterations of MLM-GREMLIN with predetermined optimal parameters: learning rate 0.05 and batch size 64. We found very similar performance between 3000 and 4000 iterations of MLM-GREMLIN. We chose to compare SMURF to 4000 iterations of MLM-GREMLIN so that both methods were trained for 4000 iterations.

Data selection for AlphaFold experiment
For our case study, the initial multiple sequence alignments (MSA) were obtained from MMseqs2 webserver as implemented in ColabFold (Mirdita et al., 2021). After trimming the MSAs to their official domain definition, they were further filtered to reduce redundancy to 90 percent and to remove sequences that do not cover at least 75 percent of the domain length, using HHfilter (Steinegger et al., 2019). Continuous domains under 200 in length, with at least 20 sequences, RMSD (root-mean-squared-distance) greater than 3 angstroms and the predicted LDDT (confidence metric) below 75, were selected for the experiment. We include one discontinuous targets T1064-D1 (SARS-CoV-2 ORF8 accessory protein) with only 16 sequences as an extra case study, as this was a particularly difficult CASP target that required manual MSA intervention, guided by pLDDT, to predict well (Jumper et al., 2021a). The filtered MSAs were unaligned (gaps removed, deletions relative to query added back in) and padded to the max length.

AlphaFold experiment details
AF takes a discrete valued MSA as input and encodes it as a one-hot tensor, which is input into the network. In our work, the input to AF is an amortization of one-hot encoded alignment tensors over the distribution of alignments output by LAM. We found that the AF predictions were particularly sensitive the the random mask used during evaluation (see Figure 15). For this reason we omitted the mask during evaluation of the MMSeqs2 MSA and throughout our optimization procedure. For simplicity, we considered only one of the five AF models and did not give AF access to the length of insertions relative to the query during our optimization procedure. Our objective function sought to maximize the pLDDT and minimize the alignment error as returned by AF's "model 3 ptm". The AF predictions from the MMSeqs2 MSAs tended to have the overarching structure correct, but were incorrect on certain parts of the sequence. Our goal was for our optimization to correct the incorrect parts of the structure. For this reason we used the more stringent metric of RMSD (rather than the GDT measure of global structure) to evaluate the accuracy of our alignments.
When the number of sequences is low, we find the optimization to be especially sensitive to parameter initialization. To increase robustness, for each target 180 independent optimization trajectories with 100 iterations each were carried out using ADAM. Each trajectory is defined by a random seed, a learning rate (10 −2 , 10 −3 , 10 −4 ) and whether a cooling scheme was used in Smith-Waterman (temperature 1.0 or temperature decreased linearly from 1.5 to 0.75 across the 100 iterations). In each case, we used 512 convolutional filters, initialized the gap penalty as −3, and allowed the network to learn the gap penalty. For completeness, we now repeat the proof of Proposition 1 given in Mensch and Blondel (2018) for the special case of Smooth Smith-Waterman. Proposition 1 gives a probabilistic interpretation of the gradient f S (t) with respect to the edge weights a ij . We first give a probabilistic interpretation of the gradient f S (t) with respect to the vertex scores f S (v ij ).
Proposition 2. Let G be an alignment graph. With respect to the random walk described in Definition 1, Proof. Let N + (v) = {u | v → u is an edge in G} denote the outgoing neighborhood of v. Let u 1 , . . . u n denote the vertices of G in a reverse topological order. We prove the statement by induction with respect to this order. Note u 1 = t, and P( t is visited ) = ∂f S (t) ∂f S (t) = 1. Assume that for all 1 where in the second equality we apply the inductive hypothesis.
Proof of Proposition 1. It suffices to show that for each directed edge where the traversal occurs from v to u in the random walk. Observe

Difference in Needleman-Wunsch implementation of Morton et. al.
Morton et al. (2020) implement a differentiable version of the Needleman-Wunsch global alignment algorithm (Needleman and Wunsch, 1970). Their implementation differs from ours in how gaps are parameterized. Consequently, their output indicates where gaps or matches are likely, whereas our output expresses matches in an expected alignment. Morton et al. (2020) define where g i,j is the gap penalty for an insertion or deletion at i or j, µ i,j is the alignment score for X i and Y j , and max Ω (x) = log ( i exp (x i )) (see Appendix A of Morton et al. (2020)). The values v i,j are analogous to our definition f S on grid vertices (Equation (1)) with match scores µ i,j = a i,j , In the alignment graph for their formulation, gap edges have weight µ i,j + g i,j . In our alignment graph, gap edges have weight g; the match score µ i,j does not play a role, and our gap penalty is not position dependent. Their code outputs the derivatives ∂µi,j is high whenever the dominant alignment path uses an edge whose weight includes µ i,j ; this includes the edges that corresponds to gaps. In contrast, in our formulation a i,j = µ i,j appears on the match edge only, and so ∂f S (t) ∂ai,j is high only when the dominant alignment path uses the edge corresponding to a match. Proposition 1 establishes that ∂f S (t) ∂ai,j equal to the probability that X i and Y j are aligned, so our output is an expected alignment.

Vectorization in our SSW implementation
Following the approach of Wozniak (1997), we implement a version of smooth Smith-Waterman where the values on the anti-diagonal are computed simultaneously. The vectorization speeds up our code substantially. In order to compute the final score f S (t), we iteratively compute the scores of the grid vertices f S (v i,j ), which take as input the values f S (v i−1,j ), f S (v i,j−1 ), and f S (v i−1,j−1 ). In a simple implementation, a for loop over i and j is used to compute the values f S (v i,j ) (Figure 4a). To leverage vectorization, we instead compute the values f S (v i,j ) along each diagonal in tandem, i.e. all (i, j) such that i + j = d. To implement this, we rotate the matrix that stores the values f S (v i,j ) by 45 degrees so that each diagonal now corresponds to a row (see Figure 4b). In the rotated matrix, the values in a row d are a function of the values in rows d − 1 and d − 2, and therefore we can apply vectorization to quickly fill the matrix.

SSW options
Our smooth Smith-Waterman implementation has the following four additional options.
Temperature parameter. The temperature parameter T controls the extent to which the probability distribution over alignments is concentrated on the most likely alignments; higher temperatures yield less concentrated alignments. We compute the smoothed score for the vertex v as which matches Equation (1) at the default T = 1.
Affine gap penalty. The "affine gap" scoring scheme introduced to Smith-Waterman by Gotoh (1982) applies an "open" gap penalty to the first gap in a stretch of consecutive gaps and an "extend" gap penalty to each subsequent gap. The open gap penalty is usually larger than the extend penalty, thus penalizing length L gaps less severely than L separate single residue gaps.
To implement an affine gap penalty, we use a modified alignment graph with three sets of grid vertices that keep track of whether the previous pair in the alignment was a gap or a match. Edges corresponding to the first gap in a stretch are weighted with the "open" gap penalty 1 . Figure 5a illustrates the incoming edges of the three grid vertices for (i, j). Paths corresponding to alignments with x i and y j matched pass through v D ij , paths corresponding to alignments with a gap at x i pass through v L ij , and paths corresponding to alignments with a gap at y j pass through v T ij . Storing three sets of grid vertices requires three times the memory used by the version with a linear gap penalty. For this reason we implemented SMURF with a linear gap penalty.
Restrict turns. Smooth Smith-Waterman is inherently biased towards alignments with an unmatched stretch of X followed directly by an unmatched stretch of Y over alignments with an equally long unmatched stretch in one sequence. Consider the example illustrated in Figure 5b where the highest scoring match states are depicted by bold black, light blue, and dark green lines. Suppose the match scores of the light blue and the dark green are identical. With a standard Smith-Waterman scoring scheme (no affine gap), the alignment containing the black and light blue segments has the same score as each alignment containing the black and dark green segments. However, there are more alignments that pass through the dark green segment. There are ten ways to align ABC and V W with no matches (the red, purple, orange, brown, and light green paths illustrate five such ways), but only one way to align V W XY Z with gaps (navy blue). Smooth Smith-Waterman will assign the same probability to each of these paths. However, since ten of the eleven paths go through the dark green segment, the expected alignment output by smooth Smith-Waterman will favor the dark green segment. This bias becomes more pronounced the longer the segments; there are L A alignments of a sequence of length L and a sequence of length L − A with no matches.
To remove this bias, we implemented "restrict turns" option that forbids unmatched stretches in the X sequence from following an unmatched stretch in the Y sequence. To do so, we again use an alignment graph with three sets of grid vertices to keep track of the previous pair in the alignment. Removing the edge with the asterisk in Figure 5a, forbids transitions from an unmatched stretch in the Y sequence to an unmatched stretch in the X sequence. When implemented with this restrict turns option, smooth Smith-Waterman will find exactly one path through the dark green and black segments in Figure 5: the path highlighted in red. Due to the increased memory requirement of the restrict turn option, we did not utilize the option in SMURF.
Global Alignment. We also implement the Needleman-Wunsch algorithm, which outputs global alignments rather than local alignments. The grey labels describe the corresponding aligned pair for each group of edges. The red edge is incoming from the source vertex s. There is an outgoing edge from v D ij to the sink t for all i, j ≥ 1 (not pictured). The edge marked with an asterisk is removed under the "restrict turns" option. (b) Without the restrict turns option, there ten paths containing both black segments and dark green segment. The red, purple, orange, brown, and light green illustrate five of these paths. There is only one path that contains both black segments and light blue segment, as depicted in navy blue. The sub-alignments corresponding to the colored segments are written on the right. With the restrict turns option the purple, orange, brown, and green paths are not valid.

Count
Family 2HBAA all scores scores for aligned pairs Figure 6: Distribution of learned similarity scores by amino acid pairs for 2HBAA. Blue: histograms of standardized similarity scores input into SSW in the LAM for protein family 2HBAA by identity of amino acid pair. Orange: overlaid historgram of scores restricted to pairs of residues with expected value of alignment at least 0.5. All scores are standardized by subtracting the mean and dividing by the standard deviation, computed over all similarity scores between 24 sequences and the query (i.e. values in the similarity tensor illustrated in Main Text Figure 1). The black lines represent the BLOSUM scores for the pair of amino acids, standardized in the same manner. Only amino acids that are present at least four times in the query or on average at least four times among the 24 input sequences are included in the figure.   Figure  1), and the corresponding MSA is averaged. Squared loss (Equation (4)) is computed between the averaged MSA and the one-hot encoding of the aligned input sequences.   Figure 1). A prediction for the masked positions is computed from the MRF parameters according to Equation (5). The network is trained with cross entropy loss given by Equation (6). The light blue arrows illustrate the update to the query that occurs between iterations of training; the query is a weighted average of the one-hot query sequence and a running average of the MSAs computed in previous iterations, see Equation (7). The grey arrow depicts the extraction of the contact map from the MRF matrix w at the end of training, as described in Equation (8).  5 Further analysis of example SMURF predictions 5.1 RNA contact prediction.
By comparing the positive predictive value (PPV) for different numbers of predicted contacts, we see that SMURF consistently yields a higher PPV for RFAM family RF00167 (Main Text 2b). For RF00010, it starts off higher but then drops off faster, leading to a lower overall AUC. Upon a visual inspection of the contact predictions, MLM-GREMLIN evidently generates more false positive predictions in seemingly random locations. On the other hand, SMURF largely resolves this issue, even for RF00010, presumably as a result of a better alignment. Interestingly, SMURF's lower AUC for RF00010 can be attributed to a concentration of false positive predictions near the 5' and 3' ends. It remains unclear whether these represent a coevolution-based structural element that was not present in the specific RNA sequence deposited in PDB or whether these arise from artifacts of the learned alignment.

Protein contact prediction and alignments.
Next, we investigated the contact predictions and alignments produced by SMURF. Figure 12 and Figure 13 illustrate the contact predictions, corresponding positive predictive value (PPV) plots, and alignments for the three families that improved the most and least (respectively) under SMURF as compared to MLM-GREMLIN. The poor performance of SMURF on 3LF9A can be attributed to the misalignment of the first ≈ 25 residues of many sequences (including the one illustrated) to positions ≈ 75 to 100 of the reference rather than to the first 25 positions of the reference. This is likely because the gap penalty for leaving positions ≈ 25 to 75 unaligned outweighs the benefit of aligning to beginning of the reference. Since our code computes a local alignment, there is no penalty for leaving positions at the beginning of the reference unaligned. Perhaps using our implementation of Smith-Waterman with an affine gap penalty would lead the network to learn a less severe penalty for long gaps and arrive at correct alignment. For the most improved families, we see that SMURF tends to predict fewer false positive predictions in seemingly random positions, as observed for RNA. Figure 12: Contact predictions and alignments for the three most improved protein families. Left: Comparison of contact predictions between SMURF (red) and MLM-GREMLIN (blue). Gray dots represent PDB-derived contacts, circles represent a true positive prediction, and x represents a false positive prediction. Middle: The positive predictive value (PPV) for different numbers of top N predicted contacts, with N ranging from 0 to 2L. Right: Comparison of the alignment of a random sequence in the family to the reference sequence. Red indicates aligned pairs that appear in the SMURF alignment, but do not appear in the given alignment. Blue indicate aligned pairs that appear in the given alignment, but do not appear the alignment found by SMURF. Figure 13: Contact predictions and alignments for the three worst performing protein families (as compared to MLM-GREMLIN). Left: Comparison of contact predictions between SMURF (red) and MLM-GREMLIN (blue). Gray dots represent PDB-derived contacts, circles represent a true positive prediction, and x represents a false positive prediction. Middle: The positive predictive value (PPV) for different numbers of top N predicted contacts, with N ranging from 0 to 2L. Right: Comparison of the alignment of a random sequence in the family to the reference sequence. Red indicates aligned pairs that appear in the SMURF alignment, but do not appear in the given alignment. Blue indicate aligned pairs that appear in the given alignment, but do not appear the alignment found by SMURF. Figure 14: Ablation results. Contact AUC for SMURF versus ablated methods. Each point represents one family in the test set. In "Constant Query," we did not update the the query with the averaged MSA between iterations (as depicted by light blue arrows in Figure 9). In "No BasicAlign," the convolutions were not initialized with BasicAlign, and instead TrainMRF was run for 4000 iterations. In "pseudo-alignment," we replaced Smith-Waterman with a pseudo-alignment obtained by taking the softmax of the similarity matrix row-wise and column-wise, multiplying the resultant matrices, and taking the square root (similar to Bepler and Berger (2018)). Figure 15: Sensitivity of AlphaFold predictions to random masking. By default, a random mask is used when AlphaFold makes a structure prediction (Jumper et al., 2021b). The distribution of RMSD of AlphaFold predictions for MMSeqs2 MSAs with different random seeds used for the masks. The black line shows the RMSD of the prediction without the mask.