Fitness effects of mutations to SARS-CoV-2 proteins

Knowledge of the fitness effects of mutations to SARS-CoV-2 can inform assessment of new variants, design of therapeutics resistant to escape, and understanding of the functions of viral proteins. However, experimentally measuring effects of mutations is challenging: we lack tractable lab assays for many SARS-CoV-2 proteins, and comprehensive deep mutational scanning has been applied to only two SARS-CoV-2 proteins. Here we develop an approach that leverages millions of publicly available SARS-CoV-2 sequences to estimate effects of mutations. We first calculate how many independent occurrences of each mutation are expected to be observed along the SARS-CoV-2 phylogeny in the absence of selection. We then compare these expected observations to the actual observations to estimate the effect of each mutation. These estimates correlate well with deep mutational scanning measurements. For most genes, synonymous mutations are nearly neutral, stop-codon mutations are deleterious, and amino-acid mutations have a range of effects. However, some viral accessory proteins are under little to no selection. We provide interactive visualizations of effects of mutations to all SARS-CoV-2 proteins (https://jbloomlab.github.io/SARS2-mut-fitness/). The framework we describe is applicable to any virus for which the number of available sequences is sufficiently large that many independent occurrences of each neutral mutation are observed.

With millions of SARS-CoV-2 sequences shared publicly, almost all mutations that are tolerated by the virus are observed dozens to hundreds of times. Where on the tree and how often on the tree we observe specific mutations has information about the effects of these mutations on viral spread. The mutation rate depends on the nucleotides involved and possibly on the sequence context and other viral determinants, but for the purpose of this derivation, we will assume the neutral rate µ is known. If the mutation is neutral, the total number of times the mutation is observed on the tree is µT, where T is the total length of the tree (assuming that the mutation never reached high frequency which is true for almost all mutations, particularly when mutations are counted on a per-clade basis relative too the clade founder as done above).
If a mutation reduces fitness, the lineages descending from branches on which this mutation happened will spread more slowly than those without this mutation. As a result, the down-stream subclades are smaller and more short lived, which in turn means that they will be less likely to be sampled and represented in the tree. To infer a mutation's effect on fitness, we need to calculate how the probability of observation depends on this fitness effect.
For a mutation to be represented in the tree, one of its descendants has to be sampled and sequenced. If the total number of descendants is w and the sampling fraction is , the probability that the mutation is present in the tree is W is a random number that depends on the realization of the transmission process, which is commonly modeled by a branching process with birth rate b and death rate d. The death rate here corresponds to clearing an infection, the birth rate to onward transmission. The latter is affected by the fitness cost of the mutation. To obtain insight how the probability of observing a lineage depends on parameters, we calculate the probability p(w, T|t) that a lineage had an integrated size w = T t k(t ) dt , where t is the birth time of the lineage, T is the current time, and k(t ) is the size of the lineage at time t . To calculate p(w, T|k), we generalize it slightly to p(W, T|k, t), where k is the number of individuals at the start time t. This quantity obeys the following "first-step" equation: We will solve for the Laplace transformp(z, T|k, t) = ∞ 0 dw e −wz p(w, T|k, t) =p k (z, T|1, t). Using the following identity for the derivative of the Laplace transform and setting k = 1, we have This simplifies further to if we substitute φ(z, T|t) = 1 −p(z, T|t).
where it is important to note that the derivative is with respect to the first time point and the interval T − t is shrinking with increasing t.

Constant birth and death rate
If the fitness effect of the mutation in question is detrimental and the overall population is constant (background b 0 = d 0 ), all mutant lineages will eventually die out and we can consider large T − t and the long time asymptotic ∂ t φ(z, T|t) = 0. Further define b = b 0 − s and d = d 0 where s is the fitness cost of the mutation (so larger values indicate a greater fitness cost). The steady state generating function is then with solution Since φ(z) = 1 − e −wz p(w) dw, φ( ) is exactly the probability that a lineage is sampled when the entire population is sampled at rate . We thus expect two regimes: if the square of the fitness cost exceeds the sampling intensity (typically at 1% or less), the probability of sampling a lineage is essentially inversely proportional to the fitness cost. The sampling probability of lineages with smaller costs effects depends less strongly on s. Their sampling mostly comes down to stochasticity independent of the fitness cost.

S1
. CC-BY 4.0 International license available under a was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint (which this version posted June 6, 2023. ; https://doi.org/10.1101/2023.01.30.526314 doi: bioRxiv preprint

Growing populations
In many scenarios relevant for lineages that arise during a viral outbreak, the background population isn't constant but is undergoing a rapid exponential expansion. The background birth rate b 0 is bigger than d 0 in this case. Since the population is growing, deleterious mutations can increase in frequency deterministically and we can not send the t to infinity as before. Instead, we need to integrate backwards in time starting from φ(z, T|T) = 0 at t = T. While φ(z, T|t) is small and the quadratic term can be neglected, this is approximately solved by where γ 0 is the growth rate of the background population. At longer times when ze γ 0 (T−t) ∼ 1 and φ is no longer small, φ tends towards a constant value determined by the same quadratic equation as above. This limit is neither interesting or relevant for the present purpose, since there are very few lineages that emerged early enough to have saturated φ. Instead, we need to average φ (the linear approximation) over all the time points when the lineage could have arisen.
This derivation assumed that γ 0 (T − t) 1, i.e. that the overall population size has expanded substantially. The most relevant fitness effects will be those with s(T − t) > 1, that is the fitness effect has strong effect on variant frequency, but s < γ 0 such that the variant is still spreading and can give rise to large lineages in an expanding variant. In this case, the above simplifies to In a variant that has been growing with rate γ 0 for a time τ = T − t and sampled with z = , we thus expect that the number of times we observe separate mutant lineages depends on s as This has a very similar behavior as the solution for constant population size, which suggests that the overall dependence on s is robust and we can assume that the number of times a mutation is observed is inversely proportional to its effect on fitness. The same basic dependency is observed at steady state in a quasi-species [65]. In a constant population, this relationship breaks down for dense sampling > √ s. In growing population, the approximation fails if the product of fitness effect and the time over which the variant has grown, sτ, is small, i.e., if the fitness cost does not affect variant frequency strongly. In these cases, there is still a dependence on s, but it is weaker.

S2
. CC-BY 4.0 International license available under a was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint (which this version posted June 6, 2023. ; https://doi.org/10.1101/2023.01.30.526314 doi: bioRxiv preprint Figure S1 Correlation between amino-acid fitness estimates from the ∼7-million sequence set of all publicly available sequences as of May-11-2023, or the set of all ∼14-million sequences in GISAID as of March-29-2023. The orange text in the upper-left corner of the plot shows the Pearson correlation. The high correlation indicates that our estimates are not substantially limited by noise related to statistical sampling of mutation counts, since using another tree with twice as many sequences does not substantially alter the estimates. Note that the main text figures all use the smaller ∼7-million publicly available sequence set. See https://jbloomlab.github.io/SARS2-mut-fitness/mat_aa_fitness_corre lations.html for an interactive version of this plot that also shows the correlations for other sets of publicly available sequences.

S3
. CC-BY 4.0 International license available under a was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made  The distribution of inferred fitness cost of four-fold degenerate sites at regions of known non-coding constraint within the protein coding genes (the ribosomal slippage and TRS sites) versus all other four-fold degenerate sites. In total, 25 four-fold degenerate sites are in these regions of known constraint, whereas 4844 four-fold degenerate sites are outside these regions. Panel C shows that a large fraction of mutations in these known regions of constraint are deleterious, while the distribution for the remaining four-fold degenerate sites is centered around 0 between -2 and +1. Furthermore, strongly deleterious mutations with fitness estimate below -3 predominantly fall in the 25 sites of known non-coding constraint. If the two additional conserved regions in E and M are included, all but one mutation outside these regions has a fitness cost estimate above -3.

S4
. CC-BY 4.0 International license available under a was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint (which this version posted June 6, 2023. ; https://doi.org/10.1101/2023.01.30.526314 doi: bioRxiv preprint Figure S3 A version of Figure 3B but with genes ordered by position in the genome rather than extent of constraint on nonsynonymous mutations. See https://jbloomlab.github.io/SARS2-mut-fitness/effects_dist_position_order.html for an interactive version of this plot.  Figure S4 Effects of stop-codon and amino-acid mutations across ORF3a. The black area plot shows the mean effect of all amino-acid mutations at each site, and the purple points show the effects of stop codon mutations. There is strong selection against stop codons (negative effects) for all but the C-terminus of ORF3a, but only a few positions show strong selection against amino-acid substitutions. See https://jbloomlab.github.io/SARS2-mut-fitness/ORF3a.html for an interactive version of this plot along with zoomable heatmap of the effects of specific amino-acid substitutions.

S5
. CC-BY 4.0 International license available under a was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint (which this version posted June 6, 2023. ; https://doi.org/10.1101/2023.01.30.526314 doi: bioRxiv preprint Figure S6 The mutation effect estimates from the current study are substantially better correlated with deep mutational scanning measurements for spike than predicted mutation effects made using three other approaches. Each row shows the correlation of the fitness estimates from the current study and full-spike deep mutational scanning measurements versus values from another study. The Maher et al (2022) and Thadani et (2023) studies make predictions for individual amino-acids, so those correlations are with the mutation-level fitness estimates and deep mutational scanning measurements. The Rodriguez-Rivas et al (2022) study only reports site-level predictions, so those correlations are with the site mean fitness estimates and deep mutational scanning measurements. Each row only shows mutations or sites with a reported value from all three approaches. The numbers in the upper right give the Pearson correlation. See https://jbloomlab.github.io/SARS2-mut-fitness/comparator_corr.html for an interactive version of this plot.

S7
. CC-BY 4.0 International license available under a was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint (which this version posted June 6, 2023. ;https://doi.org/10.1101/2023 doi: bioRxiv preprint Effects of individual mutations that fixed in at least one clade of SARS-CoV-2, faceted by whether they are in spike or another protein. "Mutation polarity" indicates if the point shows the effect of the mutation estimated using all viral clades (including those that have fixed the mutation), or just from direct forward occurrences of the mutation in clades in which it has not yet fixed. Some mutations are estimated to be more favorable when including clades in which they have fixed (blue circles) in addition to just clades in which it has not yet fixed (orange squares)-when this occurs, it suggests epistatic entrenchment of the mutations [38,60]. Note that clades in which a mutation has already fixed contribute to estimates of its fitness via estimates of the effect of its reversion and via estimates of the effects of mutations to other amino acids at the same site. See https://jbloomlab.github.io/SARS2-mut-fitness/clade_fixed_muts.html for an interactive version of this plot.

S8
. CC-BY 4.0 International license available under a was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint (which this version posted June 6, 2023. ; https://doi.org/10.1101/2023.01.30.526314 doi: bioRxiv preprint

Figure S9
Relationship between fitness effects of mutations and two measures of the number of descendants. At top is shown the log ratio of counts of the mutation on non-terminal (internal) to terminal (tip) branches; larger values indicate mutations more likely to be found in viruses that leave descendants. At bottom is shown the mean log number of tip descendants that share all the mutations on each branch containing the mutation of interest; larger values again indicate mutations more likely to be found in viruses that leave more descendants. Each point is an amino-acid mutation, the orange line is a least-squares regression, and the orange text in the upper left give the number of mutations and the Pearson correlation coefficient. This plot shows only mutations with at least 10 expected counts and 5 actual counts. See https://jbloomlab.github.io/SARS2-mut-fitness/fitness_vs_terminal.html for an interactive version of this plot that allows filtering by the number of actual or expected counts, or by gene. The number of descendants is calculated using the "leaves_sharing_mutations" variable of the UShER mutation-annotated tree.

Figure S10
Number of mutations per branch on all branches of the SARS-CoV-2 tree with at least one mutation. The analysis used here only considers mutations on branches with four or fewer mutations and so excludes the small fraction of highly mutated branches.

S9
. CC-BY 4.0 International license available under a was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint (which this version posted June 6, 2023. ; https://doi.org/10. 1101/2023