Analysis of 6.4 million SARS-CoV-2 genomes identifies mutations associated with fitness

Repeated emergence of SARS-CoV-2 variants with increased fitness underscores the value of rapid detection and characterization of new lineages. We have developed PyR0, a hierarchical Bayesian multinomial logistic regression model that infers relative prevalence of all viral lineages across geographic regions, detects lineages increasing in prevalence, and identifies mutations relevant to fitness. Applying PyR0 to all publicly available SARS-CoV-2 genomes, we identify numerous substitutions that increase fitness, including previously identified spike mutations and many non-spike mutations within the nucleocapsid and nonstructural proteins. PyR0 forecasts growth of new lineages from their mutational profile, ranks the fitness of lineages as new sequences become available, and prioritizes mutations of biological and public health concern for functional characterization.

the length scale to one twentieth of the extent of the gene. When considering the entire set of amino acid changes, i.e. all 2,904 coefficients that make up βf, we compute the Moran I statistic for two different length scales. We note, however, that the Moran I statistic is somewhat simplistic, since it is designed to pick up spatial structure at a single length scale. In particular it can be insensitive to complex spatial structure that involves multiple overlapping substructures at different scales. Nevertheless it offers a simple quantitative metric for identifying spatial structure in the coefficients βf.

Analysis of substitution statistics
To assess enrichment of amino acid changes we compared the event frequencies for the leading mutation sets (as determined by posterior mean/std ranking) against a background of all mutations used as features in the model using multiple testing corrected binomial tests. We performed this analysis for both the asymmetric case (where A->V and V->A are different events) and for the symmetric case.

Comparison to other regression models
We fit non-hierarchical logistic regression models in R version 4.0.3. The stats::glm() was used to fit binomial logistic regression models and the nnet::multinom function was used to fit multinomial logistic regression models. For multinomial logistic regressions, the data were filtered to contain sequences between January 1 2021 and December 31 2021 from the most common 25 PANGO lineages in the 10 countries with the most sequences available. The resulting dataset was downsampled to 10% of its initial size.
We compared PyR0 to PyR0-NoMut, a hierarchical, multinomial, Bayesian model identical to PyR0 except that the mutations were removed from the model as follows: σ5 ~ LogNormal (-4, 2) [new scalar latent variable] log_rate_c ~ Normal(0, σ5) [lineage-level log growth rate] pc ~ Normal (log_rate_c, σ4) [region-local lineage-specific log growth rate] We also compare PyR0 as described below to an earlier version of the model run on 2.1 million genomes. This earlier version was similar to the present version, but contained some minor differences, detailed in (17), that were updated as we improved and simplified the model. These differences include the former use of PANGO lineages which have been replaced by fine-grained phylogenetic clustering, the inclusion of deletions (which we now exclude because they cannot be placed on the mutation-annotated tree created by USHER), a logistic prior rather than a simpler Laplace prior, and a more factorized variational posterior distribution which has been replaced by a simpler distribution.
weekly seasonality, but binning coarser than a week so as to reduce memory requirements; this results in 56 time bins.
Because sample counts vary widely across GISAID geographic region (by as much as five orders of magnitude), we aggregate regions into the following coarse partitions: each country counts as a region, and any first level subregion of a country counts as a region if it has at least 50 samples; otherwise it is aggregated into a whole-country bin. Note this means that e.g. a country may be split up into its larger regions, with smaller regions being subsumed into an aggregate country level bin. We then drop regions without samples in at least two different time intervals, resulting in 1560 regions in total. Figure S26 shows the distribution of samples among countries and GISAID regions.
After preprocessing, the model input data are a T × P × C = 56 × 1560 × 3000 shaped array ytpc ∈ ℕ of counts (this array is sparse and our inference code uses a sparse representation), and an C × F = 3000 × 2904 shaped array Xsf ∈ {0, 1} of mutation features.
Cases per day (see Figure 3 inset) were estimated by multiplying confirmed case count data from Johns Hopkins University by the estimated proportion of each lineage within each (time, region) bin. We manually matched each GISAID region to the finest enclosing JHU region.

Lineage Clustering
Our method relies on a partitioning of genetic samples into clusters, where we estimate the fitness of each cluster. We initially tried to use the 1544 PANGO lineages as clusters, but found that some PANGO lineages appeared to include multiple distinct viruses of different fitness, e.g. B.1.1. exhibits two peaks in relative abundance in England, contrary to our multivariate logistic growth model. We therefore refined the 1544 PANGO lineages into 3000 finer clusters, with rates estimated individually for each cluster. Indeed Figure S4 shows that some PANGO lineages contain multiple distinct clusters of fitness estimates differing by more than a factor of two.
To create genetic clusters finer than PANGO lineages we began with a complete 4,833,238 node phylogeny of all GISAID samples maintained by Angie Hinrichs (26) (this phylogeny was created using UShER (25), excluding private mutations, masking difficult-to-sequence regions, eliding deletions, parsimoniously imputing missing sequence data). To coarsen the 4,833,238node phylogenetic tree down to 3000 nodes (treated as clusters) we greedily collapsed parentchild edges, minimizing the the following distance function TreeDistance(-,-) between two mutation annotated trees where T is the true mutation annotated tree, T' is the collapsed tree whose nodes we treat as clusters, u and v are sample sequences, mrca (T,u,v) is the sequence of the most recent common ancestor of u,v in the mutation annotated tree T, and EditDistance(-,-) counts amino acid substitutions between two sequences. This objective function minimizes the mean edit distance between the true mrca sequence and its cluster's sequence, for each pair of sequences. Changes in the objective function can be computed cheaply, and the O(n log(n)) time greedy algorithm can process the entire n=4,833,238 node phylogeny in under 5 minutes. Empirically this heuristic clustering produces trees that are approximately balanced in both cluster size and cluster-cluster edit distance, on both the true data and on synthetic datasets. Figure S26 shows the distribution of samples among both coarse PANGO lineages and the finer clusters. Figure S28 shows small example trees produced by clustering large synthetic trees.

Probabilistic Model
We model relative lineage growth with a hierarchical Bayesian regression model with a multinomial likelihood. Arrays in the model index over one or more indices: T=56 time steps (increments of 14 days) t; C=3000 clusters c; P=1560 regions ("places") p; and F=2904 amino acid substitutions ("features") f. The model, shown below, regresses lineage counts ytpc∈ ℕ in each time-region-lineage bin against amino acid mutation covariates Xcf ∈ {0,1}. The variables y and X are observed and all other variables in the model are latent. Each latent variable is governed by a prior distribution. The full model is specified as follows (visualized in Figure  S29), where the observed counts ytpc are underlined: The proportion of lineages in a single time-region bin is modeled as a Multinomial distribution whose probability parameter is a multivariate logistic growth function softmax(αp· + tβp·/τ) with intercept αpc and slope βpc in units of generation time τ = 5.5 days (these units are for interpretability only; the model does not use the notion of generation, and thus is robust to changes in generation time). Here the dot subscripts αp·∈ ℝ C , βp·∈ ℝ C , and ytp.∈ ℕ C denote vectors over cluster ids. The softmax function implements the multivariate generalization of logistic growth, inputting and outputting vectors, and is defined as For a simple model of two lineages, each of the two components of the softmax function are sigmoid curves; however for more lineages, the functional forms may be more complex. Early iterations of the model used overdispersed likelihoods such as Dirichlet-Multinomial to account for additional variability not directly encoded in the generative process. However, we found that we can obtain much more accurate model predictions by using a Multinomial likelihood and accounting for model misfit by adding hierarchical structure elsewhere. The intercepts αpc denote initial relative log prevalence of cluster c in region p; these are modeled hierarchically around the global relative log prevalence αc of each cluster. The slopes βpc are modeled hierarchically around global per-cluster fitness # # $# that are linearly regressed against amino acid substitution features Xcf . These linear coefficients βf can be directly interpreted as the effect of a mutation on a lineage's fitness, all other variation being equal. In figures we plot posterior means as an estimate of effect size and plot the posterior z-score ] %/' = : | |/ as a proxy for statistical significance.
Note that by regressing against amino acid changes we obviate the need to directly incorporate phylogenetic information into the model: if two lineages are close together in a phylogeny, then their amino acid features are likely also similar, so their regressed fitness values will likely be similar. By sharing statistical strength in this way we are also able to make accurate predictions for emergent lineages with few observations. (Note phylogenetic information is still used in preprocessing, since our clustering is created from an UShER phylogenetic tree.) Both of the hierarchies in α and β empirically improve model fit in the presence of heavily skewed observations (e.g. most samples are from the UK, and there is a long tail of sparsely sampled regions). We chose these model structures based on extensive cross-validation and forecasting experiments.
We place weak priors on scale parameters σ1, σ2, and σ4 (these denote standard deviations, the square roots of prior variance). The σ1 and σ2 priors are centered at large values to allow for wide variation in initial infection proportions across regions. The σ4 prior is centered around the smaller value e −4 ≈ 0.018 because we expect little variation of relative fitness across geographic regions a priori (some variation is expected, due to geographic variations in e.g. age distribution, behavior, or genetics as in binding affinity due HLA complex genotypes (27)). We fix the linear regression scale parameter σ3 to a small value, forcing the regression problem towards a sparse solution (i.e. we assume a priori that most observed mutations have little effect on fitness). We choose a Laplace prior on regression coefficients because it is heavier-tailed than a Normal prior, but not so heavy-tailed that the regression problem becomes multimodal (as it would for e.g. a Cauchy or Student's t prior).
We conducted a systematic analysis to evaluate the level of L1 regularization (S30). Stronger regularization (σ3 = 0.0001) resulted in better mutation cross-validation but resulted in a Manhattan plot that was implausibly sparse and failed to capture the known and experimentallyvalidated hits in the N gene. This proportional growth model differs from many forecasting models in the literature that are formulated in terms of absolute sample counts. Whereas our Multinomial likelihood allows us to model only the relative portions of lineages in each (time,region) bin, a Poisson likelihood would force us to additionally model the total number of genome samples in each (time,place) bin, a task which is less related to viral dynamics and more related to local lab capacity, political dynamics, and local calendars. We choose to model relative proportions rather than absolute counts because the relative model is robust to a number of sources of bias, including: sampling bias across regions (e.g. one region samples 1000x more than another); sampling bias over time (e.g. change in sampling rate over time); and change in absolute fitness of all lineages, in any (time, region) bin (e.g. due to changes in local policies or weather, as long as those changes affect all lineages equally). However the model is susceptible to the following sources of bias: biased sampling in any (time, region) cell (e.g. sequencing only in case of S-gene target failure); and changes in sampling bias within a single region over time (e.g. a country has a lab in only one city, then spins up a second lab in another distant city with different lineage proportions).
Some clusters of mutations occur together in only one or a small set of lineages, such as Omicron. If mutations are perfectly correlated across lineages, the model will score them identically, i.e. the effect size will be apportioned equally among the different mutations. However, in many cases, mutations occur independently in different numbers of lineages. In these cases PyR0 will use this additional information to "break ties" and disentangle the effects of co-occurring mutations so that mutations that consistently emerge in multiple lineages that are highly fit will be inferred to have large effects, whereas mutations that are found in some highly fit lineages and some less-fit lineages will be inferred to have small or negligible effect sizes. The lineages in which independent mutations have emerged is listed in Supplemental Data File 2.
We interpret the regression coefficients as the relative fitness based on a well-known result in population genetics (28) that the change in genotype frequency in a large haploid population under selection follows a logistic curve, where the logistic growth rate parameter defines the relative fitness of genotypes.

Advantages and Limitations
This model has several advantages over existing approaches. First, it provides a principled, agnostic approach that can be applied to a large dataset to identify lineages that demonstrate concerning epidemiological features. Second, by modeling the relative fitness of lineages separately across 1560 geographic regions, the model is robust to region-specific differences in non-pharmaceutical interventions and vaccination rates. Third, the hierarchical nature of the model which represents lineages as collections of mutations draws on the underlying biology and yields both strain-and lineage-specific coefficients from a single inferential approach. Fourth, the model is fully Bayesian, and the use of regularizing priors enables us to estimate coefficients for mutations and infer estimates for all extant lineages. Finally, the model is open-source, builds on optimized numerical libraries, can be configured to run on CPUs or GPUs, and includes an option to be run without mutations.
The model also has several limitations. First, as a generalized linear regression model, the model does not explicitly incorporate higher-order (non-linear) interactions between mutations (epistasis). While the linear-additive model of mutation biology is a coarse approximation to true biology including epistasis, our hierarchical model serves as a framework to explore such models (29,30) on SARS-CoV-2 genomic surveillance data. Second, the model is susceptible to sources of bias within the data. For example, preferential sequencing of SGTF lineages may bias the estimates. Third, because the model learns more from data-intensive regions, bias in such regions would have a corresponding greater impact on the resulting estimates. Fourth, the model does not incorporate migration. Thus, in order to be considered in future region-specific dynamics, a lineage must be present in a given region at the time the model is run, an effect that can be seen with BA.2 dynamics in Figure S7, i.e. BA.2 is forecast to rise only in regions where it has been detected. Inclusion of a migration model could improve long-term forecasting accuracy.
While a hierarchical model that includes mutation-level coefficients has several advantages over alternate modeling approaches, we do not believe that there is a "best" model that applies in all cases. As Supplemental Figures S6, S24, and S25 show, models at all levels of hierarchical organization and data pooling can be useful, and a systematic surveillance program should, we would argue, routinely run a diverse set of models. These might include binomial logistic regression models within a single well-sampled region, binomial logistic regression models that jointly model several regions, multi-region multivariate logistic regression models without mutation-specific coefficients as well as PyR0, which is a multi-region multivariate logistic regression model that incorporates mutation-specific coefficients.

Probabilistic Inference
The model is implemented in the Pyro probabilistic programming language (16) built on PyTorch (31). To fit an approximate joint posterior distribution over all latent variables (a space of dimension 375,909), we train a flexible reparameterized variational distribution using stochastic variational inference. Our variational approach starts by reparameterizing the model via a sequence of learnable but distribution-preserving decentering transforms (32) on the α and β latent variables. Reparameterizing is particularly helpful in avoiding Neal's-funnel situations (33) by smoothing out the geometry of latent variables with Normal prior whose scale parameter is also a latent variable. After reparameterizing we model the posterior over all variables as a joint multivariate Normal distribution whose covariance matrix is parametrized by a rank-200 matrix plus a diagonal matrix D with positive entries: where is an unconstrained matrix of size 375,909 x 200. This low-rank multivariate Normal distribution allows the approximate posterior to capture correlated uncertainty among competing mutations each of which might explain increased fitness. This variational distribution has 75,936,525 parameters to be optimized (much larger than the number 375,909 of latent variables, but much smaller than the 375,909 ⨉ (375,909 + 1) / 2 ≅ 7 ⨉ 10 10 parameters that would be required to represent a full-rank covariance matrix).
Variational inference is performed for 10,000 iterations with the Adam optimizer (34) with clipped gradients and an exponentially decreasing learning rate schedule and initial learning rates between 0.05 and 0.0025 for different parameter groups (see Figure S31). Optimization proceeds in batch-mode, i.e. without any data subsampling. We initialize model parameters to median prior values with a small amount of noise added to avoid scale parameters collapsing early in training. After inference we make predictions by drawing 1000 posterior samples. See source code for detailed optimizer and initialization configuration.
Inference and prediction on a single GPU (NVIDIA Tesla A100 with 48GB of RAM) takes about 10 minutes (compared to 14.5 hours on an 8-core CPU), which is less than the amount of time required to download and preprocess each daily snapshot of data from GISAID. The cost of fitting the model is O((TP+F)C), dominated by pointwise mathematical operations, particularly computing the softmax function on a dense array of shape T×P×C. This cost does not depend directly on the number of genetic samples, since samples are aggregated into counts y of constant shape T×P×C.
We emphasize that inference in this model is very challenging due to the large dimension of the latent space (namely 375,909), itself a consequence of the large number of regions, lineages, and mutations considered by the model (35). While variational inference has a number of attractive features, especially computationally, like any approximate inference scheme it comes with disadvantages. In our case the most notable disadvantage of variational inference is its tendency to yield biased posterior uncertainty estimates. Typically posterior uncertainty is underestimated, leading to credible intervals (CI) that in some cases can be unrealistically narrow. The primary parameters of interest in the PyR0 model are the mutation-level coefficients βf and the perlineage fitness values # # $# . Since the latter quantity governs the prior over βpc, which in turn directly feeds into the multinomial likelihood, the per-lineage fitness estimates are more-orless tightly constrained by the observed counts ytpc. Consequently the posterior uncertainty of per-lineage fitness is comparatively easy to estimate and we expect variational inference to yield reasonable credible intervals for these quantities. In contrast the mutation-level coefficients βf interact with correlated features Xcf (leading to a multi-modal posterior) and are less directly constrained by the observed counts ytpc. Consequently it is significantly more challenging to estimate the corresponding posterior uncertainty. In practice we obtain implausibly narrow credible intervals for these quantities and the posterior uncertainty must be interpreted with caution. Importantly, while the uncertainty estimates for βf should not be taken at face value, we believe that they are still very useful for interpreting inferred model parameters, since they can be used to rank/prioritize different hits βf. In particular, while the absolute magnitudes of βf uncertainty estimates are implausible, their relative magnitudes are representative of the amount of supporting evidence, and thus are useful for ranking. Since we consider a large number of mutations (F=2904) this information is invaluable for designing experiments for functional characterization.

Implementation
We implemented the PyR0 model using the probabilistic programming language Pyro (16). The model leverages PyTorch and Pyro to scale efficiently to large data sets and can therefore be applied continuously as datasets grow, completing model training and prediction with millions of viral genomes in minutes on a single GPU. We chose the Pyro framework because it cleanly separates model specification from inference customization, and scales to large models and datasets by leveraging GPUs. This flexible modeling framework allowed us to experiment with different hierarchical structures. Additionally by relying on an open source and well-tested modeling and inference framework, we minimize the risk of introducing software bugs into our analysis. The speed of inference-which took about 10 minutes on a single GPU on the full dataset of >6 million genomes-allowed quick model iteration and thorough validation on subsets of the data, including both geographic cross-validation and temporal data truncation.

Prediction
In Figure 1, the 95% confidence intervals in parentheses were estimated by drawing 1000 samples from the variational posterior distribution. Confirmed cases per day were estimated at the end of the training period (Jan 20 2021) by combining our model's relative lineage portions with confirmed case count data from Johns Hopkins university. Quantities defined over our 3000 fine clusters were aggregated up to coarser PANGO lineages for reporting. To facilitate downstream use of model predictions we have provided complete tables of lineage fitness estimates (Data S1) and mutation coefficients (Data S2). These predictions have been used e.g. by Nextstrain.org to visualize our predicted mutational fitness along a phylogenetic tree ( Figure  S32).

Validation
We considered the possibility of biased submission to the GISAID database and compared results obtained from the full dataset with results obtained from disjoint subsets. For this purpose we divided the data into samples from the most heavily sampled region (Europe, with 3.3M samples) and those from the rest of the world (with 3.1M samples) (Figures S1,S11). This split is motivated by most samples originating from the UK: we widened the region around the UK until the region and its complement both had roughly equivalent statistical strength and narrow posterior estimates. We conducted two-fold cross-validation experiments for both lineages ( Figure S2) and mutations ( Figure S19). Additionally, in Figure S33, we show that PyR0 lineagelevel Δ log R estimates are largely driven by regions with the largest numbers of samples and are thus robust to the manner in which under-sampled regions are organized into spatial units.
We found the full GISAID dataset to be invaluable to making accurate predictions. Using data up to July 2021, we tried restricting to either all CDC data or CDC's randomly sampled NS3 dataset and found those subsets to result in insufficient diversity and lead to unclear results (Pearson correlation 0.49, 0.28, respectively). Using data snapshots from mid January 2022, we tried restricting to open data available in GENBANK, but found the model made implausible estimates of Omicron fitness, due to a combination of lack of geographic diversity (GENBANK has only about 1/10 as many geographic regions as we were able to extract from GISAID data, and particularly has very few samples from South Africa) and data upload latency (GISAID appeared to have ~1 week upload latency, versus ~1 month for GENBANK).
To evaluate predictions based on sequence alone, we conducted a leave-one-out (LOO) analysis in which each lineage's fitness was estimated from its mutational profile after: i) training on the full dataset; and ii) training on the full dataset with each lineage and its subclade (all descendents) removed ( Figure S17). We compared these LOO estimates to a baseline estimator in which each child lineage's fitness is the same as its parent.
Our model assumes each single point mutation independently linearly contributes to change in fitness. A natural generalization is to search for groups of mutations that affect fitness. To explore this we fit a similar model of both single and pair mutations, considering only pairs that lie within the same gene. Fitting this model on data up to July 2021, we discovered no pairwise mutations stronger than the top 100 single mutations. While these experiments did not discover pairwise mutations, we believe that more sophisticated models would be able to measure epistasis, but sophistication in that area is beyond the scope of the present work.
We also compared our inference of lineages and mutations from the current model, run on 6. Finally, to compare our assessments of lineage fitness to other logistic regression approaches, we compared lineage fitness estimates for PyR0 without mutations ( Figure S11), multinomial logistic regression ( Figure S12), and binomial logistic growth curves (S13), showing good agreement across all approaches.

Supplemental Note 2:
Cell culture Cells were cultured in humidified incubators with 5% CO2 at 37º C, and monitored for mycoplasma contamination using the Mycoplasma Detection kit (Lonza LT07-318). HEK293 Homo sapiens, female, embryonic kidney cells (ATCC CRL-1573) were cultured in DMEM supplemented with 10% heat-inactivated FBS, 1 mM sodium pyruvate, 20 mM GlutaMAX, 1× MEM non-essential amino acids, and 25 mM HEPES, pH 7.2. Virus Infectivity Assays 16 hours prior to transduction, adherent cells were seeded in 96 well plates. HEK-293 cells were plated at 5 x 10 4 cells per well. Cells were incubated in virus-containing media for 16 hrs at 37°C when fresh medium was added to cells. 48 to 72 hours after transduction cells were assessed for luciferase activity. Cells transduced with luciferase expressing virus were assessed using Promega Steady-Glo system (Promega Madison, WI). GraphPad Prism 8.4.3 was used to analyze the infectivity data using a ratio paired t test. In these experiments, all values shown are the mean with standard deviation, with the actual calculated two-tailed P value indicated.

Supplemental Note 3:
We include here an extended discussion of high-scoring mutations.

Spike
Many of the high-scoring mutations in our model were from Spike. As discussed in the main text, fitness-associated Spike mutations tended to cluster in hotspots in NTD, RBD, and furincleave domain. For Omicron, antibody escape appears to be the dominant mechanism as predicted by our model (Figure 3B, S25B) and shown experimentally (36,37), across all variants the mechanisms by which these mutations enhance growth likely include enhancing infectivity, promoting cell-cell fusion, and increasing immune escape beyond humoral immunity, and other mechanisms. The mutational fitness of individual mutations was strongly between the 2.1 million genomes model and the 6.4 million genomes model (rho = 0.62, p < 2 x 10 -16 , Figure S15F), and includes many mutations that have been experimentally characterized including those at positions E484, L452, T478, and N501. Some of these (e.g. E484K) have been shown to promote escape from neutralizing antibodies ( (38)) whereas others (N501Y, N501T) have been shown to enhance binding to ACE2 (9). While T478K is the among top-scoring Spike mutations in our dataset, we are not yet aware of any definitive experimental results that would explain a potential fitness for this mutation.
Spike D614G was highly ranked by our initial model but appeared to have a neutral effect in the 6.4 million genomes model, a situation that also affected ORF1b P314L, a mutation that is almost perfectly linked to Spike D614G. We attribute this to effective fixation during the followup period, as nearly all variants of concern and all genomes sequenced since July 2020 contain these two mutations ( Figure S34).

Nucleocapsid
The concentration of putative transmission-promoting substitutions in N at positions 160-210 is remarkable, but is supported by a similar observation in Ebola virus (39), and recent data for SARS-CoV-2 showing mutations in that region increase the efficiency of viral packaging (40), validating some of the model's most unexpected predictions and supporting its ability to identify novel biology.

ORF1
Our model highlighted mutations within the ORF1 non-structural proteins (nsps) whose functions are not fully understood (e.g. Table S3). We found two predominant clusters within ORF1a: one in the C-terminal ~120 amino acids of nsp4 and the other within the N-terminal ~160 amino acids of nsp6 ( Figure S13C). Nsp4 and nsp6 are both membrane-anchored proteins with roles in assembly and concentration of the viral replication and transcription complex (RTC) machinery within double-membrane vesicles (41). Amino acid substitutions in these regions, combined with transmission-associated mutations identified within additional RTCassociated nsps (e.g., nsp12-16, Figure S13D), may therefore affect the kinetics of replication and gene expression, resulting in higher virus yields from infected cells. Nsp2, a rapidly evolving accessory protein (42)(43)(44) whose proposed function in disrupting host cell signaling (45) and viral mRNA translation initiation (46) remains obscure, harbored many additional mutations associated with higher fitness ( Figure S13C).
The ORF1a-ORF1b polyprotein is processed into 16 non-structural proteins by two viral proteases: a papain-like protease (nsp3) and 3C-like protease (nsp5). Multiple transmissionassociated mutations were found within the protease coding regions (e.g., ORF1a:V1750A, ORF1a:P3395H). Most of the amino acid substitutions identified by our model were outside of the domains containing catalytic residues for nsp3 (C1674, H1835, D1849) or nsp5 (H3304, C3408) (47) (48). However, the potential effects of these mutations on protease architecture and activity warrant further experimentation. A few of the top mutations from our model (e.g., ORF1a:T3255I, ORF1a:A3571V) are positioned adjacent to nsp cleavage sites, potentially influencing local structures and kinetics of polyprotein processing by nsp3 and nsp5 ( Figure  S13C-D).
Multiple highly-ranked mutations are distributed across the replication and transcriptionassociated nsps in ORF1b ( Figure S13D). The P314L mutation in nsp12 -the viral RNAdependent RNA polymerase (RdRP) -emerged early during the pandemic and became established in circulating lineages alongside S D614G (7). And like D614G, P314L was highly ranked by our initial model but had an approximately neutral effect in the 6.4 million genomes model, presumably due to fixation of this mutation during the follow-up period. A later variant at this site (P314F) was also highly ranked in our list. Additional mutations in nsp12 can be found within the canonical fingers (D445A, V631I, D514N, G662S), palm (M592I, H604Y, T701I, C721R, S763F), and thumb (L820F, L829I, D870N) subdomains of the RdRP conserved catalytic fold ( Figure S15). The functional effects of these mutations on polymerase processivity and fidelity remain to be investigated. A structural model of the SARS-CoV-2 polymerase complex has been resolved (49) (50), and contains a single subunit of nsp12, two subunits of the nsp13 helicase, and additional RdRP cofactor proteins (nsp7, 8, and 9). The ORF1b P314 residue is located at the interaction interface between nsp12 and a single subunit of nsp8. Moreover, several of the top mutations from our dataset ORF1b (e.g., P1000L, P1001S, Q1011H) are harbored within the nsp13 N-terminal zinc-binding domain that directly interacts with nsp8 (51). These findings implicate transmission-associated mutations within the SARS-CoV-2 RNA synthesis machinery in altering the stability of the replication complex, possibly via interactions with nsp8.
Nsp14 is a dual-functional enzyme with N-terminal 3'-to-5' exonuclease (ExoN) and C-terminal guanine-N7 methyltransferase (N7-MTase) activities (52) and is a core component of the coronavirus RNA proofreading complex. Nsp14 is uniquely responsible for excision of mismatched bases from the nascent RNA and methylation of the viral mRNA cap structure. Two mutational hotspots in nsp14 map to discrete regions in the ExoN (e.g., T1540I, I1566V) and N7-MTase (e.g., D1848Y, P1936H) domains. The functional consequences of these clusters of transmission-associated mutations on mRNA synthesis and genome replication remain unknown. Figure S1. Overview of the PyR0 analysis pipeline. After clustering UShER's mutation annotated tree, sequence data are used to construct spatio-temporal lineage prevalence counts ytpc and amino acid substitution covariates Xcf. Pyro (16) is used to fit a Bayesian multivariate logistic multinomial regression model to ytpc and Xcf. A.
B.      Figure S3, heatmaps depict the prediction L1 error, and the inset bar plots depict the aggregated prediction errors over all periods. Note the rapid increase in error as new fit lineages emerge in a region (vertical dashed lines provided as a guide to the eye), followed by rapid recovery and stabilization of forecasting accuracy within only a single period, highlighting the predictive value of PyR0 for detecting variants of concern. Refer to Table S1 for tabulated forecasting accuracy figures in several other regions.        . Right: the same quantities but relative to a baseline estimator in which each child lineage's fitness is the same as that of its parent lineage. If a mutation is entirely removed from the LOO dataset, then the corresponding mutation coefficient is estimated as zero. The evaluation metrics are Pearson correlation (⍴) and mean absolute error (MAE). The MAE of the leave-one-out estimator is 0.001, more than 100x smaller than the MAE of the baseline estimator (0.129). Both panels highlight the CDC's variants of concern and variants of interest. The lineages selected for testing are those with at least 100 samples and with the largest deviations from their parent, i.e. where the baseline estimator performs worst. Note that the fitness of child lineages can deviate substantially from that of the parent, e.g. BA.1 is surprisingly higher fitness than its parent B.1.1.529.           Table S1. Regional evaluation of forecasts. We evaluate the ability of PyR0 to accurately forecast the dominant lineage 4-and 8-weeks into the future in six selected regions with a relatively large number of GISAID samples. Percentage accuracies are obtained by averaging over 45 training windows.  Table S2. Spatial structure of the inferred amino acid coefficients βf. We report one-sided pvalues for the Moran I spatial autocorrelation statistic computed using a permutation test. We use a gaussian weighting function of the form exp(-distance 2 /lengthscale 2 ), where distance is measured in units of nucleotides. We find that there is significant evidence for spatial structure in S, N, ORF7a, ORF3a, and ORF1a as well as across the SARS-CoV-2 genome as a whole.