Exploring the Interaction between E484K and N501Y Substitutions of SARS-CoV-2 in Shaping the Transmission Advantage of COVID-19 in Brazil: A Modeling Study

ABSTRACT. The COVID-19 pandemic poses serious threats to global health, and the emerging mutation in SARS-CoV-2 genomes is one of the major challenges of disease control. Considering the growth of epidemic curve and the circulating SARS-CoV-2 variants in Brazil, the role of locally prevalent E484K and N501Y substitutions in contributing to the epidemiological outcomes is of public health interest for investigation. We developed a likelihood-based statistical framework to reconstruct reproduction numbers, estimate transmission advantage associated with different SARS-CoV-2 variants regarding the marking (identifying) 484K and 501Y substitutions (including Alpha, Zeta, and Gamma variants) in Brazil, and explored the interactive effects of genetic activities on transmission advantage marked by these two mutations. We found a significant transmission advantage associated with the 484K/501Y variants (including P.1 or Gamma variants), which increased the infectivity significantly by 23%. In contrast and by comparison to Gamma variants, E484K or N501Y (including Alpha or Zeta variants) substitution alone appeared less likely to secure a concrete transmission advantage in Brazil. Our finding indicates that the combined impact of genetic activities on transmission advantage marked by 484K/501Y outperforms their independent contributions in Brazil, which implies an interactive effect in shaping the increase in the infectivity of COVID-19. Future studies are needed to investigate the mechanisms of how E484K and N501Y mutations and the complex genetic mutation activities marked by them in SARS-CoV-2 affect the transmissibility of COVID-19.

INTRODUCTION COVID-19, the etiological agent of which is severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), 1 posed a serious threat to global health and swept the world in 2020; the pandemic is still ongoing. 2,3 As of March 31, 2021, more than 127 million COVID-19 cases had been confirmed worldwide, with more than 2.7 million associated deaths.
The control of COVID-19 requires knowledge of the driving factors that may affect the transmission process 4,5 ; virus mutation is one of the major challenges. 6,7 Around September 2020, genetic variants carrying the N501Y substitution on the spike (S) protein of SARS-CoV-2 were first detected in the United Kingdom 8 and, then spread globally and trended to reach fixation rapidly in many places (e.g., South Africa, 9 Brazil, 10 the United States, 11 and the United Kingdom). 12,13 In Brazil, the 501Y variants, as well as other amino acid changes, were clustered into the B.1.1.28.1 lineage by COVID-19 Genomics Consortium UK, which is also known as the variant of concern (VoC) 202101/02. 14 15,16 The mutation E484K was first identified in South Africa and became prevalent in many places, including the United Kingdome and Brazil. 17 These emerging variants may affect the epidemiological characteristics of COVID-19 18,19 and the protective effects of vaccines. [20][21][22][23] Considering the growing patterns of the epidemic curve in Brazil, the possible contributions of both E484K and N501Y substitutions are of public health interest for investigation.
The rapid spread of 501Y and 484K variants indicates a possible transmission advantage over their preceding variants. 24 On one hand, recent analyses reported evidence that the N501Y substitution was associated with an increase in infectivity of COVID-19, 12,13,[25][26][27] which appears similar to the situation of D614G substitution reported previously. [28][29][30][31] On the other hand, the relationship between E484K substitution and COVID-19 transmissibility appears inconclusive. 32,33 The survival or functional profile of pathogen could be altered through genetic mutation and, as a consequence, change its infectivity. 34 Referring to the previous studies on seasonal influenza viruses, 35 a few key amino acid substitutions may lead to changes in antigenic features and epidemiological outcomes, 36,37 and the interaction among them may become more complicated. As an example, the R384G substitution in the nucleoprotein (NP) of H3N2 virus enhances the ability of in-host immune escape, 38 which increases transmissibility, 35 but this substitution appears detrimental. In contrast, the commutations including E375G and M239V in NP could compensate and restore the viral fitness or functionality of H3N2 virus, 39,40 just as the mutated strains rapidly reached fixation in 1993-1994 influenza season. For the COVID-19 epidemics in Brazil, how the mutation activities marked by E484K and N501Y substitutions, as well as the possible interactive effect between them, might shape the transmission advantage remains unassessed.
Exploring the role of mutation activities in determining disease transmissibility is of importance to understand how the evolutionary process at molecular scale may shape the epidemiological outcomes at population scale. 28,31,41,42 In this study, we adopt a statistical framework to infer the real-time transmissibility associated with different SARS-CoV-2 variants with respect to E484K and N501Y substitutions in Brazil. We explore the interactive effects between E484K and N501Y in shaping the transmission advantage of COVID-19.

METHODS
SARS-CoV-2 sequencing data and COVID-19 surveillance data. The SARS-CoV-2 strains were obtained via the global initiative on sharing all influenza data (GISAID) with collection dates ranging from January 1, 2020 to January 31, 2021, in the Brazil. 43 A total of 4,210 complete human SARS-CoV-2 strains were retrieved. We excluded sequences with more than 5% ambiguous amino acids during the alignment, and a total of 4,052 sequences were included for further analysis. Multiple sequence alignment was performed using MAFFT version 7, 44 and the "Wuhan-Hu-1" (GISAID: EPI_-ISL_402125 or GenBank: NC_045512.2) SARS-CoV-2 genome is considered as the reference sequence.
The surveillance data of COVID-19 cases in the Brazil were collected via the WHO COVID-19 surveillance platform. 45 To avoid the under-ascertainment due to reporting delay, we drop the observations since February 2021. As such, the surveillance data of COVID-19 cases from January 1, 2020 to January 31, 2021, are included in the analysis, which match the period of SARS-CoV-2 sequencing data. To adjust for the weekly cycle in the COVID-19 case time series, the 7-day moving average is adopted for further analysis. The COVID-19 cases time series are shown in Figure 1A.
Statistical parameterization. Variant-specific reproduction number. The time-varying reproduction number is commonly adopted to quantify the instantaneous transmissibility of infectious disease in an epidemic. Using the estimation framework in Cori et al., 46 the epidemic growth is modeled as a branching process. Thus, the reproduction number at time t, R(t), is expressed as the ratio of C(t) over Ð 1 0 wðtÞCðt2tÞdt, which is commonly known as the renewable equation. 47,48 Here, the C(t) is the observed new incidences of COVID-19 at time t. The function w(Á) is the distribution of the generation time (GT) of the disease, that is, COVID-19. The GT is defined as the time interval between the time of exposure, that is, being infected, of a primary case and that of his associated secondary case in the consecutive transmission generation. 49 Thus, the distribution w(Á) is predefined in our model, which is commonly estimated from contract tracing surveillance data. [50][51][52][53] To set up the analysis for COVID-19, we consider w as a Gamma distribution having mean (6SD) values of 5.3 (62.1) days by averaging the GT estimates from the existing literatures. [50][51][52][53][54][55][56][57] Slight variation in the settings of the GT will not affect our main findings.
To incorporate the information of SARS-CoV-2 variants, we denote the proportion (or prevalence) of the j-th variant of concern (VoC) at calendar time t by r j (t), which is timevarying. Straightforwardly, X j r j ðtÞ51 for all t. We denote the variant-specific reproduction number for the j-th VoC at time t by R j (t), and we have R j ðtÞ5 . We  weekly proportion of 484E & 501Y  (1) Equation (1) will be used to formulate the likelihood function in the remaining sections. Note that the index j is merely used for labeling instead of ranking. Transmission advantages and their interactive effects. For convenience, we label the original variant as the (j 5) 0-th VoC, that is, the 484E/501N variant, and thus its associated variant-specific reproduction number is R j 5 0 . Similarly, we label the 484E/501Y, 484K/501N, and 484K/501Y variants as the j 5 1, 2, and 3 VoC, respectively.
Following previous work, 28,58 the transmission advantage of the mutated variant against the original type is defined as the ratio (denoted by h) of the strain-specified reproduction numbers. Considering the transmissibility of the original 484E/501N variant (R j 5 0 ) as the reference level, the transmission advantage of the j-th VoC is h j 5 R j R j50 . Thus, the reproduction number of cases infected by the j-th VoC is R j 5 h j ÁR j50 , and h j 5 0 5 1 by definition. If h j . 1, the j-th variant may be more infectious than the original genetic variant, and vice versa. In addition, the overall reproduction number is R j50 ðtÞ X j ½h j r j ðtÞ. We consider h j as a constant, which reflects an intrinsic nature of the j-th SARS-CoV-2 variant, and thus h j is invariant with time. Hence, we have R j (t) 5 h j ÁR j50 (t) for all time t. Then, we calculate the expected number of COVID-19 cases, E[C(t)], at time t in Eq. (2).
As such, for the observed sequencing data, the expected chance (or probability) that a randomly selected strain at the t-th day is j-th VoC, E[r j (t)], is given in Eq. (3).
Straightforwardly, X j E½r j ðtÞ51 for all t. For the E484K and N501Y substitutions, the 484E/501Y (j 5 1, i.e., including the B.1.1.7 or Alpha variants) and 484K/ 501N (j 5 2, i.e., including the P.2 or Zeta variants) are two variants with merely one substitution, whereas the 484K/501Y variant (j 5 3, i.e., including the P.1 or Gamma variants) has both substitutions. To explore the interactive effects on the variant-specific reproduction number, we compare h j53 and the product of (h j51 Áh j52 ). If h j53 is larger than (h j51 Áh j52 ), the E484K and N501Y substitutions may enhance the transmissibility than their separated partial effects, and vice versa.
Likelihood-based inference. According to Eq. (2), we construct the likelihood function L ðcÞ t of the daily number of cases using a Poisson-distributed framework with observation at C t and rate parameter at E[C t ] as in Eq. (4).
Here, the C t is the observed number of COVID-19 cases on day t and is the discretized C(t), which means The value of C t can be obtained from the number of COVID-19 cases time series as shown in Figure 1A. Note that the superscript (c) merely indicates that the likelihood function is for the number of cases, which does not indicate the power. For the observed sequencing data, we denote the numbers of j 5 0, 1, 2, and 3 variants by m j50,t , m j51,t , m j52,t , and m j53,t , respectively, for the day t. Thus, we model the sampling process of genetic variants using a generalized Bernoulli distribution, that is, categorical distribution, with probabilities at E[r j (t)]s in Eq.
Here, the E[r j,t ] is the expectation of variant prevalence for day t. Note that the superscript (s) merely indicates that the likelihood function is for genetic variants, which does not indicate the power. With Eqs. (4) and (5), we reconstruct the R j50,t time series, denoted by fR j50,t g, and estimate h j51 , h j52 , and h j53 using the overall log-likelihood function ' defined in Eq. (6).
'ðfR j50,t g, h j51 , h j52 , h j53 jfC t g, fm j50,t g, We calculate the maximum likelihood estimates (MLE) of parameters to determine transmission advantage of 484E/ 501Y (h j51 ), 484K/501N (h j52 ), and 484K/501Y (h j53 ) variants by using the likelihood framework defined in Eq. (6). The 95% confidence intervals (95% CI) are calculated using the profile likelihood estimation framework with a x 2 quantile as the cutoff, 59,60 which has also been adopted in previous work. [61][62][63][64][65][66][67] Fitting schemes and their selection. We explore the interactive effects between E484K and N501Y substitutions of SARS-CoV-2 in shaping the transmission advantage in Brazil. We consider the following eight fitting scenarios with respect to h j and investigate the role of E484K and N501Y substitutions contributing to the transmissibility of COVID-19: all of h j51 , h j52 , and h j53 are assumed at 1: scenario (#1): h j51 5 h j52 5 h j53 5 1; two of h j51 , h j52 , or h j53 are assumed at 1, and the remaining one is freely estimated: scenario (#2): h j51 5 h j52 The settings of the eight fitting scenarios are presented in Table 1.
We conduct the model fitting and parameter estimation under each scenario. The scenario with the best fitting performance is selected according to lowest values of 4 different information criteria including Akaike information criterion (AIC), corrected AIC for small sample size (AICc), Bayesian information criterion (BIC), and Hannan-Quinn information criterion (HQIC).
Sensitivity analysis. Sensitivity analysis was conducted to examine the robustness and significance of the determine transmission advantage estimates, that is, hs. We examine the consistency of both directions of the effects and their 95% confidence intervals (CI) under alternative settings. The following three sensitivity checking schemes are performed.
For the first scheme, we consider a univariate logistic regression model between the overall R t as response and r j51,t , r j52,t , and r j53,t as regressors. The regression coefficients of all r j,t are evaluated as the effect size of SARS-CoV2 variants on the overall transmissibility of COVID-19. For the second scheme, we repeat the estimating process of transmission advantage with alternative PDF of GT, that is, w(Á), which is introduced in the previous section, "Variantspecific reproduction number." We consider shorter and longer versions of mean GT at 4 days, 53,68 and 7.5 days, 2 respectively. For the third scheme, we repeat the analysis by replacing the Poisson-distributed likelihood function in Eq. (4) with a negative binomial (NB) distribution to further account for the superspreading potential of COVID-19 transmission. For the setting of NB distribution, we fixed the dispersion parameter at 0.4, which follows the estimation in recent studies. 55,[69][70][71][72][73][74]

RESULTS AND DISCUSSION
As of this writing, in Brazil, the epidemic curve had grown since March 2020 with two major epidemic waves, the first in August 2020, and the second is ongoing ( Figure 1A). The original 484E/501N variants started being replaced by the other three types of variants in October 2020 and had almost vanished in Brazil after January 2021 ( Figure 1C). In January 2021, the prevalence of the emerging 484K variants was 71.1% and 46.6% for 501Y variants. The prevalence of 484E/501Y (j 5 1 type) variants is 10.1%, 34.6% for 484K/ 501N (j 5 2 type) variants, and 36.5% for 484K/501Y (j 5 3 type) variants ( Figure 1D-F). As such, the linkage disequilibrium (LD) is calculated at 0.03, which indicates the occurrence of two mutations is likely random. Specially, the 484K/ 501Y variants are classified into the B.1.1.28.1 (or P.1, or 20J/501Y.V3) lineage.
For the eight fitting scenarios summarized in Table 1, we find that the transmission advantage of 484K/501Y (h j53 ) variants is estimated larger than 1 significantly and consistently. In contrast, the scale of h j51 for 484E/501Y variants appears statistically unclear compared with 1, that is, not significantly larger than 1. For the model selection, scenario 2 has the lowest AICc, BIC, and HQIC, and scenario 5 has the lowest AIC, and AICc. Both scenarios 2 and 5 also have close values of AIC with a difference of only 0.4. As such, scenario 2 is considered the main result and thus is presented in Figure 1B.
The modeling framework in this study links the mutation activity at molecular scale and COVID-19 transmissibility at population scale. We reconstruct the instantaneous reproduction numbers (R t ) of COVID-19 cases infected by 484K/501Y variants and other variants in Brazil under fitting scenario 2 ( Figure 1B). The overall trends of reproduction numbers are relatively high in the early phase of outbreak before and in May 2020 and the November 2020 for the second major epidemic wave, but gradually decrease thereafter. The average scale of reproduction number during the early outbreak is largely consistent with previous estimates. 2,3,[75][76][77] We report the estimated proportions of four types of SARS-CoV-2 variants, E[r t ], fit the observed sequencing data well ( Figure 1C-F). We infer the transmission advantage h j53 for 484K/501Y variants at 1.23 (95% CI: 1.04-1.41), which means the E484K and N501Y substitutions together increase 23% of COVID-19's transmissibility in Brazil, whereas other emerging variants, 484E/501Y and 484K/501N, are unlikely to have significant transmission advantage. Thus, in Figure  1B, the reproduction number of the 484K/501Y variant appears higher than that of the other genotypes (non-484K/ 501Y, or non-Gamma variants). For sensitivity checking, we find that the h j53 estimates are consistently and significantly larger than 1 in similar scales as the main estimates (data not shown), which validates our findings.
We focus on the second major epidemic wave because E484K and N501Y substitutions emerged during the same  period. Although the reproduction number of non-484K/ 501Y variants has fluctuated around 1 since December 2020, the reproduction number, R j53 , of 484K/501Y variants are largely greater than 1 during the same period, which has led to a large epidemic wave in Brazil in 2021 (see Figure 1A). Given that 484K/501Y variants trend to reach fixation, both the herd immunity threshold and intrinsic growth rate of epidemic may increase, which was previously discussed regarding the situation in United Kingdom, 12,25 and thus the local nonpharmaceutical interventions for COVID-19 control may be enforced. Hence, we highlight the importance of our analytical framework, such that the public health risks related to viral mutations may be controllable with early preparedness.
The increase in transmissibility associated with the 484K/ 501Y variants is biologically reasonable. The N501Y substitution is a mutation on a key contact residue in the receptor binding domain on the S protein 78 and is found to increase the ability of human angiotensin-converting enzyme 2 binding and cell infectivity in animal models, 79 which appears similar to the previous D614G substitution. 80 The E484K substitution is also located in the viral RBD and confers resistance to several monoclonal antibodies by affecting the binding process. 6,81,82 However, according to the h j51 estimates in Table 1, N501Y substitution alone appears less likely to form an advantage in fitness without accompanying 484K. Similarly, E484K substitution alone also might not secure a concrete transmission advantage robustly or significantly. Our finding indicates that the combined impact of 484K/501Y outperforms their independent effects, which implies statistically an interactive relationship.
Previous studies reported a higher case fatality risk among the individuals infected by SARS-CoV-2 strains in the B.1.1.7 lineage, 42,83 some of which carry both E484K and N501Y substitutions. Given the transmission advantage of 484K/501Y (i.e., h j53 . 1), the increasing intensity of COVID-19-related mortality is a public health concern. Clinical severity remains largely unassessed for the B.1.1.28 lineage in Brazil, and unexpected clinical outcomes may warrant adjustments in the treatment strategies. The emergence of 484K/501Y variants and its mutations (e.g., K417N or V1176F), combined with other VoC in Brazil and other places, implies the capacity of SARS-CoV-2 to evolve new phenotypes rapidly. 84 Although the neutralizing level of BNT162b2 vaccine-elicited sera is recently found to be satisfactory for 484K/501Y variants, 20 further investigation is required for other vaccine candidates at population scale.
This study has the following limitations. First, our analysis was based on the sequence data released in GISAID and thus is subject to the selection bias of sequences being released to the public domine. 12 Second, the reconstruction of reproduction numbers relies on the setting of the generation time (GT). Theoretically, the GT distribution might be altered by the mutated strains. However, by screening the literature, we find no evidence that GT is associated with the E484K or N501Y substitution in SARS-CoV-2, and thus we model GT distribution of COVID-19 w(Á) as a fixed Gamma distribution, following previous studies. [50][51][52][53][54] Third, we consider w(Á) as a fixed distribution. In the real-world situation, the time interval between transmission generations might vary, 75,85 which may affect the reconstruction of the reproduction number. However, the long-term trends of R t estimates are unlikely to change due to slight variation in GT. 75,86 Thus, we consider the impact of this limitation on the inference of transmission advantage may be negligible, and our model can be extended to a more complex context with the time-varying GT data available. Fourth, due to the lack of data from different Brazilian regions, we aggregated the national COVID-19 cases in Brazil to reconstruct the reproduction number series ( Figure 1B). We acknowledge this analytical scheme neglects the heterogeneities in epidemiological characteristics of COVID-19 transmission, 55,74,87,88 geographic separation in SARS-CoV-2 variants, 89 individual response and vulnerability to COVID-19, 90-92 and various nonpharmaceutical interventions 93,94 for wide regions across different Brazilian locations. We note that the transmission advantage may vary under different local settings or situations. Fifth, ideally, C(t) in the R t estimation should be the number of COVID-19 cases with onset at time t. However, because surveillance data by date of onset are unavailable, we adapted the current dataset by reporting data as a proxy for the COVID-19 incidence time series. If one considers a constant reporting lag, the R t estimates will have the same trends but are shifted for this lag. Considering that a similar reporting delay also occurred during the collection of SARS-CoV-2 sequencing data, the effects of the two reporting lags could be counteracted. We believe that this approximation is unlikely to affect the main conclusions of this study. Furthermore, with detailed reporting lag of information for each individual case, adjustments for reporting delay can be carried out based on our current analytical framework. Sixth, this study focuses on exploring the effects on changing disease transmissibility associated with mutation activities, but real-world biological mechanisms, which are usually more complex, remain uncovered. Future studies are needed to explore the mechanisms of how E484K and N501Y mutations in SARS-CoV-2 affect the transmissibility of COVID-19. Seventh, there exist other mutations in the B.1.1.28 lineage, such as K417N and V1176F, but we merely considered the E484K and N501Y because they are dominant in B.1.1.28.1 and B.1.1.28.2 lineages, respectively. Given the lack of individual patient information, time-series data were used in this work, which means there is information loss from the data aggregation. As pointed out in previous studies, 83 the independent effects of each commutation may not be disentangled in this study due to identification issues that the samples might fail to inform each estimate. Eighth, the interpretation of our findings should be limited to the COVID-19 epidemics in Brazil, but similar investigations could be conducted for other regions. Ninth, for simplification, we consider the transmission advantage (h) of new variants versus the wildtype as constant over time. This model assumption may not necessarily be strictly held considering several real-world determinants, including the accumulation of population immunity against different strains, 24 selection pressure due to intervention strategies, 27,58 and behavioral factors related to disease spread and transmission. 95 Alternatively, the estimating framework of h can be extended into a real-time basis. Lastly, as a data-driven study, the estimated association should be treated with caution. In an ecological setting, 13 although our analysis provides statistical evidence on the likelihood of causality, the findings in this study cannot guarantee causality, which requires further biomedical experiments for verification.