Evolutionary Tracking of SARS-CoV-2 Genetic Variants Highlights an Intricate Balance of Stabilizing and Destabilizing Mutations

ABSTRACT The currently ongoing COVID-19 pandemic caused by SARS-CoV-2 has accounted for millions of infections and deaths across the globe. Genome sequences of SARS-CoV-2 are being published daily in public databases and the availability of these genome data sets has allowed unprecedented access to the mutational patterns of SARS-CoV-2 evolution. We made use of the same genomic information for conducting phylogenetic analysis and identifying lineage-specific mutations. The catalogued lineage-defining mutations were analyzed for their stabilizing or destabilizing impact on viral proteins. We recorded persistence of D614G, S477N, A222V, and V1176F variants and a global expansion of the PANGOLIN variant B.1. In addition, a retention of Q57H (B.1.X), R203K/G204R (B.1.1.X), T85I (B.1.2-B.1.3), G15S+T428I (C.X), and I120F (D.X) variations was observed. Overall, we recorded a striking balance between stabilizing and destabilizing mutations, therefore leading to well-maintained protein structures. With selection pressures in the form of newly developed vaccines and therapeutics to mount in the coming months, the task of mapping viral mutations and recording their impact on key viral proteins should be crucial to preemptively catch any escape mechanism for which SARS-CoV-2 may evolve.

standstill (1). During the course of 11 months, the coronavirus disease 19  pandemic has caused more than 81 million confirmed cases in 220 countries, with close to 1,770,000 fatalities. (2). Initially, and rightly, the efforts were focused on minimizing the number of cases and deaths due to . This included fast tracking the search and development of novel treatment and prevention options (4). Today, however, as vaccine candidates have started showing promising results, there is a cautious shift towards assessing the efficacy of vaccine candidates with respect to the circulating diversity of SARS-CoV-2 and its continuously evolving genetic variants (5).
Functional mutations that help the virus to adapt to the recent host-shift events are hypothesized to drive the evolution of transmissibility and virulence in SARS-CoV-2 (6). Shortly after the first isolated SARS-CoV-2 genome from China was published, .30,500 distinct mutations were catalogued in the CoV-GLUE database (http://cov-glue.cvr.gla .ac.uk/) among globally circulating strains of this virus (7). Variations in the genetic makeup are key determinants in measuring the evolutionary distance and stability of SARS-CoV-2 from the first sequenced isolate (8). Moreover, tracking the evolution of SARS-CoV-2 since its introduction in humans is a high-priority undertaking to prevent future waves of this pandemic from escaping the global preparedness (9). Since many vaccine candidates currently under development are derived from the first available SARS-CoV-2 sequences, recurrent genetic changes may have an unforeseen impact on their sustained effectiveness in the longer term (10).
The availability of whole-genome sequences of SARS-CoV-2 in public repositories such as Global Initiative on Sharing All Influenza Data (GISAID) and real-time data visualization pipeline NextStrain (https://nextstrain.org) offers a great opportunity for scientists to track the evolutionary path of this virus (11,12). Phylogenetic Assignment of Named Global Outbreak LINeages tool (PANGOLIN) has been the most widely used tool for lineage assignment to newly emerging variants. PANGOLIN (https://cov-lineages.org/pangolin.html) has also been deployed in establishing the transmission patterns of various clones of this virus (13). Since coronaviruses frequently recombine, tracking the evolution and assigning lineages has been challenging (13,14). As a result, multiple studies that tracked the evolution of SARS-CoV-2 have been hugely controversial. For example, doubts have been cast on the claim of finding more aggressive L type strains emerging from S type strains (14). Similarly, the hypothesis that rapid spread of the D614G variant of SARS-CoV-2 indicates a possible fitness advantage has been questioned (15)(16)(17). Therefore, in the current and highly sensitive global circumstances due to this pandemic, having a detailed map of mutations highlighting their prospective role in therapeutics and vaccine development can prepare us better for the future waves of continuously evolving SARS-CoV-2. In this study, we present a catalogue of the most important genomic mutations recorded between December 2019 and November 2020 in SARS-Cov-2 and their possible impact on the stability of protein candidates that form the most crucial part of vaccines and also constitute the most common therapeutic targets.

RESULTS
Diversity of SARS-CoV-2 genomes. Of the 7,000 SARS-CoV-2 genomes screened, we constructed a robust phylogenetic tree of 513 genomes strategically selected to reflect the most complete diversity among the isolates by covering all the PANGOLIN lineages. Lineage assignment based on the PANGOLIN tool indicated the circulation of seven distinct lineages and/or sublineages, such as A, B.1, B.1.1, B.1.1.1, B.2, B.3, and B.6. This is in line with the phylogenetic groupings by GISAID (S, L, V, O, G, GH, and GR) (Fig. 1). As the epidemic has progressed and mutations have accumulated, further subdivision of major lineages into sublineages has been observed. Overall, a total of 61 lineages and sublineages have been found to be circulating concurrently in multiple countries around the world. In general, numerous introductions of different variants were observed across the globe with a few sublineages (C.2, D.2) being restricted to certain regions. While the B.1.113 lineage, for example, has been exclusively reported from India, lineages C.2 and D.2 have been geographically confined to South Africa and Australia, respectively.
Major amino acid substitutions. Mutation mapping showed a total of 106 amino acid substitutions (missense mutations in .5 genomes) from a representative set of 513 genomes. The analysis also revealed 36 mutations that were found in .5% of genome sequences, while 12 major substitutions were lineage-defining mutations (Fig. 1). The first major mutation to appear was L84S in ORF8 (present in 8.6% of the genomes) that has defined the A lineage (i.e., clade S in the GISAID classification). The subsequent amino acid substitutions L37F in ORF3a and G251V in nsp6 were found to be present in 13.3% and 1.4% of genomes, respectively. The combination of G251V and L37F, which was initially considered a defining mutation pattern for the B.2 to B.6 lineage (clade V in GISAID classification), has shown under more detailed analysis that isolates carrying the G251V mutation are distributed in other lineages too. The predominant lineage-defining mutations in the whole data set were D614G (85.5%) and P323L (85.5%), after originally appearing in late January 2020 (Fig. 2). Other major mutations noted are Q57H (26.5%), R203K/G204R (33%), G15S (12%), I120F (11.5%), and T85I (14%).
Dominance of the D614G variant. Two mutations have become consensus: D614G in S (nucleotide 23,403, A to G) and P323L (also known as P4715L) in nsp12 (nucleotide 14,143, C to T). These mutations were present in 80.5% of the sequences and have defined the B.1 lineage (G in GISAID classification). The widely discussed D614G variant is speculated to have been introduced in Europe at the end of January (EPI_ISl_422424) before becoming globally dominant. Genomes with D614G mutations were assigned as B.1 by PANGOLIN or GH/GR by GISAID. Notably, founder lineage B.1 and its sublineages B.1.X, B.1.1.X, D.X, and C.X that carry both D614G and P323L mutations have become the dominant variants across the world (87% of global collection per CoV-GLUE as of 30 November 2020).
As the pandemic has progressed, several other major substitutions affecting the protein structure have appeared. These are Q57H (nucleotide 25,563, G to T) in ORF3a, the R203K 1 G204R combination (nucleotide 28,881, GGG to AAC) in nucleocapsid, and T85I (nucleotide 1,059, C to T) in ORF1a. The region-specific sub lineages C.1, C.2, D.1, and D.2 were found to cumulatively harbor multiple mutations. Amino acid substitutions such as T428I and G15S in ORF1a were reported in sublineages C.1 and C.2, and the S477N substitution in the spike (S) protein along with I120F in nsp2 specifically established the sublineage D.2 ( Fig. 1).
Structural analysis of SARS-CoV-2 mutants. The possible structural consequences of 11 lineage-defining missense mutations identified in this study were investigated. Among the mutations, three were considered stabilizing to the respective protein structures, while six mutations were destabilizing ( Table 1). The significance of these mutations in evolutionary selection cannot be solely predicted by DDG, or change in free energy. Hence for a precise interpretation, correlation of DDG, DDS, and N-H S 2 ( Table S2 in the supplemental material) order parameter values of the proteins have been taken into account based on fine local alterations in structures. All lineage- defining mutations except two have reduced the vibrational entropies of the proteins, thereby decreasing the flexibility in the structures (Table 1).
Additionally, the impact of mutations in key structural proteins that potentially allows any pathogen to escape available treatment and prevention regime was investigated. Among the 59 major missense mutations, our analysis using both the SDM and DUET servers predicted 16 missense mutations as stabilizing and 23 missense mutations as destabilizing the protein structure. Twenty major mutations were predicted to be neither stabilizing nor destabilizing, as the DDG values provided by the SDM and DUET servers were contradictory ( Table 2).
Balance of stabilizing and destabilizing mutations. Overall, from both the data sets, 70 amino acid substitutions in SARS-CoV-2 were tested for stability, of which 19 were stabilizing, 29 were destabilizing, and 22 showed inconclusive results. Computational prediction to understand the effect of amino acid substitutions in SARS-CoV-2 revealed a balance of stabilization and destabilization of the proteins.
When checked for amino acid substitutions, the stabilizing mutation in spike (S) protein predicted an increase in the rigidity of its structure ( Fig. 3; Fig. S1). The increased rigidities of the structure may provide a stable conformation to the protein that may positively influence the binding of spike protein to the ACE2 receptor (18). The major mutations D614G and S477N were located at potential epitope regions (codons 469 to 882), with S477N particularly positioned in the receptor-binding domain (RBD) of the S protein (319 to 541).
The most frequent amino acid substitutions were observed in the nucleocapsid (N) protein, in which the variants S194L, D103Y, P13L, S197L, M234I, and S188L were predicted to be stabilizing according to both the analytical servers (Table 2). In contrast, membrane (M) and envelope (E) proteins accounted for the least number of amino acid substitutions. The amino acid changes in M (T175M) indicated a stabilizing effect, while E does not account for any stabilizing variant. Structural analysis of double (D614G 1 S477N; D614G 1 A222V) and triple (D614G 1 S477N 1 A222V) mutation patterns in the S protein indicated DDG values of 0.228, 0.195 and 0.129, respectively (Table 3). This signifies that accumulation of spike mutation in D614G-bearing lineages could potentially be affecting the stability of the spike and therefore may influence the binding affinity toward the ACE2 receptor.

DISCUSSION
Since the beginning of the COVID-19 pandemic, whole-genome, sequence-based phylogenetic inference has been heavily utilized in tracing viral origins and transmission chains (19). However, as the virus has evolved with time, genomic data are being increasingly used in guiding infection risk and control strategies. Several genomic mutations have been mapped that seem to be of advantage to the virus (20). In parallel, numerous vaccine candidates have been designed using genomic data from the original SARS-CoV-2 strain of Wuhan and many are now approved for use or at latestage trials (21,22). Based on immunological data obtained from infected and   (23)(24)(25). However, as vaccines are introduced and successful treatment options become available, it is vital that we carefully monitor the mutations in the immunogenic region of SARS-CoV-2 genome (26). Mapping these changes to protein structure will allow preemptive forecasting of the direction of change in vaccine effectiveness and guide future preparedness efforts. We analyzed the impact of recurrent amino acid replacements in the genomic evolution and proteome stability of SARS-CoV-2 from its introduction in December 2019 through to November 2020. Our analysis found an intriguing balance of stabilizing and destabilizing mutations, which may have allowed SARS-CoV-2 to evolve and persist without losing pathogenicity. SARS-CoV-2 is considered a slowly evolving virus, as it possesses an inherent proofreading mechanism to repair the mismatches during its replication. This is believed to have a crucial role in maintaining the stability and integrity of the viral genome (27,28). Our analysis confirmed previously recorded positive natural selection of the D614G, S477N (29), A222V, and V1176F (30) variants and a global expansion of the PANGOLIN variant B.1 (11). In addition, we also observed a positive natural selection of Q57H (B.1.X), R203K/G204R (B.1.1.X), T85I (B. 1.2-B.1.3), G15S1T428I (C.X), and I120F (D.X) variants (Fig. 2).
Apart from the 11 clade-defining mutations, some of the major missense mutations were in the four structural proteins (E, M, N, and S). When analyzed for their impact (n = 59) in the respective protein structure, the spike glycoprotein, more specifically its RBD domain, was found to be most vulnerable to frequent mutations. This may be due to the immunological observation that most neutralizing anti-SARS-CoV-2 antibodies have been found to target the RBD domain of the S protein (31,32). Consistent with  this finding, a total of 4,170 missense mutations have been reported in the spike protein, with 683 on the RBD domain alone (CoV-Glue, accessed 12 December 2020). Computational prediction to understand the effect of amino acid substitutions in E, M, N, and S proteins revealed a balance of stabilization and destabilization of the proteins. While viral populations carrying mutations with higher stabilizing effects (positive DDG values) would be expected to become dominant variants, it is interesting to note that destabilization mutations in the major protein targets of SARS-CoV-2 have also generated variants that have been hugely successful. For example, many of the favorably selected variants, such as L18F, L5F (spike); R203K, G204R, and A220V (nucleocapsid), were found to be destabilizing the respective protein structure ( Table 1). As destabilizing mutations are known for their crucial functional roles, a trade-off between stabilizing and destabilizing mutations may balance the protein function and structure in ways that are not yet fully understood (33,34). In our study, the effect of mutations on respective proteins was primarily estimated based on the physical change in free energy for a single "native" protein conformation. To allow the most robust correlation of mutations with molecular evolution, the mutational effects for the protein in an unfolded state, and the possibility of structural adjustment of the folded state in response to the mutation, needs to be explored in future studies when more structural dynamic information becomes available (35). While our study highlights the impact of DDG analyses as a reference frame for evolutionary evaluation, molecular evolution is likely a consequence of complex amalgamation of changes in free energy, entropy, solvent accessibilities, etc. (36). As the data on these unchecked parameters becomes available, predicting evolutionary selection of mutation with respect to the phylogeny would become confirmatory. Our study highlighting preliminary data linking free energy and phylogeny would help streamline the scope of future studies by providing a baseline matrix.
The currently circulating spike variants or RBD variants need to be taken into account while evaluating vaccine candidates or neutralizing monoclonal antibodies against SARS-CoV-2 (37). Mapping the viral mutations that escape antibody binding is essential for accessing the efficacy of therapeutic and prophylactic anti-SARS-CoV-2 agents (29,38). Recently generated experimental evidence suggests that leading vaccines (mRNA-1273, BNT162b1, and ChAdOx1 nCoV-19) and two potent neutralizing antibodies (REGN10987 and REGN10933) are unlikely to be affected by the dominant variant D614G (23,24,(39)(40)(41). As all three candidate vaccines encode RBD or the part of spike protein as antigens, the viral population is expected to try and escape by altering the positioning of the respective antigens (42) under vaccine-induced selection pressure. Notably, complete escape mutation maps of 3,804 of the 3,819 possible RBD amino acid mutations against 10 human monoclonal antibodies are already in place (29,42). The antigenic effect of key RBD mutations against the REGN-COV2 cocktail (REGN10933 and REGN10987) showed N439K and K444R variants escaped neutralization only by REGN10987, while E406W escaped both individual REGN-COV2 antibodies and the cocktail (38). Similar strategies should be adopted to map all antibody resistance mutations against neutralizing antibodies elicited after vaccination. Once mutation escape maps are available for all successful vaccine candidates, vaccine roll-out strategies should be carefully planned to counter geographically confined escape mutants.
In conclusion, our study highlights the importance of continued genomic surveillance, mutation mapping, stability analysis, and potential escape mutation cataloguing both in the pre-and postvaccination period of SARS-CoV-2 so as to design the epidemiologically best vaccination programs. The currently observed mutation pattern and subsequent phylogenetic diversification of SARS-CoV-2 seem to be strongly influenced by the negative and positive selection pressures. The overall variation in SARS-CoV-2 sequences is currently low compared to many other RNA viruses. One of the possible reasons for the low rate of mutations can be attributed to the widespread absence of neutralizing antibodies or the selective pressure. Once the virus population is challenged with the vaccine candidates or therapeutic monoclonal antibodies, the currently known epitopes on surfaces of SARS-CoV-2 proteins are likely to undergo rapid forced change for survival. Thus, the prevalence of such possible escape mutations needs to be monitored even more carefully after vaccination if we are to remain ahead of this rapidly shifting pandemic curve.

MATERIALS AND METHODS
Data acquisition and curation. In total, we have retrieved 7,000 genomes from GISAID EpiCoV database (https://www.gisaid.org/). Data sets that were flagged as complete (.28,000 bp) were screened and subsequently manually curated for excluding low quality/coverage sequences and duplicates. Sequence metadata was retrieved and only genomes containing sampling time and location were chosen for the study. Lineages were assigned from alignment file using the Phylogenetic Assignment of Named Global Outbreak LINeages tool PANGOLIN v1.07 (https://github.com/hCoV-2019/pangolin). We selected a subset of 513 genomes (Table S1 in the supplemental material) that belongs to all major PANGOLIN lineages and common mutations for the optimal output of the phylogenetic tree.
Mutation profiling. In order to identify the genetic variants, assembled genomes were mapped against the reference (Wuhan-Hu-1: accession: NC_045512) using Snippy mapping and variant calling pipeline (https://github.com/tseemann/snippy) (46). Among the SNPs, missense SNPs (nonsynonymous) were extracted using custom-written bash scripts and manually curated as per the CoV-GLUE database (http://cov-glue.cvr.gla.ac.uk/). Specifically, we considered 11 lineage-defining mutations and 59 major missense mutations in four major structural proteins: envelope protein (E), membrane glycoprotein (M), nucleocapsid phosphoprotein (N), and spike protein (S). Structural analysis of 70 amino acid substitutions in SARS CoV-2 mutants were analyzed to examine the potential impact of these mutations on protein stability.
Structural analysis. The structural impact of mutations has been assessed from the COVID-3D server (http://biosig.unimelb.edu.au/covid3d), which has integrated analytics regarding mutation-based structural changes in a protein. Vibrational entropy (VE) (DDS) and unfolding Gibbs free energy (FE) (DDG) were considered markers to ascertain the stability of the variants. Gibbs free energy (FE) (DDG) values from the site directed mutator (SDM), DUET, and DynaMut tools available in COVID-3D server were considered (47,48). The change in vibrational entropy energy (DDSVib ENCoM) between wild-type and mutant protein was calculated using DynaMut (49). VE explains the occupation probabilities of protein residues in an energy landscape based on average configurational entropies. Considerable decrease in VE increases the rigidity of the proteins (50). FE, on the other hand, describes the free energy alterations while unfolding a kinetically stable protein (49). The positive and negative values of DDG indicate the stabilizing and destabilizing mutations. DynaMine (http://dynamine.ibsquare.be/) was employed to validate the stability profiles through residue level (sequence-based) dynamics. Backbone N-H S 2 order parameter values (atomic bond vector's movement restrictions) were generated according to the molecular reference frame. These N-H S 2 order parameter values are evaluated from experimentally determined NMR chemical shifts. A value above 0.8 is considered highly stable, values between 0.6 and 0.8 can be considered to be functionally contextual, and values .0.6 are highly flexible (51).
Data availability. The genome sequences used in this study are available in the Global Initiative on Sharing All Influenza Data (GISAID) with accession IDs (see Table S1 in the supplemental material).

ACKNOWLEDGMENTS
We thank Soumya Basu (ICMR, Senior research Fellow) for his contribution and helpful advice in the structural analysis. We gratefully acknowledge the Department of Clinical Microbiology, Christian Medical College, Vellore, Tamil Nadu, India, for providing all the necessary computational facilities for this work. We are grateful to the staff of Christian Medical College for their assistance with data curation.
This work received no specific external funding and the work was carried out depending on the resources of the host institute.
We declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.
Evolutionary Events of Mutations in SARS-CoV-2