Early detection and improved genomic surveillance of SARS-CoV-2 variants from deep sequencing data

Summary A key task of genomic surveillance of infectious viral diseases lies in the early detection of dangerous variants. Unexpected help to this end is provided by the analysis of deep sequencing data of viral samples, which are typically discarded after creating consensus sequences. Such analysis allows one to detect intra-host low-frequency mutations, which are a footprint of mutational processes underlying the origination of new variants. Their timely identification may improve public-health decision-making with respect to traditional approaches exploiting consensus sequences. We present the analysis of 220,788 high-quality deep sequencing SARS-CoV-2 samples, showing that many spike and nucleocapsid mutations of interest associated to the most circulating variants, including Beta, Delta, and Omicron, might have been intercepted several months in advance. Furthermore, we show that a refined genomic surveillance system leveraging deep sequencing data might allow one to pinpoint emerging mutation patterns, providing an automated data-driven support to virologists and epidemiologists.


INTRODUCTION
The dramatic impact of the COVID-19 pandemic at the global scale has proven that pivotal efforts must be devoted by the scientific community to the timely identification and quantification of hazardous variants, i.e., those showing increased virulence, pathogenesis, or ability to escape therapeutic strategies such as vaccines (Oude Munnink et al., 2021;Elliott et al., 2021). To this end, institutions such as the World Health Organization (WHO), the European Centre for Disease Prevention and Control (ECDC), the Centers for Disease Control and Prevention (CDC), and others are repeatedly updating the lists of the so-called variants of interests (VOIs), variants of concern (VOCs), variants under monitoring (VUMs) (also named Variants Being Monitored [VBMs]) and de-escalated aminos (DEVs) (World Health Organization, 2021; European Centre for Disease Prevention and Control, 2021; Centers for Disease Control and Prevention, 2021).
Each variant is identified and categorized according to institution-specific molecular and epidemiological criteria (Konings et al., 2021). VOCs are usually associated with evidence of diminished effectiveness of treatments, increased transmissibility, immune escape, and/or diagnostic escape. VOIs present genetic changes that are predicted or known to cause the same effect of a VOC but have a limited prevalence (e.g., in circumscribed outbreak clusters). VUMs bear genetic markers suspected to impact the epidemic dynamics but circulate at a very low level. Finally, a variant is classified as DEV if it is no longer circulating or if there is solid evidence that it does not affect the overall epidemiological situation.   iScience Article SARS-CoV-2 genome. In Table 1, one can find the set of SARS-CoV-2 variants listed as VOI, VOC, VUM, or DEV by at least one of the three public health bodies as of October 26th 2021, and the related WHO label (in Greek letters), Pango lineage, and set of SMoIs (note that the Omicron variant was added to the list, despite being designed as VOC by the WHO on November 26th 2021). For further details on the categorization criteria please refer to the institution websites, whereas for an up-to-date association between mutations, variants, and lineages please refer to Hodcroft (2021).
During the pandemic, the analysis and surveillance of SARS-CoV-2 variants has benefited from the surge of Next-Generation Sequencing (NGS) experiments performed via different protocols (e.g., Illumina RNA-seq or Amplicon) on viral samples, which are typically collected from primary isolates of infected people. The data generated are made available on portals such as GISAID (Shu and McCauley, 2017), Nextstrain (Hadfield et al., 2018), Cov-Lineages.org (O'Toole et al., 2021a(O'Toole et al., , 2021b, NCBI SARS-CoV-2 Resources (National Center for Biotechnology Information, 2021), or the EMBL-EBI COVID-19 Data Portal (EMBL-EBI Covid-19 Data Portal, 2021). However, the large majority of available datasets include only the consensus sequences of the samples, rather than the source deep sequencing data from which such sequences are generated with distinct criteria. In fact, the proportion of deep sequencing datasets shared on public repositories has been significantly lower than that of consensus sequence during the course of the pandemic. To provide a clarifying example, we focus on two of the most widely used portals collecting NGS data of viral samples: as of August 2021, 4,558,675 consensus sequences are stored on the GISAID database (Shu and McCauley, 2017), whereas only 975,767 samples (z21:4%) are included in (Illumina paired-end) sequencing datasets accessible (from different sources) on the NCBI website (National Center for Biotechnology Information, 2021), of which only approximately 50% are of high quality (see Figure 1 and see STAR Methods for further details). Furthermore, to the best of our knowledge, all existing portals collecting deep sequencing datasets do not currently provide any unified standard for their processing and analysis, nor any automated computational tool for their integration and in-depth investigation. Hence, the retrieval, processing, and analysis of such datasets may require a significant amount of ad hoc manual work and computation time, possibly hindering their proper exploitation.
Deep sequencing data are indeed essential to characterize the intra-host mutational landscape of viral samples and, especially, to identify the presence of minor mutations, i.e., single-nucleotide variants (SNVs) and indels detected with low mutation frequency (MF, which-roughly-is the ratio of alternative allele reads over the total reads for a specific genome position) (Liu et al., 2022). Minor mutations areby definition-not included in consensus sequences but characterize the heterogeneous ensemble of viral subpopulations, known as quasispecies (Domingo et al., 2012;Knyazev et al., 2021), which are typically present in single hosts. Most important, every new genomic mutation first originates as minor within single iScience Article hosts, due to replication errors of viral polymerases that are often induced by host-related mutational processes Maspero et al., 2021;Ramazzotti et al., 2022).
Minor mutations are generally purified, due to detrimental effects on the fitness of the virus, and are significantly affected by transmission events (e.g., bottlenecks), which hamper their diffusion in the population (see Gallego-García et al., 2022), as opposed to fixed mutation, which are usually transmitted from a host to another during infections. However, certain minor mutations may provide the virus with a fitness advantage, for instance in terms of enhanced reproductive potential or increased infectiveness and, accordingly, are positively selected, first within hosts and, later, in the population of infected people. For these and many additional transmission-related phenomena, the MF of certain minor mutations can sometimes increase both within single hosts and across infection chains. Once the MF exceeds a certain threshold (typically around 50%), the mutation fixates, meaning that the related viral subpopulation has become dominant within a given host. Only then the mutation is included into consensus sequences. Among the set of fixed mutations, some (e.g., the SMoIs) will eventually contribute to the origination of hazardous variants, subsequently impacting the course of the epidemic.
A finer characterization of the minor mutation landscape is therefore beneficial to (1) intercept impactful mutations prior to their fixation and (2) assess the presence of dangerous minor mutations in samples exhibiting circulating variants. Both aspects are essential for the definition of an effective genomics-informed epidemiological surveillance system, which may drive the design of timely public-health interventions, with substantial repercussions in terms of epidemiological dynamics and socioeconomic costs (Gardy and Loman, 2018;Baud et al., 2020;Verschuur et al., 2021;Davis et al., 2021). In this regard, note that the characterization of the minor mutation landscape has already been successfully exploited to investigate drug resistance, contagion chains, bottleneck effects, and mutational signatures, for a number of infectious diseases including COVID-19 (Beerenwinkel and Zagordi, 2011;Knyazev et al., 2021;Ramazzotti et al., 2021;Graudenzi et al., 2021).
In support of our claims, here we provide the largest up-to-date worldwide study of deep sequencing data of SARS-CoV-2 samples, which includes 220,788 high-quality samples from 137 distinct datasets. Our iScience Article analyses are first focused on a list of 44 variants and related 35 SMoIs, included in the lists of hazardous variants by the WHO, the ECDC, and the CDC. We also analyze a list of 95 further spike mutations that have not been associated to any known variant at the time of writing but display significant diffusion patterns. Because attention was recently raised on the functional role of mutations hitting the nucleocapsid (N) protein (via the analysis of SARS-CoV-2 virus-like particles [Syed et al., 2021]), we also investigated a list of 13 N mutations with potential functional effect (here labelled as N mutations of interest, NMoIs) and a further list of 82 highly diffused N mutations.
In brief, (1) we prove that the identification of several S and N mutations could be anticipated of a significant time-span with respect to standard analyses based on consensus sequences, and (2) we highlight that a significant number of samples harboring circulating variants display homoplastic minor mutations, which might lead to the origination of new variants.

RESULTS
The analyses presented in this work focus on (1) the set of 35 S mutations (SMoIs) associated to the 44 SARS-CoV-2 variants included in the lists of VOCs, VOIs, VUMs, or DEVs in at least one of the WHO, the CDC, and the ECDC websites, as of October 26th 2021 (see Table 1; Omicron variant was labeled as VOC on November 26th 2021); (2)

Early detection of mutations of interest Spike mutations of interest associated to known variants
One of the major differences in employing deep sequencing data instead of consensus sequences lies in the possibility of detecting, in principle, any genomic mutation with great temporal advance. To corroborate this claim, we first analyzed separately each SMoI associated to the hazardous variants included in Table 1. A total amount of 35 distinct mutations were analyzed, involved in 44 variants (we recall that any variant can be associated to one or more Pango lineages ).
In particular, we computed (1) the MF of all SMoIs in all samples, (2) the prevalence of each SMoI in the population (i.e., proportion of samples) when detected as either minor (MF R5% and <50%) or fixed (MF R 50%), with respect to collection date (grouped by month) and location (grouped by continent) (Note that we restricted the detection time analysis on the months in which any mutation is detected [with MF R5%] in at least 5 samples in a given location, so to ensure a minimum level of statistical significance [for a discussion on filtering criteria, please refer to the Limitations of the study section]).
Overall, 4 (out of 35) SMoIs were detected as minor (in at least 5 samples) at least one month prior to their initial detection as fixed at the global level, i.e., S:Q414K in March 2020, S:L452R in September 2020, and S:H655Y in January 2020, and 2 additional SMoIs in at least one of the considered geographical regions, i.e., S:L18F, S:T478K, and S:A701V, in September 2020 in Africa (see Figure 2). The 6 SMoIs detected in advance characterize a large number of hazardous variants (see Table 1) and, in particular, S:Q414K is associated to one variant (B. In Figure 3, one can find the in-depth analysis of mutations S:L452R and S:H655Y, whereas the analysis of all remaining early detected S and N mutations is presented in Figures S3-S36. More in detail, at the global scale, mutation S:L452R is first observed as minor in Africa in September 2020 and as fixed in North America in October 2020 (1 month in advance). The anticipation is remarkably amplified when considering the local scale. In fact, in Africa the mutation is observed as fixed only in December 2020, that is 3 months later.
A similar trend is observed for mutation S:H655Y, which is firstly detected as minor in January 2020 in North America and as fixed in June 2020 in Europe (5 months in advance at the global scale) and in October 2020 in North America (9 months in advance at the local scale). Notice also that when mutation S:H655Y was first detected in Europe, Africa, and Oceania, the large majority of samples exhibited it as minor. These important results show that, once a variant is identified as VUM, VOI, or VOC, it is possible to detect the related mutations considerably before their fixation at both the global and the local scale, allowing for a timely implementation of containment strategies.
As an aside note, the distribution of the MF proves that most SMoIs are present either at a very low or at a very high frequency within hosts, suggesting the presence of strong purifying selection and of bottlenecks, as already observed in ) (see Figures S3-S36).

Additional spike mutations
Even if a mutation is not associated to any of the known variants, it is possible to investigate its selection and fixation dynamics with remarkable anticipation by looking at its MF variation in time and its diffusion in the population.
To this end, we analyzed the list of (single-nucleotide) mutations meeting the following criteria: (1) falling on the spike gene, (2) not being associated to any of the variants of  Figure 2. Early detection of 6 SMoIs associated to hazardous variants from deep sequencing data Analysis of SMoIs: S:L18F, S:Q414K, S:L452R, S:T478K, S:H655Y, and S:A701V (see Table 1). Circles with purple borders mark the first month in which the mutation was detected as minor (MF R5% and <50%) in at least 5 samples, while been still undetected as fixed (MF R50%); circles with blue borders mark the month in which the mutation was first detected as fixed in at least 1 sample; red lines highlight the anticipation (when >1 months). The analysis is performed by splitting the samples in the 6 distinct geographical regions and by reporting the corresponding results at the global scale. All circles contain a pie-chart that displays the ratio of samples showing that mutation either as minor or as fixed in that month (further details are provided in the main text). For each SMoI the related variants are also reported.    Jacob et al. (2021). These further results prove that the analysis of the minor mutation landscape might be effective to intercept hazardous S mutations prior to their fixation in the population, even when not associated to lists of known variants.
We finally note that 9 additional highly diffused S mutations were initially detected both as minor (in at least 5 samples) and fixed ( Mutation S:G446V shows particularly interesting dynamics, as it has been observed mostly (> 90% samples) as minor mutation since March 2020, but has been showing an increase in MF since November 2020, and is now observed as a fixed variant approximately in 50% of the samples presenting the mutation in July 2021 and August 2021. Moreover, mutation S:G446V has been associated to attenuate monoclonal and serum antibody neutralization (Liu et al., 2021).

Nucleocapsid mutations
We repeated the analysis by first focusing on the list of 13 NMoIs selected in Syed et al. (2021). Four of them, in particular, were associated to a significant increased mRNA delivery and expression from the analysis of SARS-CoV-2 virus-like particles (N:P199L, N:S202R, N:R203K, and N:R203M), and it was also hypothesized that one of such mutations may be responsible for the increased spread of variants including Delta (N:R203M). Furthermore, we selected an additional list of 83 highly diffused N mutations, with the criteria employed for the additional S mutations and described earlier (mutations N:D3E, N:D3H, N:D3V, and N:K256* were removed from the analysis after manual curation).  iScience Article Of such mutations, N:D377Y was identified as NMoI in Syed et al. (2021) and in Africa was discovered as minor 3 months in advance. In Figure 5, one can find such mutation, in addition to three mutations first detected as minor at the global scale (see above), whereas the remaining mutations are shown in Figure S2.
Although ad-hoc investigations on the possible functional effect of such mutations are clearly required, these findings demonstrate the effectiveness of deep sequencing data analyses to intercept possibly hazardous mutations.
In further support to this claim, let us also notice that 1 additional NMoI was initially detected both as minor (in at least 5 samples) and fixed (

Improved genomic surveillance of circulating variants
In addition to the early detection of mutations, deep sequencing data are important for the characterization of the intra-host diversity of SARS-CoV-2 samples that are already associated to circulating variants, overcoming the intrinsic limitations of studies on consensus sequences. This analysis has important repercussion in terms of genomic surveillance.
In fact, homoplastic minor mutations (i.e., retrieved in distant lineages, with sufficient sample size) are typically not related to transmission events but emerge independently in unrelated hosts, either due to a possible fitness advantage or due to mutational hotspots . Accordingly, they should be flagged and carefully considered, as they might possibly lead to the origination of new dangerous variants, if positively selected due to any underlying functional advantage. Thus, their characterization, in combination with analyses directed to the evaluation of positive selection processes (e.g., via molecular simulation-see the Limitations of the study Section for a dedicated discussion on the topic) might allow one to design opportune alert systems and timely intervention strategies.  Table 1 (samples assigned to the ''Other'' category were excluded from the analysis, whereas, at the time of writing, Pangolin did not associate any sample to the Omicron variant). In Figure 6, one can find the prevalence of the selected minor mutations in the samples associated to the different variants. Only the mutations retrieved in at least 1% of the samples in at least one variant are considered.  Overall, 19 (out of 44) variants listed in Table 1 (dark purple check marks) display homoplastic minor (S or N) mutations of interest. This result confirms the benefits of analyzing deep sequencing data to pinpoint the emergence of possibly hazardous mutations, also in samples harboring known variants, with an improved resolution with respect to standard analyses of consensus sequences.

DISCUSSION
Thanks to the largest up-to-date analysis of deep sequencing datasets of SARS-CoV-2 samples, we proved that standard studies based on consensus sequences might be scarcely effective for the early detection of mutations of interests and for the fine monitoring of homoplastic mutations that might lead to the origination of new variants. These aspects are even more relevant when considering the exceptional proportion of the COVID-19 pandemic and should be wisely considered in the ominous prospective of future epidemics.
Accordingly, a refined estimation of key epidemiological parameters (e.g., R t ) from deep sequencing data might lead to significant differences in the predictions delivered by the wide range of currently available epidemiological models (Davis et al., 2021;Kraemer et al., 2021), as well as in the accuracy and robustness of phylodynamics methods, which in most cases rely on consensus sequences (Duchene et al., 2020). This might affect, in turn, the geo-temporal narrative on variants origination, as well as that of infection chains and (multiple) introductions (Deng et al., 2020), possibly guiding improved testing and response strategies.
In this work, we started the analysis from the set of mutations related to known variants. Yet, a large set of additional S and N mutations were detected as minor with great advance with respect to standard analyses. As specified in the Limitations of the study section, it would be important to combine this data-driven result with automated approaches aimed at estimating the fitness advantage of mutations and variants.
Furthermore, the difference in the anticipation window related to the distinct geographical region suggests a straightforward way of improving current surveillance practices: once a variant start being monitored, even if already fixed in some areas, the analysis of the related minor mutations on different locations might allow one to intercept outbreak clusters with great advance, as well as to better estimate its overall prevalence.
It is also vital to point out that both the overall number of (minor) mutations detected in advance, and the magnitude of anticipation, would have dramatically benefited from the possibility of accessing a larger number of deep sequencing datasets, especially at the beginning of the epidemic. This issue mostly resulted from the absence of shared standards for testing and sequencing, which also contributed to the origination of relevant sampling biases and geographical inhomogeneities (Díaz-Pachó n and Rao, 2021).
In this respect, given the strong evidences of our results, a methodological paradigm shift aimed at exploiting deep sequencing data seems to be opportune to improve genomic surveillance and might take advantage of the ever-increasing computational power and the pervasive data-sharing networks available to the scientific community. Accordingly, we advocate a collective effort for the definition of standardized best practices for deep sequencing data processing, analysis, and sharing (e.g., distribution of VCF files in FAIR-compliant repositories (Wilkinson et al., 2016)), to be implemented on top of existing databases and portals.

Limitations of the study
Any analysis of intra-host viral diversity highly depends on the quality of upstream variant/haplotype calling, which is in turn closely related to the adopted technology and the testing criteria ( iScience Article 2021). When dealing with low-frequency mutations, one of the main criticisms lies in the difficulty of identifying true mutations from sequencing artifacts or phantom mutations, due, e.g., to mutational hotspots (Bandelt et al., 2002), as well as the possibility of dropouts, due, e.g., to uneven coverage (Posada-Cespedes et al., 2017). The analysis of distinct sets of minor variants might deceive statistical inference approaches, leading to partially incorrect results, and this might apply to the case of early detection of mutations as well. To this end, many variant callers currently exist to correct for data-specific errors (Knyazev et al., 2021) and might be tested to assess the robustness of the results discussed hereby.
It is also important to remark that the filtering criteria employed in the detection time analysis (Results Section) are-by construction-arbitrary, even if in this case they were designed to ensure a good trade-off between sensitivity of the analysis and overall statistical robustness. In fact, we decided to employ stringent criteria for minor mutations (e.g., we considered only the mutations observed as minor in at least 5 samples), so to reduce the impact of noisy observations and sequencing artifacts, and much softer criteria for fixed mutations (e.g., we considered such mutations when found in at least 1 sample), so to demonstrate the general validity of our approach. Even if the exploration of the countless filtering combinations is unpractical, we expect that different and less strict criteria might lead to even greater magnitudes of anticipation in the detection of mutations, further confirming our results.
Another general limitation of this study is related to the practical unfeasibility of providing up-to-date results with respect to the current (and ever-changing) knowledge regarding the evolution of the SARS-CoV-2 virus and its variants, which however does not affect the general message of the work. For updates on SARS-CoV-2 phylogenomic model and variants, we refer the reader to the websites of public bodies such as World Health Organization, 2021; Centers for Disease Control and Prevention, 2021; and European Centre for Disease Prevention and Control, 2021 or to portals such as Shu and McCauley, 2017;Hadfield et al., 2018;O'Toole et al., 2021aO'Toole et al., , 2021bNational Center for Biotechnology Information, 2021;andEMBL-EBI Covid-19 Data Portal, 2021 (Hodcroft, 2021).
As mentioned in the previous section, several advanced methods for the investigation of the selection processes involving genomic mutations are currently available, e.g., via molecular simulations of the functional effects of genomic changes, and represent a fundamental complementary aspect of genomic surveillance. For instance, in Starr et al. (2020), the authors evaluate the fitness of SARS-CoV-2 mutations by estimating their effect on: (1) the human ACE2 receptor binding affinity and (2) the virus receptor binding domain (RBD) folding stability, thus providing a further instrument for vaccine design and genomic surveillance. In Pond (2021), state-of-the-art statistical methods are employed to identify which positions in the SARS-CoV-2 genome may be subject to positive or negative selection, also allowing one to pinpoint the most interesting sites according to different criteria, via an interactive tool. Further studies aimed at predicting the fitness advantage of mutations are presented, e.g., in Shang et al., 2020;Yan et al., 2020;Wu et al., 2021;Zhou et al., 2021;and Greaney et al., 2021. It would be surely worthwhile to integrate similar approaches with methods for the early detection of mutations from deep sequencing data, as proposed here, so to deliver comprehensive genomic surveillance systems covering the many facets of viral evolution.
We finally note that the limitations related to partial and inhomogeneous testing/sampling of the population are discussed in the Discussion section.

STAR+METHODS
Detailed methods are provided in the online version of this paper and include the following: