Current mutatome of SARS-CoV-2 in Turkey reveals mutations of interest

As the underlying pathogen for the COVID-19 pandemic that has affected tens of millions of lives worldwide, SARS-CoV-2 and its mutations are among the most urgent research topics worldwide. Mutations in the virus genome can complicate attempts at accurate testing or developing a working treatment for the disease. Furthermore, because the virus uses its own proteins to replicate its genome, rather than host proteins, mutations in the replication proteins can have cascading effects on the mutation load of the virus genome. Due to the global, rapidly developing nature of the COVID-19 pandemic, local demographics of the virus can be difficult to accurately analyze and track, disproportionate to the importance of such information. Here, we analyzed available, high-quality genome data of SARS-CoV-2 isolates from Turkey and identified their mutations, in comparison to the reference genome, to understand how the local mutatome compares to the global genomes. Our results indicate that viral genomes in Turkey has one of the highest mutation loads and certain mutations are remarkably frequent compared to global genomes. We also made the data on Turkey isolates available on an online database to facilitate further research on SARS-CoV-2 mutations in Turkey.

SARS-CoV-2 has a single-stranded RNA genome that codes for the proteins responsible for its own replication, many of which are produced via cleavage of the Orf1ab polyprotein, the largest gene on the genome. Therefore, mutations in the SARS-CoV-2 genome can lead to cascading effects by reducing the fidelity of subsequent replication cycles. Key proteins in the RNA replication complex include nsps 7, 8, and 12 (also known as RNA dependent RNA polymerase or RdRp), which together form the core polymerase complex (Kirchdoerfer and Ward 2019;Peng et al., 2020), as well as nsp14, a dual function protein which joins the larger replication complex as a 3'-5' error-correcting exonuclease (Subissi et al., 2014;Romano et al., 2020). Our previous findings show that frequently observed mutations in both nsp12 and nsp14 are associated with an increase in mutation density in the SARS-CoV-2 genome (Eskier et al., 2020a(Eskier et al., , 2020b(Eskier et al., , 2020c.
In this study, we aimed to analyze the current mutatome of SARS-CoV-2 in Turkey, with three main questions in mind: (i) are there any key reoccurring mutations observed in a large number of isolates? (ii) how does the distribution of mutations among isolates compare to other regions in the world? and finally, (iii) are there any mutations observed in Turkey but not the rest of the world? We focused on the latter two questions in particular, with an emphasis on mutations of interest previously described in the literature. Our findings reveal the presence of three main clades of SARS-CoV-2 in Turkey, roughly analogous to 19A, 20A, and 20B as described in NextStrain, with a preponderance of high mutability variants (Eskier et al., 2020a(Eskier et al., , 2020b(Eskier et al., , 2020c compared to international isolates. Furthermore, we identified several frequently recurrent, previously uncharacterized variants in Turkey isolates not observed in isolates from other countries, which can serve as potential candidates for validation and study. Furthermore, we collected our analysis of Turkey isolates in a regularly maintained and updated database, which we hope will serve as a potential resource for future research on the local mutatome of SARS-CoV-2.

Materials and methods
2.1. Genome sequence filtering, retrieval, and preprocessing SARS-CoV-2 isolate genome sequences and the corresponding metadata were obtained from the GISAID EpiCoV database on 28 July 2020 4 . These sequences were filtered for location to limit our database to isolates with the location "Europe/Turkey", which resulted in 180 isolate sequences. We applied further quality filters, including selecting only isolates obtained from human hosts (excluding environmental samples and animal hosts), those sequenced for the full length of the genome (sequence size of 29 kb or greater), and those with high coverage for the reference genome (<1% N content, < 0.05% unique mutations, no unverified indel mutations), which further narrowed down the list to 166 isolates. To ensure alignment accuracy, as characters that are not one of A, C, G, T, or N would not be aligned according to potential biological meanings of the alternative characters, all nonstandard unverified nucleotide masking was changed to N, using the Linux sed command, and the isolates were aligned against the SARS-CoV-2 reference genome using the MAFFT (v7.450) alignment software (Katoh et al., 2002). Variant sites in the isolates were annotated using snp-sites (2.5.1), bcftools (1.10.2) 5 , and ANNOVAR (release date 24 October 2019) software (Wang et al., 2010;Page et al., 2016), to identify whether a given mutation was synonymous or nonsynonymous. In addition, the 4 The GISAID Initiative (2008) 5' untranslated region of the genome (bases 1-265) and the 100 nucleotides at the 3' end were removed from the alignment and annotation files due to a high number of gaps and unidentified nucleotides.

Development of the database and user interfaces
The genome data is stored using the MariaDB 10.3.22 database installed on Debian Linux 10 operating system. For web application, the genome data is visualized on the map using jVectorMap with HTML 5 and Ajax web development techniques, using the Django 3.0.5. framework and Python 3.7.3 programming language. A modified version of TreeTime, an open-source phylogenetic analysis software, is used to create the phylogenetic tree (Sagulenko et al., 2018).

The distributions of mutations across isolates in Turkey
Our analysis of the genome sequences of 166 isolates from Turkey revealed 258 distinct mutations across the isolates, 87 of which are observed in multiple isolates, and 43 of them are found in at least five isolates (hereafter referred to as recurring mutations). 19 of the 43 recurring mutations are nonsynonymous, 21 are synonymous, and 3 are found outside of coding regions. C>T transitions are the most common, comprising over half of the mutations, consistent with previous international findings on C>U hypermutations in SARS-CoV-2 (Simmonds, 2020). The most commonly seen mutations are 3037 C>T, 14408 C>T, and 23403 A>G, observed together in 139 of the isolates, with one singleton instance of 23403 A>G, also consistent with previous findings (Pachetti et al., 2020;Yin, 2020). Orf1ab mutations are the most common, comprising 23 of the recurring mutations, consistent with the size of the gene, as Orf1ab makes up two thirds of the SARS-CoV-2 genome. Orf9 (nucleocapsid; N) gene has the second highest number of recurring mutations (n = 7, however, 3 of them are block mutations of 28881-28883 trinucleotide), followed by Orf5 (membrane; M) and S genes (n = 5) ( Table 1).

Database implementation
Data regarding Turkey isolates are available as a database comprising an interactive phylogenetic tree of the isolates, a geographical heatmap of sequenced isolates, and tables for both the mutatome of individual isolates, and summaries of the mutations observed in the isolates (Figure). The phylogenetic tree can be viewed both in real time and divergence time, and colored according to nucleotide of interest, location, or sequencing date. The tables are generated using the sequencing metadata available from GISAID as well as ANNOVAR variant annotation tables. We aim to regularly validate and update the database as new sequences are made available 7 . Future plans include implementation of Nextstrain clade and branch information in the phylogenetic tree to aid the user in comparisons with international sequencing data.

Discussion
COVID-19 has been causing tremendous challenges for clinicians, healthcare systems, societies, and governments, 7 The database is freely accessible at http://covid19.ibg.edu.tr.  and has required development of novel approaches to fight the pandemic. With an unpredictable future course for the ongoing pandemic, close monitoring and characterization of mutations has emerged as top priorities for better understanding of possible genotype-phenotype relations, and therefore better management of healthcare efforts. Mutations in any viral infection, especially those that have crossed interspecies barriers, have to be considered in the context of natural selection. As the evolution of a virus will likely affect its fitness in a new host, any attempts against such an infection have to consider the causal relationships between genomic variances and the spread of the virus. Previous studies suggest that the selective pressure on mutations in SARS-CoV-2 in human hosts are largely confined to modest positive selection, with very little purifying selection, due to the short span of the pandemic, and that most of the positive selection have occurred in previous hosts (MacLean et al., 2020). Therefore, any investigation of the mutations will need to consider most of the mutations have to be beneficial or neutral to create true strains of the virus. A comprehensive analysis by Jungreis et al. (2020) showed that SARS-Cov-2 mutations are excluded from the evolutionarily conserved amino acid residues and nucleotides, and the authors concluded both synonymous and nonsynonymous mutations are under purifying selection. Therefore, not only the nonsynonymous mutations, but also the synonymous ones should be considered as potentially functional.
Many studies already provided lines of evidence that supports a role for the S D614G mutation in increased infectivity and likely in transmissibility of SARS-CoV-2 (Daniloski et al., 2020;Korber et al., 2020). It is possible that new mutations that affect viral behavior may arise, and therefore emergence and spreading of such mutations should be monitored closely. However, with tens of millions affected worldwide, monitoring of every single mutation is a challenging task. We believe that our database will provide a valuable and practical resource for researchers in Turkey, as well as in other countries, to track the spread of SARS-CoV-2 mutations in Turkey.
Our findings show the viral isolates in Turkey have accumulated a higher number of mutations compared to other regions on average, even after normalizing for the isolates sequenced earlier during the pandemic having accumulated fewer mutations. Furthermore, it has more mutations in the Orf1ab gene, which produces the polyprotein that is cleaved into the mature peptides responsible for viral replication, than any other region. In addition, it has the third highest number of mutations in the S gene, which is responsible for the viral infection of the cells. As these two genes have the highest potential impact on the replication and transmission cycle of the virus, a higher mutation density in these genes can lead to an accelerated mutation rate. Of note, the 18877 C>T mutation in nsp14, the 3'-5' exonuclease responsible for error correction during genomic replication, has the second highest frequency in Turkey of any country 6 . Our previous study (Eskier et al., 2020a) shows a strong correlation between increased mutation density and the 18877 C>T mutation, which might be a potential reason for Turkey's increased SNV average per isolate.
Two groups of mutations we identified that is worth further attention are the 3037 C>T, 14408 C>T, 23403 A>G haplotype, and the 28881-28883 block mutation. Both of these groups of mutations are found almost exclusively together, both in Turkey, and worldwide. In both cases, Turkey has a higher incidence of mutations in these groups than worldwide averages, and four of the major regions (Asia, Europe, North America, Oceania). We previously found that the 14408 C>T and 23403 A>G mutations,when occurring together, are strongly associated with increased mutation density over time (Eskier et al., 2020a), and the prevalence of both these mutations and the 18877 C>T mutation in Turkey isolates may further contribute to a variant-rich mutation landscape (Eskier et al., 2020b). 28881-28883 GGG>AAC is found on the N gene, whose product is responsible for packaging the genome into newly produced virions in cells, and regulating host cell response (McBride et al., 2014). The mutation disrupts an SR-rich motif in the nucleocapsid protein, which was found to cause reduced transmissibility in SARS-CoV, a similar betacoronavirus with high homology to SARS-CoV-2 (Tylor et al., 2009;Ayub, 2020). It is not clear whether the mutation groups are selected together and show homoplasic recurrence across isolates, or if they are a result of strong founder effect.
A major concern when analyzing the isolate sequences from Turkey is the limited nature of the data. The sequences are few in number, and their geographical and temporal distributions are highly skewed, leading to difficulty in understanding the transmission routes of the virus across the country. Furthermore, new sequences are often made  available in large batches by the centers, which further introduces bias to the samples by potentially generating sequencing or assembly artifacts to the sequences. Unless verified by multiple centers, in multiple batches, or by other experimental methods, caution is required when studying these mutations. As more genomes are sequenced, a more clear picture of the SARS-CoV-2 mutatome in Turkey will emerge and we will likely be able to draw more solid conclusions. Finally, it should be noted that mutational profiles of viral genomes may determine whether infected patients will develop lasting immunity and remain protected from re-infection. Although exposure to SARS-CoV-2 protected rhesus macaques from re-infection with the same strain of virus (Deng et al., 2020), there are questions still remaining to be answered related to whether each recovered patient will have lasting immunity. Recent news within days reported that four patients from Hong Kong, Belgium, the Netherlands, and USA, who had earlier recovered from COVID-19 has been reinfected, with a different strain of SARS-CoV-2 than the original infection 8 (Tillett et al., 2021). In support of this observation, an earlier study reported that convalescent plasma from some of the COVID-19 patients showed reduced neutralizing activity against pseudoviruses with D614G mutation in 8 Euronews (1993). Euronews [online]. Website https://www. euronews.com/2020/08/25/two-cases-of-covid-19-reinfectionreported-in-europe [accessed 28 August 2020]. culture environment (Hue et al., 2020). We do not have a clear understanding of the viral determinants of lasting immunity to SARS-CoV-2, however, it seems that certain viral proteins may be more critical than others, based on analyses of patient plasma samples. Grifoni et al. (2020) suggested that M, Spike and N proteins are the major determinants of CD4+ response, with additional responses to nsp3, nsp4, ORF3a and ORF8. Hachim et al. (2020) showed that ORF8, ORF3b and N proteins of SARS-CoV-2 elicited the strongest specific antibody responses in infected patients. It is plausible that certain mutations within these proteins affect the immune response, however, it remains to be explored whether any of the mutations common or more frequently seen in Turkish isolates have any effect on the immune response.