Comparative sequence analysis of SARS-CoV-2 suggests its high transmissibility and pathogenicity

Aim: Because the highly pathogenic SARS-CoV-2 is newly introduced to humans, we aimed to understand the unique features of its genome and proteins, crucial for high transmissibility and disease severity. Materials & methods: The available genome and protein sequences of SARS-CoV-2 with known human and nonhuman CoV were analyzed using multiple sequence alignment programs. Results: Our analysis revealed some unique mutations in SARS-CoV-2 spike, ORF1a/b, ORF3a/3b and ORF8. The most interesting ones were in the spike angiotensin-converting enzyme 2 receptor binding-motif and generation of a furin-like cleavage site as well as deletions of ORF3a ‘diacidic motif’ and the entire ORF3b. Conclusion: Our data suggest that SARS-CoV-2 has diverged from SARS-CoV-1 but is most close to bat-SL-CoV. Unique mutations in spike and ORF3a/b proteins strongly endorse its adaptive evolution, enhanced infectivity and severe pathogenesis in humans.

antigenic and is used as a serological marker [9]. The nonstructural replicase proteins (pp1a: nsp1-nsp11 and pp1b: nsp12-nsp16) are involved in mRNA synthesis and replication, and the accessory proteins (3a, 3b, 6, 7a, 7b, 8 and 9b) participate in modulating host innate immunity. Upon infection, SARS-CoV-2 first gets attached to the naso-/oro-pharyngeal inner linings and then moves down to the lungs, which are even richer in ACE2, and triggers cell damage. The bat-SL-CoV S protein is known to bind to civet and horseshoe bat ACE2 receptors [10]. Similarly, SARS-CoV-2 S protein also binds to the ACE2 of airway epithelium, alveolar type-2 pneumocytes that produce pulmonary surfactant [11]. In SARS-CoV-1 patients, the role of apoptosis in lung epithelial cells damage as well as hematological changes including lymphopenia, thrombocytopenia and occasionally leucopenia has been observed, suggesting it role in disease severity [12]. Moreover, the SARS-CoV-1 N, 3a, 3b and 7a proteins are reported to induce apoptosis in cultured cells [13,14].
Theoretical framework SARS-CoV-2 has faster 'human-to-human' transmission rates and higher pathogenicity than SARS-CoV-1 and MERS-CoV [15]. Notably, the success behind containment of SARS-CoV-1 was due to the fact that the majority of infections happened in hospital setting where spread occurred during late and symptomatic phase [16,17]. Unlike SARS-CoV-1, most of the spread of SARS-CoV-2 is occurring through asymptomatic infection [18], which is a bottleneck for its quick containment. In addition, SARS-CoV-2 has been reported to survive longer than SARS-CoV-1 on certain surfaces including cardboard, plastic and stainless steel [19], which raises the higher chances of its fomite transmission. In general, HCoV do not cause life-threatening disease. However, owing to zoonotic origin of SARS-CoV-1 and SARS-CoV-2, humans lack natural immunity making them aggressively pathogenic [20]. Lacking of this pre-existing immunity, called 'herd immunity' is an important reason why naive humans have a much delayed time to develop adaptive immune responses against SARS-CoV-2.
Moreover, SARS-CoV-2 has an incubation period of 2-14 days, which is higher than that of MERS-CoV (2-7 days) and SARS-CoV-1 (2-7 days) [21,22]. The symptoms include fever, cough and breathlessness, which may manifest from mild pneumonia to severe illness [22,23]. Nearly 80% of COVID-19 cases remain asymptomatic or show very mild and self-recovering symptoms and about 15% cases show high fever, pneumonia and breathlessness, whereas up to 5% develop respiratory or multiorgan failure and death [22]. Also, COVID-19 patients with diabetes or hypertension have significantly increased expression of cellular ACE2 receptors, putting them on high risk of mortality [24]. Clinical studies have shown that COVID-19 patients with severe pneumonia may rapidly progress to acute respiratory stress syndrome, septic shock or multiple organ failure and deaths [24]. Nonetheless, unlike SARS-CoV-1 and MERS-CoV, the precise mechanism of modulation of host innate immune responses and severe pathogenesis by SARS-CoV still remains elusive [21]. Moreover, digestive symptom and liver inflammations are also reported in hospitalized COVID-19 patients, which are attributed to cytotoxic T cells and Kupffer cells activities [25][26][27][28][29][30]. Unlike SARS and MERS cases, cardiac disease, arrhythmia and hypertension have been observed twice as much among COVID-19 critical patients [31,32].
Although, the clinical manifestations of COVID-19 are well understood now, the mechanism(s) underlying its high infection rate and pathogenicity is hitherto not clearly established. Several recent studies have reported comparative sequence analysis revealing some important aspects of specific mutations among SARS-CoV-2 isolates from different geographical regions [33][34][35][36][37][38]. In this report, we, therefore, analyzed the available human and nonhuman CoV genome and protein sequences to have an insight into its high transmissibility and disease severity.

Multiple sequence analysis
Translation of protein sequence from cDNA sequence was performed using the sequence analysis application MacVector. Genome and proteins/ORFs sequence alignment was analyzed using multiple sequence alignment A.

Comparison of genome sequences of SARS-CoV-2 with other CoV
In the present study we compared the genome and protein sequences of SARS-CoV-2, SL-CoV, MERS CoV and other HCoV. Our comparative analysis showed no similarity of SARS-CoV-2 with MERS-CoV and the common cold causing HCoV-OC43. The SARS-CoV sequence from horseshoe bats, civets and humans had strong similarity in RNA sequences (Table 2A). Interestingly, genome analysis of the reference Wuhan Hu-1 and USA WA-1 sequences revealed up to 87 and 96% identity with bat-SL-CoV-ZXC21 and bat-SL-CoV-RaTG13 isolates, respectively. This indicated that SARS-CoV-2 had a separate line of evolution from bat-SL-CoV or bat-like mammals as compared with SARS-CoV-1.

Mutational analysis of SARS-CoV-2 S protein
Amino acid sequence alignment of SARS-CoV-2 with SARS-CoV-1 showed most mutations acquired in the S protein (Table 2B). Interestingly, the RBM residues located within RBD extracellular domain showed the most  variability, suggesting it a 'hotspot' of rigorous mutations (Table 2C). Similar to genome sequences, the SARS-CoV-2 S protein had the most identical sequences with those of bat-SL-CoV (Table 2D). One of the unique features of SARS-CoV-2 S protein is the presence of a furin-like cleavage site ( Figure 1) with insertions of proline-argininearginine-alanine (PRRA) residues, not reported in HCoV or bat-SL-CoV isolates. In line with this, we also did not observe this furin-like cleavage site in any of the known CoV sequence ( Figure 1). However, whether the 'PRRA' insertion would have evolved in bats or another intermediate mammal, like civet or pangolin remains another area of further investigation.

Mutational analysis of ORF3a/3b
Unlike other Beta-CoVs, SARS-CoVs express unique ORFs where ORF3a is the largest. Our sequence analysis revealed SARS-CoV-2 encoded 274 residues long 3a as a major mutational 'hotspot' with only 72% similarity between SARS CoV-1 (Table 2B). We also observed mutations leading to deletion of the 3a diacidic motif (EXD) in SARS-CoV-2. Notably, a significant truncation in ORF3b (also called orf4 elsewhere) due to introduction of multiple stop codons was also observed. Therefore, the truncated 3b significantly differentiated SARS-CoV-2 from other Beta-CoVs ( Figure 2). Taken together, our analysis has highlighted some of the interesting features of SARS-CoV-2 that may be crucial for its pathogenicity and evolution.

Discussion
Approximately 80% of viruses that infect humans are zoonotic, which are initially ill adapted in a new host, slowly replicated and inefficiently transmitted [39]. Therefore, their 'animal-to-human' and 'human-to-human' transmission greatly depend on their evolution to virulent strains that can well adapt to human hosts. RNA viruses, due to the high replication-fidelity rate (∼10 -4 error/site/cycle) of their RNA polymerase, are more genetically diversified than DNA viruses [1]. In addition to this, post-transcriptional nucleotide modifications, genetic reassortment or virus-host recombination may further lead to the establishment of stable strains or lineages in human populations [1]. In a recent analysis, the most prevalent A→G mutation observed in SARS-CoV-2 RNA are suggested to be caused by the host RNA-deamination mechanism [40]. In view of this, both viral RNA replication errors and host RNA-modification system have significant impact on mutation rates toward host adaptation and pathogenicity. Previously, the phylogenetic analysis has shown very close similarity of SARS-CoV-2 genome (∼96% identity) and S protein (∼80% identity) with bat-SL-CoV (ZXC21 and ZC45) [41][42][43][44]. Notably therein, while the 'S1' 10  subunit is highly variable (∼70% similarity), 'S2' sequences are conserved and shares ∼99% identity with both bat-SL-CoV and SARS-CoV-1 [44,45]. Notably, five critical amino acids in the RBD differ between SARS-CoV-2 and SARS-CoV-1, suggesting strong binding of SARS-CoV-2 S protein with ACE2 receptor and high infectivity [46,47]. Our data showed insignificant similarity of SARS-CoV-2 with MERS-CoV and HCoV-OC43. Interestingly, the two SARS-CoV2 isolates (Wuhan Hu-1 and USA WA-1) revealed up to 87 and 96% sequence identity, respectively with bat-SL-CoV. This indicated that SARS-CoV-2 could have a separate line of evolution from bat-SL-CoV or batlike mammals (e.g., civet and pangolin) as compared with SARS-CoV-1. Notably, SARS-CoV-1 has been already detected in masked-palm or gem-faced civet cats, which are commonly sold at Chinese wildlife/wet markets [14,48]. In line with this, during the early phase of the SARS-CoV-1 outbreak in China over 40% of the infected individuals were associated with wildlife market or restaurant workers [10]. Likewise, some of the initial cases of SARS-CoV-2 infection were also suggested to be linked to a Wuhan wet market, indicating its possible transmission from bats, civets or pangolins to humans [49]. Our data showed most mutations were acquired in the SARS-CoV-2 S proteins, especially in the RBM, suggesting its enhanced binding to ACE2 as compared with SARS-CoV-1. In addition, we did not observe furin-like cleavage site in any of the known CoV, except in SARS-COV-2, as reported elsewhere [50,51]. Cleavage of SARS-CoV-2 S protein by furin leads to its open conformation that gives advantage of enhanced binding to host ACE2 receptors [51]. Moreover, neuropilin-1 that binds to furin and cleaves substrates has been reported to significantly improve the ACE2-mediated entry of SARS-CoV-2 [52]. Because the role of furin-like proteases in virus entry is a characteristic of pathogenic flu viruses [53], its presence in SARS-CoV-2 strongly supports its high transmission and pathogenicity in humans.
The recent comparative sequence analysis of S, E, M, N and ORF8 proteins of several SARS-CoV-2 isolates of 15 countries with the reference Wuhan Hu-1 has revealed substitutions and/or deletion in S protein at 13 sites, substitutions at three sites in N protein and one substitution in M protein [54]. Another sequence analysis has identified frequent mutations in S, N, ORF1ab and ORF8 proteins in 20 SARS-CoV-2 isolates and evaluated their potential in protein stability and possible functional consequences [34]. Of these, co-occurrences of some mutations across different proteins have suggested their structural and/or functional interactions among other viral proteins, and their involvement in virus adaptability and enhanced transmission. Notably, analysis of structural stability of S protein mutants has further indicated the viability of specific variants that could be more prone to their global distribution, temporally and spatially [34]. Sequence analysis of Russian SARS-CoV-2 isolates, including those from other countries, has revealed a set of seven common mutations in S and N proteins, suggesting their multiple import to Russia, local circulation and varying patterns of spread [36]. Meta-analysis of SARS-CoV-2 isolates within the USA has also identified over 900 unique variants in at least three samples [55]. These included 487 missense and 348 synonymous mutations, four in-frame deletions, five stop codon insertions/deletions and 66 intergenic recombinations. In another study, though diversity of SARS-CoV-2 strains seemed to emerge globally, there was no geographical clustering observed, which suggested their multiple introductions [37]. Interestingly however, their 5' terminal sequences were more variable as compared with 3' termini, indicating S, E, ORF1ab and ORF3a as key drivers of diversity, notably RBD as mutational hotspot.
Previous study on SARS-CoV-1 has revealed the presence of a 'EXD' motif within the internalization motif of ORF3a that regulates the surface expressions, interaction and internalization of 3a and S proteins [56]. Our sequence analysis has shown mutations leading to deletion of 'EXD' in SARS-CoV-2 3a. Notably, the SARS-CoV-2 ORF3b was found to be truncated due to introduction of multiple stop codons, which significantly differentiated it from other Beta-CoVs that code for eight accessory proteins. In line with the reported SARS-CoV-1 truncated 3a activity [57], a recent study has shown the interference of SARS-CoV-2 3b with host interferon system [58]. Because 3a has been reported to modulate IL-2 promoter and interfere with interferon signaling [59], it would be interesting to study the consequence of such observed mutations in clinical settings.
Moreover, SARS-CoV-2 ORF3b and ORF8 has been found to induce a strong antibody response in the early and late phase of infection [60]. In SARS-CoV-2 ORF8, a substitution (Leu→Ser) in 23 isolates has been observed, suggesting their high impact on protein functionality and pathogenesis [37]. Moreover, a proposed phylogenetic tree with representative HCoV and bat-CoV has identified at least two hypervariable hotspots in ORF8 protein, one of which showing a Leu→Ser substitution [33]. Notably, analysis of new SARS-CoV-2 isolates in Italy has reported no evidence for the putative 382-nucleotide deletion in ORF8 as reported in Singapore [38,61]. Taken together, our analysis has endorsed some of the interesting features of SARS-CoV-2 that may be crucial for its pathogenicity and high transmission in humans.
In addition to acquired mutations, differential host factors, such as age, health, physiology, nutritional status, pastexposure, travel history, co-infections, immune-competence, comorbidities and genetics significantly determine the susceptibility to a novel virus [62]. A recent phylogenetic analysis of SARS-CoV-2 sequences from Hong Kong has shown their linkage to European isolates [63]. Interestingly therein, despite insignificant variations between their genomes, they had different clinical presentations, suggesting a more important role of host factors in pathogenicity than mutations. New viruses introduced to humans may further evolve into more aggressive strains as seen for SARS-CoV-2. Herein, the intricate 'host-pathogen-environment' interplay is very important in the understanding of the evolution and adaptation of such novel viruses [1]. In view of this, while the emergence of SARS-CoV-2 in naive regions is caused primarily by human movement, local emergence is driven by a combination of environmental and socio-traditional changes [62]. Notably, viral transmission rates are often higher in dense than in sparse populations and social contacts greatly enhance their human-to-human spread. Very interestingly, a recent analysis of SARS-CoV-2 genomic data has revealed its accelerated and high transmission in Italy because of 'air pollution' measured with days exceeding the limits set for PM 10 (particulate matter 10 μm) [64]. In particular, hinterland cities with average set limits along with low wind speed had a very high infection rates compared with coastal cities with high wind speed. This study suggested the accelerated 'polluted air-to-human' transmission dynamics of SARS-CoV-2 [64]. Moreover, inhalations of aerosolized or splattered infectious virus particles have been implicated in a community spread. Previously, a study on hundreds of residents of a housing society in Hong Kong showed that the building's faulty drainage significantly contributed to the aerosolization and respiratory spread of fecal SARS-CoV-1 [65]. In view of this, shedding of infectious SARS-CoV-2 in stool and urine of COVID-19 patients warrants the risk of its 'waterborne or fecal-oral' as well as 'airborne or respiratory' transmission [66]. Further, the respiratory, flu or pneumonia viruses, including certain HCoV survive in cold seasons and gradually wane with a rise in temperature. Nonetheless, although the emergence of SARS-CoV-2 in colder weather indicated for its plausible seasonality, arrival of summer did not affect the infection rate [62].
The current surveillance of transmission dynamics of infectious pathogens is mainly based on reproduction number (R 0 ) and fatality rates, which has been adopted as a real-time monitoring of COVID-19 pandemic [62]. Few mathematical-computational models that combined a framework for host, epidemiological and molecular data for SARS-CoV-2 have demonstrated understanding of patterns of evolution, global spread and country-bycountry distribution [67,68]. However, due to the rapid increase in RNA sequencing data, mutation rates differ from viral protein-to-protein and study-to-study. A recently proposed mathematical model has highlighted the emergence phenomena of SARS-CoV-2 and the effects of evolutionary adaptations on spreading processes [69].
Another such model offers quantification (Index c: contagions) of the environmental risk of exposure to future COVID-19 epidemics in a given region [70]. These theoretical models could be helpful in formulating a proactive epidemiological and environmental strategy in the prevention of such pandemics.

Conclusion
Mutations acquired by SARS-CoV-2 during 'human-adaptation' and 'human-to-human' spread could provide insights into its transmission dynamics that together with clinical and epidemiological data can predict disease prognosis. In view of this, our genome and protein sequence analysis of SARS-CoV-2 has revealed several novel mutations, the most important ones in the ACE2 receptor binding-motif and generation of a furin-like cleavage site in the spike protein. This suggests high infectivity of SARS-CoV-2 in humans through enhanced cell attachment and facilitated entry. Observed mutations within the replicase protein may be crucial for the enhanced replication of the viral genome. In addition, mutations within the accessory proteins (3a, 3b etc.) could have significant roles in evading or modulating host innate immune system and sustaining virus replication. Nonetheless, the consequence of such mutations on virus infectivity and tissue-tropism remain to be studied in animal models. Since COVID-19 is spread even during the asymptomatic phase of the disease, it will be interesting to study the replication of the virus in early phase and how the innate and adaptive immune system responds to the life cycle of SARS-CoV-2, associated with high transmission and pathogenesis. Nonetheless, a larger sample size, including the recently emerged SARS-CoV-2 mutant strains and a rigorous analysis using more advanced tools would further enhance our knowledge on the subject.

Summary points
• The underlying mechanisms of high transmission rate and pathogenicity of SARS-CoV-2 remain poorly understood. • Our comparative sequence analysis of SARS-CoV-2 and other CoVs identifies unique mutations in spike, ORF1a/b, ORF3a/3b and ORF8. • Its most conserved E, M, and N protein sequences have, however, undergone fewer mutations.
• The most crucial mutations are in the spike ACE2 binding domain and creation of a furin-like cleavage site as well as deletion of ORF3b. • Sequence analysis reveals that SARS-CoV-2 has diverged from SARS-CoV-1 but is most close to bat-SL-CoV.
• The roles of unique mutations are, therefore, envisaged in the high transmission and pathogenicity SARS-CoV-2.
Author contributions K Padhan and MK Parvez contributed in conceptualization, methodology, software, data analysis and manuscript writing. MS Al-Dosari contributed in data analysis, manuscript writing and editing.