Emerging SARS-CoV-2 diversity revealed by rapid whole genome sequence typing

bioRxiv [Preprint]. 2020 Dec 28:2020.12.28.424582. doi: 10.1101/2020.12.28.424582.

Abstract

Background: Discrete classification of SARS-CoV-2 viral genotypes can identify emerging strains and detect geographic spread, viral diversity, and transmission events.

Methods: We developed a tool (GNUVID) that integrates whole genome multilocus sequence typing and a supervised machine learning random forest-based classifier. We used GNUVID to assign sequence type (ST) profiles to each of 69,686 SARS-CoV-2 complete, high-quality genomes available from GISAID as of October 20 th 2020. STs were then clustered into clonal complexes (CCs), and then used to train a machine learning classifier. We used this tool to detect potential introduction and exportation events, and to estimate effective viral diversity across locations and over time in 16 US states.

Results: GNUVID is a scalable tool for viral genotype classification (available at https://github.com/ahmedmagds/GNUVID ) that can be used to quickly process tens of thousands of genomes. Our genotyping ST/CC analysis uncovered dynamic local changes in ST/CC prevalence and diversity with multiple replacement events in different states. We detected an average of 20.6 putative introductions and 7.5 exportations for each state. Effective viral diversity dropped in all states as shelter-in-place travel-restrictions went into effect and increased as restrictions were lifted. Interestingly, our analysis showed correlation between effective diversity and the date that state-wide mask mandates were imposed.

Conclusions: Our classification tool uncovered multiple introduction and exportation events, as well as waves of expansion and replacement of SARS-CoV-2 genotypes in different states. Combined with future genomic sampling the GNUVID system could be used to track circulating viral diversity and identify emerging clones and hotspots.

Publication types

  • Preprint