Genomic characterization and epidemiology of an emerging SARS-CoV-2 variant in Delhi, India

Deadly surge in Delhi In the spring of 2021, Delhi, India experienced a wave of coronavirus cases that overwhelmed healthcare services despite the population showing a high level of immune positivity. Dhar et al. collated a mixture of serosurveillance, quantitative polymerase chain reaction, and genomic data, finding that waves of variants had passed through the Delhi population during 2020 and 2021. The alpha (B.1.1.7) variant dominated in March 2021 and was rapidly replaced by the delta (B.1.617.2) variant in April and May 2021. The delta variant outcompeted its predecessors by mutations that enhanced replication, immune evasion, and host receptor avidity, thus increasing transmissibility, reinfection, and vaccination breakthrough. —CA


Materials and Methods Ethics
All work conducted under this study was approved by Institutional Ethics Committees at NCDC and CSIR-IGIB. Testing and sequencing of positive samples for genomic surveillance is exempted from individual informed consent, being a mandated public health service of NCDC and CSIR-IGIB for public health purposes. The use of deidentified data generated through clinical and public health services for research was reviewed and approved under certificates CSIR-IGIB/IHEC/2020-21/01 Dt. 28.03.2020, and NCDC Ethical review committee No: 2020/NERC/14. The serosurvey was approved under certificate CSIR-IGIB/IHEC/2020-21/01 Dt. 28.03.2020 and CSIR-IGIB/IHEC/2020-21/02 Dt. 23.02.2021, for project entitled "Phenome India -A long-term longitudinal observational cohort study of health outcomes" with individual informed consent from each participant (4).

Sampling and Metadata Collection
Nasopharyngeal and throat swab samples from COVID-19 confirmed cases with Ct value < 25 were collected and transported to Biotechnology Division, National Centre for Disease Control (NCDC), New Delhi, from the various testing sites across different states in India as per the sampling strategy of Central Surveillance Unit (CSU) of Integrated Disease Surveillance Programme (IDSP), NCDC. 24 post-vaccination RT-PCR confirmed SARS-CoV-2 positive samples were also included in this study (Data S4). All patient details and metadata were filled on the patient identification form and were accompanied with the samples. A total of 11,335 samples were received for whole genome sequencing at NCDC between November 2020 to May 2021 which were processed for viral genome sequencing.

PCR Amplification, Viral Genome Sequencing and Assembly
Viral RNA was isolated from the patient samples using MagNA Pure RNA extraction system (Roche) following the manufacturer's instructions. Whole-genome sequencing of the viral isolates was done as per the COVIDSeq protocol by Illumina using the NextSeq 550 platform. A total of 376 samples per lot were processed in batches of 94 with indexes A-D by loading 1.4 pmol of the library on the 75 cycle High Output Kit flow cell. Approximately 20GB of data was generated by genome sequencing which was processed using the Illumina DRAGEN COVID Pipeline and DRAGEN COVID Lineage Tools (v3.5.1). The raw data sequencing data generated in binary base call format (BCL) from the NextSeq 550 instrument was demultiplexed to FASTQ files using bcl2fastq (Illumina,v2.20). The raw reads were aligned against the SARS-CoV-2 reference genome (NC_045512.2) following the pipeline (23). The minimum accepted alignment score was set to 12 and alignment results with scores <12 were discarded. The coverage threshold and virus detection threshold were set to 20 and 5 respectively. The variant calling target coverage which specifies the maximum number of reads with a start position overlapping any given position was set at 50. The consensus sequences generated for the samples at the end of the DRAGEN COVID Pipeline were used for downstream analysis including lineage assignment and phylogenetic analysis. Out of the 9,557 genome sequences generated, 7,858 sequences with complete metadata were used for further analysis.

Genome Datasets and Lineage Analysis
Two datasets were compiled to estimate the lineage frequency of SARS-CoV-2 for the state of Delhi and other surrounding states and union territories (UTs) in North India: Punjab, Haryana, Uttar Pradesh, Chandigarh, Himachal Pradesh, Uttarakhand, Jammu and Kashmir, and Ladakh (Data S1). Dataset A comprises 7,858 genome sequencing data generated in this study. Dataset B comprises SARS-CoV-2 genome sequences from these states publicly available in GISAID (with collection dates up to 30th June 2021). Only those sequences from GISAID having complete date of collection (YYYY-MM-DD) were included in Dataset B and appended to Dataset A. For Dataset A, lineages were assigned to the genome sequences using the Pangolin tool (15,16) (version 3.1.8, pangoLEARN version 2021-07-28) to match the current version of lineage assignments on GISAID. The lineage data was segregated according to states and date of collection. State wise frequency of variant of concern B.1.617.2 (Delta) was plotted along with frequencies of B. The dataset of number of tests, confirmed cases and positivity rate for the state of Delhi was taken from the state level database maintained at NCDC (Data S2). For states other than Delhi, sequences with dates of collection between 1 November 2020 and 30 June 2021 were used to analyze lineage frequencies of the virus aggregated monthly. The dataset of the number of tests and confirmed cases for other states was accessed from https://covidtoday.github.io/backend/. A table describing the GISAID accession IDs, date of collection and lineages of the samples used to calculate lineage proportions in different states from Datasets A and B is given as Data S1. The acknowledgement table for genome sequences downloaded from GISAID is given as Data S3. All FASTA sequences included as Dataset A are available at https://github.com/banijolly/ncov-Delhi-Epidemiology (DOI:10.5281/zenodo.5521073) (21). Details for the 24 post-vaccination samples and the respective genome sequence IDs are given in Data S4. Raw Ct values for genes ORF1a (Target 1) and E (Target 2) analysed for the time period between July 2020 -June 2021 is available as Data S5.
Phylogenetic Analysis 1787 genome sequences from the state of Delhi were used for phylogenetic analysis along with an additional 152 B.1.617.2 (Delta) and 271 B.1.1.7 (Alpha) genomes from the state of Maharashtra and Punjab respectively. The phylogenetic tree was constructed following the Nextstrain protocol for genetic epidemiology of SARS-CoV-2 (https://nextstrain.github.io/ncov/) (24,25). Briefly, the genome sequences were aligned against the SARS-CoV-2 reference genome MN908947 (23) using NextAlign (24) using the default parameters of the Nextstrain protocol. The first 100 and last 50 bases were masked in the resulting alignment file before further processing. A fast maximum likelihood phylogenetic tree was constructed using IQTREE2 (26) using a general time reversible model (GTR) by specifying a log-likelihood epsilon value of 0.05 for final model parameter estimation. The tree was further processed to resolve polytomies and rerooted to have sample hCoV-19/Wuhan/WH01/2019 (EPI_ISL_406798) as the root. The pipeline further processes the tree using TreeTime (25) to estimate a skyline coalescent using a fixed clock rate of 8x10 -4 . The resulting phylogenetic tree was visualized and annotated using the R package ggtree (27). The phylogenetic network of SARS-CoV-2 isolates from Punjab collected during February-March 2021 was visualized in Auspice ( fig. S3). The tree files generated by the analysis are available at the following repository (21).

Serosurvey
The serosurvey was conducted through a voluntary participation wherein personnel working at CSIR labs/centers and their family members gave their blood samples in July-Sep 2020 (Phase 1) and January-February 2021 (Phase 2) and May-July 2021 (Phase 3) (details are given in table S1 and Data S6). The study was approved by the Institutional Human Ethics Committee of CSIR-IGIB vide approval CSIR-IGIB/IHEC/2019-20 & CSIR-IGIB/IHEC/2020-21/02 and carried out in over 40 CSIR laboratories and centers spread across the country. Blood samples (6 ml) were collected in EDTA vials from each participant and analyzed on site or transported to CSIR-IGIB, New Delhi for Analysis. Elecsys Anti-SARS-CoV-2 kit from Roche Diagnostics was used to detect antibodies to SARS-CoV-2 NucleoCapsid antigen. It is a qualitative kit which was used for screening and a Cut-off index COI >1 was considered seropositive. Positive samples were further tested for quantitative antibody titers using the same manufacturer's kit directed against the spike protein (Santigen). An antibody levels >0.8 U/ml was considered seropositive as per manufacturer's protocol. The detection range of this kit is from 0.4 U/ml to 250 U/ml. For samples, where values of >250 U/ml were obtained; appropriate dilutions were made. Neutralizing antibody (NAb) response directed against the spike protein (RBD site) was assessed using GENScript cPass kit which is a surrogate virus neutralization test (sVNT). A value of 30% or above was considered to have neutralizing ability. sVNT neutralization assay data for Phase 1 and Phase 2 is available as Data S7 and data for subjects with and without reinfection is available as Data S8 and Data S9 respectively.

Supplementary Text Epidemiological Model
The model described here builds on a previously published model of SARS-CoV-2 transmission introduced in Flaxman et al, 2020 (11), subsequently extended into a two-category framework in Faria et al, 2020 (28). Replication code is available at https://github.com/ImperialCollegeLondon/Delta_Variant_Delhi (22). Model Specification The model describes two categories, denoted ∈ {1,2}. The populationunadjusted reproduction number for the first category is defined as where 0 is a scale parameter (3.3), is a logistic function, and is an second-order autoregressive process with weekly time innovations, as specified in earlier work (29). The population-unadjusted reproduction number of the second category is modelled as where is a parameter defining the relative transmissibility of category 2 compared to category 1 and [ 2 ,∞) is an indicator function taking the value of 0 prior to 2 , and 1 thereafter, highlighting that category 2 does not contribute to the observed epidemic evolution before its emergence. The prior for is chosen because it weakly informative, setting 90% of prior mass a between × 0.4 and × 1.8 increase in transmissibility, while maintaining a neutral to conservative default in the context of increased transmissibility since it has a mean of × 1 and median of × 0.9.
Infections arise for each category according to a discrete renewal process (30,31) where is the total population size, , is the total extent of population immunity to category present at time , and is the generation interval distribution.
The susceptible depletion term for category is modelled as where \ denotes not-, under assumptions of symmetric cross-immunity with prior ∼ Beta(2,1) . 6 Immune escape or the evasion of cross-immunity of Delta, as reported in the analysis, is defined as the complement of the cross-immunity parameter, that is (1 − ). The prior for has been chosen to reflect our default assumption that the mostly likely scenario is no evasion of cross-immunity.
− is the time-dependent waning of immunity elicited by previous infection, which is modelled as a Rayleigh survival-type function with Rayleigh parameter of sigma = 310, which produces 50% of individuals still immune after 1 year. The cross-immunity susceptible term , is modelled as Infections in Delhi are seeded for six days at the start of the epidemic from 1 as The model generates deaths via the following mechanistic relationship: The infection fatality ratios of each of the categories (ifr ) are given moderately informative priors: ifr ∼ Normal(0.25, 0.02 2 ) ∈ [0,100] 12 with our central estimate based on the results of Brazeau et al (32) and adjusted for the demography of the city. We allow for variation around the estimate of 0.25 however, with the prior providing some support for IFRs in the range 0.15% -0.35%. This range is similar to estimates by Banaji for Mumbai, a city with comparable demographics, for which an IFR is reported with 95% confidence intervals of (0.15%, 0.33%) (33). A limitation of the model is the assumption of homogeneous exposure across subsets of the population.
Likelihood component 1. The observation model uses three types of data from four sources. In the first, the likelihood for the expected deaths , is modelled as negative-binomially distributed, with mortality data and diffuse dispersion prior ∼ Normal(0, 5 2 ) ∈ [0, ∞) . 14 and underreporting factor , which describes the degree of death underascertainment, e.g. a value of 0.25 means 25% of COVID-19 deaths are not reported (due, e.g. to limited testing). The central scenario chosen for death underreporting is 50%, due to the wide range values reported in available literature for Delhi and India (33)(34)(35)(36)(37)(38). To mitigate the effect of the uncertainty in death underreporting, sensitivity tests are carried out a range of values, in {10%, 33%, 50%, 66%}, to ensure inferences are robust. Likelihood component 3. Serological data are incorporated in our modelling framework, using results from the survey presented in this work, and from Velumani et al. (17). The observed seropositivity ( ) on a given day, , is modelled as follows where − is the cumulative probability of an individual infected on day having seroconverted less seroreverted by time . This distribution is empirical and based on (40). The term is a multiplicative random effect specified to mitigate likely biases in the serological data where is the logistic function.
Eq 4 can be modified to account for population effects (decreasing susceptible population over time) such that no over-shooting happens due to discretization as follows (41,42): The formula for , is derived from a continuous time model on [ − 1, ]. This is to avoid discrete time effects such as infections going above the total population . Specifically, we assume that the infections ( ) in [ − 1, − 1 + ] are given by the differential equation ∂ ( )/ ∂ = �1 − � , + ( )� / �, which has the solution (1) = as above.
Modelling limitations Inferences from the model are subject to multiple limitations, arising from biases in the data, and in our choices of priors. In particular, the level of underreporting in Delhi is poorly characterised. This uncertainty is mitigated by sensitivity testing, as described in the Supplementary Information. Similarly, uncertainty in the temporal waning of immunity, the date of when Delta was first introduced to Delhi, and the IFRs of SARS-CoV-2 variants in Delhi, all provide sources of bias that we currently only mitigate through sensitivity testing. Furthermore, we note serological data is likely to be systematically biased. This is partially addressed through the inclusion of an additional term in the serology likelihood that provides a multiplicative random effect.
Computational notes The analysis uses R version 3.6.3. Inferences are based on 2000 iterations of Hamiltonian Monte Carlo using 2 chains, with rhat statistics confirmed to be less than 1.02. The inference is performed using rstan version 2.21.2. Replication code and data are available at https://github.com/ImperialCollegeLondon/Delta_Variant_Delhi.  (Fig 3D-F and fig S4) Table S1. Seropositivity in the Delhi CSIR cohort. Seropositivity is shown for Phase I to III, along with 95% CI. In the beginning of the pandemic, with strong lockdowns, higher infection rates were seen in outsourced staff providing frontline and essential services, while laboratory employees and students were relatively protected. By mid-pandemic, with lockdowns lifted by July 2020, use of private vs public transport also determined risk. By the end of the Delta wave in Delhi, all subgroups reached similar seropositivity, suggesting universally high exposures. For comparisons between serially obtained values, or between sub-groups, statistical significance of difference of proportions was calculated via the standard 2x2 Chi-statistic; p<0.01 was considered significant. Table S2.
Inferred changes in epidemiological characteristics of B.1.617.2, depending on the timing of introduction assumed in the model, and level of under-ascertainment present in Delhi mortality data. Results presented are the median, with the 50% Bayesian Credible Interval, bCI, in brackets. Note that "Immune escape" refers specifically to the escape of immunity conferred by prior infection with other variants, rather than escape from immunity acquired through vaccination. It is further important to note that immune escape and transmissibility increase inferred for B.1.617.2 are values given with reference to the composition of earlier and co-circulating variants in Delhi, from the start of the epidemic to 25 May 2021.