Estimating global, regional, and national daily and cumulative infections with SARS-CoV-2 through Nov 14, 2021: a statistical analysis

Summary Background Timely, accurate, and comprehensive estimates of SARS-CoV-2 daily infection rates, cumulative infections, the proportion of the population that has been infected at least once, and the effective reproductive number (Reffective) are essential for understanding the determinants of past infection, current transmission patterns, and a population's susceptibility to future infection with the same variant. Although several studies have estimated cumulative SARS-CoV-2 infections in select locations at specific points in time, all of these analyses have relied on biased data inputs that were not adequately corrected for. In this study, we aimed to provide a novel approach to estimating past SARS-CoV-2 daily infections, cumulative infections, and the proportion of the population infected, for 190 countries and territories from the start of the pandemic to Nov 14, 2021. This approach combines data from reported cases, reported deaths, excess deaths attributable to COVID-19, hospitalisations, and seroprevalence surveys to produce more robust estimates that minimise constituent biases. Methods We produced a comprehensive set of global and location-specific estimates of daily and cumulative SARS-CoV-2 infections through Nov 14, 2021, using data largely from Johns Hopkins University (Baltimore, MD, USA) and national databases for reported cases, hospital admissions, and reported deaths, as well as seroprevalence surveys identified through previous reviews, SeroTracker, and governmental organisations. We corrected these data for known biases such as lags in reporting, accounted for under-reporting of deaths by use of a statistical model of the proportion of excess mortality attributable to SARS-CoV-2, and adjusted seroprevalence surveys for waning antibody sensitivity, vaccinations, and reinfection from SARS-CoV-2 escape variants. We then created an empirical database of infection–detection ratios (IDRs), infection–hospitalisation ratios (IHRs), and infection–fatality ratios (IFRs). To estimate a complete time series for each location, we developed statistical models to predict the IDR, IHR, and IFR by location and day, testing a set of predictors justified through published systematic reviews. Next, we combined three series of estimates of daily infections (cases divided by IDR, hospitalisations divided by IHR, and deaths divided by IFR), into a more robust estimate of daily infections. We then used daily infections to estimate cumulative infections and the cumulative proportion of the population with one or more infections, and we then calculated posterior estimates of cumulative IDR, IHR, and IFR using cumulative infections and the corrected data on reported cases, hospitalisations, and deaths. Finally, we converted daily infections into a historical time series of Reffective by location and day based on assumptions of duration from infection to infectiousness and time an individual spent being infectious. For each of these quantities, we estimated a distribution based on an ensemble framework that captured uncertainty in data sources, model design, and parameter assumptions. Findings Global daily SARS-CoV-2 infections fluctuated between 3 million and 17 million new infections per day between April, 2020, and October, 2021, peaking in mid-April, 2021, primarily as a result of surges in India. Between the start of the pandemic and Nov 14, 2021, there were an estimated 3·80 billion (95% uncertainty interval 3·44–4·08) total SARS-CoV-2 infections and reinfections combined, and an estimated 3·39 billion (3·08–3·63) individuals, or 43·9% (39·9–46·9) of the global population, had been infected one or more times. 1·34 billion (1·20–1·49) of these infections occurred in south Asia, the highest among the seven super-regions, although the sub-Saharan Africa super-region had the highest infection rate (79·3 per 100 population [69·0–86·4]). The high-income super-region had the fewest infections (239 million [226–252]), and southeast Asia, east Asia, and Oceania had the lowest infection rate (13·0 per 100 population [8·4–17·7]). The cumulative proportion of the population ever infected varied greatly between countries and territories, with rates higher than 70% in 40 countries and lower than 20% in 39 countries. There was no discernible relationship between Reffective and total immunity, and even at total immunity levels of 80%, we observed no indication of an abrupt drop in Reffective, indicating that there is not a clear herd immunity threshold observed in the data. Interpretation COVID-19 has already had a staggering impact on the world up to the beginning of the omicron (B.1.1.529) wave, with over 40% of the global population infected at least once by Nov 14, 2021. The vast differences in cumulative proportion of the population infected across locations could help policy makers identify the transmission-prevention strategies that have been most effective, as well as the populations at greatest risk for future infection. This information might also be useful for targeted transmission-prevention interventions, including vaccine prioritisation. Our statistical approach to estimating SARS-CoV-2 infection allows estimates to be updated and disseminated rapidly on the basis of newly available data, which has and will be crucially important for timely COVID-19 research, science, and policy responses. Funding Bill & Melinda Gates Foundation, J Stanton, T Gillespie, and J and E Nordstrom.


Introduction
Measuring SARS-CoV-2's daily infection rate, cumulative infections, and the proportion of the population with one or more infections is essential for understanding the determinants of past transmission, identifying ongoing inequities, predicting future trajectories of the COVID-19 pandemic, and, in theory, prioritising vaccination allocations. Daily infections are also the crucial input into measuring the changing effective reproductive number (R effective , the number of subsequent infections caused by a new infection). 1-3 A robust assessment of R effective by day in each location is useful to help evaluate the effect of the wide range of non-pharmaceutical interventions that have been deployed during the pandemic. The R effective over time is also a crucial input into future forecasts of COVID-19. 4

Cumulative infections can help us identify
This information might also be useful for targeted transmission-prevention interventions, including vaccine prioritisation. Our statistical approach to estimating SARS-CoV-2 infection allows estimates to be updated and disseminated rapidly on the basis of newly available data, which has and will be crucially important for timely COVID-19 research, science, and policy responses.
Funding Bill & Melinda Gates Foundation, J Stanton, T Gillespie, and J and E Nordstrom.

Research in context
Evidence before this study This study was conceptualised and developed from the start of the pandemic to fill a void in the provision of timely estimates of SARS-CoV-2 infections for tracking the pandemic and to provide inputs to epidemiological models of transmission. Several research groups have estimated SARS-CoV-2 daily or cumulative infections in select locations at specific points in time. For example, the US Centers for Disease Control and Prevention estimates cumulative infections by approximating the infectiondetection ratio (IDR) using assumptions about the portion of the population who will seek care. The Serotracker project reports on the universe of seroprevalence surveys and some attributes of these surveys, but it does not make estimates of cumulative infections based on these data. Noh and Danuser (2021) used reported deaths and published estimates of the infection-fatality ratio (IFR) to estimate cumulative infections for US states and select countries. To our knowledge, however, no source has provided estimates, either periodic or regularly updated, of global daily and cumulative SARS-CoV-2 infections at this resolution (399 administrative units).

Added value of this study
This study is the first comprehensive analysis of global daily and cumulative SARS-CoV-2 infections to date and improves upon previous infection estimation strategies in several important ways. First, we combined three approaches that have been used to estimate daily infections: cases divided by the IDR, hospitalisations divided by the infection-hospitalisation ratio (IHR), and deaths divided by the IFR. Combining these estimates gave us a more robust estimate of daily infections that was less susceptible to biases within and between each type of measure. Second, estimates of total COVID-19 deaths derived from a comprehensive assessment of excess mortality and a statistical estimate of the portion of excess mortality directly due to COVID-19 allowed for more meaningful interpretation of spatial heterogeneity in total COVID-19 mortality rates. Third, we used a systematic analysis of available seroprevalence data matched in space and time to cases, hospitalisations, and deaths to empirically estimate the IDR, IHR, and IFR. Because the IHR and IFR are profoundly age related, we also estimated agestandardised ratios for these quantities. Fourth, for locations without seroprevalence surveys, we used statistical models based on the available empirical data and the testing of a wide range of covariates to predict the IDR, IHR, and IFR. Fifth, we used daily infections to estimate cumulative infections and, with assumptions on cross-variant immunity, the cumulative number of individuals with one or more infections, as well as posterior estimates of cumulative IDR, IHR, and IFR. Sixth, we incorporated corrections to the primary data into the analysis to deal with known biases such as waning antibody test sensitivity. Seventh, our ensemble model reflects the uncertainty of the data sources, model design, and parameter assumptions included in the analysis. Finally, the methods developed to triangulate on daily infections, cumulative infections, and the proportion of the population infected once or more than once have been developed into easily applied statistical code, so estimates can be shared and updated rapidly and iteratively on the basis of the frequency of newly reported data.
which nations and communities have been able to keep transmission at lower levels, potentially creating the opportunity to learn from these success stories. Finally, a sound measurement of the proportion of the population ever infected could help to identify which communities are at greater risk of future transmission and might be a factor that should be considered in vaccine prioritisation. 5 Several studies have estimated cumulative infections in select countries at specific points in time. [6][7][8][9] Some of these studies have used seroprevalence surveys, while others have made estimates of infections by assuming a particular infection-detection ratio (IDR). 7,10-12 One study estimated infections in the USA and other select countries, 13 and other studies have done multinational systematic reviews and meta-analyses of seroprevalence surveys. 14,15 The fundamental problem in all of these analyses is that each of the data series observed has potential biases: reported cases capture only a portion of infections, and this portion will be a function of the availability of testing; reported deaths capture only a subset of total COVID-19 deaths, and the infection-fatality ratio (IFR) can vary widely over time and across locations; [16][17][18][19] the proportion of patients with an infection who are admitted to hospital can also vary over time and location; and seroprevalence surveys can be influenced by sampling design, waning of sensitivity of antibody tests, and vaccination rates. Few studies have combined data from reported cases, reported deaths, hospitalisations, and seroprevalence surveys to triangulate daily infections, and WHO only routinely reports confirmed cases, not estimated infections. 20 The use of such sources of incomplete, biased, and heterogeneous case data uncritically in research, science, and policy will result in inferences confounded to unknown levels by these known problems.
In this study, we present an approach to estimating past SARS-CoV-2 daily infections, cumulative infections through Nov 14, 2021, and the proportion of the population with one or more infections on the basis of reported cases, total deaths attributable to COVID-19, hospitalisations, and seroprevalence surveys. This approach attempts to deal with the biases in each of these measures and use them all to triangulate daily infections. With this statistical approach to the fusion of these data streams, we aimed to provide a method that can be applied on a rapid and ongoing basis, so that these estimates remain maximally relevant for research, science, and policy and can be immediately and freely available. Importantly, we incorporated various sources of uncertainty in daily infections into the analysis to help informed assessment of the variation in space and time of the fidelity of the estimates.

Overview
We derived comprehensive global estimates of daily and cumulative SARS-CoV-2 infections for the duration of the COVID-19 pandemic, using the heterogeneous universe of reported epidemiological data (iteratively curated, corrected, and calibrated into an internally complete and consistent time series at national and subnational levels) to further timely research, discovery, and policy inference. Our approach can be divided into seven steps, which are applied by use of an ensemble model framework. First, we developed a dataset of reported COVID-19 cases, total COVID-19 deaths, and hospitalisations (where available), corrected for known biases such as lags in reporting. Second, we identified representative SARS-CoV-2 seropreva lence surveys that could be used to create a database of cumulative infections and adjusted them for waning antibody sensitivity, vaccinations, and reinfection from escape variants. Third, using adjusted seroprevalence survey data matched to cases, hospitalisations, and deaths, we created an empirical database of IDRs, infectionhospitalisation ratios (IHRs), and IFRs. Fourth, for locations without seroprevalence surveys and to estimate a complete time series for each location, we developed statistical models to predict the IDR, IHR, and IFR by location and day, as a function of a wide range of covariates. Fifth, three series of estimates of daily infections (cases divided by IDR, hospitalisations divided by IHR, and deaths divided by IFR) were combined into a more robust estimate of daily infections. Sixth, we used the combined time series of daily infections to estimate cumulative infections and the cumulative proportion of the population with one or more infections, and calculate posterior estimates of cumulative IDR, IHR, and IFR. Seventh, we converted daily infections into a historical time series of R effective by location and day, on the basis of assumptions of duration of the period from infection to infectiousness and time an individual spent being infectious. Estimates are given for all ages and both sexes combined for 190 countries and territories, and for subnational locations in ten of those countries, aggregated into 21 regions, seven super-regions, 21 and globally, from the start of the COVID-19 pandemic through Nov 14, 2021. This study complies with the Guidelines for Accurate and Transparent Health Estimates Reporting recommendations (appendix 1, section 2). 22 All code used in the analysis can be found online.

Ensemble framework
Our model system includes many component parts that are inherently uncertain, ranging from input data sources and parameter assumptions to model specification. To account for this, we developed an ensemble framework wherein we varied the data and model settings across 100 iterations of the analysis, which were then run independently to yield 100 estimates of infections. These sources of uncertainty include seroprevalence survey error; bootstrapped samples of our seroprevalence database; estimates of seroreversion rates; estimates of total COVID-19 mortality; parameterisation of crossvariant immunity, increased risk of hospitalisation and death from non-ancestral SARS-CoV-2 variants, and durations associated with COVID-19 natural history; covariate selection and specification of statistical models of the IDR, IHR, and IFR; and triangulation of infections on the basis of cases, hospitalisations, and deaths (more details regarding the ensemble framework in appendix 1, section 9).

Data inputs and corrections
Data of reported cases were obtained largely from Johns Hopkins University (Baltimore, MD, USA), 23 with exceptions and additions noted in appendix 1 (section 4.1) and appendix 2 (section 4). Hospital admissions were largely sourced from national databases such as that of the Department of Health and Human Services (HHS) in the USA and the Secretaria de Vigilância em Saúde in Brazil (for an exhaustive list see appendix 2, section 1). Deaths were based on reported deaths data from Johns Hopkins University 23 and various national sources from locations where data inconsistencies were evident in the Johns Hopkins University datasets (more details in appendix 1, section 4.3, and appendix 2, section 2). To account for the prevalent issue of under-reporting in COVID-19 deaths, we applied a scalar of reported to total COVID-19 deaths in our analysis. Total COVID-19 deaths, as defined by WHO, are all deaths where the deceased individuals were actively infected with SARS-CoV-2 at the time of the death. Estimates of total COVID-19 mortality were constructed with use of the statistical model developed by the COVID-19 Excess Mortality Collaborators to predict the excess mortality rate for all locations between Jan 1, 2020, and Nov 14, 2021. 16 To estimate total COVID-19 mortality, we predicted a counterfactual excess mortality rate due to COVID-19 in which the IDR was set to the maximum observed values among all locations. The predicted excess mortality rate from this counterfactual analysis, corrected for under-reporting, resulted from insufficient testing and changes in mortality driven by behaviours such as deferred health care during periods of lockdown. We used the ratio of this counterfactual excess mortality rate and the prediction for the same period as a proxy for the proportion of excess mortality that is total COVID-19 mortality. Subsequently, a scalar of reported COVID-19 deaths to total COVID-19 deaths can be derived (more details in appendix 1, section 9.4). We identified seroprevalence surveys through a search protocol that leveraged previous reviews, 24,25 SeroTracker, 26 and routine inclusion of national and subnational surveys undertaken by governmental organisations. Studies that focused on specific subsets of the population-either a specific subpopulation such as health-care workers or specific locations such as specific cities-were typically excluded as a result of not being representative. In total, we identified 2817 seroprevalence survey datapoints (of 6420 reviewed) for inclusion in this analysis.
Although most data streams for daily cases, deaths, and hospitalisations are indexed by date of report, some are indexed by date of event; in these instances, lags in reporting create misleading trends in the most recent days of data. These trends are gradually corrected over time as reporting systems catch up but, to prevent this occurrence from influencing our models, we needed to evaluate each individual data source and determine an appropriate number of days to exclude in any iteration of the analyses.
Some hospital admissions data series only became available starting from weeks or months after the beginning of the COVID-19 pandemic-for example, the HHS database began in July, 2020. However, total cumulative hospitalisations are required to create our empirical estimate of IHR. In these instances, we leveraged information from the metrics that did have complete time coverage (cases and deaths) to impute the earlier portion of the admissions time series (appendix 1, section 4.2).

Seroprevalence survey adjustments
Seroprevalence surveys were corrected for vaccination, because vaccination generates a positive anti-spike antibody test in most individuals who receive the vaccine. 27 In locations where vaccination rates have increased over time, population levels of anti-spike antibodies will be elevated. To correct for this, we adjusted seroprevalence estimates downward on the basis of vaccination rates in adults in every location, accounting for vaccination of previously infected individuals (appendix 1, section 5.1).
Seroprevalence surveys provide an estimate of the number of individuals who have been infected with SARS-CoV-2 one or more times; these surveys do not detect repeat infections in a single individual. Because reinfection can be common in settings where escape variants such as beta (B.1.351), gamma (P.1), and delta (B.1.617.2) are present, [28][29][30] we had to adjust seroprevalence data to estimate the cumulative number of infectionsthat is, to include both first and any subsequent infections. We used a level of cross-variant immunity of 30% to 70% between escape variants and ancestral variants and alpha (B.1.1.7), on the basis of an empirical analysis conducted by the COVID-19 Forecasting Team (unpublished). This estimate did not take into account that some individuals could have been infected more than once with ancestral variants. 31 A detailed explanation of how we adjusted for escape variant prevalence is given in appendix 1 (section 5.2).
Lastly, seroprevalence surveys were corrected for waning sensitivity of antibody tests. We identified eight categories of antibody tests; for each of these, we used a reported curve of sensitivity over time. [32][33][34] To implement the correction based on waning, we used initial estimates of the timing of infection based on reported deaths. We did not adjust for specificity, as reported specificity for all available commercial assays included in the analysis is over 95% and mostly over 98% (more details in appendix 1, sections 5.3 and 9.3). 35

Empirical estimates of the IDR, IHR, and IFR
Using the adjusted seroprevalence data we have described, we created a dataset of 2817 empirical measurements of the IDR in which the numerator was the cumulative number of confirmed cases and the denominator was the number of cumulative infections and reinfections combined. We aligned cases and seroprevalence on the basis of individual record data suggesting that exposure to a laboratory-confirmed case was typically 10-13 days 36 and exposure to seroconversion was 14-17 days. [37][38][39] Figure 1A shows these empirical estimates of location-specific IDR over the course of the pandemic. For the purposes of visualising the data, the IDR data are time-localised to the average date of infection based on the model estimate and daily cases.
Using adjusted seroprevalence surveys matched to cumulative hospitalisations, we developed a dataset of 2580 empirical estimates of the IHR. Based on the same data and analysis used to determine the lag for cases, 36 we used a 10-13-day lag for hospitalisations. Far fewer locations reported hospitalisations, so less information was available for this metric than for the IDR. We used 703 surveys that included age-specific seroprevalence data to estimate the IHR age pattern, and we then used indirect age standardisation to estimate the age-standardised IHR across locations and used those age-standardised estimates in the modelling of the IHR (more details on indirect standardisation methods in appendix 1, section 6.1). Figure 1B shows the universe of available age-standardised IHR over time. For the purposes of visualising the data, IHR data are timelocalised to the average date of admission.
Using the 718 seroprevalence surveys with age-specific detail, the COVID-19 Forecasting Team 40 estimated the age pattern of the IFR. We used this age pattern to create a dataset of age-standardised IFR data using 2817 pairs of adjusted seroprevalence surveys and death data, assuming 22-28 days from exposure to death on the basis of analyses of patient-level data in the USA. 41 Time indexing of IFR data was based on the average date of death for each observation. Figure 1C shows the relationship between age-standardised IFR and time.

Statistical models of the IDR, IHR, and IFR
To generate estimates of daily infections from cases, hospitalisations, and deaths, we needed estimates of the IDR, IHR, and IFR by location for each day during the pandemic. We used a cascading implementation of a Bayesian regression framework 42 to estimate each of these measures (more details in appendix 1, section 6.2). The cascading regression model allows for a flexible fit to the key covariates, including the option to specify them as splines, and borrows strength across locations. After parameterising the relationship of seroprevalence to cases, hospitalisations, and deaths through predictive models of IDR, IHR, and IFR, we used local covariates and age structure to generate predictions of these ratios in both in-sample and out-ofsample locations based on our hierarchical cascade model. For the IDR model, the most spatially and temporally consistent predictive relationship was   between testing per person and the IDR. To capture the rise in health system capacity to deliver testing, we used the observed maximum testing rate up to a given date as the covariate. Additionally, we included universal healthcare coverage, the Healthcare Access and Quality (HAQ) Index, and the proportion of the population older than 65 years as covariates that each submodel selected from in our ensemble. These covariates were estimated for all locations as part of the Global Burden of Diseases, Injuries, and Risk Factors Study (GBD; appendix 1, sections 6.3 and 9.6). 43 Predictive covariates for IHR and IFR were primarily based on a list of underlying medical conditions identified by the US Centers for Disease Control and Prevention (CDC) as increasing the risk of severe illness from SARS-CoV-2 infection. 44 We cross-referenced this list with a study of individuals admitted to hospital in the USA 41 that evaluated the increased risk of in-hospital death to identify seven possible covariates, all of which were included in our models as age-standardised prevalence in the population (estimated as part of GBD): obesity, smoking, diabetes, cancer, chronic obstructive pulmonary disease, cardiovascular disease, and chronic kidney disease. 43 Several of these covariates, most prominently obesity, were further supported by relationships in US claims 45 and Brazil hospitalisations data. 46 To this list, we also added universal health-care coverage and the HAQ Index. We then tested all possible combinations of these covariates and selected the top 100 most predictive combinations to use across submodels in our ensemble models of IHR and IFR (more details in appendix 1, section 9.6). We estimated age-standardised IHR and IFR using these covariates and then converted estimates back to all-age IHR and IFR to reflect population structure. We accounted for reductions in the IFR due to improved treatment over the course of the pandemic by including a spline on time in the regressions in addition to the ensemble covariates (more details on these models in appendix 1, sections 6.4, 6.5, and 9.6).
Vaccines and variants also affect the likelihood of severe disease and death, and thus influence both the IHR and the IFR. First, vaccination strategies that prioritise older age groups before younger ones can temporarily increase the relative proportion of infections that occur in younger individuals, thus lowering the population-level IFR and IHR for at least a period of time. Additionally, COVID-19 vaccines have been shown to confer higher levels of protection from severe disease and death than from mild infection, also serving to lower the overall IFR and IHR. The prevalence of variants with higher likelihood of severe disease and death can conversely increase these ratios, 47 and the introduction of escape variants can increase them further by reducing vaccine efficacy. More information on how we accounted for these features can be found in appendix 1 (section 6.6).

Robust estimates of daily infections
We then paired the estimates of our ratio models with data that were reported by local jurisdictions-accounting for reporting biases in cases through the testing covariate and in deaths through the total COVID-19 death scalarsto estimate infections in a manner that was sensitive to local context, even in the absence of seroprevalence data. By dividing cases by the modelled IDR, hospitalisations by the modelled IHR, and deaths by the modelled IFR, we produced three daily infections time series (or two if only cases and deaths were reported for a given location). Estimates based on each input data type were shifted back in time by their respective lags, such that they were all indexed on date of infection. We then fit a time series spline model using all three data sources as inputs to triangulate a best estimate of daily infections. After deriving this mean estimate of daily infections, we sampled the residuals of the intermediate case-based, hospitalisation-based, and deaths-based infection estimates independently in each submodel and refit the infections curve to these data; this enabled us to more accurately reflect the volatility in reporting practices, such as for deaths, in our ensemble distribution of daily infections (more details in appendix 1, section 7).

Cumulative infections and cumulative proportion of the population infected at least once
Daily infections, including reinfections, were summed to derive an estimate of cumulative infections. With this estimate of cumulative infections, we then returned to reported cases, reported hospitalisations, and total COVID-19 deaths to produce posterior estimates of cumulative IDR, IHR, and IFR. Where the reported data were not available, the posterior ratio estimate would be equal to the prediction from the ratio model. To estimate the proportion of individuals who were infected with SARS-CoV-2 at least once by Nov 14, 2021, we used the same assumptions already described. The crucial assumptions required were cross-variant immunity, the prevalence of escape variants, and the assumption that exposure to escape variants is independent of the probability of previous infection with ancestral variants.
Figures found in appendix 3 show cases, hospitali sations (where available), deaths, IDR, IHR, IFR, daily infections, cumulative infections, and cumulative proportion of the population infected at least once for 399 locations.

R effective estimation in the past
Using daily infections, we directly estimated R effective in the past by location and day, where R effective at time t is: The assumptions required for this estimation are the duration from infection to being infectious and the period of infectiousness, collectively represented as θ.

R effective (t)= infections(t+θ) infections(t)
See Online for appendix 3 We used ranges of 3-5 days for both assumptions to generate estimates of R effective in the past. These estimates are useful for identifying the effect of different non-pharmaceutical interventions on transmission in different settings. An R effective lower than 1·0 indicates that the epidemic is shrinking, whereas an R effective higher than 1·0 indicates that the epidemic is growing.

Role of the funding source
The funders of the study had no role in the study design, data collection, data analysis, data interpretation, or the writing of the report.

Results
Globally, daily SARS-CoV-2 infections steadily increased over the first several months of the pandemic, surpassing 3 million daily infections for the first time in mid-April, 2020, and then doubling to 6 million per day 6 weeks later (figure 2A). Global daily infections remained higher than 5 million per day until dipping slightly below that threshold after a period of decline in

Figure 3: Cumulative proportion of the population infected with SARS-CoV-2 at least once by Nov 14, 2021, by country and territory
The first administrative level is mapped for countries that are modelled at that level and have a population greater than 100 million. <10% 3). Over 70% of the population had been infected in 40 countries, including over 80% in 17 countries and across states in Mexico, India, and Pakistan. More than half the population had been infected in an additional 55 countries and territories across every super-region, except high income. Notable cross-border variations were observed in some parts of the world, such as at the interface of western and central Europe, where the percentage of the population infected was substantially lower in Germany, Austria, and Italy than in the bordering nations Poland, Czechia, Slovakia, Hungary, and Slovenia. In South America, a clear demarcation can be seen splitting the tropical and Andean nations from the southern nations and the Brazilian state Rio Grande do Sul. Countries in mainland southeast Asia such as Laos, Thailand, and Vietnam, maintained a much lower percentage of population infected than neighbouring south Asian countries or island nations within the region, such as Indonesia and the Philippines. The cumulative percentage of population infected varied widely within most countries for which subnational units were modelled in this analysis, varying by a factor of two across administrative units in Brazil, India, Italy, and Mexico; a factor of three in Germany and Spain; and over a factor of four in the USA (table).
Cumulative total COVID-19 deaths and death rates on Nov 14, 2021, can be found in the table and appendix 1 (section 9.4). Although roughly 5·6 million deaths due to COVID-19 had been reported by this date, estimated total deaths attributable to COVID-19 were nearly three times as high at 15·1 million (95% UI 11·2-20·2)-a rate of 195 deaths per 100 000 people (145-262). Across all countries and territories, the estimated death rate ranged from no more than 1 per 100 000 people in New Zealand and China to 1125 (724-1709) in Bolivia. Death rates over 450 per 100 000 were estimated in 23 countries, as well as many states in Mexico; multiple states in Brazil, Italy, and the USA; and one in India. At least one country in every super-region except southeast Asia, east Asia, and Oceania surpassed 300 estimated deaths per 100 000, 51 in total. Estimated death rates remained very low throughout much of east and southeast Asia, high-income Asia Pacific, Australasia, and select countries such as Norway, Iceland, and Qatar.
Posterior estimates of the IDR showed that 44·6% (95% UI 42·3-47·2) of COVID-19 infections were detected in the high-income super-region, with 18 countries and parts of Canada, Italy, Spain, and the USA detecting over half of the COVID-19 infections that occurred in those locations by Nov 14, 2021. Countries in Latin America and the Caribbean and central Europe, eastern Europe, and central Asia detected about 10% of infections on average, and fewer than 10% of infections were identified in each of the remaining four super-regions (table). The IHR varied by a factor of four across super-regions, and the IFR by a factor of five. The highest IHR and IFR were estimated primarily in countries with older population structures, such as Japan. The lowest IDR, IHR, and IFR were all detected in sub-Saharan Africa, where only the southern region exceeded 0·5% for any ratio (table). During the first 20 months of the pandemic, R effective varied widely across locations and time, from lower than 0·1 to higher than 2·0. Only 39% of location-weeks for which total immunity was under 10% had R effective lower than 1. Between 10% and 20% total immunity, this proportion increased to 56%, and between 20% and 30% total immunity, we observed an additional increase to 65% of location-weeks with an R effective lower than 1 ( figure 4). However, over the range of 30-60% total immunity, the percentage of observations with R effective lower than 1 decreased back to 55%. This absence of a clear relationship highlights the many other factors such as seasonality, physical distancing mandates, mask use, and new variant spread that have influenced R effective over time. From 60% to 70% total immunity, we observed 60% of observations with R effective lower than 1, and above 70% total immunity, 72% of location-weeks had an R effective lower than 1. Although these data suggest transmission to be somewhat lower at the highest levels of total immunity observed thus far, even with total immunity at 80%, we saw no indication of an abrupt drop in R effective .

Discussion
In this study, we estimated that global daily SARS-CoV-2 infections fluctuated between 3 million and 17 million new cases per day from April, 2020, to October, 2021. In total, we estimated that between the start of the pandemic and Nov 14, 2021, there were 3·80 billion (95% UI 3·44-4·08) total SARS-CoV-2 infections and reinfections combined and that 3·39 billion (3·08-3·63) individuals had been infected with SARS-CoV-2 one or more times. The proportion of the population that had been infected at least once ranged from under 1% to over 80% across countries and territories. The highest cumulative infection rates were estimated in sub-Saharan Africa; central Europe, eastern Europe, and central Asia; and south Asia. Translating daily infections into R effective showed no clear herd immunity threshold.
Cumulative infection rates through Nov 14, 2021, varied greatly across countries and territories and between subnational units within countries. This variation can be explained by a combination of factors including policies enacted by governments to encourage mask use and reduce social interaction, [48][49][50] presence of escape variants, testing and contact tracing capacity, 51,52 previous exposure to other coronaviruses, 53 baseline patterns of social interaction, and more. For instance, greatly different levels of cumulative infection were found in some neighbour ing countries with similar patterns of non-COVID-19 disease burden, such as Norway and Sweden. 43 In these two countries, testing and contact tracing strategies, government restrictions, and mobility patterns varied substantially, 54 contributing to substantially different SARS-CoV-2 infection outcomes. Other countries, such as Australia and New Zealand, have shown how early and effective lockdowns, combined with geographical isolation and travel restrictions, have kept transmission low throughout the study period. 18,55,56 Excess mortality and seroprevalence data available suggest that some of the most severe COVID-19 epidemics occurred in eastern Europe and central Asia. This might be related to comparatively less public intervention, such as mask mandates or stay-at-home orders. 57,58 But, although it might be tempting to ascribe all variations in cumulative infections to effective public health action in different countries, the April-September, 2021, surges in many southeast Asian countries where, up to the end of March, public health responses to the pandemic had been swift and believed to be effective, [59][60][61] suggest that other factors might also be contributing to these patterns. 57,58 The empirical measurements of the IDR suggest that it was low early in the pandemic, when testing was scarce, and increased as testing capacity expanded. On average, the IDR increased steadily, especially over the course of the first year of the pandemic, but with marked variation across countries. This variation highlights how analyses based on the assumption that SARS-CoV-2 IDR is constant across location and time 7 could be very misleading. Although we expect that, in general, IDR increased as testing capacity increased, national guidance on who should be tested, and changes in that guidance over the course of the pandemic, might also affect the IDR. For example, on May 1, 2021, the CDC issued guidance not to test vaccinated individuals who had been exposed to COVID-19 but did not have symptoms. 62 Likewise, the advent of workplace and school testing programmes in the later months of 2021 might also shift the IDR up in some countries. Great care needs to be taken when interpreting trends based only on reported cases in the later phases of the pandemic. In many settings, hospitalisations-which tend to be a robust measure of more severe disease-are likely to be more informative than confirmed infections.
Our analysis suggests that the cumulative IFR across countries and territories ranged from 0·1% to 2·0% as of Nov 14, 2021. Age standardisation has been shown to explain a considerable portion of this variation, 40 but substantial differences remain in the available data. Some of this variation appears to be due to the prevalence of certain comorbidites, 63 and some could be residual errors in the estimation of excess mortality or seroprevalence in the available data. Nevertheless, it might turn out that other factors, such as previous exposure to other coronaviruses, help explain the considerable variation in the age-standardised IFR that is observed in the data. Such variation in the IFR should caution against studies that assume the IFR (either all-age or age-standardised) is constant across locations and over time. The temporal analysis of the IFR supports the clinical observation that the IFR was initially much higher in March and April, 2020, and subsequently declined as clinical practice improved, particularly in approaches to oxygenation and the use of corticosteroids. [64][65][66][67] Trials of some oral antivirals have shown substantial effectiveness in preventing severe disease and death, suggesting that the IFR might decline further in the coming months if these and other antivirals become widely available and if diagnostic capacity is able to support early treatment. 68,69 We did not find a clear relationship between R effective and total immunity up to 60%. Over 60%, R effective was more often under 1·0 than over 1·0. Despite this finding, we observed no obvious herd immunity threshold in the data. The generally weak relationship between R effective and the total immunity highlights the powerful role of other factors driving infection, including physical distancing mandates, seasonality, mask use, and the emergence of new variants over the study period (especially the delta variant) in mediating this relationship. Although figure 4 does not show us the prospects for reaching herd immunity in each location for any given season or variant, the overall relationship points to the very high degree of combined natural and vaccine-derived immunity that might be needed to block community transmission (especially in the winter months).
This empirical analysis has several important limitations. First, some seroprevalence surveys (such as the CDC monitoring of laboratory data) might be biased, but the direction of the bias is difficult to ascertain. Additionally, in reporting serosurveys, various corrections can be applied to produce estimates, including the use of sampling weights, correcting for manufacturing sensitivity and specificity, or, in some instances, full correction for waning detectability. Where possible, we attempted to standardise for this by extracting data that were adjusted for sampling frame and manufacturing sensitivity, but not more complex corrections. If this was not possible, we used the raw numerator and denominator as reported. In some instances, no metadata were provided to describe whether any correction had been applied. In all instances, these values were treated as equivalent. Second, we have assumed that one of the key covariates for the IDR is demonstrated testing capacity. By construction, this variable cannot decline as it is the maximum value of previously observed daily testing rates. In some countries, changes in guidance on who gets tested could lead to declines in effective testing and the IDR, and we may have missed these changes. The CDC guidance in spring, 2021, not to test vaccinated individuals who were asymptomatic or mildly sympto matic is an example of such a policy. Third, vaccination increases the proportion of the population who test positive on antispike antibody tests. We note in some locations, particularly in the UK, attempts to account for vaccination rates resulted in decreasing estimates of seroprevalence over time, suggesting that assumptions about the probability of vaccinated individuals being identified in serological surveys in those locations are incompatible with the data collected; in these instances, we excluded the seroprevalence data from the analysis. Fourth, matched seroprevalence surveys with reported cumulative cases, hospitalisations, and deaths provide an interval measure of the IDR, IHR, and IFR from the beginning of the pandemic to the period of the survey. We used these interval measures to derive relationships for the daily IDR, IHR, and IFR. This approach decreases our ability to identify drivers of shorterterm fluctuations in these key rates. Fifth, the availability of hospital admissions data in low-income and middleincome settings was generally low, minimising its effect on the estimation process in many countries. Sixth, we used estimates of total COVID-19 mortality based on the measurement or estimation of excess mortality multiplied by a statistical estimate of the proportion of excess mortality directly attributable to infection with SARS-CoV-2. This statistical estimation was based on removing the effect of a low IDR and reduced mobility that might be a proxy for deferred care and other health effects of isolation. This estimate of the proportion of excess mortality that is total COVID-19 has wide UIs. Eventually, better data will emerge on causes of death during the pandemic that will hopefully refine the estimate of total COVID-19 deaths. The wide uncertainty in the ratio of total COVID-19 to reported COVID-19 is reflected in the uncertainty analysis in this study. Seventh, our model permitted a maximum of two infections per individual-in the case where a person gets an ancestral or alpha variant infection, they might also be infected with a beta, gamma, or delta variant. There is evidence of waning naturally derived immunity, suggesting that an individual might become more broadly susceptible to reinfection sometime after exposure. 70 This empirical analysis of past COVID-19 infections ends at the point where the omicron (B.1.1.529) wave was first detected in Gauteng province in South Africa. Omicron is much more transmissible than previous variants and has shown immune escape. 71 Since Nov 14, 2021, the omicron wave has taken off in all countries and territories. Because of much lower severity of disease, the IDR is likely to have dropped considerably during the omicron wave. Models suggest that more than 50% of the world might have been infected with omicron already-however, a detailed analysis will have to await new seroprevalence data emerging in the coming months. Cumulative infections for COVID-19 through to March, 2022, might be nearly double what occurred through Nov 14, 2021.

Conclusion
COVID-19 has had a staggering impact on the world, with 3·39 billion (95% UI 3·08-3·63) people infected with SARS-CoV-2 at least once as of Nov 14, 2021. These findings highlight the potential for COVID-19 to have a continued and profound impact on the world's population. The vast differences in cumulative proportion of the population infected across countries and territories can help policy makers identify locations whose transmissionprevention strategies should be emulated, as well as those populations at greatest risk of future infection-a factor that should be considered in global vaccine prioritisation. Our statistical approach to estimating SARS-CoV-2 infection, which can be applied routinely and will allow for rapid availability of estimates, will be crucially important for research, science, and policy efforts towards pandemic preparedness, response, and control in the coming months and years. It has and continues to be made freely available to all on a routine basis. was partly supported by CNPq (310679/2016-8 and 465518/2014-1), by FAPEMIG (PPM-00428-17 and RED-00081-16) and CAPES (88887.507149/2020-00). D F Santomauro is employed by the Queensland Centre for Mental Health Research, which receives core funding from the Department of Health, Queensland Government. C S Wiysonge's work is supported by the South African Medical Research Council.