Estimating incidence of infection from diverse data sources: Zika virus in Puerto Rico, 2016

Emerging epidemics are challenging to track. Only a subset of cases is recognized and reported, as seen with the Zika virus (ZIKV) epidemic where large proportions of infection were asymptomatic. However, multiple imperfect indicators of infection provide an opportunity to estimate the underlying incidence of infection. We developed a modeling approach that integrates a generic Time-series Susceptible-Infected-Recovered epidemic model with assumptions about reporting biases in a Bayesian framework and applied it to the 2016 Zika epidemic in Puerto Rico using three indicators: suspected arboviral cases, suspected Zika-associated Guillain-Barré Syndrome cases, and blood bank data. Using this combination of surveillance data, we estimated the peak of the epidemic occurred during the week of August 15, 2016 (the 33rd week of year), and 120 to 140 (50% credible interval [CrI], 95% CrI: 97 to 170) weekly infections per 10,000 population occurred at the peak. By the end of 2016, we estimated that approximately 890,000 (95% CrI: 660,000 to 1,100,000) individuals were infected in 2016 (26%, 95% CrI: 19% to 33%, of the population infected). Utilizing multiple indicators offers the opportunity for real-time and retrospective situational awareness to support epidemic preparedness and response.


Introduction
The emergence and rapid spread of Zika virus (ZIKV), an arbovirus transmitted by Aedes species mosquitoes, in the Americas [1] resulted in large-scale epidemics throughout the tropical areas of the region. The first confirmed locally acquired ZIKV case in Puerto Rico was reported on December 31, 2015 [2], followed by more than 36,000 confirmed cases in 2016 [3]. While confirmed cases provided an indicator of transmission intensity, reported cases represented a small proportion of actual infections [4] in part because many ZIKV infections are asymptomatic or mild, and are not captured by surveillance systems [5][6][7]. Furthermore, distinguishing symptomatic (i.e., disease) cases of ZIKV infections from other arboviral infections (e.g., dengue, chikungunya) was difficult due to their similar symptoms (e.g., fever, rash), and serological cross-reactivity with dengue viruses (DENV). Despite these challenges, estimating the underlying ZIKV infection incidence was critical to assess useful metrics (e.g., transmission intensity, the number of people previously infected, and the number still at risk) that informed prevention and response measures, and preparation for severe outcomes like Guillain-Barré Syndrome (GBS) [8] and congenital Zika syndrome [9].
ZIKV serosurveys, like those conducted in Yap and French Polynesia [5,10], provided estimates of cumulative incidence, but are logistically difficult, require substantial time and resources, and present diagnostic challenges due to varying duration of infection markers (RNA and different types of antibodies) and cross-reactivity [11]. Therefore, it is important to find alternative methods to estimate incidence of infection, including statistical techniques that can be easily applied to surveillance data.
Although many infections are undetected during outbreaks, data exist for the set of infections captured through surveillance systems. Bayesian statistical methods explicitly consider both variability in observations (data) and uncertainty in model parameters (e.g., probability of observation), and are well-suited to address challenges like estimating quantities that are not directly observed. During the emergence of ZIKV in Martinique, Andronico et al. [12] developed a Bayesian model to explicitly incorporate a classic epidemiological compartmental model with surveillance data from Martinique using prior information on ZIKV transmission, reporting rates, and GBS risk from French Polynesia. We employed a similar approach in Puerto Rico incorporating multiple surveillance indicators and prior information on the probability of observing infections.
We considered surveillance data on suspected arbovirus cases, suspected Zika-associated GBS cases, and infections identified through a subset of blood banks as indicators of infection. Suspected arbovirus cases identified through passive surveillance reflect symptomatic careseeking individuals with symptoms indicative of ZIKV, dengue virus, or chikungunya virus infection. Suspected Zika-associated GBS cases represented a more severe and easily recognized manifestation, though GBS can also result from other causes. Blood donor data provided information on asymptomatic and pre-symptomatic infections identified through blood screening. To capture underlying infection dynamics, we used a generic Time-series Susceptible-Infected-Recovered epidemic model within a Bayesian framework to relate infections to data by utilizing evidence-based assumptions on detection probabilities for each indicator. In this framework, we estimated the weekly incidence on ZIKV infection and the cumulative number of infections in Puerto Rico in 2016.

Estimated infections based on individual surveillance indicators
In 2016, 65,820 suspected arboviral disease cases, 175 suspected GBS cases, and 360 ZIKV-positive blood donors (out of 54,588 tested), were reported in Puerto Rico (Fig 1A-1C). We used these data to estimate the weekly and cumulative ZIKV infection incidence in 2016 for each surveillance indicator independently (Fig 1D). Estimated suspected arbovirus cases, suspected GBS cases and ZIKV-positive blood donors from the individual indicator models reflected a reasonable fit to these reported indicator data (Fig A in S1 Text). Weekly ZIKV infections estimated from suspected arbovirus cases and suspected GBS cases had similar trends over time, peaking in August 2016 and declining thereafter. Using the blood bank data, estimated ZIKV incidence peaked in June followed by high incidence through August and declined afterwards. For cumulative incidence estimates based on each of the three indicators, the median estimate was lowest using suspected GBS cases (880,000 infections) and highest using blood bank testing data (960,000 infections), with substantial overlap of credible intervals (Table 1). Estimates based on suspected arbovirus cases had the lowest uncertainty (95% Credible Interval (95% CrI): 630,000 to 1,200,000) and estimates based on suspected GBS cases had the highest (95% CrI: 420,000 to 1,300,000). The estimated proportion of the population infected during the outbreak was similar across the three indicators, with 27% (95% CrI: 19% to 35%) infected using suspected arbovirus case indicator, 26% (95% CrI: 12% to 38%) using the suspected GBS indicator, and 28% (95% CrI: 19% to 37%) using the blood bank indicator.

Estimated infections with combined surveillance indicators
Using a combination of the three surveillance indicators, infections peaked between August and September 2016 (Fig 1E), reflecting the combined peaks in incidence from the three indicators. We estimated that the total incident ZIKV infections was most likely between 810,000 and 970,000 infections (50% CrI, 95% CrI: 660,000 to 1,100,000) (Table 1), corresponding to 24% to 29% (50% CrI, 95% CrI: 19% to 33%) of the total population. These estimates correspond well to an a priori triangular probability distribution used to anticipate resource needs during the epidemic [13,14] (Fig 2). The combined estimates had reduced uncertainty compared to that triangle distribution and each independent estimate based on the individual surveillance indicators.

Prior and posterior parameter distributions
For each of the estimates reported above, we used a model with an informed set of prior parameter distributions. However, we also compared these estimates to those with less informed priors (increased variance) and naïve priors ( Fig 3A). Increased uncertainty in the priors resulted in similar median estimates for infections throughout 2016, but increased uncertainty especially for the lower bound of the credible intervals ( Table 2).
For the three prior distributions, the baseline transmission parameter (β 0 ) converged on a similar value regardless of the surveillance data used. As expected, the less informative priors led to higher uncertainty in the posterior distributions for the outcome parameters, particularly for the individual models in which additional data are not available to inform the estimates ( Fig 3B). The most notable effect was seen for the suspected GBS surveillance model. In this case, the posterior baseline GBS risk (p G0 ) was slightly higher and the ZIKV-specific infection risk (p G|Z ) was highly uncertain, indicating a lack of sufficient information to distinguish between the two components of GBS risk. On the other hand, the combined model was able to resolve all parameters regardless of the assumed prior variance. However, parameters using naïve priors still had more uncertainty in their posterior distributions ( Fig 3C).

Posterior estimates over time
Over the progression of the outbreak, the parameter posteriors evolved over time as more data became available for each surveillance indicator ( Fig 3D). For the first 4 weeks of 2016, the posterior estimates for individual parameters largely reflected the priors. However, by 8 weeks the posteriors started to shift, narrow, and stabilize. We observed similar trends when assessing how the incidence estimates of the individual and combined models changed over 4-week increments (Fig C in S1 Text). When incorporating new data for each surveillance indicator and the combined model over the course of the 2016 epidemic, the incidence estimates had the largest uncertainty in the earliest weeks of the outbreak (Fig C in S1 Text), though the uncertainty was larger for individual models. For each 4-week estimation, the end-of-year estimate based on the full dataset fell within the uncertainty bounds.
Evaluating the time-varying transmission parameter (β t ) possible seasonality showed the β t parameter had a constant pattern with some week-to-week variation (Fig D in S1 Text). This trend is consistent with a pattern expected of an emerging pathogen as opposed to a seasonal trend, where stronger seasonal oscillations would be expected.

Discussion
Estimates of the true burden of infection for ZIKV, like other pathogens, is challenging because many infections are inapparent; apparent infections are not always recognized, confirmed or reported; and disease surveillance systems for collecting case data are highly varied. Here, we developed an epidemic model, applied within a Bayesian modeling framework, to estimate ZIKV incidence in Puerto Rico using three separate infection indicators available from multiple surveillance systems and assumptions about detection probabilities for each system. Our approach utilizes different data sources to increase the precision of infection estimates over time and may further reduce bias by accounting for inherent surveillance biases based on the probability of detection. Using this framework, we estimated that ZIKV infections occurred in roughly a quarter of the population, resulting in 890,000 total incident ZIKV infections in Puerto Rico in 2016, translating to an average of 36 to 44 new infections per 10,000 people per week. The peak of the epidemic occurred during the week of August 15, 2016 (i.e., week 33), when an estimated 120-140 weekly incident infections occurred per 10,000 people, and correspond to the observed mid-August peak of the outbreak when 2,542 ZIKV cases were confirmed [3].
In contrast to our analysis, studies from other islands found substantially higher seroprevalence estimates, including 73% in Yap in 2007 [5], 49% in French Polynesia in 2013-2014 [10], and 42%-50% in Martinique in 2015 [7,31]. The differences in the estimated underlying infection burdens may be in part due to heterogeneity in exposure to infection even within an island population, as seen in early dengue serosurveys in Puerto Rico and municipality-level estimates for ZIKV infection rates [27]. In this analysis, we did not attempt to estimate municipality level due to sparse suspected GBS case data and limited spatial representativeness of blood bank data. Invasion patterns, by nature drive some spatial heterogeneity which can influence longer-term transmission dynamics, since areas with high immunity may limit transmission to areas with low immunity. Other factors that may contribute to differing levels of immunity include human mobility, underlying socioeconomics, and cross-immunity from other arboviruses, and all warrant further examination. The estimated 2016 Puerto Rico infection rate is well below most model estimates for how large a Zika epidemic would be if the population was perfectly mixed [32].
The framework developed here offers advantages beyond the estimation of epidemic size. In practice, we used the model to actively inform situational awareness beginning in August 2016. From the analysis of performance over time, the model could have provided useful information even earlier in the epidemic, despite limited availability of information about the outcome probabilities. Informative priors were available early in the year for baseline GBS risk and for the blood bank reporting factor. While specification of informative prior distributions for the reporting of ZIKV cases and suspected GBS cases would have been more challenging, they would not have been completely naïve. Critically, combining indicators lessened the need for strong prior information for any single indicator. Assuming less precise priors had some effect on outcome precision for all models but had the least effect when indicators were combined, except in the case of completely naïve priors. Similar observations were reported by Andronico et al. [12], where examination of different prior assumptions did not strongly affect parameter estimates.
Our approach had some limitations. One of our key indicators, suspected arboviral cases, uses a broad case definition to capture all potential symptomatic cases, and likely includes cases caused by other circulating arboviruses, such as dengue and chikungunya viruses. Confirmed rather than suspected cases could have been used as an indicator, however confirmation is also dependent on testing and test timing and introduces an additional delay. The results also suggest this would have made little difference as the model based only on suspected arboviral cases gave estimates in the same range as those for the other indicators and 92% of suspected arboviral cases were confirmed as ZIKV infections, as dengue and chikungunya were rare in Puerto Rico in 2016 [3]. We expect if there had been more dengue and chikungunya cases, our framework would improve differentiation from ZIKV cases from other arboviral cases. In general, our prior assumptions were based on data available at the time. If these early data contain hidden bias, resulting estimates could also be biased. Our results suggest that being conservative with respect to precision does not sacrifice substantial precision in the results, in part due to using multiple indicators. Our framework did not allow for substantial transmission variation beyond some week-to-week variability and reduced incidence due to acquired immunity in those already infected. In practice, transmission likely changes seasonally and can be influenced by vector control measures. In this case, with the introduction of a completely novel virus and the lack of large-scale effective vector control measures, we expect there would have been little change in transmissibility throughout the year and examination of β t over time did not show evidence of seasonal variation. Nonetheless, it is likely that there are some seasonal patterns to transmission. Those would likely be stronger in other epidemiological settings and could be incorporated into future versions of the framework. In addition, including biologically relevant weighting of the transmission parameter relative to each surveillance indicator could provide additional insight on transmission dynamics over time, including direct estimates of the reproductive number, as well as differences between the surveillance indicators [33,34].
Reported cases generally underestimate the true number of incident infections occurring in an epidemic, since they capture only recognized and reported clinical infections. However, multiple imperfect indicators provide the opportunity to estimate the underlying incidence of infection, utilizing multiple complementary indicators: (1) a broad non-specific indicator with relatively high counts (suspect arboviral disease cases); (2) an indicator of a severe outcome that is rare, but likely to have high reporting fidelity (GBS); and (3) an age-and geographybiased sample of infection prevalence (blood donors). Each indicator offers a unique but biased insight into the progression of the epidemic, capturing different case subgroups including different age-groups, and, when combined, can provide critical situational awareness about the progression of the epidemic.
The approach described here can estimate how many people have been infected in near real-time or identify changes in the trajectory of incidence across various indicators. It is also useful for post-hoc analysis to understand what the impact may have been on the populationlevel and whether more transmission may be expected. With 19-33% of the population infected in 2016, ZIKV transmission should be much more limited but still possible in Puerto Rico, particularly in areas that may have experienced lower infection rates. These insights are critical both for preparedness for and response to future epidemics, and this modeling approach is applicable to future Zika epidemics as well as epidemics of other pathogens. This approach could be applied to future outbreaks of dengue and other arboviruses in Puerto Rico, using suspected arboviral case, GBS, blood bank, and potentially other data as indicators. Likewise, other types of surveillance data, like reported influenza-like-illness (ILI)-associated hospitalizations, outpatient ILI visits and reported laboratory-confirmed specimens, could be used to enhance influenza surveillance. Each have limitations as individual indicators of the incidence of influenza infection, but when combined they may best approximate the incidence of influenza infection. Approaches like the one we present here provide a tool to incorporate these diverse data and the uncertainties in them to generate timely estimates of incident infection and inform response and control efforts.

Ethics statement
Exemption was obtained from the CDC Human Subjects Research Office as the data were collected as part of regular surveillance activities.

Data
We collected data on suspected arboviral disease cases, suspected GBS cases, and infections detected among blood donors reported in Puerto Rico during January 1, 2016-December 31, 2016. A suspected arboviral disease case was defined as any patient with clinically suspected illness resulting from an arbovirus infection and reported through the Passive Arboviral Diseases Surveillance System (PADSS) in Puerto Rico [3,35]. Suspected GBS cases were patients experiencing onset of neurological symptoms characteristic of GBS (e.g., bilateral flaccid limb weakness [13]) reported through the GBS Passive Surveillance System-a surveillance system capturing GBS cases along with patients experiencing other neurological symptoms (e.g., encephalitis [13]) and operated by the Puerto Rico Department of Public Health (PRDH) with support from the Centers for Disease Control and Prevention (CDC) [36]. Not all suspected GBS cases were confirmed to be either ZIKV infections or GBS cases, rather they represented real-time reports of possible GBS cases. We used an indicator for asymptomatic and pre-symptomatic infections using blood bank data beginning in April 2016 when all blood donations were tested for ZIKV RNA [30]. Two blood collection agencies provided the numbers of total donations and the number testing positive for ZIKV [30], the population of donors was not representative, being predominantly male and not including anyone under age 16. For the population of Puerto Rico, we assumed the population was approximately 3.4 million people based on 2016 census estimates [37]. See S1 Data for the weekly suspected arboviral disease cases, suspected GBS cases, and blood donor data.

Epidemic model
We developed a generalized Bayesian discrete Time-series Susceptible-Infected-Removed model [33,34] to fit surveillance data over the 52 weeks of 2016 to estimate weekly ZIKV infection incidence in Puerto Rico. We used an underlying SIR epidemic model, where a proportion of the population was infected each week, z t , and was defined as the product of the proportion infected in the previous week (z t−1 ), the proportion of population that was susceptible in the previous week (s t−1 ), and a time-varying transmission rate, β t : b t � Normal ð0;1Þ ðb 0 ; sÞ where we use the notation Distribution (a,b) (θ) to indicate that we use the stated Distribution with relevant parameter(s) θ but restricted to support (a,b), and so scaled to yield a valid distribution. The time-varying transmission rate (β t ) was assumed to have a constant mean reflecting no substantial control, with random variability between weeks, and was constrained to be greater than or equal to zero using a half-normal distribution. We assumed a prior for the baseline transmission rate (β 0 , β 0~N ormal (0,1) (2, 1)) reflecting an expected weekly reproductive number on the order of 1 to 5 [32]. The prior for the standard deviation had a similar scale (σ~Normal (0,1) (0, 1)). Though the average time between successive generations of arboviral infections, i.e., generation times, is typically several weeks [33,38], we implemented this model with a weekly time step, intended to reflect a generic representation of a weekly transmission process in which β t cannot be directly interpreted as R 0 , the basic reproductive number. We did not explicitly aim to model the initial phases of the epidemic, and therefore only estimated an initial proportion of the population infected in the first week of 2016, using a restricted normal prior to indicate a small prevalence of infection that week (z 0~N ormal (0,1) (0, 0.001)). All other individuals were assumed susceptible to infection, as evidence suggests only very limited transmission prior to 2016 [2]. Our model assumed a closed population, meaning that susceptible and infection population estimates depended only on population-level risk, and that there were no births, deaths, or migration.

Reporting models
For each surveillance indicator, we estimated the probability of observing an infection as a function of infection risk (z t ) and the observation process for each data type within the epidemic model. The combined model estimated incident infections from the three individual indicators within one epidemic model.

Suspected arboviral cases
Given that laboratory testing identified very few dengue and chikungunya cases [3], we assumed most suspected arboviral cases were suspected Zika cases. We estimated the expected number of reported suspect arboviral cases (S t ) as the product of population size (N), ZIKV infection prevalence (z t ), and the probability of reporting a clinical suspect Zika case per infection (p S|Z ) and fit the case data using the negative binomial distribution formulated as a mean and dispersion:  Table 3). The dispersion, φ S , was assigned a prior distribution with high expected overdispersion, F S~N ormal (0,1) (0, 1000).

Suspected ZIKV-associated GBS cases
We assumed the number of observed suspected GBS cases (GBS t ) came from a binomial distribution: where p G0 is the weekly risk of GBS due to other causes and p G|Z is the probability of suspect GBS given ZIKV infection and p G0 +(p G|Z −p G0 � z t ) represents the probability of an individual in population N being reported as a suspected GBS case. We assumed the global baseline GBS risk was 0.8-1.9 GBS cases per 100,000 per year [40] for Puerto Rico, and the weekly risk was approximated using a beta prior distribution: p G0~B eta(23, 8.9x10 7 ). The prior for the probability of suspected GBS given ZIKV infection p G|Z was based on an estimated range of 0.5-4.6 GBS cases per 10,000 ZIKV infections [39], which was approximated with a beta distribution (p G|Z~B eta(5.9, 2.3x10 4 )).

Blood bank indicator
We assumed the number of positive blood donors (B Z,t ) was a binomial sample from all tested donors (B t ): We used an adjustment factor (f BB ) to account for the duration of test positivity and the proportion of infected individuals excluded from donating blood because they were symptomatic at the time of donation [30]. Assuming weekly testing, the equations and distributions of Chevalier et al. were used to approximate a prior distribution for this factor: We sampled from distributions of each component, p A , the proportion of asymptomatic infections [5,6,11]; V, the duration of viremia [41]; and D, incubation period [41,42] (see S1 Text), to estimate a Gamma prior for f BB : f BB = Gamma(6.7, 7.6).

Analysis of priors
We examined the effect of different prior distribution assumptions on model posteriors for the probability of a suspected case being reported (p S|Z ), the probability of acquiring GBS if ZIKV infected (p G|Z ), and the relative incidence of ZIKV in the general population compared to positivity in blood donor (f BB ) parameters. We assessed the models with three alternative types of prior variance: informative (as described above), informative with doubled standard deviation in the prior and naïve (uniform or flattened). Prior distributions for each parameter under each assumption are available in Table A in S1 Text.

Model fitting
We fitted epidemic models using a Markov chain Monte Carlo (MCMC) Bayesian framework to estimate incident ZIKV infections and 95% credible intervals [95% CrI] from the three surveillance indicators individually and using all three indicators combined. For each indicator model, we performed 1,001,000 iterations for three chains, and discarded the initial 1,000 iterations as the burn-in period. We evaluated convergence using the Gelman-Rubin diagnostic [32] and thinned the output using every 1,000th sample to obtain 1,000 effectively uncorrelated simulations per chain.