Feasibility of very short-term forecast models for COVID-19 hospital-based surveillance

Abstract INTRODUCTION: We evaluated the performance of Bayesian vector autoregressive (BVAR) and Holt’s models to forecast the weekly COVID-19 reported cases in six units of a large hospital. METHODS: Cases reported from epidemiologic weeks (EW) 12-37 were selected as the training period, and from EW 38-41 as the test period. RESULTS: The models performed well in forecasting cases within one or two weeks following the end of the time-series, but forecasts for a more distant period were inaccurate. CONCLUSIONS: Both models offered reasonable performance in very short-term forecasts for confirmed cases of COVID-19.

The coronavirus disease (COVID-19) pandemic has already claimed more than 190,000 lives in Brazil at the time of this writing. Among the many efforts made to increase knowledge about this new disease and enable strategies for its mitigation, mathematical and statistical models can provide useful information about the dynamics of COVID-19. These approaches include susceptibleinfected-recovered (SIR) models and their extensions, exponential smoothing, and models based on S-shaped curves, such as the logistic, Gompertz, and Richards curves 1 . These models are used to describe the temporal variations of an outbreak and can provide short-term forecasts of new cases of the disease. Furthermore, when the aim is to evaluate two or more simultaneous time-series of events, we can consider the use of multivariate approaches such as vector autoregressive (VAR) models 2 . VAR models are very useful, especially in the field of economics 2 , but they also have applications in epidemiology 3 , including the modeling of COVID-19 data 4 .
In this study, we used a multivariate time-series analysis based on a Bayesian extension of the VAR model in order to describe the weekly number of confirmed cases of COVID-19 reported in six units of the Clinical Hospital of the Ribeirão Preto Medical School (HCFMRP), University of São Paulo (USP), and to explore the possibility of very short-term forecasts. Multivariate time-series data can provide more information than univariate time-series data. VAR models and their extensions allow us to investigate how each variable impacts the other in each separate time-series, and this can improve the forecasting process. We compared the forecast accuracy of this multivariate analysis with those obtained by exponential smoothing based on Holt's method.
We used data from the Center of Epidemiological and Hospital Surveillance (NVEH) of the HCFMRP. By convention, epidemiological weeks (EW) run from Sunday to Saturday, and we considered the weekly number of confirmed cases of COVID-19 reported from EW 12 (March 15-21, 2020) to EW 41 (October 4-10, 2020). The HCFMRP is a large hospital complex with many health departments and 1,194 general beds in total. There were 45,866 hospitalizations and 41,377 surgeries in 2019. The health units of the HCFMRP included in the present study were the central unit located on the USP campus (here referred as "Campus"), the State Hospital of Américo Brasiliense (HEAB), the State Hospital of Ribeirão Preto (HERP), the State Hospital of Serrana (HES), the Reference Center in Women's Health (MATER), and the Emergency Unity (UE). Four of these units are located in Ribeirão Preto, a medium-sized city in the northwest region of the state of São Paulo, Brazil, with approximately 700,000 inhabitants, and the other two units are located in the cities of Américo Brasiliense and Serrana, located near Ribeirão Preto. Patients suspected of infection were tested for COVID-19 by reverse transcription polymerase chain reaction (RT-PCR). This was a study conducted with anonymized secondary data and therefore did not require approval from the Human Research Ethics Committee.
Formally, a time-series is defined as a collection of random variables {Y t }, where the time index t assumes integer values t = 1, 2, 3, and so on. Let us consider a K-dimensional multivariate time-series denoted by Y t = (Y 1t , …, Y Kt ). The VAR model, in this matrix form, is given by where p is the order of the autoregressive process (lag), B 0 = (b 10 , b 20 , ..., b K0 ) is a K×1 column vector of coefficients, B j are KxK matrices of coefficients (j = 1,...,p), and e t = (e 1(t) , e 2(t) , ..., e K(t) ) are the errors of the model 2,3 . It is often assumed that the errors follow a multivariate normal distribution with a vector mean of 0 and variance-covariance matrix S. An autoregressive model of order p was used, meaning that the p preceding values are used to predict the next value. In this study, we consider a multivariate set of K = 6 time series, each relative to a health unit of HCFMRP. The length of each series was 30 weeks, considering data until October 10, 2020. The estimation of a VAR model uses an equation-by-equation approach, where ordinary least squares (OLS) regressions were used for each equation. One of the prerequisites for the estimation of a VAR model is that the analyzed time-series are stationary, or that they show similar behavior throughout their duration. In this case, their mean and variance remained unchanged with time. Brooks 5 argues that, "if one wishes to use hypothesis tests, either singly or jointly, to examine the statistical significance of the coefficients, then it is essential that all of the components in the VAR are stationary." As we used the cumulative number of COVID-19 cases to conduct model fitting, variables are obviously non-stationary given the increasing behavior of the corresponding curves. As an alternative, we can use the Bayesian vector autoregressive (BVAR) model. BVAR models use Bayesian methods to estimate the parameters of the VAR model, and according to Holden 6 , some authors have argued that this approach may be applied to non-stationary variables. For example, in a study discussing the asymptotic distribution theory for statistics from autoregressive models with a unit root, including VAR models, Sims et al. 7 claim that "because the Bayesian approach is entirely based on the likelihood function, which has the same Gaussian shape regardless of the presence of nonstationarity, Bayesian inference need take no special account of nonstationarity." Gupta and Kotzé 8 also provide some discussion on this.
The Akaike information criterion (AIC) values for different models with lag length (p) ranging from 1 to 6 are, respectively, given by 969.8, 898.6, 880.2, 761.5, 626.4, and 839.9. Models with p > 6 were not fitted, because high-order lags may overfit the data. Based on these AIC values, we decided to consider p = 5, because models with lower AIC values are preferred. Figure 1 describes the time-series for the cumulative number of cases reported in each health unit and the predicted values obtained from the BVAR model, based on p = 1 and p = 5. These models were fitted using the Metropolis-Hastings algorithm in the R package "BVAR", based on 100,000 simulated samples taken after discarding an initial 5,000 burn-in period. Convergence of the simulated samples was verified by trace and density plots, and correlation between successive samples was inspected with autocorrelation plots. The BVAR package uses the Minnesota prior as baseline 9 , which is a prior distribution that transforms the VAR model into a random walk process for each variable. Figure 1 shows a sudden increase in the number of hospitalizations at the HERP in EW 31, which is accompanied by a decrease in the number of hospitalizations in the other units with the most beds. This decrease is not so clear in the corresponding graphs because the cumulative number of hospitalizations is relatively large in these units, but this situation describes the possible reallocations of patients between the units of HCFMRP. In this way, when considering interrelations between time-series variables, the BVAR model can provide an effective method to cope with these effects.
Holt's method is a special case of the Holt-Winters exponential model in which seasonality is absent. In the equation for Holt's method, the forecasted value of the series at time t is given by where L t is the estimated level given by T t is the estimated slope given by and α and β are smoothing parameters 10,11 . Holt's model can be applied by using the "holt" function included in the R package "forecast".
We then assessed the feasibility of using BVAR and Holt's models to obtain short-term forecasts for weekly COVID-19 cases in each health unit. Given that Holt's method is a univariate approach, we adjusted independent models for each health unit. We considered weekly reports from the date on which the first case was notified in one of the units of HCFMRP up to EW 37 as the training period. The values of the validation period were the corresponding observations from EW 38-41. Comparisons between the forecasted and actual values were based on the mean absolute percent error (MAPE), and Theil's U entropy coefficient was used as a measure of outof-sample forecasting accuracy 12 . The lower the MAPE value, the better the performance of the corresponding model. When Theil's coefficient is greater than 1, the forecasts under consideration are less accurate than those obtained by a naïve approach, or a simple method in which the forecasts are equal to the last observed value. Table 1 shows the weekly observed COVID-19 cases and the corresponding forecasts obtained from the BVAR and Holt's models (with 95% prediction intervals) in the validation period. Table 2 shows the MAPE and Theil's U values considering forecasts of   Even in a hospital context, the dynamics of COVID-19 can change very quickly during the time course of disease in response to a large number of factors, including changes in mitigation strategies for transmission in the community and people's adherence to them, and the availability of tests for essential screening 13 . The future of an epidemic in a population during its course is thus hard to predict, and mathematical and statistical models are only capable of simulating what can happen in the future if the conditions observed in the present do not change 14 . In the present study, we have data reported in weeks, and a period of two weeks or more can be large enough to produce substantial changes in exogenous variables that affect the dynamics of the disease. Although important, these limitations do not reduce the usefulness of the study in obtaining very short-term forecasts for confirmed cases of COVID-19, since out-of-sample predictions for the next one or two weeks can provide useful information for healthcare planning and distribution of resources. In economics, very short-term forecasting is sometimes called nowcasting (a contraction of "now" and "forecasting"), where forecasts for a very near future are useful to monitor changes in variables of interest in real time 15 . In the context of the HCFMRP, accurate short-term forecasts can create alerts for situations of peak demand for hospitalizations during the pandemic and can help health managers optimize the costs of supplies and staff. Our results suggest that the BVAR model is a feasible tool to obtain these short-term forecasts. In addition, the "BVAR" package in R offers researchers with programming knowledge a powerful tool for the study of multivariate time-series in epidemiology.
The limitations of the present study include uncertainties in the diagnostic accuracy of the tests used to determine the COVID-19 status and, from a statistical perspective, the fact that the time-series are relatively short and the weekly case counts were sometimes small, which can make inferences difficult. Despite these shortcomings, both BVAR and Holt's models offered reasonable performances in very short-term forecasts for confirmed cases of COVID-19 and can be valuable tools for disease surveillance.

AUTHORS' CONTRIBUTION
EZM: conception, data analysis, interpretation; ADCP: critical review and acquisition of data; AFC: management of the system of information; ACE: data typing, database validation; RAM: database validation and data organization; JMS: critical review and interpretation; FBR: critical review and interpretation; DCA: review, data analysis and discussion. All authors have read and approved the final version of the manuscript and contributed to its writing, review, and editing.