Testing the Accuracy of the ARIMA Models in Forecasting the Spreading of COVID-19 and the Associated Mortality Rate

Background and objectives: The current pandemic of SARS-CoV-2 has not only changed, but also affected the lives of tens of millions of people around the world in these last nine to ten months. Although the situation is stable to some extent within the developed countries, approximately one million have already died as a consequence of the unique symptomatology that these people displayed. Thus, the need to develop an effective strategy for monitoring, restricting, but especially for predicting the evolution of COVID-19 is urgent, especially in middle-class countries such as Romania. Material and Methods: Therefore, autoregressive integrated moving average (ARIMA) models have been created, aiming to predict the epidemiological course of COVID-19 in Romania by using two statistical software (STATGRAPHICS Centurion (v.18.1.13) and IBM SPSS (v.20.0.0)). To increase the accuracy, we collected data between the established interval (1 March, 31 August) from the official website of the Romanian Government and the World Health Organization. Results: Several ARIMA models were generated from which ARIMA (1,2,1), ARIMA (3,2,2), ARIMA (3,1,3), ARIMA (3,2,2), ARIMA (3,1,3), ARIMA (2,2,2) and ARIMA (1,2,1) were considered the best models. For this, we took into account the lowest value of mean absolute percentage error (MAPE) for March, April, May, June, July, and August (MAPEMarch = 9.3225, MAPEApril = 0.975287, MAPEMay = 0.227675, MAPEJune = 0.161412, MAPEJuly = 0.243285, MAPEAugust = 0.163873, MAPEMarch – August = 2.29175 for STATGRAPHICS Centurion (v.18.1.13) and MAPEMarch = 57.505, MAPEApril = 1.152, MAPEMay = 0.259, MAPEJune = 0.185, MAPEJuly = 0.307, MAPEAugust = 0.194, and MAPEMarch – August = 6.013 for IBM SPSS (v.20.0.0) respectively. Conclusions: This study demonstrates that ARIMA is a useful statistical model for making predictions and provides an idea of the epidemiological status of the country of interest.


Introduction
Towards the end of 2019, local hospitals from the Hubei region, Wuhan city begun to report by the day more and more cases of severe pneumonia with an unknown etiology. It was difficult for clinicians to establish a diagnosis on the basis of the unique symptomatology that the first patient had. Fortunately, in a relatively short interval it was revealed that the so-called patient zero was infected with a novel beta-coronavirus. Already known as severe acute respiratory coronavirus 2 (SARS-CoV-2) after its successor, it was demonstrated that person-to-person transmission ultimately causes the coronavirus disease (COVID-19) [1][2][3].
Intriguingly, 2019-nCoV is a zoonotic family member, and for a long time it was speculated that Rhinolophus sinicus is the natural host of SARS-CoV. Unfortunately, no clear evidence was found that incriminates the horseshoe bat, all remaining at a hypothetical stage even after almost a year since the first case was reported. Most likely, these assumptions were made based on the current knowledge that the bat is a natural reservoir for pathogens. Because of its novelty, more than fifty people were confirmed as SARS-CoV-2-infected patients by the beginning of 2020 [4].
In retrospect, humanity has never faced such a crisis since the Spanish flu between 1918 and 1919/1920, with figures indicating that it caused the death of fifty-one hundred million people [5]. Even if both clinicians and researchers are in a timed battle against this virus, the latest statistics issued by the World Health Organization (WHO) suggest that over twenty million people are positive, and approximately eight hundred thousand have died despite their best efforts (https://covid19.who.int/).
Considering the uncontrolled and fulminant spreading of SARS-CoV-2, concomitantly with its identification it was demonstrated that the elderly and those who have associated chronic diseases are the most predisposed [6]. However, these figures vary, not because of the lack of data, but rather the finite capacities in the epidemiological surveillance. Based on the aforementioned, the need for a reliable and efficient strategy for planning health infrastructure is all more imperative, especially for mid-class countries. Compared with Westernized countries that have all the resources necessary, in Romania, on the other hand, the situation may reach the critical point and soon be cataloged as the second Lombardia.
There is an increasing trend in the current literature regarding the possible epidemiological course of COVID-19. Both mathematical and statistical models are crucial to determine short and long case estimates [6]. One example is represented by the AutoRegressive Integrated Moving Average (ARIMA) model that has been successfully applied in the past to estimate the prevalence and incidence of numerous other highly infectious diseases (Table 1) [7]. Brucellosis ARIMA [14] Unlike the other studies conducted, the present study aims to estimate COVID-19 cases through ARIMA using two distinct statistical software (IBM SPSS and STATGRAPHICS) in order to test their reliability and accuracy. It also aims to present the evolution of the mortality rate in Romania considering the high, almost double reports between the number of positive cases/deaths in the last thirty days compared to the same intervals of the previous months.

Data
The daily prevalence data of COVID-19 was taken from The Ministry of Internal Affairs of Romania (https://www.mai.gov.ro), and compared to the figures reported by the World Health Organization (WHO) (https://covid19.who.int/). An MS Excel was used to build a time-series database.
Even though the first case in Romania was reported back on 27 February, we decided the following: (1) in order to test the accuracy of the ARIMA models, the established interval was divided into small (1 month) subdivisions with fourteen days forecast of the next month and comparing the numbers reported daily by the Romanian Government and WHO; (2) to perform a forecast from the so-called point zero (27 February ) until the present day, 31 August, with also a fourteen days forecast.  Descriptive statistics of the COVID-19 data for the established intervals (1 March-31 March, 1  April-30 April; 1 May-31 May; 1 June-30 June; 1 July-31 July, 1 August-31 August, and 27 February-31 August) are given in Table 2. The current situation in Romania (31 August) is as follows: 85,833 confirmed cases and 3539 deaths. At least thirty observations are recommended for an optimum ARIMA model [15]. Thus, the data set was used to conduct and analyze a case estimation model starting from the assumption according to which it will be useful in the future to predict the evolution of COVID-19 in Romania. Therefore, a time-series containing at least 45 data was used to predict SARS-CoV-2 prevalence in Romania over the next two weeks with a 95% confidence interval (CI).
Initially, the outbreak did not affect Romania significantly, but starting from 23 July, the number of positive cases exceeded 1000. Since then, only on 4, 11, 18, 24, 25 August were registered <1000 cases per day, the highest number being reported on 28 August with 1504 confirmed cases and 38 deaths.

The ARIMA Model
A time-series, as the name suggests, is just a succession of data points indexed in a time order [16] dedicated to generating statistical data. More precisely, are used to perform predictions of values of a series [17], ARIMA becoming a simple-to-use algorithm since it was introduced in the 1970s [15]. ARIMA is preferred to the detriment of other models due to the fact it takes into account all (in)dependent variances. Nevertheless, beyond fitting for a large sphere of data, through seasonality to cyclicity a temporal dependency can be modeled.
In summary, autoregressive integrated moving average (ARIMA) technique is used for tracking linear tendencies, the entire concept constituting a mixture or being denoted by three orderly parameters. Non-seasonal ARIMA's parameters AR(p) (auto regression) represents the order of autoregression, MA(q) (moving average) the order of moving average, whereas I(d) is the degree of difference.
Viewed or described as a time-series, Yt represents a succession of independent arguments on the basis of a time t [18]. A deterministic/stochastic time-series could be explained by the following function, Yt = f/X(t), where X is just a random variable. Thus, AR(p) (Equations (1a) and (1b)) predict the future value based on previous p-time observations as inputs, θ or Φ is the multiplying coefficient, ε t or ω is the random error or white noise at a time t and µ, the mean of a series. In cases of a stationary time-series, the average of the ε t or ω t is 0, the variance being noted as σ 2 : or: Here, δ or α have the same value = constant. The polynomial's MA(q) (Equations (2a) and (2b) time-series as a q th degree can be found such as follows: or: Therefore, AR(p)MA(q)'s expression is obtained by combining p and q, mathematically being represented in Equations (3a) and (3b) [19]: or: On the other hand, there are also circumstances when the time-series is not stationary. In such cases, it should be verified if this condition is satisfied or not; if not, it can be made stationary by adding another variable d. Once ∆Y "take over" Yt's non-stationary differences, ∆Y can be explained as follows: (Equation (4)), with L representing the likelihood of the data: For testing the accuracy of our model, we analyzed the performance of three factors known under the name of root mean square error (RMSE), mean absolute error (MAE), and mean absolute percentage error (MAPE) (Equations (5)-(7)): For congruity, MAE, MAPE, and MRSE's values must be low, all analyses being performed using STATGRAPHICS Centurion (v.18.1.13) and IBM SPSS (v.20.0.0) software with statistically significant levels of p < 0.05.

Mortality Rate
We collected data aiming to determine the mortality rate depending on the sex of each individual, the median age ranging between <10 as the minimum and >80 as the maximum limit of people who have died, who had associated comorbidities, and were hospitalized in intensive care units (ICUs) between the established intervals. All these parameters were calculated using Excel software. Unfortunately, there are some limitations in this context. More specifically, the figures related to the sex of patients, the median age, the associated comorbidities, and at ICU are incomplete as a consequence of lack of management from the Romanian government during this pandemic. Based on the aforementioned, we were able to collect data from the last several months; 11 June for sex, median age, and associated comorbidities, and 17 March for ICU patients.

Results
Building an ARIMA model for any given time-series involves the checking of four steps: assessment of the model, estimation of parameters, diagnostic checking, and prediction. The first, which is otherwise imperative, is to verify if the mean, variance, and autocorrelation of the time-series are consistent throughout the established interval [20]. Therefore, two-time-series plots, autocorrelation function (ACF), and partial autocorrelation function (PACF) ( Figure 1) graphs were generated to test the seasonality and stationarity. ACF is a statistical metric that determines whether the prior values are related to the latest values of not, while PACF the value of the correlation coefficient between its time lag and the variable [13]. Both are imperative in detecting misspecification, the model performance being measured by Akaike information criteria expression, and the Bayesian information criterion of Schwarz (BIC) [21]. Estimated autocorrelations for Romania are presented in Figure 1; the straight lines indicate the limit of two standard deviations and the bars that extend beyond the lines suggest statistically meaningful autocorrelations.    Additionally, a series of ARIMA models were created, and their performances were compared using various statistical tools. All statistical procedures were performed on the transformed COVID -19 data. ARIMA models with the lowest MAPE values were considered the most optimum model. Among the tested models, ARIMA (1,2,1), ARIMA (3,2,2), ARIMA (3,1,3), ARIMA (3,2,2), ARIMA (3,1,3), ARIMA (2,2,2), ARIMA (1,2,1) were chosen as the best models for Romania. The models where COVID-19 data fitted are presented in Figure 1 and Tables 3 and 4       In Table 4, the parameter estimates for the best models are presented. The fitted and predicted values are presented in Figure 2. As seen in Table 5 for both software, the next two weeks estimate of confirmed cases may be between 2450.74-5673. 29      Regarding the mortality rate, since 11 June until 31 August a total of 2261 patients were identified, from which 1356 (59.97%) were male and 905 (40.02%) were female. The most affected age group were people aged between 70 and 79 years, where SARS-CoV-2 caused the death of 709 people, followed by people between 60 and 69 years with 621 deaths and >80 with 526 deaths (Figure 3). On the other hand, a total of 405 people died, from which 260 had between 50 and 59 years, 104 between 40 and 49 years, 30 between 30 and 39 years, 10 between 20 and 29 years, 1 between 10 and 19 years, and 0 with less than 10 years old. From the total number of 2261 people, 2184 had comorbidities (96.6823%), and 77 not. As well, since 17 March when the first 4 people were confirmed, the total number registered until 31 August was 506 (Figure 4).

Discussion
Based on our results, it can be concluded that Romania will face an even higher number of infections which can exceed one hundred thousand. In terms of the number of deaths, these figures are not comparable with other countries such as Italy, Spain, or France. The probability of exceeding 1000 is very small, especially due to the high longevity rate of people from other states compared to Romania.
According to the current literature, this is the first study of such a manner. Thus, the idea of testing the accuracy of the ARIMA model using two distinct statistical software is novel, all the more so as middle-class countries do not have the resources necessary or a reliable strategy in restraining the rate of contagion or transmissibility in such conditions. For an unknown reason, most studies have focused on Westernized or China's neighboring countries.
Recently, a team of authors proposed three new methods for studying the epidemiological course of COVID-19. The first one is a universal physics-based model designed to assess the COVID-19 dynamics in Europe. The model folds within the existing curve due to the fact that the results obtained following simulation indicate an evolution curve related to that describing the current status. This "overlap" can be explained by the fact that this approach is based on a universal mechanism, having as a structural concept, the "diffusion over a lattice". In this context, it has been successfully applied for seven European countries, and further offers the chance to study the memory effects through autocorrelation within the epidemiological dynamical systems [22]. Furthermore, Demertzis et al. [23] applied an exploratory time-series analysis built on a recent conceptualization. More specific, is dedicated in detecting connective communities by developing a novel spline regression in which the knot vector is represented by the community detection in a complex network. Through this approach, the authors demonstrated the reliability of this exploratory time-series analysis in decision-making in Greece, mainly because diagnostic testing, services, and resources strategies vary between countries. Finally, Tsiotas et al. [24] used the modularity optimization algorithm in which the visibility graphs generated describe a sequence of different typologies that this disease has. According to their results, the current pandemic in Greece is about to reach the second half in a decreasing manner, whereas the chances for a "maximum infection" are low due to the saturation point reached.
Quarantine is the first alternative, Chintalapudi et al. [25] demonstrated that in Italy this approach promoted a reduction up to 35% of the total registered cases, in parallel with a significant percentage (66%) of recovered cases.
Considering the emphatic nature of humankind, self-isolation or quarantine could have branched and serious repercussions upon humans' psychological profile. The psycho-social impact is exponential, post-traumatic stress disorder (PTSD) and depression representing just two examples [26]. The gut-brain axis (GBA) component should not be neglected, since it is already known that a long-term loss of host eubiosis can promote psychiatric or neurodegenerative disorders [27].
Based on the above discussed, from our point of view, a two-sided approach is social confinement. López et al. [28] considered that social confinement should remain valid for at least 8 weeks because 99% of the current wave was attributed to humans intervention and recommended a resumption of daily activities up to 50%. Chakraborty et al. [29] sustained the arguments of López taking into consideration that people >65 years are more prone, and consider the necessity of an adequate medical center arrangement.
A study conducted by Williamson et al. [30] in which reunited a cohort consisting of over 17 million UK people demonstrated an increased risk among Black and South Asian people, predisposition attributed to age, sex, and related medical conditions. Miller et al. [31] assumed a case scenario in which around 20% of the US population will be infected, especially counties compared to the rest of the country. The authors created this pattern based on a series of assumptions such as transmission, contact patterns, basic reproductive rate, and how efficient quarantine really is.
Despite that travel restriction and social distancing significantly reduce the risk of transmissibility, evidence regarding the use of face masks are inconsistent. Regardless of the status of the individual, even for an asymptomatic carrier, face masks can mitigate the risk [32]. A recent systematic review and meta-analysis conducted by Chu et al., [33] reunited 172 observations studies across 16 countries with a cohort consisting of 25,697 patients. As expected, the greater the physical distance than 1 m, the risk is inversely proportional and vice versa. Intriguingly, even eye protection was positively associated with less infection.
However, a question arises. Why is there such a significant difference in the total number of deaths between countries? A cross-sectional dataset comprising 169 countries aiming to investigate factors associated with cross-country variation revealed that mortality rate is influenced by a series of variables; government effectiveness, the number of hospital beds, transport infrastructure, and the most important is the number of tests performed [34].
If all these amendments will not be taken seriously, we could face a second wave much more severe [35], reflected by the number of deaths reported each day. A similar event has been recorded as a consequence of the violation of these prevention measures in Romania.
An investigation of 12,343 SARS-CoV-2 genome sequences coming from the individual from 6 distinct geographical regions revealed that ORF1ab 4715L and S protein 614G variants is in direct correlation with fatality rates. The authors also showed that the bacillus Calmette-Guérin (BCG) vaccine and the frequency of several HLA alleles are associated with fatality rates and the number of infected cases [36].
From our point of view, researchers and clinicians should change the direction of this topic. Where does the next question come from? "If it is still known that angiotensin-converting enzyme 2 (ACE2) receptors [37,38] are also found in different niches along the digestive tract, why is the number of studies that aim to identify SARS-CoV-2 using rectal swabs or stool samples limited?" In several previous occasions, it has been demonstrated the presence of viral signatures in stool samples starting from day seven, and ranging up to almost two weeks after infection [39][40][41][42]. This hypothesis is also supported by additional evidence that the incidence of gastrointestinal deficiencies varies from mild [40,[43][44][45] to moderate [46][47][48][49].
The temperature could play an important role in the spreading of this virus. Demongeot et al. [50] concluded that high temperatures restrict the range of action of SARS-CoV-2, but this does not mean that in the cold season there will not be big question marks as to whether or not a person is infected with SARS-CoV-2, especially when it will overlap with influenza infections.
In conclusion, Eastern European countries such as Romania are at particular risk because of the vulnerabilities in the health system, corruption, and emigration of doctors. All these delays and the poor organization represent the consequences of the communist regime that still makes its mark even after more than three decades. It should be noted that Romania has also faced several economic crises, the critical point being reached on February 5 this year, at which point it collapsed [51].
Identical to Western models, and consistent with WHO guidelines (distance between people of about 1.5 m, wearing a mask, isolation, and massive testing), all these measures have been implemented also in Romania. Despite the efforts made, the sums allocated for carrying out such tests are insignificant, the equipment is missing, the staff is not qualified, and the hospitals are at full capacity.
Cumulatively, all these negative aspects are certified by an increasing number of infected people in contrast to the rest of Europe where the situation has reached the upper limit and is now stabilizing. What is certain is that Romania does not yet have an effective strategy to reduce the number of patients.

Conclusions
Forecasting the prevalence of SARS-CoV-2 is imperative to date, especially for health departments. As has been described and demonstrated throughout this study, time-series models play a crucial role in disease prediction. In this study, ARIMA time-series models were applied with success with the aim of estimating the overall prevalence of COVID-19 in Romania. However, based on our expertise and although both software have proven effective, Statgraphics has a much wider spectrum of possibilities in terms of speed, analysis, and utility. To these arguments is added the current pandemic, where providing a clear perspective in a short interval is vital for every individual.