Forecasting the COVID-19 Diffusion in Italy and the Related Occupancy of Intensive Care Units

This paper provides a model-based method for the forecast of the total number of currently COVID-19 positive individuals and of the occupancy of the available intensive care units in Italy. The predictions obtained—for a time horizon of 10days starting from March 29th—will be provided at a national as well as at a more disaggregated level, following a criterion based on the magnitude of the phenomenon. While those regions hit the most by the pandemic have been kept separated, those less aﬀected regions have been aggregated into homogeneous macroareas. Results show that—within the forecast period considered (March 29th–April 7th)—all of the Italian regions will show a decreasing number of COVID-19 positive people. The same will be observed for the number of people who will need to be hospitalized in an intensive care unit. These estimates are valid under constancy of the government’s current containment policies. In this scenario, northern regions will remain the most aﬀected ones, whereas no signiﬁcant outbreaks are foreseen in the southern regions.


Introduction
On March 19th, the death toll paid by Italy for the spread of the virus COVID-19 amounted to 3405 deaths, the highest paid by a single country in the world. Despite hard and relatively timely lockdown policies implemented by the government, on March 26th this figure has risen to 8165 deaths.
In such an emergency situation, a reliable forecast method for the infection development is essential for policy and decision makers to design evidence-based policies and to implement fast actions to curb the spread of the infection. In particular, predicting the number of people currently tested positive for COVID-19 (thereafter "positive cases") could be useful to draw the epidemiological curve of the infection and therefore to predict its peak. Other than this variable, the forecasting procedure presented in this paper is used to predict the future values of another crucial variable, i.e., the number of people needing hospitalization in an intensive care unit (ICU). e Italian ICU system is at the moment severely stressed due to the spread of the disease; therefore, predictions of future ICU demand could be fruitfully considered in the design and the implementation of operational schemes. e forecast horizon for both the variables is of 10 days starting from March 29th.
Since the Italian regions are affected in different extents by the COVID-19, it has been decided to perform the forecasting exercise for the following geographical areas: Lombardia, Piedmont, Valle d'Aosta, Veneto, Friuli Venezia Giulia, Trentino Alto Adige, Lazio, and Campania. e remaining regions have been grouped in the following macroareas: "Center" (Marche, Umbria, and Toscana) and "South" (Abruzzo, Molise, Puglia Basilicata, Calabria, Sicilia, and Sardegna). At least two other reasons justify such a break down: (1) e different starting times recorded for the lockdowns (2) e southern regions have been hit less severely and therefore, especially at the beginning of the observation period, show several zeroes or low numbers across the considered time span In essence, in this study, the available official data, detailed in Section 2, have been employed in a three-step procedure, i.e.: (1) Data preprocessing, in which data anomalies are identified and corrected according to an approach of the type a Kalman filter (2) Univariate forecasting, based on an autoregressive moving average (ARMA) model for number of positive cases and ICU (3) Bootstrap-based generation of predicted values and confidence intervals

The Data
is paper employs the data related to COVID-19, collected and regularly updated by the Italian National Institute of Health (an agency of the Italian Ministry of Health) and by the Italian Civil Protection Department. e whole data set is freely and publicly available in a comprehensive database, accessible on the Internet at the web address https://github. com/pcm-dpc/COVID-19/tree/master/dati-regioni (the file name is dpc-covid19-ita-regioni-20200323.csv). It collects crucial data related to all the persons tested for COVID-19-from the outbreak of the pandemics (February 24th)and, in particular, (1) is a collection of 21 data points-representing 19 Italian regions plus the two autonomous provinces of Trento and Bolzano-one for each day starting from the disease's outbreak, (2) considers crucial variables, such as positive cases, recovered cases, deaths, number of people hospitalized, and number of people admitted to intensive care units (ICUs).
As already pointed out, in the present study, the variables of interest are the number of people who have been It is worth outlining how, according to the regulations issued by the Italian government, only the people showing moderate to severe symptoms, generally associated with the infection, or who have been in close proximity with at least one positive person, are tested. erefore, the predictions obtained are to be referred to the sample, as no attempt have been made to carry out inference procedures for the estimation of the variables at the population level.
In order to correctly process the data, all the regions showing no positive cases at the beginning of the recording period and/or low values along the whole time span have been aggregated into macroareas. is has been done to (i) give more meaningful results and (ii) save degrees of freedom (which are always precious in short time series).
In details, the prediction exercise will be performed on the following regions/macroareas: e only exception is Valle d'Aosta, which has been left separated as no aggregation options could be found.
To simplify notation, for both the variables of interest V and U, the following convention is introduced: (1) Here, the upper left superscript (denoted by the upper case Latin letter K) refers to the geographical areas (i.e., North, Center, and South), whereas the subscript j is associated with the number the different regions or macroareas are codified with, as above detailed. For example, by the symbols A V 6 and B V 2 the number of positive cases for the Emilia-Romagna region and the Center macroarea "Marche-Umbria-Toscana" are respectively identified.

Data Preprocessing
Missing data and other anomalies become the first challenge when designing predictive models, as statistical methods, in general, are designed and tested under the assumption of no missing observations [1]. Before delving into the details of the proposed procedure, a word of caution is needed since, unfortunately, a visual inspection of the data suggests the presence of a number of anomalous data, both at a regional and a country level. e detected anomalies might be associated with the biological sample collecting process and the related testing procedures. In fact, the typical lab workflow is governed by a set of rigid protocols which might be critically affected by factors such as the availability of manpower, swabs, reagents, and other laboratory materials. In emergency situations, such a workflow can be disrupted, and temporal inconsistency might appear as a result. For example, a set of samples might be delivered to a laboratory with longer than usual delays with respect to the time of collection, or a given lab can only complete the screening process for a certain number of samples. In both the cases, a shift of one day (or more) in the release of the lab results can be reasonably expected. A further source of anomalies is represented by the data entry and data editing processes, carried out in working environment likely affected by the risk of contagious and under rigid deadlines.
An example of such anomalous data is given in Figure 1, where the series 1 V t,1 (Lombardia) is depicted. Here, some data points showing values inconsistent with the overall pattern are clearly noticeable. Given the (very small) available sample size, the relative weight of such data is almost surely not negligible and can introduce severe distortions in the model parameter inference procedures and thus in the predicted values.
In order to correct those data, a Kalman smoother statespace model [2] has been applied. In particular, the Kalman smoother adopted is of the type fixed-point smoothing. is algorithm is designed to obtain the estimate of a realization w N (the time t N is fixed N < K) of a given random variable W t , given a set of observations Z k � z k | 0 ≤ N ≤ k . A thorough explanation of this method goes beyond the scope of the paper; therefore, the interested reader is referred to the excellent paper by Sage and Melsa [3].
In Figure 2, the corrected version of the series 1 V t,1 -resulting by applying the Kalman smoother-is depicted. Not only this series lends itself to a better visual inspection but, more importantly, is more suitable to be processed by the adopted prediction model.

Theoretical Framework
e approach used in this paper relies on (i) the theory of stochastic process and (ii) a resampling method. While the former is necessary to generate the input (predicted values) of the bootstrap algorithm, as well as to justify the employment of the outlier correction method, the latter serves the purpose of (1) generating the final predictions, which are affected by a reduced amount of uncertainty (with respect to those generated by the stochastic model), (2) yielding the related confidence intervals.

e Stochastic Processes Paradigm.
e approach proposed in the present paper relies on the assumption that the (transformed) time series K V j,t and K V j,t are approximately a realization of a process of the type ARMA (autoregressive moving average) [4].
Let X � (X t ) t∈Z be a real 2 nd order stationary process, and it is said to admit a ARMA(p, q) representation (p, q ∈ Z) if, for some constant a 1 , . . . , a p , b 1 , . . . , b q under the following conditions: Here, F t denotes the sigma algebra induced by ε(j),j ≤ t, and p j�0 a j Z j and q j�0 b j Z j are assumed not to have common zero. e above conditions assure that X t can be represented as  with β j decreasing to 0 at geometric rate. e dynamics of the series under investigations are not suitable for this theoretical framework as it requires 2ndorder stationarity and homoscedasticity; those conditions are simultaneously achieved by preprocessing the series according to the following filter: log(∇ d ), being the symbol ∇ the difference operator and the exponent d indicating the order of the difference. To fully understand the role played by ∇, the backward operator B is now introduced. In essence, B moves the time index of an observation back by p time intervals, i.e., B p x t � X t−p , and thus we have

e Resampling Method.
In order to extract valuable information from our data and, at the same time, decrease the total amount of uncertainty associated with the outcomes of the ARMA model, a resampling procedure has been employed. Among the several resampling methods for dependent data available-many of which are freely and publicly available in the form of powerful routines working under software packages such as Python ® or R ® -the adopted resampling method is of the type maximum entropy bootstrap (MEB). Proposed by Vinod [5] and subsequently improved (see, e.g., Vinod [6]), it is based on basic assumptions which are different from those usually followed by standard schemes. In more details, while in the classic bootstrap an ensemble Ω represents the population of reference the observed time series is drawn from, in MEB a large number of ensembles (subsets), say ω 1 , . . . , ω N becomes the elements belonging to Ω, each of them containing a large number of replicates x 1 , . . . , x J . Unlike standard bootstrap schemes, in the MEB case the resample set Ω mimics the observed realization of the underlying stochastic process, in MEB a large number of subsets, say ω 1 , . . . , ω N becomes the elements belonging to Ω, each of them containing a large number of replicates x 1 , . . . , x J . Among the important features of the MEB scheme, it is worth mentioning the consistency of its bootstrap samples with the ergodic theorem (see, e.g., Birkhoff [7]) and with the probabilistic structure of the observed time series. In Figure 3, an example of the application of MEB for the variable 1 V t,1 is given.

The Forecasting Method
In what follows, the proposed procedure is presented in a step-by-step fashion: (1) Equation (2) is estimated for both V t and U t so that the model orders (equation (2)) M 1 and M 2 become available. (2) For each time series V t and U t , the MEB procedure is applied so that the sets V and U-each containing B � 500 "bonafide" replications-are available, i.e., Figure 4, the set V for the variable 1 V t,1 is given).
(3) For each of the replications stored in V, equation (2) is estimated according to the model order selected, i.e., M 1 , and the 1-to 10-step-ahead predictions-as well as the 5% and 95% bootstrap confidence interval-are generated. t-percentile method. e explanation of this procedure goes beyond the scope of this paper; therefore, the interested reader is referred to the excellent paper by Berkowitz and Kilian [8]. (6) CI * L (h � 2, . . . , 10) and CI * U (h � 2, . . . , 10) (the subscript b is omitted for brevity) are computed conditional to a subset of V, say V, made up of the bootstrap replications whose range falls between the minimum and maximum values of the values of the confidence intervals computed for h � 1. In symbols, Steps 1-6 are repeated for U t , so that a new matrix of prediction of dimension B × 3 is built, i.e., h � 1, 2, . . . , 10], whose columns are as in F U (h) and denoted by the symbols CI * U (h), U * t , and CI * U (h). Unfortunately, the whole procedure cannot be considered fully automatic since the estimation of equation (2) (step 1) is required.

e Adopted Models.
e stochastic model structures identified for both V t and U t are almost always of the type ARMA (1, 0), with the exception of Campania (ARMA(0, 1), for both the variables V t and U t ) and Emilia-Romagna, for which the best model for the variable U t is of the type ARMA(1, 1). e most suitable prefilter (equation (5)) has been always of the type d � 3 difference of the natural log of the variables of interest.

Empirical Evidences
At the national level (data have been plotted in Figure 4), the peak in the number of COVID-19 positives will be reached on April 2nd, with a number of predicted positive close to 77,000. e maximum forecasted value for the occupied ICU-expected for April 4th-will be 4280. ese values have been calculated using an indirect methodology, i.e., by summing up the estimates obtained at a disaggregated level.
e results related to COVID-19 positives and the ICU occupancy are reported, respectively, in Tables 1 and 2 (i) Lombardia-the most affected region-will reach the peak of positive cases (25963) and of the demand of ICUs (1425), respectively, on April 2nd and 4th. (ii) Emilia-Romagna is the second most affected region by COVID-19 but still shows a very high number of victims. e trend of infected people will reach its peak on April 5th, whereas the Journal of Probability and Statistics 5