Short-Term Infectious Diarrhea Prediction Using Weather and Search Data in Xiamen, China

Infectious diarrhea has high morbidity and mortality around the world. For this reason, diarrhea prediction has emerged as an important problem to prevent and control outbreaks. Numerous studies have built disease prediction models using large-scale data. However, these methods perform poorly on diarrhea data. To address this issue, this paper proposes a parsimonious model (PM), which takes historical outpatient visit counts, meteorological factors (MFs) and Baidu search indices (BSIs) as inputs to perform prediction. An experimental evaluation was done to compare the short-term prediction performance of ten algorithms for four groups of inputs, using data collected in Xiamen, China. Results show that the proposed method is effective in improving the prediction accuracy.


Introduction
To keep up with the pace of income growth, urbanization, and globalization, risk management of infectious diseases in public has become a critical task [1]. Infectious diarrhea (ID) [2] is one of the most common infectious diseases in the world, which infects more than 1 billion persons. It became the 37th legally notifiable disease in the China's National Notifiable Disease Reporting and Surveillance System, and any new case must be reported within 24 hours of diagnosis [3].
Early warning techniques [4] have been developed to monitor the status of infectious diseases and the demand for healthcare and health services. ese techniques can support decision making for medical intervention strategies [5], by preinforming people, health service providers, and the government. e problem of predicting upcoming diarrhea outpatient visits can be viewed as time series prediction problem. In past decades, numerous studies have used autoregression (AR), autoregressive integrated moving average model (ARIMA), and machine learning methods to predict upcoming values based on past observations. e widely used machine learning methods are multiple linear regression (MLR), support vector regression (SVR), and random forest regression (RFR) [6][7][8]. A famous autoregression method is that of Box and Jenkins [9], which has been applied in many fields [10], such as for electricity load forecasting and stock price prediction. Another famous statistical method is spline interpolation [11], which learns and uses a cubic spline interpolation to predict future values. But the performance of these methods degrades when dealing with nonstationary and chaotic time series, such as those of diarrhea outpatients.
Recently, to alleviate the uncertainty of a time series, exogenous data have been collected and fused into machine learning methods to achieve better predictions [12][13][14]. Hereto, the methods using exogenous data are called NARX [15]. According to the structure of NARX, we categorize them into wide models and deep models.
A wide model commonly builds more than two components in a layer. To capture the sequential features of a time series, a recurrent neural network (RNN) [16] is adopted to process exogenous data. To discriminatively process exogenous data and historical observations, the encoder-decoder framework [17] is introduced to undertake predictions. Meanwhile, the gated recursive unit (GRU) [18] is used to replace RNN in the framework, which captures long-and short-term memory (LSTM) [19]. Moreover, an attention mechanism is designed to adjust the values of exogenous inputs and historical inputs in the stage of encoding and decoding, respectively [20][21][22]. e deep model consecutively connects components from inputs to outputs using more than two neural network layers. For example, DL4Epi in [23] consists of a CNN layer, a RNN layer, and a residual link layer. DilatedRNN in [24] dilated three RNN layers from input to predict glucose incidence rate. TPA-LSTM in [25] uses a temporal pattern attention layer to deepen a model as well.
However, not only wide models but also deep models suffer from the problem of needing numerous samples. Given a time series of weekly or monthly diarrhea outpatients, the number of training samples is usually in hundreds. If a model has lots of parameters and the training samples are few, the learned model has a poor generalization performance. In the following, we call a model having many parameters a heavy model. In reality, heavy models remember almost the training samples and thus perform poorly for predicting unknown inputs.
How to build a parsimonious model, which has few parameters and learns from hundreds of samples? Observations about from previous models reveal that models assign weights to input elements, and then weights are passed and transformed into the target. Can the weights of elements be reduced? Can the reduced weights be passed to the output?
To address the issues, we propose a parsimonious model (PM). e proposed model first assigns a vector to each input dimension, and then the vectors are connected to the target. e rest of this paper is organized as follows. Section 2 briefly introduces the study area and data sources. Section 3 illustrates the defined problem and presents the proposed method. Section 4 describes the experimental setting and related benchmarks. Section 5 compares different methods and analyses their sensitiveness. Finally, a conclusion is drawn in Section 6.

Study Materials
is section first introduces the study area, i.e., Xiamen city. en, it briefly describes meteorological factors and search behavior in this city using basic descriptive statistics. e basic statistical variables of inputs and the target are listed in Table 1. e reasons for choosing meteorological factor (MF) data and Baidu search index (BSI) data as exogenous data are that those data are significantly correlated to the infectious diseases. MF data have been proved to be associated with the incidence of infectious diseases [26,27], such as ID and "hand, foot, and mouth disease" (HFMD). As an important entrance for users to obtain digital information resources on the Internet, the search engine provides lots of useful information. e search index of specific keywords directly reflects the social attention on infectious diseases [28][29][30]. Hence, MF data and BSI data are considered as exogenous inputs to models.
In order to determine the relationships between exogenous variables and the target variable, the Person correlation coefficient (PCC) values are figured out in the last column of Table 1. All of the exogenous factors are significantly correlated to weekly outpaitent counts except weekly maximum temperature. H max is used as an important input, since humidity is considered as an important factor [27].

Study Areas.
Xiamen is a developing city, which has a population of around 4 million permanent residents and around 4 million temporary residents as of 2018. It is an important special economic zone in China since 1980 and is located in southeast China. It covered a land area of 1,699.39 km 2 and a sea area of over 390 km 2 in 2017. Xiamen has a monsoonal humid subtropical climate, characterised by long, hot, and humid summers (but moderate compared to much of the rest of the province) and short, mild, and dry winters, and the annual mean temperature is 20.7°C [31].
Herein, the detailed population from 2012 to 2016 is listed in Table 2 as a background for this study. e annual growth of its population is between 1.3% and 2.1%, which is relatively stable. Morbidity and mortality are influenced by disease trends within a period. Meanwhile, infectious diarrhea is the most common infectious disease. Hence, the number of infectious diarrhea outpatient visits is key to monitor the status of infectious diseases in a city.

Meteorological Factor (MF).
e MF data from January 1, 2012, to December 31, 2016, were collected from Weather Underground (http://www.wunderground.com), which is a popular personal weather website. ose data from meteorological monitoring sites provide globally comprehensive, timely, and reliable meteorological data. At the moment, this website provides web-based application program interfaces for users to download data or develop third-party applications. e provided weather information is formatted in days. We calculate 261 weekly weather information pairs by those daily records. e selected weather information in weeks consists of 12 factors, e.g., highest temperature (°C), average dew point (°C), lowest atmospheric pressure (hPa), and average relative humidity (%). us, we get 12 groups of exogenous data to describe weather conditions.

Baidu Search Index (BSI).
Baidu is the most widely used search engine (http://index.baidu.com) in China, and it is also the largest Chinese search platform. People in Xiamen, China, are accustomed to using that search engine as well. Baidu search index records the search volume of many terms, which are queried by users since January 2011, and it is publicly available. BSI has been used to monitor the incidence of influenza epidemics [33], analyse regional infectious diseases, and perform real-time monitoring and prevention [28].
A major concern to acquire query data (search indices) is to find proper query words [29], which reflects user behavior about searching for infectious diarrhea. e engine returns daily counts of a given keyword, which can be conditioned by a region and a platform. Six time series of each query word are downloaded by choosing a region from {Xiamen, countrywide} and a platform from {mobile, PC, total}, respectively. ose data pairs are calculated in weeks. We have six groups of search indices, and they are used as exogenous data for each keyword. A correlation coefficient analysis is carried out on those input indices and target output (i.e., the number of cases). We found that the indices of the keyword "腹泻" (i.e., ID) have significant correlation, while others do not. Hence, we adopt the six group indices as inputs of models.

Approach
is section first defines the problem of time series prediction using exogenous data and then introduces data preprocessing and postprocessing. Finally, the proposed method is presented.

Problem Formulation.
e epidemic prediction problem can be viewed as a time series forecasting task [10]. A time series is defined as a sequence of observation values with consecutive identical interval lengths.
Let y t ∈ R 1 denote the observation measured at time t and x t ∈ R n denote the exogenous data measured at time t, where n is the number of input series. Furthermore, let T be a time window size. e known exogenous series with window size T is symbolized as [x 1 , x 2 , . . . , x T ], and the historical target observations are denoted as [y 1 , y 2 , . . . , y T ].
e goal is to predict the value of a future time point y T+1 , given historical observations and exogenous series. In detail, a nonlinear mapping F(·) is applied into the predictive formula: Moreover, let X � (x 1 , x 2 , . . . , x T ) denote the available training data in a time-span of size T and y � (y 1 , y 2 , . . . , y T ) be the recent historical outpatient visit counts.

One-Step Forward Split and Normalization
3.2.1. One-Step Forward Split. A time series cannot be directly used as input of regression models. Hence, we conduct one-step forward split to transform a time series to supervised data. e formula of the one-step split is shown as follows:

Scientific Programming 3
where M is the length of inputted time series and T is the window size.

Normalization.
It is a process of rescaling the data from the original range so that all values are within a given range [34], usually within 0 and 1 or 0.05 and 0.95. Both min-max normalization and standard (i.e., z-score) normalization methods are commonly applied to normalize time series data. Min-max normalization scales data in the [0, 1] interval by using the bounds of the minimum and maximum values. Standardizing a dataset involves rescaling the distribution of values, so that the mean of observed values is 0 and the standard deviation is 1. e mean and standard deviation estimates of a dataset can be more robust to new data than the minimum and maximum.
We choose standard normalization to rescale inputs and the target, since we regard the outpatient visit counts as Gaussian distribution, and the standardization fits a Gaussian distribution well. e standard normalization of inputs is formulated and recovered as where d ∈ R M denotes a feature of observed samples, M is the number of observed samples, d ′ is the normalized data, μ(d) is the mean value of d, and σ(d) is the standard variation of d. e recovered formula is applied on model outputs in the postprocessing stage. Because normalization is only applied on exogenous data, the recovery action can be skipped while generating predictions.

Parsimony Model (PM).
e diagram of the proposed PM is shown in Figure 1. is diagram consists of three stages: data preprocessing, PM, and postprocessing. In the preprocessing stage, the exogenous time series is normalized and combined with target time series to split the supervised data using one-step forward. In the PM processing stage, a parameter layer is exploited to extract pattern features from heterogeneous inputs. e outputs of the model are generated by linear summarization of the intermediate state result.
e left bottom part of Figure 1 gives toy examples of the structure of the parsimonious method with respect to inputs and weights. e yellow solid circle denotes exogenous inputs, such as meteorological factor or Baidu search index. e blue solid rectangle denotes historical observed outpatient visits, and red solid rectangle is the target. e yellow solid rectangles are weights to be learned. e goal is to reduce these complex models and find a parsimonious model to learn and predict well on the diarrhea data collections.
erefore, the weights of the first layer in a neural network are exploited to reconsider model structures. e left bottom part of Figure 1 gives inputs of exogenous data (i.e., MF, BSI, or MF + BSI) and historical cases, and the right part indicates the weights and connections among them (arrows between circles and rectangles). e two following definitions are introduced to provide clearer explanations.

Definition 1. A weight is assigned to each input element.
Definition 1 is formulated as follows: where [X; y] ∈ R (n+1)×T is the input of a given sample (i.e., input matrix), y T+1 is the target of this sample, which means the outpatient visit counts of the upcoming week, W ∈ R (n+1)×T is the weight corresponding to the inputs (i.e., weight matrix), and F 1 is a linear or nonlinear mapping from input and weight matrices to target. Both wide models and deep models in the research field of epidemic prediction are based on this definition. A wide model processes X and y in different components. For example, the encoder-decoder structure encodes X, passes a code to the decoder, and the decoder processes the code and y to make predictions. A deep model adds layers on the sequence of W, such as CNN and RNN.
When learning with a small number of samples, the number of samples is usually in hundreds, and these models have a burden of parameter training. In reality, the number of parameters is greater than the number of samples times the number of exogenous factors. Hence, these trained models would remember all the training samples and have a value of loss close to 0, but they poorly predict upcoming events.
erefore, the wide model and the deep model are condensed into a light neural network by setting F 1 as a linear function or nonlinear function, such as sigmoid function. But the condensed light neural network has many parameters as well. us, we try to use two vectors to restore weight W in equation (4). e aim is not only to have fewer parameters during the training stage but also to obtain better performance at the predicting stage.

Definition 2. A weight is assigned to each input dimension.
Definition 2 is formulated as follows: where [X; y] ∈ R (n+1)×T is the input matrix, and y T+1 is the prediction target (i.e., the upcoming week outpatient counts), w r ∈ R n+1 and w c ∈ R T are the weights corresponding to two input dimensions (i.e., weight vectors), and F 2 is a linear function or a nonlinear mapping from input and weight matrices to the target. ere are limitations in accurately capturing the input matrix of a sample solely on the basis of using two vectors, whose lengths are the same as the size of input matrix. Hence, we exploit two ways to recover W in equation (4) using w r and w c : the addition method and the multiplication method.
For each element w ij in W, it can be recovered using the addition method. e addition recovery formula is where w ij is the recovered value, w ri ∈ w r denotes the i-th element of w r , and w cj ∈ w c denotes j-th element of vector 4 Scientific Programming w c , respectively. Equation (6) shows that a matrix can be recovered by two vectors, which would reduce the training parameters.
Another recovery method is the multiplication method. For each element w ij in W, it can be recovered using the multiplication method. e multiplication recovery formula is where w ij is the recovered value, w ri ∈ w r denotes the i-th element of w r , and w cj ∈ w c denotes j-th element of vector w c , respectively. Equation (7) shows that a matrix can be recovered by two vectors, which would reduce the training parameters as well.
To benefit the recovery effects from the both methods, we add them together and get According to equations (8) and (5), we have where w ij ∈ W. Lots of methods applied to equation (5) can be used to equation (6) as well, whereas the goal of this paper is trying to find a parsimony method to fit well with smallscale datasets. Hence, F 2 in equation (9) should be as simple as possible.
We adopt the summarization and sigmoid activate function to replace F 2 and make it be simple. e prediction function is formulated as where s ij ∈ [X; y] denotes the element in the i-th row and j-th column of input matrix [X; y]. Equation (10) is called the prediction function of the proposed PM.

Experimental Settings
is section first describes the exogenous feature selection, experimental settings, and relevant related work, which are compared with our proposed method. Finally, an analysis of the comparable results is provided.

Experimental Configuration.
e disease data collection is divided into two subsets: the first part, from the 1st week of 2012 to the 52nd week of 2015, is used to build and train model; the remaining part, from the 1st to the 52nd week of 2016, is utilized to assess the learned models.
All neural models are trained using the Adam optimizer [35]. e batch size is set to 32. eir learning rate is set to 0.001, and mean squared error (MSE) is chosen as the loss function. For RNN and LSTM, the number of hidden neurons is set to {64, 128}.  Figure 1: e diagram of the proposed PM.

Scientific Programming
We ran each method five times and reported the median. e optimal RMSE values of models at given T and input data are in bold types. DA-RNN [20] and DL4Epi [23] have been applied on the four groups of inputs, and they are not converged in training stage. Hence, their results are not listed and compared.

Evaluation Metrics.
A number of performance evaluation criterion have been employed to evaluate and compare the performance of models, but there is a no uniform standard. erefore, we evaluate models based on error metrics, which are commonly adopted in regression performance evaluation.
e evaluation criterion consists of the mean absolute error (MAE), the root mean square error (RMSE), and the coefficient of determination (R 2 ). ese metrics are expressed as the following mathematical expressions: where y i T+1 is an actual value at week (relative time) i in the test set, y i T+1 is a predicted value at week (relative time) i, y T+1 is the mean value in the test set, and N is the number of weeks in the test period. e model providing the smallest MAE and RMSE and the largest R 2 is considered to have the best performance.

Comparable Methods.
According to the type of input variables, the prediction methods are categorized into univariate models and multivariate models. We choose models which can work with multiple inputs and unstable series and keep ARIMA as a baseline.

Multiple Linear Regression (MLR).
It is widely adopted to model the connection between several independent variables and dependent variables. e prediction model of MLR based on generalized MLR analysis formula is given below: where w y ∈ R T and b are weights and bias to be learned, T is the size of the historical data, ε represents the random error term, and E(ε) � 0.
We extend MLR to model exogenous inputs and target as follows: where w xy ′ ∈ R (n+1)T×1 denotes weights of exogenous inputs and historical data at past T time intervals.

Random Forest Regression (RFR).
It is a CART [36] method. It has been used to predict the death characteristics of patients, postoperative prognosis of hepatocellular carcinoma, and other data [37]. Random forest regression (RFR) consists of a collection of unpruned regression trees using different bootstrap samples of the training data. In each bootstrap sample, a random sample with replacement and with the same length, some of the data are repeated, and the left-out samples are called out of bag (OOB). In practice, the number of trees n trees and the size of the variable subset n subsets should be optimized to reach the ideal forest by minimizing the OOB error. e values of the parameters n trees and n subsets were optimized simultaneously by using the grid-search method ranging from 10 to 1000 (with step size 10) and from 1 to 9 (with step size 1), respectively. e parameter values which give the lowest RMSE of the OOB data were selected as an indicator for the performance.

Support Vector Regression (SVR).
It is a nonlinear kernel-based regression method, which tries to find the best regression hyperplane with smallest structural risk in a high dimensional feature space. It has been used in many medical related applications, such as diagnosis of the incidence of infectious diarrhea [26]. e relationship between input (or inputs) and output is formulated as where ϕ(·) denotes kernel function, x t is a vector of inputs at time t, and b is a bias term. Radial basis function expressed as K(a, b) � exp(− |a − b| 2 /σ 2 ) was used as a kernel function because of its advantages and simple implementation only one tuning parameter.

Gradient Boosting Regression (GBR).
It is a powerful regression model that enhances the decision tree model using gradient boosting, which produces a regression model by combining a series of weak prediction models [38].
GBR iteratively constructs decision trees, and the newly added decision tree is trained according to the negative gradient information of the loss function from the current model. e goal of GBR is to learn an optimal model F(x) that minimizes n i�1 L(y i , F(x i )) for a specified loss function L (y, F(x)). e optimal model F * (x) of GBR can be calculated as follows: where F 0 (·) is the first built decision tree, m is the number of iterations, v represents the shrinkage parameter that controls the learning rate of GBR, h i (·) denotes the tree trained in the 6 Scientific Programming i-th iteration, and ρ i is the weight of h i (·). e generated optimal model is used for testing.

Extreme Gradient Boosting Regression (XGBoost).
It is an improvement on the gradient boosting algorithm [39]. XGBoost adds the regularization terms during the decision tree construction phase. e loss function is where y t represents the prediction of the t-th training sample and Ω is the regularization term. e regularization term is calculated as follows: where N is the number of leaf nodes, w is the vector formed by all leaf node values of the decision tree, and c and λ are manually set parameters. Similar to GBR, the goal of XGBoost is also to minimize the loss function. Besides, XGBoost uses weight shrinkage and column sampling techniques to resist overfitting.

Convolutional Neural Networks (CNN1d). CNN [40]
consists of convolutional layers that are based on the convolutional operation. Filtering with kernel window function gives an advantage of image or series processing to CNN architectures with fewer parameters, which are beneficial for computing and storage [41][42][43]. e basic convolution operation is shown as follows: where s denotes mapped space and W denotes kernel. CNN has been verified to have good accuracy results when applied to pattern recognition [44].

Neural Network Regression (NNR).
On account of the neural network (NN) approximation and generalization property, NN-based prediction is widely used. Recurrent neural network is a type of deep neural network specially designed for sequence modelling [16]. e main idea of RNN is to provide a weighted feedback connection between layers of neurons and add time significance to the entire artificial NN. But RNN may face the problem of vanishing gradient or gradient explosion during backpropagation. LSTM was proposed by Hochreiter and Schmidhuber in [45]. We adopt a lite RNN model like NARX in [15], which consists of an input layer, an RNN layer, and an output layer, to test its performance. e nonlinear sequential layer can be a LSTM layer or a gated recurrent unit (GRU) [18] layer as well.

Experimental Results and Analyses
is section presents the results of intensive experiments to evaluate algorithms over four kinds of inputs and analyses them. Figure 2 shows the performance of PM with varying window size T. Figure 3 shows the evaluation results of all the methods over all the inputs for all metrics. Figure 4 gives the visual comparison of PM methods over four groups of inputs.

e Impact of Window Size T.
Parameter sensitivity is a significant part of experimental analysis. In time series prediction study, T is the most interesting parameter of time delay. Different window sizes will affect the parameter complexity and the overall prediction performance of the methods. In addition, window size is an important indicator to reveal the incubation period of infectious diseases. erefore, this study visualized the influence of different window sizes on the proposed model. e performance of PM with varying window size T in terms of MAE, RMSE, and R 2 is plotted in Figure 2.
According to the results, when we vary T and keep other parameters fixed, it is easily observed that the performance of PM degraded when T increases from 4 to 20, and the performance tends to get worse as the length of delay time steps continues to increase. A possible reason is that the longer lagged time delay leads to larger feature dimension of the inputs, which causes an increase in the complexity of the regression function in the training process and directly influences the weighted parameter representation for smallscale sample, resulting in the regression function unable to accurately fit the input. By setting T � 4, we notice that PM usually achieves the best performance, which possibly suggests that the diarrhea infection cycle pattern lags about four days in Xiamen.
is is an important artificial parameter required by the predictive model.

e Performance of Time Series Prediction.
e optimal performance of the benchmark and proposed methods is found at T � 4. Hence, the evaluation results of all the comparable methods are shown in Figure 2 when T is fixed at 4.
Several important observations are made about these results: (1) By benefiting from MF and BSI data, the PM model obtains the best performance in terms of three metrics (2) e performances of MLR and GBR are better than some complex RNN methods, and ARIMA cannot perform well under unstable time series (3) CNN1d and RNN methods can achieve stable performance when processing each type of inputs (4) Both MF and BSI data can improve the prediction accuracy of diarrhea outpatient counts in upcoming weeks.
To demonstrate the effectiveness of PM, we compare the performance of PM against some common methods for the same prediction tasks. e comparable experiments choose MLR, GBR, RFR, SVR, and XGBoost as the representative of the traditional methods and three other deep learning Scientific Programming 7 methods. e deep learning methods include CNN, RNN and LSTM with {64, 128} hidden nodes.
Comparing the performances of all the methods over the four inputs, according to the results of predictions based on historical cases, we can observe that the PM method shows the best performance, and CNN1d, RNN, GBR, and MLR also have good results. e performance of MLR is better than the performance of deep learning methods when the inputs are solely on the basis of historical cases. A possible reason is that the small-scale data collections cannot train deep learning methods well, especially the LSTM method, either overfitting or underfitting. According to the results of predictions based on cases+BSI, we can observe that the traditional m`ts that common learning methods cannot extract sufficient features from high dimensional inputs to predict the upcoming values. According to the results of predictions based on historical cases and BSI, we can observe that the performance of all the methods has degraded, which suggests that too many missing values will greatly affect the prediction performance of the model. However, PM has ability to adapt to input and to deal with missing values, which is significantly stronger than other models. According to the the inputs are solely on the basis of predictions based on cases+MF+BSI, we can observe that PM obtains the best results. Compared to the CNN1d method, it is difficult for traditional learning models to give stable performance with exogenous inputs. e prediction accuracy of the LSTM method has a slight improvment when compared with MLR.    two weights for each input element greatly decreases the need of parameters, which can be updated by the training process, and the two weighted vectors are adopted to learn temporal weights and factor weights.
When compared with CNN1d, the proposed method exploits the multiplication and addition of two vectors to generate predictions instead of using a kernel matrix to calculate the inputted 2d data to get output values. When T is small, the convolution kernel of CNN has poor performance, since the kernel size should smaller than input size. Moreover, not only wide or deep models but also machine learning models have abilities in overcoming overfitting using decay weights (i.e., regularization items) in their loss functions. e key concern is that more parameters will lead to hard training on small-scale data collections. erefore, PM is more concise and effective, which has better results and better adaption on broad inputs.
Another considerable result is that the performance differences in results among the methods are enlarged with complicated inputs. For example, when only considering the outpatient cases with T � 4, PM's performance is similar to other models. But the performance differences become larger when fusing exogenous inputs, since the mined patterns from heterogeneous data are more complex than those found in only outpatient case data. PM has a better ability to handle small-scale heterogeneous input data. e results of correlation experiments also show that it is completely feasible to use exogenous factors to predict the incidence of infectious diarrhea in Xiamen.

e Visualization Analysis of Predictions.
In time series prediction, it is sometimes more interesting to compare the performance on capturing the so-called extreme events, e.g., an oscillation after a stable growth or decay, or a huge sudden change during oscillations. us, a visualization comparing real values and PM predictions is presented in Figure 4. e red dashed lines mark the 10 th and 23 rd week, whose values follow the highest and lowest real values, respectively. e prediction of time series extreme events has been proposed by previous studies [46]. We can see that the number of outpatients peaked at 9 th week and 10 th week, and their outpatient counts significantly dropped. In this situation, PM models with BSI + MF fit the real value better than other methods, which illustrates the ability of weighted layers in summarizing the interactions of exogenous inputs, and the heterogeneous data fusion is an effective method to improve prediction accuracy. e advantage is also retained in the method with MF and BSI. It also should be noted that at 9 th week, the number of cases reaches the peak values, and MF effectively improves the accuracy in this timestamp, which makes up for the loss of accuracy of fusion of BSI and historical data in the forecast at that time. Historical cases and MF have slightly fluctuated at 13 th week. By incorporating BSI data, the influence of random disturbance in this timestamp is obviously reduced for prediction results. When the value at 23 rd week followed the lowest real value, the performance of the PM model with all heterogeneous data fusion is not stable.
Moreover, the predicted value using MF or BSI of 45 th week is obviously higher, and the situation is reflected in the fusion of MF and BSI. A possible reason is that the internal random disturbance of exogenous data has not been completely eliminated in the preprocessing stage. erefore, the extreme weather conditions or special events and degrade the prediction performance.

Conclusions
In this paper, we focus on predicting the number of infectious diarrhea outpatient visits in the upcoming week. A parsimonious model (PM) is proposed by condensing previous prediction models. e benchmarks of ten algorithms with four groups of inputs show the advantage of our method. It achieves better prediction performance by consolidating MF data and BSI data.
In the future, we will try to investigate the prediction of other cities and improve the robustness of PM in predicting those cities. Moreover, the simultaneous forecast of multiple cities would be our future research direction.

Data Availability
e data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest
e authors declare that they have no conflicts of interest.