A LASSO-Based Prediction Model for Child Influenza Epidemics: A Case Study of Shanghai, China

Child infuenza is an acute infectious disease that places substantial burden on children and their families. Real-time accurate prediction of child infuenza epidemics can aid scientifc and timely decision-making that may reduce the harm done to children infected with infuenza. Several models have been proposed to predict infuenza epidemics. However, most existing studies focus on adult infuenza prediction. Tis study demonstrates the feasibility of using the LASSO (least absolute shrinkage and selection operator) model to predict infuenza-like illness (ILI) levels in children between 2017 and 2020 in Shanghai, China. Te performance of the LASSO model was compared with that of other statistical infuenza-prediction techniques, including autoregressive integrated moving average (ARIMA), random forest (RF), ordinary least squares (OLS), and long short-term memory (LSTM). Te LASSO model was observed to exhibit superior performance compared to the other candidate models. Owing to the variable shrinkage and low-variance properties of LASSO, it eliminated unimportant features and avoided overftting. Te experimental results suggest that the LASSO model can provide useful guidance for short-term child infuenza prevention and control for schools, hospitals, and governments.


Background
Infuenza is an infectious respiratory disease which is the leading cause of respiratory illness among children. It can cause serious illness and even death in children. Te annual infuenza morbidity in children often exceeds 30%, which is higher than that for other age groups [1][2][3][4]. Tis is because children lack prior immunity to this disease. Te average hospitalization rate of children attributable to infuenza is the highest among all diseases. Further, it is estimated that 9,000-106,000 children younger than 5 years die due to respiratory diseases caused by infuenza in 92 countries around the world [5]. During infuenza epidemics, the infuenza incidence rate in preschool and school-aged children may exceed 40% and 30%, respectively [1,6,7]. Tis results in the highest rates of outpatient medical visits and hospitalizations of young children for acute respiratory diseases [8]. In addition, children are the primary disseminators of infuenza in communities [6]. Infuenza also leads to other consequences, including complications, antibiotic treatment, absence from daycare or school, and parental work loss [4]. Consequently, accurate real-time prediction of child infuenza is signifcant and can aid in the adoption of effective measures to reduce harm caused by child infuenza.
Over the last few years, several methods have been utilized to predict infuenza levels. Tese methods can be broadly classifed into two categories-statistical and mechanistic models. Statistical approaches are based on statistics or machine learning principles, e.g., the autoregressive integrated moving average (ARIMA) model [9,10], seasonal autoregressive integrated moving average (SARIMA) model [11], which is a variant of ARIMA, elastic net model [12], least absolute shrinkage and selection operator (LASSO) model [13,14], random forest (RF) model [15], support vector machine (SVM) model [12], long short-term memory (LSTM) model [16], and transformer model [17]. For instance, the ARIMA model has been used to predict the expected morbidity of infuenza cases in Ningbo, China [10]. Some studies have reported accurate prediction of infuenzalike illness (ILI) in the Netherlands and the USA using the LASSO model, providing early warnings in advance of infuenza epidemics [13,14]. According to research on the forecast of ILI incidence rates at both the national (France) and regional levels (the Brittany region in France), a regression model based on SVM (with a linear kernel) outperformed other models, including the RF and elastic net + residuals ftted by ARIMA (ElasticNet + ARIMA) models. Mechanistic models include compartmental and agent-based models (ABMs). Compartmental models model transitions among subpopulations with diferent disease states and can represent disease transmission dynamics in a population. Examples of compartmental models include the susceptible-exposed-infectious-recovered (SEIR) model [18] and the susceptible-infectious-recovered-susceptible (SIRS) model [19], both of which are based on diferential equations. Compartmental models exhibit some variants and can model the temporal dynamics of other diseases, such as dengue [20,21] and breast cancer [22], using fractional derivatives. ABMs [23] describe the transmission behavior of each individual in a population. Additionally, by combining multiple models and leveraging the advantages of each, ensemble approaches [24,25] usually yield better prediction results.
To further improve prediction accuracy, numerous methods often combine various data sources, including infuenza surveillance data [14], web search data [26,27], temperature data [16], air pollutant data [28,29], Wikipedia access data [30], Twitter posts [31], and electronic medical records [32]. For instance, the ARGO method combines Google search term data and the LASSO model to capture people's dynamic search behavior over time-it has been noted to exhibit excellent infuenza prediction performance. Climate [16] and air pollutant data [29] have also been demonstrated to be correlated to the infuenza incidence rate to some degree, and the performance of forecast models may be enhanced by accounting for them. Chretien et al. [33], Nsoesie et al. [34], and Reich et al. [35] have provided detailed reviews of infuenza prediction approaches in recent years.
Owing to their lack of prior contact with and immunity to infuenza viruses, children exhibit the highest infuenza infection rate among all age groups. Infuenza vaccination is currently the most efective approach to slow down the transmission of the disease, and immunization of children can signifcantly reduce infuenza incidence. In China, infuenza vaccination is not routinely recommended, and immunization of children is unusual. Compared to adults, children exhibit diferent characteristics in terms of infuenza immunity and infection.
Although a variety of methods have been proposed for infuenza prediction, most existing studies [10-17, 19, 23-28] have focused on infuenza prediction in adult or mixed populations rather than in child populations. Among the few studies focusing on the latter, He et al. [9] and Rao et al. [36] employed the ARIMA model to predict the rate of infuenza virus infection in children in Wuhan and Suzhou, China. However, the authors did not compare the performance of ARIMA with that of other prediction models. Tus, the current state of research does not reveal the extent to which mainstream infuenza prediction methods are applicable to pediatric infuenza or their infuenza prediction performance on children. To answer these questions and identify an accurate forecasting model for child infuenza in China, we performed a case study, thoroughly comparing the child infuenza prediction performances of fve statistical approaches-ARIMA, OLS (ordinary least squares), RF, LSTM, and LASSO-in Shanghai City.
To this end, we obtained the number of weekly outpatient visits from children with ILI between 2015 and 2020 from Shanghai Children's Hospital. Te covariates were taken to be temperature and air pollutant data. Te response was taken to be the 1-week-ahead ILI level. Te experimental results indicate that LASSO outperformed the other four baseline models. Te RMSE (root mean squared error) of LASSO was higher than those of ARIMA, RF, OLS, and LSTM by 9.47%, 22.96%, 26.67%, and 33.70%, respectively. Additionally, owing to the variable shrinkage property of LASSO, the coefcients of the temperature and air pollutant features shrank to zero and had no impact on the predicted child ILI levels.
Tus, estimation using LASSO can be expected to provide adequate guidance on child infuenza prevention and control for schools, hospitals, and Centers for Disease Control (CDCs). Hospitals can use the accurate estimates for scientifc and timely decision-making regarding public health resource allocation. Schools can also use the reliable estimates to remind students and their households to take precautions against infuenza.

Methods
Te methods used to support the fndings of the study are explained in detail in the following sections.

Data Sources.
Tree data sources were used to predict child ILI activity: historical data on children's outpatient visits for infuenza, temperature, and air pollutant data. Te data of historical outpatient visits were used as infuenza surveillance data to represent juvenile ILI activity levels. Previous studies have reported a strong correlation between low temperatures (0-5°C) and high levels of infuenza virus infection [37,38]. Furthermore, much evidence has been reported indicating that air pollution is a risk factor for respiratory diseases, such as infuenza, and that air pollutant data (PM 2.5 , PM 10 , SO 2 , CO, NO 2 , and O 3 ) afect the incidence of infuenza signifcantly [39,40]. Infuenza prediction performance can be improved by accounting for these environmental factors [41].
2.1.1. Children's Outpatient Visits for Infuenza. We obtained data on children's outpatient visits for infuenza from Shanghai Children's Hospital. Shanghai is one of the largest cities in China. It has an area of 6340.5 square kilometers and a population that exceeded 24 million by the end of 2020. Te gross domestic product (GDP) per capita was 24,443 US dollars in 2020. Te population of children aged 0-14 years is approximately 2.44 million (based on national census data). Shanghai Children's Hospital is located in the central district of Shanghai. It was the frst specialized pediatric hospital in Shanghai. It serves almost all children under 14 years of age in the city. In this study, child ILI was defned as the number of outpatient visits from children seeking medical attention for ILI symptoms. Data on weekly outpatient visits were collected from Shanghai Children's Hospital for the period between January 1, 2015 and May 31, 2020.

Temperature Data.
Maximum and minimum temperatures for each day in the study period were obtained from a historical weather website [42]. As they exhibit a high correlation, we used them to calculate the average temperature of each day.
Daily temperature and air pollutant data were collected for the period between January 2015 and May 2020. Tese data contained occasional missing values, which were flled in using linear interpolation. Subsequently, the average weekly temperature and air pollutant data were aggregated based on the daily data. Note that web search data were not used in this study, primarily because children are only a subset of the whole population, and many search terms related to infuenza are not directly related to children.

Predictive Models.
In this study, LASSO was utilized as a regularized estimation model to forecast ILI. LASSO is an extension of linear regression and exhibits variable shrinkage and selection. Owing to its sparsity property, the coefcients of some of its features are zero. Tis property enables LASSO to avoid overftting and enhances its predictive accuracy.
For comparison, we used four methods: OLS, ARIMA, RF, and LSTM. OLS [44] is perhaps the simplest linear regression method, and LASSO is obtained by adding the L 1 penalty to OLS. Unlike OLS, LASSO exhibits the sparsity property. ARIMA [45] is a classical time-series prediction method that considers temporal autocorrelation and past forecasting errors. RF is an ensemble learning method consisting of multiple small decision trees, RF regression averages their predictive results to obtain accurate predictions [46]. LSTM is a special type of recurrent neural network (RNN) that can learn long-term dependencies in timeseries data [16]. It has been commonly utilized in the current deep learning era. Te main characteristics of each approach are described below.

LASSO and OLS.
Given N observations, (x i , y i ) and i � 1, . . . , N, where x i � (x i1 , . . . , x ip ) T denotes the p features and y i denotes the outcome variable, the linear regression model assumes that where β � (β 0 , β 1 , . . . , β p ) T denotes the unknown regression parameter and ϵ i denotes the error term. OLS estimates the β parameter by minimizing the following least-squares objective function: Equation (2) provides OLS estimates. In general, all OLS estimates are nonzero, i.e., OLS estimates have low bias but high variance. Tus, small changes in inputs afect the estimates signifcantly.

LASSO Estimates the β Parameter by Solving
where ‖β‖ 1 � p j�1 |β j | denotes the l 1 norm of β and t ≥ 0 denotes a user-specifed parameter that controls the degree of variable shrinkage.
Te LASSO problem can also be written in the following Lagrangian form: for some λ ≥ 0. By Lagrangian duality, there is a one-to-one relationship between the solutions of equation (3) and that of equation (4). Te most important utility of the l 1 norm is that when t is sufciently small, LASSO usually shrinks some of the regression coefcients to zero. Tis enhances the interpretability of LASSO, uncorrelated features are eliminated Mathematical Problems in Engineering from the model, and the model becomes easier to understand. Interpretability is becoming increasingly signifcant in the era of big data. Currently, we can readily obtain hundreds, thousands, or even millions of features, but we cannot identify in advance the features that are related to the outcome variable. LASSO removes all unrelated features and preserves only important features to avoid this problem.
By excluding unrelated features from the complete OLS model, the prediction variance is decreased, but the bias is increased as a tradeof [47]. If the reduction in prediction variance exceeds the increase in bias, the accuracy of the model is enhanced. Terefore, LASSO often outperforms OLS in terms of accuracy. By choosing an appropriate λ for LASSO, uncorrelated features are eliminated and accuracy is improved. In this study, the value of λ was optimized using cross-validation (CV) based on the root mean square error (RMSE) metric. In particular, we trained the LASSO model corresponding to diferent values of the λ parameter, ranging from 0 to 100000 at intervals of 10. Te value corresponding to the lowest RMSE score was selected as the optimal LASSO model. It was implemented using Python scikit-learn package version 1.0.9.

ARIMA.
ARIMA is one of the most widely used methods for time series prediction. ARIMA incorporates diferencing of lagged values of the prediction variable and lagged errors [45,46]. Diferencing is used to convert a nonstationary time series into a stationary time series. Te lagged values of the prediction variable comprise the autoregressive (AR) part of ARIMA and the lagged errors form the moving average (MA) part of ARIMA. ARIMA involves three parameters (p, d, q)-p denotes the order of the autoregressive part, d denotes the degree of frst differencing, and q denotes the order of the moving average part. Te traditional method of identifying and ftting an ARIMA model is complex, time-consuming, and subjective. Terefore, we used the grid search strategy to identify the optimal ARIMA model based on minimal Akaike's Information Criterion (AIC) via CV. During grid search confguration, the p and q values were taken between 0 and 5, and the d value was selected between 0 and 2. Te (p, d, q) parameters with minimal AIC values were chosen to optimize the ARIMA model. Te ARIMA model was implemented using Python statsmodels package version 0.13.0.

Random
Forest. RF models nonlinear relationships in data and has been used to predict various types of infectious diseases [46,48]. It is an ensemble learning method that combines a large number of decorrelated classifcation and regression trees (CARTs). To this end, it constructs a single tree using the bagging (bootstrap aggregation) technique. Bagging involves sampling with replacement to create decorrelated decision trees during training. RF can be used for classifcation and regression tasks. As our prediction target is the occurrence of infuenza, we used RF regression for prediction, with the average of all individual decision trees yielded as the output. Te grid search strategy was applied to optimize hyperparameters by minimizing RMSE using CV. Te RF model was applied using Python scikitlearn package version 1.0.9. Te n_estimators hyperparameter represents the number of trees in the forest, and its value was varied between 50 and 500 in intervals of 10. Te other hyperparameters were set to their default values.

LSTM.
Neural networks (NN) are widely used to model nonlinear relationships in data. As an extension to NN, RNN is designed to deal with sequential data. It combines current inputs and past information to obtain the output. However, RNNs sufer from the gradient vanishing problem. To resolve this, LSTM [16,49] incorporates an input gate, a forget gate, and an output gate within the RNN cell. Tese cells transmit past information corresponding to multiple time steps to subsequent time steps. Primarily, LSTM saves old information for later use, thereby avoiding the gradient vanishing problem during dataset training. [40] Te LSTM model was applied using the Python Keras package version 2.3.1, which was constructed in TensorFlow package version 2.1.0. Te LSTM model comprised two layers. Te hid-denNum hyperparameter of each layer was determined using a grid search strategy based on cross-validation. Default values were used for the other hyperparameters. Te value of the hiddenNum hyperparameter varied from 8 to 128 in intervals of 8.

Feature Selection.
Although LASSO exhibits variable shrinkage and selection, feature selection was performed to ensure fair comparison between LASSO and other prediction models, and the selected features were transmitted into all models for ILI level prediction. Te features were selected using mutual information [50] owing to its suitability for measuring nonlinear relationships between random variables. Te mutual information coefcient of each feature was calculated and scaled to the range 0-1. During the calculation of mutual information, the time window of the lag was selected to be 52 weeks (one year) based on previous research [16] and the number of features. Features with mutual information coefcients less than 0.4 were removed. Te remaining features were directly transmitted into LASSO, OLS, RF, and LSTM, as these are multivariate models. In contrast, ARIMA is a univariate model that uses only lagged ILI occurrences for prediction.

Model Assessment.
To assess the efectiveness of the aforementioned models, we used a naive method for comparison. Te naive method uses the value of the previous week as the predicted value of the current week. It was adopted as the baseline model, and models outperforming the naive method were considered efective.
Te dataset was divided into training and testing sets. Data corresponding to 2015-2016 were used as training data, and data corresponding to 2017-2020 were used as test data. We used the rolling-origin recalibration method for evaluation [51]. Te predictions for each week in the test set were obtained by moving the data from the test set to the training set sequentially. Once the data were updated each week, all prediction models were dynamically retrained to predict the ILI level of the following week. Retrospective estimates of child infuenza activity were evaluated corresponding to 2017-2020 using an out-of-sample approach.
Five accuracy evaluation metrics were adopted to compare the performances of the fve models-RMSE, mean absolute error (MAE), mean absolute percentage error (MAPE), correlation coefcient, and correlation coefcient of increment.
Te following notations are used: y i denotes the true value of ILI at time t i , x i denotes the predicted value of ILI at time t i , y denotes the average value of the time series y i , and x denotes the average value of the time series x i .
RMSE is a measure of the average diference between the true and predicted values. Te RMSE of the two-time series y i and x i (i � 1, · · · , n) is defned as follows: MAE measures the average absolute diference between the true and predicted values. Te MAE of y i and x i is defned as follows: RMSE and MAE are widely used in prediction tasks, and they both measure the average extent of prediction errors. MAE averages the prediction errors directly and can be considered a linear combination of errors, with equal weights corresponding to all errors. However, RMSE squares the prediction errors before computing the average. Terefore, RMSE produces large weights for large errors. As a result, it is particularly suitable in cases where large errors are especially undesirable.
MAPE is a measure of the average percentage diference between the true and predicted values. Te MAPE of y i and x i is defned as follows: Te correlation coefcient calculates the Pearson correlation coefcient and measures the linear relationship between true and predicted values. Te correlation between y i and x i is defned as follows: Finally, the correlation of increment between y i and x i is defned as follows:

Results
Te results obtained in the article is described in the following sections.

Infuenza Prediction Accuracy of LASSO.
Retrospective estimates of ILI levels were obtained using LASSO, ARIMA, RF, OLS, LSTM, and the naive model for the period between week 1 of 2017 and week 20 of 2020. We compared the estimates with the ground truth, i.e., the ILI level provided by Shanghai Children's Hospital. Te various accuracies of the models are presented in Table 1. Te column in Table 1 lists diferent time periods, including the entire period, the of season, and the four regular fu seasons in 2016-2020. Te regular annual fu season lasts from week 40 of each year to week 20 of the following year. Te test was conducted for the period 2017-2020. Tus, the start of the 2016-2017 season was taken to be week 1 of 2017. Te prediction curves of all models and the ground truth are depicted in Figure 1.
Over the entire period, LASSO outperformed the other models in terms of all metrics except the relative MAPE. Overall, in terms of relative RMSE, MAE, and MAPE, LASSO was the most accurate model in all seasons, including the of-season and the four regular fu seasons. During some regular seasons, the performance of the ARIMA model was slightly better than that of LASSO. For instance, in the 2016-2017 fu season, LASSO (relative RMSE � 1.038 and relative MAE � 1.070) exhibited the second-best performance in terms of relative RMSE and MAE, second to the ARIMA model (relative RMSE � 0.979 and relative MAE � 1.051). Te correlation coefcient of LASSO was the highest in all seasons except the 2016-2017 fu season, when it (corr � 82.6%) was slightly inferior to that of ARIMA (corr � 86.0%). Moreover, LASSO exhibited the highest correlation of increments during all periods.
Te RMSE values of LASSO were observed to be higher than those of ARIMA, RF, OLS, and LSTM by 9.47%, 22.96%, 26.67%, and 33.70%, respectively. Te relative RMSE of LASSO (0.899) and ARIMA (0.993) over the entire period were both less than one, while those of RF (1.167), OLS (1.226), and LSTM (1.356) were not. Tis suggests that, compared to the naive method, LASSO and ARIMA were efective, but RF, OLS, and LSTM were not.
Although, in terms of the relative RMSE, ARIMA was the second-best model over the whole period, the discrepancy between LASSO and ARIMA during regular fu seasons was slight. However, LASSO (relative RMSE � 0.872) outperformed ARIMA (relative RMSE � 1.201) by a notable margin during the of-season.
Both RF and LSTM are nonlinear models and exhibited poorer performances than the naive method. Figure 1 indicates that the prediction curve of RF does not contain as many undulations as that of LSTM. OLS is similar to LASSO in principle, but its performance was observed to fall between those of RF and LSTM. Tis suggests that OLS exhibits higher variance than LASSO, which can also be observed in Figure 1. Note that OLS exhibited remarkably large estimation errors on March 23, 2017, December 28, 2017, and February 20, 2020.
Te vertical orange dashed line in Figure 1 represents January 23, 2020 during the 2019-2020 fu season. From this day, COVID-19 began to spread rapidly, and the Chinese government closed Wuhan down to prevent the epidemic Mathematical Problems in Engineering from spreading outside Wuhan. Other cities, including Shanghai, strictly managed the travel of residents-students attended classes at home, and ofce workers worked from home or went on vacation. Adoption of home isolation measures reduced the exposure of children to infuenza, and the number of children infected with infuenza decreased sharply. Te ILL level was centered around 2000 between the middle of February and May 2020, as depicted in Figure 1. As MAPE calculates the percentage of errors, equation (7) indicates that if the ground truth is a small value, MAPE is very likely to be a relatively large value, even if the prediction result is not very large. For this reason, the relative MAPE of LASSO (2.076) was signifcantly larger than that of ARIMA (1.288) during this period. Nevertheless, the patterns of previous fu seasons were diferent from those of the 2019-2020 fu season after the outbreak of COVID-19, and all prediction models performed worse than the naive method in terms of MAPE. Figure 2 depicts a scatter plot of the ground truth and all prediction results. From LASSO to LSTM, as R 2 decreased, the errors of the prediction models gradually increased, and a tendency towards wider confdence intervals for intervalbased prediction was observed. Figure 3 depicts the dynamic regression coefcients of LASSO obtained by applying the rolling-origin recalibration evaluation method to data between week 1 of 2017 and week 20 of 2020. Te coefcients of features, num_lag_1 and num_lag_51, were positive and represented the ILI number in the previous week and previous 51 weeks (usually one year contains 52 weeks), respectively. Feature num_lag_1 has been highlighted in dark red to indicate that the ILI number of the previous week exerted the greatest infuence on that of the Table 1: Infuenza estimation accuracies of diferent models. Te best performance corresponding to each accuracy metric in each time period is highlighted in boldface. Te reported RMSE, MAE, and MAPE scores are relative to the absolute error of the naive method, i.e., the ratio of the error of a given method to that of the naive method. Te numbers in parentheses denote the absolute errors of the naive method.

Metric
Whole  current week, and feature num_lag_51 exhibited a much weaker efect. Te coefcients of features, num_lag_2 and num_lag_52, were negative, and their absolute values were generally lower than those of features, num_lag_1 and num_lag_51, respectively. Te former pair seemed to slightly ofset the latter pair. Te coefcients of features concerning the temperature, PM2.5, PM10, SO 2 , NO 2 , and O 3 were constantly zero. Compared to the autoregressive features of ILI number, they exhibited almost no impact on the ILI number of the current week; thus, their coefcients were reduced to zero by LASSO. Te green vertical dashed line represents the outbreak of COVID-19. Subsequently, the coefcients of num_lag_2 and num_lag_52 became remarkably smaller. Tese coefcients are directly proportional to the predicted ILI number. Owing to home isolation measures, the number of children infected with infuenza decreased sharply.

Te Distribution of Coefcients of the Most Important
Features. Figure 5 displays box plots of the coefcients of the ten features with the largest mean absolute values of the coefcients. Te features were ordered from left to right in terms of the mean absolute values of the coefcients. As all of these features are autoregressive, the coefcient values of these features represent their importance. Te four most important features were observed to be num_lag_1, num_lag51, num_lag_2, and num_lag_52. Te negative coefcients of the features, num_lag_2 and num_lag_52, slightly ofset the positive coefcients of num_lag_1 and num_lag_51. Tis relationship is illustrated in Figure 3. Te mean absolute values of the coefcients of the other six features did not exceed 0.02; thus, these features were deemed trivial compared to the four most important features. (3), the parameter t represents the possible size range of p j�1 |β j |. When t is large, the constraint on the l 1 norm of β is relaxed and the estimated coefcients can be large. In particular, when t is greater than    Mathematical Problems in Engineering

Te Variable Shrinkage Property of LASSO. In equation
In contrast, if t is small, the constraint on p j�1 |β j | is strict, and the estimated coefcients are also small. When t is sufciently small, some of the estimated coefcients become zero: this is the variable shrinkage property of LASSO. Tis property can be visualized using the defnition of the shrinkage ratio, s:  intercept  num_lag_1  num_lag_2  num_lag_16  num_lag_17  num_lag_18  num_lag_24  num_lag_25  num_lag_26  num_lag_27  num_lag_28  num_lag_33  num_lag_42  num_lag_46  num_lag_47  num_lag_48  num_lag_49  num_lag_50  num_lag_51  num_lag_52  temp_lag_1  temp_lag_7  temp_lag_8  temp_lag_12  temp_lag_13  temp_lag_14  temp_lag_15  temp_lag_16  temp_lag_17  temp_lag_18  temp_lag_19  temp_lag_20  temp_lag_21  temp_lag_22  temp_lag_24  temp_lag_28  temp_lag_29  temp_lag_39  temp_lag_40  temp_lag_41  temp_lag_42  temp_lag_43   temp_lag_45  temp_lag_46  temp_lag_47  temp_lag_51  temp_lag_52  PM25_lag_16  PM10_lag_8  PM10_lag_16 So2_lag_19 Figure 3: Heat map of dynamic regression coefcients of LASSO between week 1 of 2017 and week 20 of 2020. Tese regression coefcients were computed using the rolling-origin recalibration evaluation method. Te X-axis represents the date and the Y-axis corresponds to the regression coefcient of features. Positive coefcients are indicated in red, negative coefcients are indicated in blue, and zero is indicated in white. Te feature, num_lag_x, denoted the ILI number with lag (x) and temp_lag_x denoted the average temperature with lag (x). Te green vertical dashed line represents January 23, 2020, when the Chinese government closed Wuhan down to prevent the COVID-19 epidemic from spreading outside Wuhan.
When s lies between 0 and 1, the corresponding coefcient of the solution can be obtained. Te collection of coefcients for all s (0 < s ≤ 1) was generated for LASSO. Te generated coefcients of LASSO for Week 1 of 2017 are depicted in Figure 6. As the overall number of features was large, only the ten most important features are depicted. When the shrinkage ratio s was gradually increased, the most important feature, num_lag_1, frst entered the LASSO model. Subsequently, the features, num_lag_51, and num_-lag_47 were entered into the model. Tese features exerted a smaller infuence on the predictive target than num_lag_1. In the fgure, s is represented on the X axis, and the outcome of variable shrinkage can be observed. For instance, corresponding to s � 0.5, the three features, num_lag_1, num_-lag_47, and num_lag_51 were used to predict the ILI number, and their coefcients were determined by the intersections of the coefcient curve and the red dashed vertical line. Te coefcients of the other features were zero. Te Xaxis represents the shrinkage ratio and the Y-axis represents the estimated coefcients. Te ten most important features are displayed in the fgure. Te red dashed line represents s � 0.5. When λ was increased further, the RMSE gradually increased. Tis behavior can be explained by the fact that, when λ became too large, the amount of variable shrinkage was also large. Tis shrunk the coefcients of some important features down, even to zero in some cases. Consequently, the performance of LASSO was degraded. Terefore, the selection of λ is critical in LASSO. As the computational efciency of LASSO is very high and λ is the only parameter, parameter selection for LASSO can be performed efciently.

Te Efect of Feature Selection on Prediction Accuracy.
Te efects of selecting diferent numbers of features on LASSO were assessed in this case study. Feature selection was performed using mutual information, and features with mutual information values exceeding a certain threshold were selected and input into LASSO. Te thresholds were set to 0.8, 0.7, 0.6, 0.5, 0.4, 0.3, and 0.2, and the corresponding numbers of selected features were 5, 7, 14, 26, 60, 128, and 217, respectively. Diferent LASSO models were trained using these groups of features. For each group of selected features, the LASSO model was trained with CV performed using λ values ranging from 1 to 100000 in intervals of 10.
Te minimal RMSE for each group of selected features is depicted in Figure 8. Te efects of selected features on LASSO were not negligible. Te RMSE values ranged from 1637 (60 selected features) to 1844 (14 selected features). Tis result is extremely interesting, considering that LASSO exhibits variable shrinkage and selection. In particular, the RMSE did not increase or decrease monotonically as the number of selected features increased. Te RMSE was minimal, corresponding to 60 selected features. Terefore, despite the variable shrinkage property of LASSO, thorough feature selection is still benefcial.

Discussion
In this study, a prediction method based on LASSO was proposed to track incidence of infuenza in children in Shanghai, China. To the best of our knowledge, this is the frst attempt to evaluate state-of-the-art prediction models thoroughly in the case of infuenza incidence in children and identify the superior model. LASSO was used to discard unnecessary features and select the most signifcant features based on lagged ILI levels, temperature, and air pollutant data. Owing to its variable shrinkage property, LASSO eliminated unimportant features and outperformed other infuenza tracking models, including ARIMA, RF, OLS, and LSTM, in terms of prediction accuracy. Compared to adults, children exhibit diferent characteristics of infuenza immunity and infection. Previous studies have demonstrated that LASSO is efective in predicting infuenza in adults. However, the predictive accuracy of LASSO for pediatric infuenza remains unclear. Tis study revealed that LASSO also achieves accurate prediction of infuenza trends in children.
Te interpretability of LASSO is enhanced by its variable shrinkage properties. Te ILI level from the previous week was observed to have a signifcant impact on the ILI level of the current week, and ILI levels with lags 51, 2, and 52 provided information of decreasing importance. Tis refects a strong temporal autocorrelation and yearly cycle pattern in child infuenza data that is corroborated by the ARGO (Autoregression with Google search data) model [14]. In ARGO, the current week's ILI level is signifcantly infuenced by the ILI levels of the previous week and those from half a year ago and one year ago.
Te proposed approach is not directly compared to ARGO as they use diferent datasets and features. However, both this study and ARGO utilized the relative RMSE metric with respect to the naive method. Te relative RMSE of the proposed approach (0.899) was observed to be much larger than that of ARGO (0.608).Te diference could be attributed to the differences in the adopted features. In addition to the lagged ILI features, the proposed approach considered temperature and air pollutant features, whereas ARGO uses Google-searchterm features. In our approach, lagged temperature and air pollutant features were not selected by LASSO, and their coefcients were decreased to zero. However, the coefcients of many Google search features are not zero, and these features play an important role in ARGO. It is likely that if we accounted for child infuenza-related search data, the accuracy of our approach could be further improved. Previous studies have reported the relationships between temperature/air quality data and ILI levels and employed these data to promote the performance of predictive models. In this study, before transmitting the features into LASSO, they were selected via feature selection using the mutual information criterion. Many of the temperature and air pollutant features were selected by feature selection, but their fnal coefcients were reduced to zero by LASSO as depicted in Figure 3. Tis demonstrates the primary advantage of LASSO; even among the selected and potentially useful features, relatively unimportant features were eliminated by LASSO while retaining signifcant features. In this study, the temperature and air pollutant features were observed to be relatively unimportant compared to the autoregressive features of the ILI number. Tus, our results indicate that air pollution does not have a signifcant impact on child infuenza. However, we only used a dataset corresponding to Shanghai, and, thus, the result may be a consequence of the limited data sample. Terefore, further data collection and research are required.
Although LASSO exhibits variable shrinkage and selection, the impact of the features input into LASSO on its estimation accuracy was not negligible. Tis was attributed primarily to the variable shrinkage property of LASSO. When several unrelated features were input into LASSO, their coefcients were often close to zero but not equal to 0. Tis increased the variance and decreased the estimation accuracy of LASSO. Terefore, cautious feature selection remains an important factor infuencing LASSO's accuracy.
LASSO involves a single hyperparameter, λ, which is another merit of LASSO. Furthermore, its training process is highly efcient. Terefore, hyperparameter search for LASSO is a trivial task. ARIMA, RF, and LSTM all involve more than three hyperparameters, and the training processes for these models are less efcient than those of LASSO. Terefore, hyperparameter search is tedious in these cases.
Tis study has several limitations. Owing to availability issues, we only used a child infuenza dataset pertaining to Shanghai, China. Te vaccination rates of infuenza vaccines in children vary over regions and countries. Moreover, children from diferent countries exhibit diferent levels of immunity against infuenza. Tus, the efectiveness of LASSO as a predictor of the child infuenza incidence in other cities or on larger scales, such as states and countries, requires further research. Additionally, the child infuenza dataset used in this study corresponded to a relatively short period, from January 1, 2015 to May 31, 2020. Finally, the predictive target was taken to be the 1-week-ahead ILI level. We only focused on short-term forecasts in this study, and other long-term forecasts were not considered.

Conclusions
In this study, the feasibility of using the LASSO model to predict child ILI activity level based on data corresponding to the period 2017-2020 in Shanghai, China, was demonstrated. Te proposed model leverages data from multiple input data sources, including lagged ILI number, lagged temperatures, and lagged air pollutant data. Owing to the variable shrinkage property of LASSO, the coefcients of the unimportant features (lagged temperature and air pollutant features) are decreased to zero. On the contrary, autoregressive ILI number features are preserved as important features. Te proposed LASSO model outperforms the other candidate models assessed in the study. Although there are some distinctions between child and adult infuenza, this study demonstrates that LASSO is efective and accurate for child infuenza prediction, making it a powerful tool for providing guidance on child infuenza prevention and control for schools, hospitals, and the CDC. Although LASSO exhibits variable shrinkage, feature selection continues to have a signifcant impact on its performance. Tus, cautious feature selection can further improve its prediction accuracy. In future works, we intend to study the deep relationship between feature selection and LASSO, and investigate long-term forecasts, such as the 1-month-ahead ILI level. In addition, we wish to evaluate the feasibility of the LASSO model using child infuenza datasets in other cities or regions.
reviewed, and edited the manuscript; Yu Xu and Dayu Cheng visualized the study; Tao Pei supervised the study; Jin Zhu and Tao Pei administrated the project; Yuan Liu and Tao Pei funded acquisition. All authors have read and agreed to the fnal manuscript. Jin Zhu, Yu Xu, and Guangjun Yu contributed equally to this work.