Predicting the Number of Reported Pulmonary Tuberculosis in Guiyang, China, Based on Time Series Analysis Techniques

Tuberculosis (TB) is one of the world's deadliest infectious disease killers today, and despite China's increasing efforts to prevent and control TB, the TB epidemic is still very serious. In the context of the COVID-19 pandemic, if reliable forecasts of TB epidemic trends can be made, they can help policymakers with early warning and contribute to the prevention and control of TB. In this study, we collected monthly reports of pulmonary tuberculosis (PTB) in Guiyang, China, from January 1, 2010 to December 31, 2020, and monthly meteorological data for the same period, and used LASSO regression to screen four meteorological factors that had an influence on the monthly reports of PTB in Guiyang, including sunshine hours, relative humidity, average atmospheric pressure, and annual highest temperature, of which relative humidity (6-month lag) and average atmospheric pressure (7-month lag) have a lagging effect with the number of TB reports in Guiyang. Based on these data, we constructed ARIMA, Holt-Winters (additive and multiplicative), ARIMAX (with meteorological factors), LSTM, and multivariable LSTM (with meteorological factors). We found that the addition of meteorological factors significantly improved the performance of the time series prediction model, which, after comprehensive consideration, included the ARIMAX (1,1,1) (0,1,2)12 model with a lag of 7 months at the average atmospheric pressure, outperforms the other models in terms of both fit (RMSE = 37.570, MAPE = 10.164%, MAE = 28.511) and forecast sensitivity (RMSE = 20.724, MAPE = 6.901%, MAE = 17.306), so the ARIMAX (1,1,1) (0,1,2)12 model with a lag of 7 months can be used as a predictor tool for predicting the number of monthly reports of PTB in Guiyang, China.


Introduction
Tuberculosis (TB) is a chronic infectious disease caused by Mycobacterium tuberculosis (M.tb), which is spread mainly through the respiratory tract. M.tb can infect various organs throughout the body, with lung infections being the most common, and is the 13th leading cause of death worldwide [1][2][3]. It is estimated that about 25% of the world's population is infected with M.tb [4], and that their lifetime risk of developing TB is as high as 5% to 10% [5], which poses a large threat to human life and health. Despite the tremendous efforts made by countries around the world to prevent and control tuberculosis, it will remain a major public health problem. According to the Global Tuberculosis Report 2021 published by the World Health Organization (WHO) [6], the decline in the global incidence of TB has slowed from previous years, with approximately 9.9 million new cases of TB in 2020 and an incidence rate of 127 per 100,000. It is worrying that the incidence of TB in China turns down to be up in 2020, from 58/100,000 in 2019 to 59/100,000, and becomes the second highest burden of this disease in the world [7]. At the same time, because the COVID-19 pandemic in 2020 has had a huge impact on the provision of basic services for TB, the number of reported TB patients in China has dropped significantly, and about 217,000 cases of TB patients have not been diagnosed or reported to the China Disease Prevention and Control Information System; so, the prevention and control situation is very serious and there is an urgent need to take action to reduce the risk of disease in the population.
PTB is an infectious disease with an incubation period and the potential for widespread population infection [8], and the construction of appropriate predictive models to understand the trend of the epidemic in advance will be of great benefit to the prevention and treatment of tuberculosis. At present, many scholars around the world have constructed mathematical models to be applied in the prediction of infectious disease epidemic trends [9][10][11][12][13], which can be roughly divided into two categories, namely, traditional prediction models and advanced prediction models, and the performance of each method depends on multiple factors, such as data trends, periodicity and noise, and other environmental and social factors, of which the construction of infectious disease prediction models based on time series analysis technology is widely used. In the field of public health and medicine, time series analysis focuses on the continuous observation of biomedical data over time, looking for patterns of change and thus predicting future trends [14]. The ARIMA (Autoregressive Integrated Moving Average) model [15], the Holt-Winters model, and the LSTM (long short-term memory) network model [16] are popular time series models. ARIMA and Holt-Winters models are traditional forecasting models with well-established modeling steps and good statistical properties, but their linear modeling approach suffers from a number of limitations [17]. LSTM models are a type of deep learning and have received widespread attention in the field of infectious disease prediction because they can deal well with the long-term dependence of data and remember the long-term relationships of data in prediction and show great advantages in fitting structurally complex infectious disease data [18,19]. Since TB has certain transmissible properties as an infectious disease, it is worth discussing whether the LSTM model is necessarily superior to the traditional time series model in terms of prediction.
Since 2020, the COVID-19 epidemic has been raging around the world, and in order to control the epidemic, China implemented an unprecedented social intervention strategy on January 23, 2020 to carry out an emergency response, requiring the whole society to mobilize and advocate good personal prevention, wearing masks, washing hands correctly, and maintaining social distancing as routine measures for the prevention and control of the COVID-19 epidemic. Because COVID-19 and TB are both respiratory infectious diseases and both have similarities in prevention and control methods, the conventional prevention and control strategies implemented during the COVID-19 pandemic may have played a crucial role in reducing the incidence of tuberculosis, and this routine measure will continue for a long time in China. Therefore, in the context of the COVID-19 epidemic, building some more sensitive timeseries models for the decline in the number of TB cases will be an interesting question.
Related studies have shown that PTB is a seasonal disease in mainland China, with the epidemic showing a distinct "high in the west and low in the east" regional distribution, with temperate continental, highland, and mountainous climates dominating the western region, which is prone to seasonal increases in PTB [20,21]. Xiao et al. [22] found that temperature, humidity, wind speed, and sunlight may influence changes in PTB incidence by analyzing the relationship between PTB incidence and meteorological factors in Jinghong, Yunnan Province, southwestern China. The incidence of tuberculosis in Qinghai Province [23] and Guangxi Zhuang Autonomous Region [24] also has different degrees of relationship and lagging effect with some meteorological factors. Therefore, climatic factors play an important role in the onset and transmission of tuberculosis and in predicting the number of tuberculosis cases.
To our knowledge, there are few relevant studies that combine the above time series models with meteorological factors and then predict the number of PTB cases. In this study, based on time series analysis techniques, we developed ARIMA, ARIMAX with meteorological factors, Holt-Winters (additive and multiplicative), LSTM without meteorological factors, and LSTM with meteorological factors, respectively, based on the monthly reported number of TB cases and monthly meteorological data in Guiyang City, Guizhou Province from 2010 to 2020. The six prediction models were compared using mean absolute percentage error (MAPE), root mean square error (RMSE), and mean absolute error (MAE), and the best prediction model was selected.

Research Area.
Guiyang is the capital of Guizhou Province, located in the eastern part of the Yunnan-Guizhou Plateau in southwest China, between 106°07′ and 107°17′ East longitude and 26°11′ and 26°55′ North latitude, with a total area of 8,034 km 2 and a resident population of 5,987,000. Guiyang City has a humid and mild subtropical climate with both plateau and monsoon characteristics. The annual average temperature is 15.3°C, the annual average relative humidity is 77%, the annual average total precipitation is 1129.5 mm, the annual average sunshine hours is 1148.3 hours, and the ultraviolet intensity is weak.

Data Scores. Monthly reports of tuberculosis in Guiyang
City from January 1, 2010 to December 31, 2020 were obtained from the Tuberculosis Management Information System of the China Disease Prevention and Control Information System, which included a total of 38,835 cases of tuberculosis. Meteorological data for the study period were obtained from the Guiyang City Statistical Yearbook. Meteorological indicators included average temperature (°C), total precipitation (mm), average atmospheric pressure (100 Pa), relative humidity (%) sunshine hours (hour), annual highest temperature (°C), and annual lowest temperature (°C). Data from 1 January 2010 to 31 December 2019 were defined as the training set to build and compare the prediction models, and data from 1 January 2020 to 31 December 2020 were the number of monthly PTB reports (defined as the test set) in 2 Computational and Mathematical Methods in Medicine Guiyang City after the onset of the COVID-19 to evaluate the prediction performance of the models.

Variance Inflation
Factor. The variance inflation factor (VIF) is a measure of the severity of multicollinearity in a multiple linear regression model. It represents the ratio of the variance of the estimated regression coefficients to the variance of the independent variables assuming that there is no linear correlation between them. The formula is as follows [25]: where R i is the coefficient of determination of the independent variable x i on the rest of the independent variables doing the regression analysis. The larger the VIF, the stronger the collinearity between the independent variables. In general, if the VIF > 10, it means that the regression model has severe multicollinearity.
2.4. LASSO Regression. LASSO, known as the least absolute shrinkage and selection operator, was first proposed by Tibshirani [26] in 1996 and is a popular penalty regression method, which penalizes the absolute value of each regression coefficient by constructing a penalty function so that the regression coefficient is compressed even to zero, while requiring the sum of the absolute values of all regression coefficients to be less than or equal to the penalty parameter λ (λ ≥ 0). A more refined model is finally obtained. It retains the advantages of subset shrinkage and is a kind of biased estimation for dealing with covariance data and therefore possesses very good results for the treatment of highdimensional data and the selection of variables. The formula is as follows: The multiple linear regression equation is assumed to be The regression coefficients satisfy Minimize the sum of squared residuals under (3): In LASSO regression, the C p (Mallow's C p ) statistic is usually used to select the optimal subset, with a smaller C p value representing the optimal subset.  [15], of which p represents the autoregressive order, d represents the difference order, q represents the moving average order, P represents the season the order of autoregression, D is the order of seasonal difference, Q is the order of seasonal moving average, and S is the cycle step. The ARIMAX model is an extension of the ARIMA model, and when the ARIMA model includes other time series (meteorological factors) as input variables, it is called the ARIMAX model. The modeling steps for both models are roughly as follows: (1) stationarity and white noise test of the original sequence: use the augmented Dickey-Fuller (ADF) test to judge whether the original sequence is stationary, if it is a nonstationary sequence, then perform sequence transformations, such as logarithmization, difference (d), and seasonal difference (D), and then use "Ljung-Box" statistic to perform white noise test on the stationary sequence to see if it has analytical value; (2) model identification: the model is automatically identified by the "auto. arima ()" function in RStudio, and the possibility of p, q, P, Q, and S is preliminarily determined by combining the properties of the stationary series sample autocorrelation coefficient (ACF) and partial autocorrelation coefficient (PACF); (3) parameter estimation: after the model is identified, the value of the unknown parameter in the model is estimated by using the observed value of the sequence. The method chosen for this paper is conditional least squares and maximum likelihood estimation hybrid method (CSS-ML); and (4) model test: the best model is selected by comprehensive consideration of AIC (Akaike Information Criterion), and the residual sequence of the model should be a white noise sequence. [27] is a time series analysis and forecasting method divided into an additive model and a multiplicative model. The method is applicable to nonstationary series containing linear trends and periodic fluctuations and uses exponential smoothing (EMA) to allow the model parameters to continuously adapt to changes in the nonstationary series and to provide shortterm forecasts of future trends.

Holt-Winters Model. The Holt-Winters method
The Holt-Winters additive model can be expressed as follows: The recursion formula is as follows: The predicted value for the forward k period iŝ The Holt-Winters multiplicative model can be expressed as follows: The recursion formula is as follows: The predicted value for the forward k period iŝ In the above equation, a ðt − 1Þ is the unbiased estimate of the sequence intercept term, where the t − 1 moment eliminates the seasonal effect, b ðtÞ is the unbiased estimate of the slope of the t-moment b, c ðtÞ is the unbiased estimate of the t-time seasonal index Sj, xt is the latest observation obtained at the t-time, and m is the period length of the sea-sonal effect; α, β, γ are smoothing coefficients that meet the 0 < α, β, γ < 1.

LSTM
Model. The LSTM model, known as the long short-term memory model, was first proposed in 1997 by Hochreiter and Schmidhuber [16]. Because of its unique design structure, it is suitable for handling and predicting important events with long intervals and delays in time series. Currently, application areas include text generation [28], speech recognition [29], machine translation [30], and infectious disease prediction [31]. LSTM is a special kind of recurrent neural network (RNN) [32], which combines short-term and long-term memory through gate control, overcomes the gradient disappearance or gradient explosion of traditional RNN models, and is better at dealing with the problem of multiple variables. The individual circulatory structures of LSTM (also known as cells) consist of input gates, forgetting gates, output gates, and unit states. The output gate determines how much of the input data of the network needs to be saved to the cell state at the current moment; the forgetting gate determines how much of the unit state at the previous moment needs to be retained at the current moment; an output gate is a control of how much of the current unit state needs to be output to the current output value. Prediction methods that use only a single piece of data as input belong to univariatable LSTM and multivariable predictions can be constructed to improve the accuracy of predictions, taking into account that some variables exhibit periodic changes. In this paper, a multivariable LSTM prediction model of PTB reporting was established using the "torch" package in RStudio, which not only considered multiple meteorological elements but also  Computational and Mathematical Methods in Medicine the lagging effect of meteorological factors on the incidence of PTB. Since the LSTM model is sensitive to the input size of the data, the data should be readjusted to the range of 0 to 1 (also known as standardization), and this study has been parameterized several times to determine the optimal model based on the minimum root mean square error (RMSE) of the test set. The main parameters are shown in Table 1 y = 1, 2, 3, ⋯, n, and n represents the sequence sample size.
2.9. Data Processing and Analysis. The number of monthly PTB reports and monthly meteorological indicators in Guiyang City from 2010 to 2020 was collated and summarized through Excel 2010, using the "forecast," "tseries," and "torch" packages in RStudio (https://www.rstudio.com/) to establish ARIMA models, ARIMAX models, Holt-Winters (additive and multiplicative) model, LSTM model, and multivariable LSTM model. The statistical test level is α = 0:05.

Epidemic Situation of Pulmonary Tuberculosis in
Guiyang. From January 1, 2010 to December 31, 2020, a total of 38,835 cases of PTB were registered in Guiyang City, with the maximum number of monthly reports being 477 cases and the minimum number of monthly reports being 189 cases. From the decomposition of multiplicative time series (Figure 1), there is a long-term trend and a clear seasonality in the number of monthly reports of PTB in Guiyang. The long-term trend shows that the number of PTB reports in Guiyang from 2010 to 2020 shows an overall downward trend. The seasonality shows that the number of PTB reported in Guiyang is the lowest level in February and the high incidence period from March to July and two "small peaks" in September and November. Random effects showed that when trend effects and seasonal effects were excluded, the monthly reports of PTB in Guiyang showed randomness.
The results are found in Table 3, where the VIF values of average temperature (°C) and annual lowest temperature (°C) are greater than 10, indicating that there is a multicollinearity between meteorological factors. To overcome the problem of collinearity, LASSO regression is used for meteorological variables screening and selection, the optimal subset is selected by the C p (Mallow's C p ) statistic of the model, five meteorological factors (Table 3) are screened out when C p = 6:149, then linear regression is performed, and there are four variables that pass the significance test (Table 4):

ARIMA Model.
We defined the data from 1 January 2010 to 31 December 2019 as the training set to build the prediction model. From Figure 1, we know that the number of monthly PTB reports in Guiyang City has a long-term trend and seasonality; so, the ARIMA ðp, d, qÞ ðP, D, QÞ S model was chosen for fitting. According to the modeling steps, the series was transformed to a smooth series (Dickey − Fuller = −5:855, p < 0:01) after a 1st order 12step difference and conformed to the nonwhite noise property (p < 0:01), so that d = 1, D = 1, and S = 12. The parameters p, q, P, and Q were estimated by the "auto.arima ()" function was automatically fitted in combination with manual observation of ACF and PACF plots to estimate the parameters p, q, P, and Q. The "auto.arima()" function selected ARIMA (1,0,1) (0,1,1) 12 as the best model, which clearly did not fit the actual situation and required further observation of the ACF and PACF plot (Figure 2), initially identified as p = 0 or 1 and q = 0 or 1.
The identification of P and Q is more difficult, but in general, it does not exceed order 2; so, we took 0, 1, and 2 from low to high order, respectively, and tried them one by one (Table 5). After considering the fitting effect of each ARIMA model in the training set and the sensitivity of predicting the number of PTB incidence from January 1 to December 31, 2020, we chose the ARIMA (1, 1, 1) (0, 1, 2) 12 model as the relatively optimal model, which has a white noise series of residuals (p = 0:863) and AIC = 1124:440, which is a good fit, see Figure 3.

ARIMAX Model.
In order to evaluate the correlation between meteorological factors and the number of PTB registrations in Guiyang with different lags, we developed ARIMA models for sunshine hours, relative humidity, average atmospheric pressure, and annual highest temperature ( Table 6) to obtain the residual series of meteorological factors.
The ARIMA (1, 1, 1) (0, 1, 2) 12 model for the number of PTB cases in Guiyang City from 1 January 2010 to 1 December 2019 was analyzed using the crosscorrelation function (CCF) with the lag time of the meteorological factor for the same period (Figure 4), and it can be seen from the figure RH (6-month lag) and AAP (7-month lag). Based on this, we combined the RH (6-month lag) and AAP (7-month lag) with the ARIMA (1, 1, 1) (0, 1, 2) 12 model in turn to construct the ARIMAX model (Table 7). After considering the fitting effect of each ARIMAX model in the training set and the sensitivity of predicting the number of PTB incidences from January 1 to December 31, 2020, we selected the AAP (7-month lag) + ARIMAX (1, 1, 1) (0, 1, 2) 12 model as the relatively optimal model, which had a white noise series of residuals (p = 0:973) and AIC = 998:980, which is a good fit ( Figure 5).

Holt-Winters (Additive and Multiplicative) Model. The
Holt-Winters (additive) model and Holt-Winters (multiplicative) model were constructed using the Holt-Winters () function in RStudio to automatically fit the data based on the principle of optimality of fit, as the original series has a certain long-term trend and periodicity. The results show that the Holt-Winters (additive) model outperforms the Holt-Winters (multiplicative) model in terms of fitting and prediction sensitivity see Table 8 and Figure 6.    Figure 7 and Table 9, the multivariable LSTM +RH (6-month lag) + AAP (7-month lag) model outperforms the LSTM in terms of training set fitting and prediction sensitivity.

Comparison of Time Series Model.
In this paper, MAPE, RMSE, and MAE were used to assess the fit and predictive sensitivity of the forecasting models (

Discussion
In recent years, due to the increase in global climate change and extreme weather events, the spread of infectious diseases   7 Computational and Mathematical Methods in Medicine has been seriously affected, and the existing health problems of human beings have multiplied [33]. In southwest China, the weather usually changes rapidly, with more extreme weather and more susceptibility to meteorological factors. Therefore, it is of great significance to consider the influence of meteorological factors on the occurrence and spread of infectious diseases while constructing mathematical models to predict the epidemic trend of infectious diseases, which will help optimize the prevention strategy of infectious diseases.
In this study, we found that meteorological factors were correlated with the number of PTB reports in Guiyang City from 2010 to 2020 based on monthly PTB reports and meteorological data for the same period, and that there was a lagged effect, with each model significantly improving performance when meteorological factors were included. At   the same time, after the decomposition of the multiplicative time series, we found that the number of monthly PTB reports in Guiyang city has the characteristics of seasonality and long-term trend; so. it is a good choice to use the time series analysis technique to construct a prediction model. We chose the classical ARIMA/ARIAMX, Holt-Winters (additive and multiplicative) models and the advanced LSTM/multivariable LSTM models to evaluate the performance of the six models on training set and prediction sensitivity by including or not including meteorological factors. The results of this study showed that the ARIMAX (1, 1, 1) (0, 1, 2) 12 + AAP (lag 7) model with the inclusion of meteorological factors showed the best performance and was suitable for predicting the epidemic trend of tuberculosis in Guiyang after the occurrence of a COVID-19, and accurately predicting the epidemic trend of tuberculosis could help the local government to formulate appropriate prevention and control measures. ARIMA and Holt-Winters (additive and multiplicative) models are traditional statistical models for time series prediction, both of which reveal the patterns of historical data over time and are suitable for short-term prediction. They are widely used for predicting the epidemic trends of infec-tious diseases such as hand, foot, and mouth disease [34], tuberculosis [9], and COVID-19 [35,36] because of their advantages of easy and fast modeling approach and high prediction accuracy. In this study, the ARIMA model and Holt-Winters (additive and multiplicative) model were generally good predictors and fits for the number of monthly PTB registrations in Guiyang from 2010 to 2020, but the ARIMA model was superior to the Holt-Winters model. The reason for this may be that the ARIMA model constantly adjusts the parameters during the modeling process, taking into account the developmental characteristics of the series, especially for series with complex interactions between long-term trends, periodicity, and stochastic fluctuations. The Holt-Winters model, on the other hand, decomposes the variation pattern of the series by means of exponential smoothing, which is suitable for analyzing data with little variation over time, and its modeling process is simpler than that of the ARIMA model, but the Holt-Winters model wastes serious information on the stochastic fluctuations in the series, which leads to a less than ideal model fitting accuracy.
It is well known that environmental and natural factors largely influence the incidence and transmission of PTB.      model and is able to maintain long-term storage of information, thus allowing accurate modelling of data where shortterm or long-term dependencies exist [41]. Based on this, one might think that the LSTM model would outperform the ARIMA/ARIMAX and Holt-Winters models in training and prediction, but the univariate LSTM model was found to be the worst performer in our study. The LSTM model is an advanced recurrent neural network, which improves the accuracy of the model prediction results with a larger number of data sets [42]. We then further constructed the multivariable LSTM model, i.e., adding meteorological factors with lagged effects to the LSTM model, and the results obtained showed that the meteorological factors could improve the performance of the LSTM model, but it was still worse than the ARIMA/ARIMAX model and Holt-Winters model. The LSTM model is a complex neural network that requires a large amount of data for training, and too few training samples can lead to overfitting [43]. In this study, we are constructing a time series forecasting model based on monthly data from 2010 to 2019, with a small sample size and strong linear dependence between series; so, the ARIMA/ARIMAX model and Holt-Winters model will perform better when they have a clear trend in the series. This study compared the predictive performance of ARIMA, ARIMAX, Holt-Winters, LSTM, and multivariable LSTM models with and without meteorological factors using time series analysis techniques, with the aim of finding a suitable model for predicting the trend of tuberculosis epidemic in Guiyang, China. Of course, there are still some shortcomings in this study; firstly, we only considered meteorological factors in our modeling process. In future work, we will incorporate social and economic influences into the prediction model and continuously update the PTB data to obtain more accurate results. Secondly, the monthly TB data we collected did not distinguish between the mobile population and drug-resistant PTB reports, and some of the data were difficult to obtain; so, further research is necessary to predict trends in the mobile population and drug-resistant PTB reports in Guiyang. Finally, we need to further delineate different time scales in future studies to explore the performance of different time series prediction models.

Conclusions
In this study, we constructed six time series models with and without meteorological factors using monthly PTB reports and meteorological data for the same period from 2010 to 2020 in Guiyang City, China, of which the ARIMAX (1, 1, 1) (0, 1, 2) 12 + AAP (lag 7) model performed best in the training and validation sets. It shows that the addition of meteorological factors can improve the accuracy of the prediction model, which can be applied to the prediction of the trend of tuberculosis epidemic in Guiyang City, thus helping the local government to formulate effective intervention measures and prevention and control strategies, which is of practical significance and value for the comprehensive prevention and control of TB, and also provide methodological reference for the trend prediction of other infectious diseases.

Data Availability
The data used and analyzed in this paper are available in the Tuberculosis Management Information System of the China Disease Prevention and Control Information System and the Guiyang Statistical Yearbook.

Supplementary Materials
This supplementary material is the original data for this study and contains monthly data on TB registration reports in Guiyang City from 1 January 2010 to 31 December 2020 and monthly meteorological data for the same period. (Supplementary Materials)