Time Series Modeling of Tuberculosis Cases in India from 2017 to 2022 Based on the SARIMA-NNAR Hybrid Model

Tuberculosis (TB) is still one of the severe progressive threats in developing countries. There are some limitations to social and economic development among developing nations. The present study forecasts the notified prevalence of TB based on seasonality and trend by applying the SARIMA-NNAR hybrid model. The NIKSHAY database repository provides monthly informed TB cases (2017 to 2022) in India. A time series model was constructed based on the seasonal autoregressive integrated moving averages (SARIMA), neural network autoregressive (NNAR), and, SARIM-NNAR hybrid models. These models were estimated with the help of the Bayesian information criterion (BIC) and Akaike information criterion (AIC). These models were established to compare the estimation. A total of 12,576,746 notified TB cases were reported over the years whereas the average case was observed as 174,677.02. The evaluating parameters values of RMSE, MAE, and MAPE for the hybrid model were found to be (13738.97), (10369.48), and (06.68). SARIMA model was (19104.38), (14304.15), and (09.45) and the NNAR were (11566.83), (9049.27), and (05.37), respectively. Therefore, the NNAR model performs better with time series data for fitting and forecasting compared to other models such as SARIMA as well as the hybrid model. The NNAR model indicated a suitable model for notified TB incidence forecasting. This model can be a good tool for future prediction. This will assist in devising a policy and strategizing for better prevention and control.


Introduction
Tuberculosis is a highly infectious disease that is the primary cause of ill health and one of the leading responsible factors that cause death throughout the globe [1].Until the COVID-19 pandemic, TB is the foremost cause of death from the single most signifcant factor, HIV/AIDS [2].TB is caused by bacillus Mycobacterium tuberculosis, which is transmitted from person to person through oral precipitations when an infected individual coughs or sneezes [3].According to WHO health reports, the incidence rate of TB diminished from 7.1 million in 2019 to 5.8 million in 2020 [2], but the death rate increased in 2020, i.e. 1.3 million in 2020 whereas, 1.8 million TB deaths in 2019 occurred throughout the world [2].In India, the situation is the opposite, i.e. the number of notifed TB incidents reported in 2021 was 19,33,381, which was 19% higher than in 2020 (16,28,161) [4].India faces the greater burden of TB incidence throughout the world with an estimated rate of 188 per 100,000 population (129-257 per 100,000 population) as of 2021 [4].Although the worldwide TB incidence is showing a declining trend from 1% to 2% per year, it is still a signifcant global threat to public health, especially in developing countries [5].
Population gatherings, cultural programs, and religious festivals are the infuencing factors that can increase TB's incidence rate during this time.It can help to monitor the trend of TB incidence and establish an accurate model to predict and control the further transmission of TB [6].Several studies have been conducted in diferent geographical regions, such as India's eastern, western, northern, and southern regions [7,8].
Time series analysis is a statistical technique that deals with time series data or trend analysis.It can determine the rate of infection among the population with various diseases such as tuberculosis, cancer, diabetes, and kidney disease.It can help to alleviate the burden of diseases by providing insights into the trends of diseases over time.It can help identify patterns and forecast future trends.
Te ARIMA and SARIMA are the most suitable time series forecasting models proposed by Box and Jenkins 1970 [9].Te time series forecasting model completely depends upon the ftting values of previous data that explore future values.Tese models consist of 5 expressions such as AR (P), MA (q), ARMA (p, q), ARIMA (p, d, q), and ARIMA (p, d, q) (P, D, Q) s.Tis mathematical model is required to forecast TB occurrence, which provides an early warning system for the control of the disease.ARIMA and SARIMA models have frequently been used in forecasting epidemic or pandemic diseases such as COVID-19 [10][11][12], Malaria [13], Infuenza [14], Hand foot & mouth disease [15], and as well as TB [16].
SARIMA model is widely utilized in the feld of infectious diseases for future prediction [17,18] and has also been adopted as a primary method throughout the world in TB incidence prediction [19].However, the trend related to the impact of seasonality on TB cases nationwide has not been observed [8,20].A previous study conducted in India based on an assessment of seasonality and trend reported that the northern region has high seasonal variation while the southern and central regions have very little or no seasonality reported [21].Some other countries have also reported seasonal variations such as China [22] and the United Kingdom [23].Various seasonal factors such as precipitation, humidity, temperature, and day-night length vary according to the geographic regions, which might be responsible factors but their impacts are not completely understood.Terefore, with reference to TB, it is critical to analyze the seasonality to identify emerging concerns about TB.Tis may aid in devising future protective strategies for prevention and control [19].Apart from that there are several studies conducted in previous years nationwide on TB prediction in other countries based on the SARIMA model that only refects linear information [24,25].In this study, we used the SARIMA model that considers the linear information to assess the nationwide incidence of TB in India.However, the neural network autoregressive model takes into account the nonlinear information.
Artifcial neural network (ANN) models are time-series forecasting methods that are constructed based on general mathematical models, which permit the nonlinear relationship among the response variables and their forecast variables [26].Nonlinear mapping has a powerful ability to predict with high accuracy.A nonlinear autoregressive neural network (NARNN) is an efective model that has high fault tolerance properties in time series forecasting methods.In some studies, this efective model is regarded as a neural network autoregressive (NNAR) [15].However, some studies have mentioned that a single ANN model may not be able to show any connection between linear and nonlinear models of the time-series prediction [27].Previous studies conducted on combined models SARIMA and ANN showed high prediction accuracy and overcome the inadequacies of single models [28].Te SARIMA model is the best ft for linear relation whereas the ANN model is the best ft for nonlinear relation of the models [29].Te combination of both the models are utilized for better forecasting of the time series models.
Te primary goal of the current study is to compare the time series forecasting efciency of the SARIMA model, NNAR model, and the hybrid model with reference to the forecasting accuracy of the prevalence of TB cases in India.Tis study expects that the NNAR model has a better forecasting efciency than the other time series models [30].However, this model can help to predict future results on TB cases and facilitate better prevention control strategies as referenced information.

Data Collection.
Secondary data on TB were extracted from the open repository web portal Ministry of Health and Family Welfare, Central TB Division, and Government of India (https://tbcindia.gov.in)from January 2017 to December 2022.All the secondary data were registered in the NIKSHAY repository (https://reports.nikshay.in/Reports/TBNotifcation) of notifed TB incidences.We separately extracted data each month of every year, therefore, a total of 72 observations as a month over the 6 th year.Tis was followed by further incorporating the data into an Excel (2019) sheet to make a time series database.Time series analyses were accomplished in R-programming language (Vienna, Austria version 4.0.3)[31,32] with integrated development for R in R-Studio (PBC, Boston, MA) [33] for making predictions, where (P < 0.001) was considered statistically signifcant.

SARIMA (1, 1,
2)(0, 0, 1) [12] Model.ARIMA is largely accepted and broadly used in time series forecasting methods for univariate analysis.Time series data with a seasonal component is known as the SARIMA model.Tis primarily consists of seasonal components (P, D, Q) and non-seasonal components (p, d, q) for interpretation of seasonality in a cyclic manner which repeats over the S period in the time series database.It is expressed as SARIMA (p, d, q) (P, D, Q) s where, p � AR order of nonseasonal, d � nonseasonal diferencing, q � MA order of nonseasonal.P � AR order of seasonal, D � seasonal diferencing, Q is the MA order of seasonal, and S � period of the repeated seasonal pattern.

Canadian Journal of Infectious Diseases and Medical Microbiology
Te expression of the SARIMA model is given as follows [30]: (1) Te nonseasonal components are as follows: (2) Te seasonal components are as follows: whereas, B indicated a reverse shift, ε t is a projected residual error at t, and x t indicates an observed value at t (1, 2, . . ., k), φ is the route of the AR coefcient, θ is the route of MA coefcients, Θ is the route of seasonal MA coefcients, and Φ is the route of seasonal AR coefcients.First, the stationarity of the time series data is essential for ftting the SARIMA model.Terefore, to check the data stationarity, an Augmented Dickey-Fuller (ADF) unit root test is applied.Seasonal and nonseasonal (D and d) diferences are required to convert the nonstationary data into stationary data.Second, the order of the model was selected based on the autocorrelation function (ACF) and partial autocorrelation function (PACF) which is shown in Figure 1.After that, the lower values of the Bayesian information criterion (BIC) and Akaike information criterion (AIC) [34,35] are required to select the appropriate model for the time series database of notifed TB cases.Subsequently, we checked the reliability of the selected model, and its parameters were estimated then we used the model (p, P, d, D, q, Q) with its diferenced values.ACF is associated with the past time-series TB data, whereas PACF is associated with the lagged values criterion of timeseries data [36].Both the BIC and AIC were penalized using and log-likelihood criterion.Finally, we checked the residuals with the help of ACF, PACF, and L-Jung Box tests.Te residuals have hypothetically white noise with no autocorrelation among them.
Before applying the models, a time-series database of notifed TB cases was split into 70 th and 30 th ratios of the training and testing datasets, respectively.Te training dataset was selected from January 2017 to February 2021 (50 th observation) for training the models and the testing dataset was considered from March 2021 to December 2022 (22 th observation) for validation.All the training datasets were used for constructing the models and the validation dataset was used for forecasting performances of the models.Te forecasting performance of the models was assessed by the following metrics: RMSE, MAE, and MAPE and fnally, the models were used to forecast the prevalence of notifed TB cases from January 2023 to December 2023.[12] Model.Artifcial neural networks are the mathematical model of the brain, which is applied in the time series forecasting methods.It is widely used for complex nonlinear forecasting purposes [37].With this model, lagged values can be used as an input for neural networks in the time series data and it is used as linear autoregressive models, and then the model is regarded as a neural network autoregressive (NNAR).Tis model is usually expressed as NNAR (p, k) and sometimes it is also considered as NNAR (p, P, k) m, whereas p is the lagged inputs, P is the seasonal lagged input, k is denoted as nodes in one hidden layer, and, m is the length of the seasonal period.An NNAR (p, 0) model is similar to an ARIMA (p, 0, 0) model, which is without any limitations on the parameters to confrm the stationarity.Tis model is also helpful for adding the last observation values in the same input of time series seasonal data.As an example, the NNAR (3, 1, 2) 12 model has inputs (y t− 1 , y t− 2 , . . ., y t− 3 ) and y t− 12 and with two hidden layers are in the neurons.NNAR (p, P, 0) m model is also similar to the SARIMA (p, 0, 0) (P, 0, 0) m model but there are no limitations on the parameters that make confrm the stationarity.

Hybrid (SARIMA-NNAR) Model.
A hybrid model was considered as it contains a linear and a nonlinear autocorrelation component.Te SARIMA and the NNAR methodologies work together and predict the future values using observed past time series data and it is appropriate for nonlinear and linear issues, respectively.We also considered a hybrid model that combined the SARIMA model (linear) and NNAR (nonlinear) for the present study.Te linear relations of the time series TB database were examined by using the SARIMA model while the residual part of the nonlinear relations has been examined by the hybrid mode [15,38,39].Terefore, linear, and nonlinear sections of the hybrid model are combined and added for future prediction in this study.Te linear part of the SARIMA and the nonlinear residual part of the hybrid models were selected to estimate the occurrence of TB cases at time t.Te construction of the combined (SARIMA-NNAR) hybrid model is shown in Figure 2.

Forecast Assessment Methods.
All the models were ftted in the past 72 months of the training data sets.Te number of cases was forecast based on previous time series data.Among the SARIMA, NNAR, and hybrid models of forecasting results were evaluated with the assistance of four computational parameters.Te four common methods to assess the results and outcomes in the forecasting models of time series data of TB cases are mean absolute error (MAE), mean absolute percentage error (MAPE), and root means square error (RMSE).
here x t is the defnite occurrence,  x t is a projected occurrence, n is the forecast number, A t is the defnite value of the quantity being predicted, and F t is a prediction value.

Results
Te total notifed TB incidences from January 2017 to December 2022 were reported at 1,25,76,746 cases over the years in India, given in Table 1.Descriptive statistics (mean, standard deviation, minimum, and maximum) were performed to the available databases where an average prevalence was (174677.02),standard deviation (31906.84),minimum (83647), and maximum (228814) values were observed over the years given in Table 2. Te time series plot of the monthly occurrence of TB cases is depicted in Figure 1.Te peak values of notifed TB cases often occurred in March, April, and May.Te additive decomposition function was used to diagnose time-series data of TB, which revealed the seasonal association that has cyclic changed every 12 months, shown in Figure 3. Peak seasons TB cases were seen in March, April, and May.Terefore, it has been confrmed that seasonal patterns are involved in seasonal indices of notifed TB cases.

3.1.
Performance in the SARIMA Model.ADF test was applied to the time-series nonstationary data (P value 0.3257) of notifed TB cases.After one diference (d � 1) in the time series data, it became stationary (P value 0.05162).Due to this reason, we have selected the most appropriate model with the help of auto.arima()function [42].Te obtained model was SARIMA (1, 1, 2) (0, 0, 1) [12], having the lowest AIC-1632.85whereas BIC-1646.43.It was the most appropriate model obtained in the R program by auto.arima()function of the forecast library [42][43][44].Ljung-Box test was performed on this model to get a not signifcant P value is 0.3049 which indicated no autocorrelation in the residuals.

Performance in the NNAR Model.
For this seasonal component, P was set to 1 for this seasonal component.Te model that was automatically generated by the forecast library's nnetar () function [15,40,45].We applied diferent P values multiple times and fnally got the best prediction model (3, 1, 2) 12, which consists of the least error of the model.Te obtained model established no autocorrelation in the residuals.

Performance in the Hybrid Model.
Te forecast hybrid package in the R program with the hybridModel () function [40,45] was applied to determine a hybrid model.In this hybrid model, the SARIMA model is also determined by auto.arima() function, while the NNAR model is determined by nnetar () function.Both models worked together on training time series databases.Te Ljung-Box test was performed and demonstrated no autocorrelation in residuals.

Accuracy Evaluation among the Models.
Te performances of these three models (SARIMA, NNAR, and hybrid) in the training and testing datasets are shown in Table 3.All the criteria for comparison are mentioned in Table 3 which shows the RMSE, MAE, and MAPE values of hybrid and NNAR models having the least error values reported in the training dataset compared to SARIMA models.Te forecasted TB transmission for the next 12 months is given in Table 4. Te forecasting time series plots of SARIMA, NNAR, and hybrid models are shown in Figures 4-6 with training and testing data of notifed TB incidences.Te time-series models of forecasting values are approximately ftted to the actual notifed cases of TB data which are depicted in Supplementary Figures 1-3.According to the models (SARIMA, NNAR, and to the WHO, the COVID-19 pandemic has had a signifcant impact on TB diagnosis and treatment in India.Te ensuing disruptions from the health crisis arising out of the COVID pandemic have prompted lags in diagnosis and the beginning of treatment, which has resulted in a 25% annual shortfall in TB case notifcations in 2020 compared with 2019 [41].In addition, the COVID pandemic has led to the absolute mobilization of healthcare eforts and infrastructure towards its control and management, leaving no attention to other afictions such as TB and others.Hybrid), it is seen that the number of TB notifcations will remain constant in the coming months, with peak occurrences in March, April, and May.

Discussion
Tis study applied NNAR and hybrid models to estimate the number of TB notifcations.Te NNAR model was the bestftted model for forecasting time series data of notifed TB cases from 2017 to 2022 in India.To estimate the efcacy of the models, we compared the results of the models such as SARIMA, NNAR, and hybrid models.Both the models follow all the comparison criteria, which are required in time series data of TB cases.However, the best estimation results were obtained by the NNAR model followed by the hybrid model and SARIMA model, respectively.During the development of the NNAR model, there were various parameters utilized to forecast the best models.Te hybrid model was also considered due to higher data characteristics in comparison to the nonhybrid model for forecasting [40].Te present analysis  has suggested that the applied time series models, which showed the trend of TB due to seasonal variations might be a responsible factor that causes TB incidence in India, which shows periodicity in the notifed TB database.Tis study also revealed an enhanced peak in the second quarter (March to May) and a decline in the fourth quarter (October to December) [20].Several similar studies also show the seasonal infuence in TB transmission [8,38].As per time series neural networks hybrid models, the projected results showed that the prevalence of TB cases in India would continuously increase in the coming months or years.A similar study was conducted in provinces of China where SARIMA (0, 1, 1) (0, 1, 1) 12 and SARIMA-GRNN models predicted the prevalence of TB cases on the basis of infuencing factors such as seasonality and trend [46,47].Another Chinese province has studied the prevalence of TB data by applying the SARIMA (1, 0, 0) (1, 0, 1) 12 model.Tese studies have suggested that the NNAR and hybrid model can efciently explain the seasonality and trend in TB forecasting compared to SARIMA [48].Te basic origin model of an ARIMA has a widely accepted application model in the feld of epidemic modeling or disease modeling.Further, it has extended to the SARIMA, NNAR, and different hybrid models.
However, the most suitable model was considered based on the results that are decided by the evaluation parameters, which are given in Table 2. NNAR is the frst-best model, which was considered and the second-best model was the hybrid model for prediction.Tese models can help monitor notifed TB incidences and help in adopting necessary measures accordingly.
Te main strength of the current study is the analysis of nationwide data, which is the frst concerning such studies in India.Also, for the frst-time hybrid models along with SARIMA, and NNAR models are applied to nationwide TB data.However, the current study has a few limitations as well.Foremost there is a lack of demographic indexes such as age, gender, educational status, caste, and religion in the study.In addition, socioeconomic and climatic factors were not included in the study that could infuence the seasonal variations.Finally, the opted models were derived using the data only from January 2017 to December 2022 in India and tested against only one year of the available datasets.Hence, the fndings should be reassessed cautiously with the additional time-series data in future studies.

Conclusion
Te fnding of the study is based on time-series SARIMA, NNAR, and a hybrid model ftted on notifed TB cases from January 2017 to December 2022 in India.We have estimated   Canadian Journal of Infectious Diseases and Medical Microbiology TB incidences by the NNAR and hybrid models.Te NNAR model performs better than the individual SARIMA or the hybrid (SARIMA-NNAR) for predicting the prevalence of notifed TB cases.Tis forecast will assist policymakers in devising better policies and strategies for the control and prevention of TB.

CanadianFigure 1 :Figure 2 :
Figure 1: Time series with ACF and PACF plot of monthly notifed TB cases from January 2017 to December 2022.

Figure 3 :
Figure 3: Additive decomposition of monthly notifed time series plot.

Table 1 :
Monthly reported notifed TB cases from January 2017 to December 2022.

Table 2 :
Descriptive statistics for the available database from 2017 to 22 in India.

Table 3 :
Accuracy parameters of SARIMA, NNAR, and hybrid models of training and testing data set.

Table 4 :
Forecasting comparison between the models for January 2023 to December 2023.