Using Sequence Mining to Predict Complex Systems: A Case Study in Influenza Epidemics

,


Introduction
With the rapid development of societies and economies worldwide, health technologies have been enhanced, and health facilities have been promoted as well. e flu infection faces societies with a number of health problems.
Consequently, influenza diseases have still posed a great threat to human health, and controlling influenza diseases has become a very important challenge globally. Influenza has brought huge losses to national economies and continues to pose a serious threat to human health across the world. Although the subtypes of influenza diseases, such as smallpox and malaria, have been efficiently controlled, the seasonal incidences of influenza still have high occurrence rates and cause many emergent health problems, including early deaths worldwide [1]. erefore, influenza is the first infectious disease for which a surveillance system was implemented. Yet, its effective control remains elusive. Millions of Internet users around the world have submitted Internet search terms for the purpose of developing a system to detect influenza outbreaks at the earliest stages [2,3]. e rapid adoption of the Internet has opened new gates for developing and enhancing healthcare. Many researchers have used huge amounts of data on the Internet and social media platforms such as Twitter or Facebook to discover novel methods to diagnose diseases. us, the language patterns from the Internet and social media have proved their usefulness in analysing and predicting chronic diseases and in determining the behaviours and habits that increase the possibility of those diseases. Understanding population behaviour and trends of noncommunication diseases is directed by using web search activity data. Noncommunicable diseases have been detected by using web search activity data and examination data that has been submitted to the concerned health officials. ese search activity data have the same trend as examination data [4,5].
Researchers have compared Internet search query data relating to the main key adaptable risk factors of noncommunicable diseases with clinical population data from the US Center for Disease Control. Developing real-time surveillance can provide a proxy for clinical population data and real-time web search data for enhancing healthcare systems. Most previous research has tried to predict influenza disease using data from the Internet search query alone. Here, we developed a new model that has the capability to predict the influenza epidemic with the best accuracy. e main contribution of this study is to propose a complex system that can assist in enhancing the time series models in the healthcare domain.
e INtelligent Time Series (INTS) combines clustering with a time series mode. INTS model was developed to predict the influenza epidemic based on a Google search. We found that the INTS model is capable of yielding better results compared with another proposed model, such as Google Flu Trend (GFT) and Auto-Regression with Google search data (ARGO).

Background of the Study
Typically, the Internet is a primary tool that can identify individuals making attempts towards wellbeing and supplying data. Individuals are frequently subjected to certain infections or medicinal problems and always look for suitable medicinal medications or methods. Various studies have recommended remarkable methods for predicting influenza epidemics [6,7]. In November 2008, Google launched the Google Flu service, which uses a computational search term model to predict influenza activity. In 2009, Google also offered Google Flu Trends (GFT), a digital method used to detect public health surveillance [8]. By gathering web information, the investigator claims to validly estimate influenza epidemics. e novelty of the GFT model is that it is used by the Center for Disease Control (CDC) to find specific search terms from digital data for predicting influenza epidemics. Various subsequent studies have modelled their approaches after the GFT model to enhance the GFT model [9][10][11][12].
Hence, we present the INtelligent Time Series (INTS) model, which outperforms all alternative models in predicting influenza epidemics by using Internet search queries. Increasing studies are focusing on monitoring data-based infectious illnesses to complement current technologies and develop new models [13][14][15][16][17][18][19]. Furthermore, developed models for detecting infectious diseases using Internet searches are presently being conducted by using large amounts of information, such as Internet search queries [20][21][22][23][24]. us, it becomes possible to collect and process Internet search information to monitor the healthcare system. Internet search information has the ability to detect an epidemic at a better speed than standard surveillance technologies, according to Towers et al. [25]. For instance, the model that included search query data obtained the best results when Huang et al. predicted hand, foot, and mouth disease using the generalized additive model (GAM). As such, fresh big data surveillance tools have been shown to have the benefit of easy accessibility and recognising patterns in infectious disease before formal organisations [26]. Social media provides big data for useful information that can help discover those patterns. Tenkanen et al. reported that big data on social media is comparatively simple to obtain useful information for developing a real-time system [27].
is proposed research uses Twitter information to forecast mental illness [28]. Besides the influenza epidemic, a new type of influenza virus against which there is no previous immunity shows human-to-human transmission and has caused millions of deaths until an epidemic vaccine has been discovered. is system can use search queries from Google's search engine for influenza epidemic surveillance [29]. Quick and early estimation and prediction of the influenza epidemic before spreading greatly helps governments, health officials, and healthcare organisations to take appropriate decisions and timely prevention measures. In addition, influenza epidemic surveillance helps to provide information about the spread of influenza on a larger scale. Furthermore, the system helps in taking preemptive measures and spreading awareness regarding the disease to minimize its spread. e increased number of Internet users and researchers has helped identify Google's search engine use as a new monitoring scheme to complement the traditional scheme. us, Google Flu Trend tracks Google queries for obtaining information linked to influenza behaviour by Google customers, which shows a correlation with influenza CDC data while providing a projection of 1 to 2 weeks before CDC releases. Researchers [35] and Chretien et al. [36] presented a useful literature review of work in this area and described the methodology and data that estimate and predict the influenza epidemic. As they pointed out, some researchers used search queries to forecast influenza outbreaks. Twitter, Facebook, and Four Square are examples of sites where individuals intentionally post updates on their daily behaviours, health status, and physical locations. Paul et al. [37] used search queries from the social medium of Twitter to improve influenza forecasting. ey observed that tweets were positively correlated with existing surveillance data provided by the CDC. HarshavardhanAchreka et al. [38] developed digital flu surveillance using Twitter data to estimate and predict the influenza epidemic. ey argued that tweets collected from the social medium of Twitter could substantially help to detect influenza outbreaks earlier.
us, the objective of the present research was to build a model that assists in predicting the influenza epidemic using Google search queries. We integrated machine intelligence with the existing time series model to enhance the prediction of the influenza epidemic.

Data Sets
3.1.1. Epidemiological Surveillance Data. A weighted version of CDC'S influenza-like illness (ILI) activity level data was obtained from the Center for Disease Control and Prevention, which routinely collected epidemiological data and national statistics about influenza incidences on a weekly basis. We collected the ILI data from January 4, 2009 (week 1) to December 27, 2015 (week 52), across a total period of 312 weeks. is period covered the data expressed during the influenza seasons from the CDC in the USA. Data from the CDC ILINet system were obtained from [43], which provides weekly influenza surveillance information at the national and regional levels of outpatient and viral illnesses. We decided to use the CDC ILI because the CDC data is a very strong data set. All reports about the CDC ILI are made available [44].

Google
Correlate. Google search engines have become a significant part of everyone's lifestyle. ey have become an indivisible clue for understating our lives. e Google search engine helps us search for an individual or an area and provides us with important information about events, problems, solutions, and other stuff. Many search engines are available, such as Google, Bing, AOL, Yahoo, and the like. Since Google is the most famous search engine, we searched for models using Google's centre. Google has a Google Trends centre that provides statistics on search queries conducted around the globe, place, and moment. Google has the facility of Google Trends, which provides the statistics of the searches made in the world with respect to the search query, location, and time. e Google Flu Trends is a good example of this use to predict the influenza epidemic. Our ultimate objective is to construct a model similar to the GFT system and other standard designs using opensource information and enhanced methodology. Our objective in information collection is to discover open-source search query information that looks like search queries used for GFT. For each season, ILI information could be acquired from the CDC as our basis. GFT system exploited 50 million of the most popular database queries in the United States, where a request was described as a full series of customer words, to discover some queries mostly linked to CDC information. Using a simpler technique, Google also constructed Google Correlate, which would provide information that a customer could upload and the corresponding daily time series of these queries, with the top 100 most associated search queries at the domestic stage. erefore, it has used this tool to obtain an open-source dataset that reasonably matches the query data used in the GFT model that would not be released by Google. From January 4, 2009 (week 1) to December 27, 2015, we posted weighted CDC ILI information to Google Correlate and achieved an output of the 100 most important database queries. For each of these Correlate queries, the time series was not the real amount, but the quantity was subtracted by the median and split by the standard deviation of the sample. e output time series also ranged from January 4, 2004, through January 24, 2016. e output time series also ranged from January 4, 2004, through Jan 24, 2016. e Google Correlate has standardized the search volume of each query to have means zero and standard deviation one across time and contains data only from 2004 to January 2016. We compared our model with the original and revised (October 2014) Google flu Trend models. We observed that all the search terms obtained from Google Correlate were related to influenza activity. e 10 search terms that have heights correlated are selected search terms for predicting the influenza epidemic in this work.
To make Google Correlate data compatible with trend data, the min-max normalisation method was used. e Complexity 3 min-max method is used for scaling the data between 0 and 2. Table 1 shows the entire spurious search, which contains information about the influenza epidemic. We obtained 100 search queries, the same as the GFT model, but we selected only 10 search queries on sum each week, respectively.

Normalisation
Method. Normalisation transformation of the appropriate time series typically helps to improve suitable time series models. e Min-max method is employed in Matlab for scaling the data.
is method transforms data within a range of 0 to 2 scales: where x min is the minimum of data and x max maximum of data. New min x is minimum number 0 and New max x is maximum number 2.

Prediction Models.
In this section, the proposed system is presented. Figure 1 shows the generic framework of the proposed system.

e INtelligent Time Series (INTS)
Model. e INTS model explicitly predicts influenza outbreaks using Google search queries. Figure 2 illustrates how the INTS model can be a hybrid model with the existing time series prediction model and the k-means clustering algorithm. e prediction model was used to predict the influenza epidemic using Google search queries. Furthermore, the k-means clustering algorithm is employed to analyze the search pattern that has been obtained from the Google engine separately. e novelty of the INTS model lies in its integration of the results obtained from the WES time series prediction model along with the centroid obtained from k-means algorithms. e INTS model is a function of results obtained from the WES model and centroids of the k-means clustering algorithm.
EP I � f(P i , C i ) is prediction function generated from the time series model and the centriod of clustering. e integrating model improved the prediction results. A comparative prediction result between the INTS model and existing times series models is presented. It is noted that the INTS model outperforms. e steps of the proposed INTS algorithm are discussed in the following subsections. e INTS algorithm is shown below: Let, S i be the sample of i th day, K be the number of clusters and K i be the i th cluster, C i is the centroid of i th cluster. Let P i be the prediction for i th sample obtained by using WES model and EP i is an enhanced prediction for thei th sample obtained by using the proposed model (Algorithm 1). e components of the INTS proposed system are as follows: (1) Weighted Exponential Smoothing (WES) Model. Exponential smoothing models are one of the most important prediction approaches widely used in industry and commerce.
where S T+1 remains constant and is smoothing data. e exponential smoothing method is the generalisation of the moving average technique. Exponential smoothing models are also one of the prediction approaches that use stationary time series data. e idea behind exponential smoothing is to smooth the original time series data for forecasting future values.
e weighted exponential smoothing model (WES) is the most commonly used model for forecasting information from time series. is model is used when there are roughly horizontal information patterns, and there are no lengthy and temporary fluctuations.
is means that the WES strategy is used to predict a time series when the time series data is at the normal stage.
(2) K-Means Clustering Algorithm. Clustering time series is one of the most difficult clustering problems in information mining time series. Subsequence time series is used by a sliding window to remove the subsequence of items, which is segment clustering from a single long time series. Another type of clustering is time-point clustering, which is used to cluster object time points based on a combination of temporal proximity and the similarity of their respective values. is sort of time series clustering is similar to the segmentation of time series. However, time-point clustering is distinct from segmentation, owing to the fact that in timepoint clustering of all items, it is not appropriate to add to the cluster because some of the items are deemed noisy. In the clustering of subsequent time series, it is important to observe how the technique can be used to categorize a vast quantity of time series data on how they can generate significant results. A most recent study has focused on subsequent time series clustering to improve time series models. Our goal was to focus on clustering in the centroid to improve the model of the WES time series. It is important to note that to improve the WES time series model, our technique was more viable. e strategy of k-means clustering is one of the easiest unsupervised teaching methods to address the well-known issues of clustering. K-means clustering processes are very simple and easy to classify in a certain amount of clusters (suppose k clusters) in a given information set.
where ‖x i − μ k ‖ 2 is the Euclidean distance, c i is number of data points, and c is the number of clusters.

Support Vector Machine Regression (SVMR).
e support vector machine regression (SVMR) model is an increasingly common version of the support vector machine used for problems with regression. Although the Support Vector Machine algorithm is common in classification issues, SVMR is trained to generate numerical values for regression.
e general formulation of SVM and SVMR algorithms is very different. e basic idea in both SVMR and SVM is to map data set X to a high-dimensional feature space F through a mapping function called kernel function π and to do linear regression in F [45,46]. SVMR algorithm is essential to solve problems requiring many parameter estimates using traditional statistical methods. e SVM algorithm is used to classify data by using ε-insensitive. For the SVM algorithm, which uses regression purposes, the main idea was to find function f(x) that has a deviation from the reality obtained target y i for the training data. e main principle is the same as the SVM classification, but we have a new function that can be minimized. In the ε-insensitive support vector regression, the main goal is to find a function f(x) that has a deviation from the actually obtained target y i for all training data.
For this equation, we have to solve the following problem: Subject to If the problems are not feasible, we need to introduce the slack variables ξ i , ξ * i as it is called soft margin: Subject to For determination, the trade-off between the flatness of f(x) by using C, the amount up for deviations is larger than εtolerance. is case is called ε-insensitive loss function |ξ|ε and this can be as follows. Figure 3 displays the hyperplane of the SVM algorithm when the hyperplane separates the data into classification and regression purposes. e SVM algorithm is used for classification data; it is a very powerful machine learning algorithm for classification, and the SVM algorithm has the ability to solve the regression problem, as shown in Figure 3. Figure 4 displays the process of the SVMR model to predict influenza outbreaks using Google search queries.

Artificial Neural Network Using Particle Swarm Optimisation (ANNPSO).
Particle swarm optimisation was developed for a global optimisation system, PSO, which is a group based on a stochastic optimisation method for nonstop nonlinear capacities. In correlation with other metaheuristics, PSO has acquired prevalence and is indicated plainly to be successful, and it focused on enhancement calculation. Every part of the PSO technique has been known as particle flies around the multidimensional search space with a velocity, which is constantly raised to date by the particle's own particular experience and the experience of the particle's neighbours or the experience of the whole swarm. It implies two errors of the PSO algorithm are created: PSO with a neighbourhood in the global and PSO method with neighbourhood overall worldwide. As indicated by the global surroundings, every particle moves towards its best past position and towards the best particle in the whole swarm, called gbest demonstrate [47,48]. Furthermore, as indicated by the local disparity called lbest, every particle moves towards its best past position and towards the best particle in its limited neighbourhood. While PSO has a memory of the past, the learning of a good solution is kept by all particles. Particles cooperate in a helpful way to share data in the swarm. e particle swarm optimisation (PSO) algorithm is based on a velocity update and position update. Velocity updates the following equation: Position update: where random inertia weight was calculated according to the equation as follows: Artificial neural network (ANN) is a type of computational model that is regularly utilized in the fields of machine learning, software engineering, and other research disciplines. is computational model is composed to mirror the immense system of neurons in a brain. It is commonly utilized for issues that are hard to be unequivocally customised in view of its capacity to gain from cases. e type of ANN utilized this exploration, which is completely associated with feedforward that organizes where each input is associated equitably with all the hidden neurons. For simplicity and preparation speed purposes, only a single hidden layer was utilized in the system. PSO is a global search and population-based algorithm used to train neural networks, identify neural network architectures, adjust network learning parameters, and optimize network weights. PSO avoids trapping at a minimum local level because it is not based on information about gradients [47]. PSO function in ANN is to obtain the best set of weights (particle position) where several particles try to move to obtain the best solution. e search space dimension comprises cumulative weights and prejudices. By following the personal best solution of each particle and the best global amount of the entire swarm, the algorithm finishes the optimisation. A population-based algorithm's success or failure depends on its ability to trade efficiently between discovery and extraction. An inappropriate balance between exploration and extraction can result in a poor method of optimisation, which may suffer from premature convergence, local optimum trapping, and stagnation. Figure 5 shows the flow process of the ANNPSO model for predicting the influenza epidemic using Google search queries.

Performance Metrics.
Four error indicators were used to evaluate the prediction model. e mean square error, root mean square error, and mean absolute error were used as performance indices. ose methods of standard indicators are defined as follows: where x t is observed responses, x t are estimated responses, and N is the total number of observations.
where x t is observed responses, x t are estimated responses, and N is the total number of observations.
where x t and x t are the estimated and observed responses, respectively.

Results Analysis
Our analyses used the data from January 4, 2009 (week 1) to December 27, 2015 (week 52) across a total period of 312 weeks, covering 7 years of the CDC data. e CDC data are uploaded to Google Correlate, obtaining 100 search query terms that are related to the influenza epidemic. In total, 10 search terms were analyzed in this study. e 10 search queries with the highest correlation have been selected. e min-max method was used for normalisation purposes, and three experiments were conducted to obtain the prediction result. ese three experiments are presented in the following section.

Results Analysis of the INTS Model.
e Weighted Exponential Smoothing algorithm was applied to search terms obtained from Google correlate. e weighted exponential smoothing model depends on the α smoothing constant; it was then tested with values from 0.1 to 0.9. e MSE performance measure was scrutinized through the use of these parameters.
e α � 0.9 parameter was selected as a Complexity smoothing constant. It was observed that α � 0.9 was appropriate for data prediction. Moreover, α � 0.9 is given fewer errors as compared to other parameters. To enhance the prediction of the conventional weighted exponential smoothing model, the k-means clustering algorithm is used. e first step was to determine the number of clusters for kmeans clustering. e beginning was made up of eight clusters. After determining the existence of one cluster that had fewer objects, it was decided to reduce the cluster numbers until all clusters with more objects were obtained. Lastly, we determined that the five clusters were appropriate. en, it should be considered for centroids of cluster numbers. Each assigned object belongs to the specific cluster by centroids. e centroids were integrated with the results that have been achieved from the existing WES algorithm. e predictive capabilities of our intelligent model were compared with the existing GFT, ARGO, GFT + AR, AR (3), and naive models. erefore, the comparison is presented by employing CDC real data. MSE, RMSE, and MAE were used to evaluate and estimate the performance of the INTS proposed model in comparison with the existing prediction models. e obtained results showed significant advantages for our proposed model. It was obvious that the INTS model is the most effective and robust predictor that can be used to enhance the prediction of the influenza epidemic using search terms. Table 3 Figure 6 illustrates the performance prediction of the intelligent time series model. Figure 7 shows the performance of the regression plot of the INT model.

Results
Analysis of the ANNPSO Model. ANNPSO intelligent models have been implemented to predict the influenza epidemic using Google search terms. e Min-Max method is presented to scale the data to enhance the prediction models. Adapting the ANNPSO model was applied to develop a smart heal care system. Since the weights of the ANN need to be optimized, the position of the particles in the PSO algorithm needs to be tracked. e issue space includes combinations of all weight values of the ANN algorithm. is search space consists of n-dimensions, where n is the total number of weights to optimize. Each particle has an n-dimensional location vector and speed vector. e particle swami optimisation is flying around this search space and creating the optimum weight set. e weights are allocated to the ANN while assessing the fitness of a particle in the PSO, and its predictive precision is discovered. is offers the particle's fitness. If fitness is the best so far for the particle, it will be taken as its personal best, and if it is the best so far for the swarm, it is considered the best global. e adapting model helps to improve prediction results. ese particle swarms were used to improve the weight of the ANN approach. 25 particles were considered for 200 iterations. (1) Cluster search queries obtained from Google correlate S i using clustering techniques: k-means (2) Apply conventional time series prediction models, such as the WES model. (3) Obtain prediction P i using WES model. (4) Modify P i using C i is centroid of the j th cluster EP i � f(P P i , C_j) ALGORITHM 1: INTS algorithm. 8 Complexity Table 3 summarizes the prediction results of ANNNPSO to predict influenza epidemics. It is noted that the adapting model obtained satisfying results. e prediction results were MSE � 0.0024 and RMSE � 0.493, MAE � 0.25, and R � 0.94%. e obtained results have proved that Google search queries have the strongest relationship with clinical data. Figures 8 and 9 display the performance of the ANNPSO model. Table 3 demonstrates the prediction results obtained from adapting the SVMR model. It is reported that the proposed model has performed good results. e support vector machine algorithm was applied to predict influenza, and we used the RBF kernel.

Results Analysis of the SVMR Model.
e RBF kernel function has robust efficiency compared with other SVM functions. e kernel parameter valuesC, c, ε were tested to attain the best performance by changing parameter values. e optimum parameter values were selected according to the lowest obtained errors. e prediction results show that there is a relationship between clinical data and web search terms. According to the MSE, RMSE, MAE, and R � 0.99 obtained results of 0.136, 0.369, and 3.888, it is indicated that clinical data has more impact on the web search.
us, Figures 10 and 11 exhibit the estimation performance of the SVMR model for predicting the influenza outbreak.

Discussion
In the present research paper, some significant implications have been presented for estimating and predicting the influenza epidemic at the local and national levels of USA

Complexity
prediction models with different input data sets have used the pattern of the whole period and seasonal period. But, the INST model is used for one input data, with a whole data set. e results obtained from the intelligent model were compared with different existing models using different input data sets throughout the period and the season. We observed that the INTS model outperformed all other alternative models of different input data with respect to MSE, RMSE, MAE, and increment correlate. Table 4 shows the perdition results of adapting the models against the existing  Further, the important advantage of predicting Google Internet search activities behaviour is the standing possibility for earlier detection. is is significant because taking notice by clinical or official health bodies is often delayed until there is an investigation of the prediction process. Furthermore, seeking significant health information on the web search activities before, or even instead of medical visits, can help detect earlier stages of the illness. It is therefore imperative to pay attention to the use of Google search queries for developing models that can help detect influenza in its earlier stages. is can be done by using CDC data that we have obtained from USA patients. Demonstrating and combining a robust, dynamic, and more accurate methodology to predict the influenza epidemic is our intention. We have concluded that the INTS model is stronger and more robust in comparison with other alternative models, such as ARGO and GFT. e adapting model INTS is more efficient and robust in improving the healthcare system-the strength of the INTS model. e prediction results of the INTS model demonstrated the superiority of our model with respect to accuracy and strength compared to all alternative influenza prediction models over Google searches. e methodology of the proposed system can help improve the healthcare system by using Google search queries.

Conclusions
Web activities are an important source for obtaining health information. Consequently, web searches provide vital information regarding numerous infectious disease activities. For example, an influenza epidemic can occur when an infectious disease quickly spreads to many people. As a further instance, the web can efficiently perform a thorough examination of the relationship between search queries of influenza and actual influenza occurrence. ree adapting prediction models, namely INTS, ANNPSO, and SVMR, were presented to improve the prediction of the influenza epidemic by using Google search queries. e methodology of these proposed models outperformed other existing models and provided higher accuracy and robustness in predicting influenza. e former models were originally implemented to predict influenza using Google search terms. However, the novelty of the proposed research is the development of the INTS model, which is a new model for predicting the influenza epidemic.
e prediction results demonstrate that the proposed INTS model can be effectively employed to predict influenza outbreaks using Google search queries. A comparative prediction result between GFT, AGRO models, and the present SVMR and ANNPSO models is presented. It has been observed that the results of the alternative model are 0.0013, 0.0369, 0.0185, and 0.97, in accordance with MSE, RMSE, MAE, and correlate of increment performance measures. Respectively, it is also observed that the INTS model is more satisfying in comparison to the existing models like ARGO and GFT.

Conflicts of Interest
e authors declare that they have no conflicts of interest.