Early Detection of Seasonal Outbreaks from Twitter Data Using Machine Learning Approaches

Seasonal outbreaks have several diﬀerent periods that occur primarily during winter in temperate regions, while inﬂuenza may occur throughout the year in tropical regions, triggering outbreaks more irregularly. Similarly, dengue occurs in the star of the rainy season in early May and reaches its peak in late June. Dengue and ﬂu brought an impact on various countries in the years 2017–2019 and streaming Twitter data reveals the status of dengue and ﬂu outbreaks in the most aﬀected regions. This research work presents that Social Media Analysis (SMA) can be used as a detector of the epidemic outbreak and to understand the sentiment of social media users regarding various diseases. Providing awareness about seasonal outbreaks through SMA is an eﬀective approach for researchers and healthcare responders to detect the early outbreaks. The proposed model aims to ﬁnd the sentiment about the disease in tweets, and the seasonal outbreaks-related tweets are classiﬁed into two classes as disease positive and disease negative. This work proposes a machine-learning-based approach to detect dengue and ﬂu outbreaks in social media platform Twitter, using four machine learning algorithms: Random Forest (RF), K-Nearest Neighbor (KNN), Support Vector Machine (SVM), and Decision Tree (DT), with the help of Term Frequency and Inverse Document Frequency (TF-IDF). For experimental analysis, two datasets (dengue and ﬂu) are analyzed individually. The experimental results show that the RF classiﬁer has outperformed the comparison models in terms of improved accuracy, precision, recall, F1-measure, and Receiver Operating Characteristic (ROC) curve. The proposed work oﬀers favorable performance with total precision, accuracy, recall, and F1-measure ranging from 84% to 88% for conventional machine learning techniques.


Introduction
An infectious disease may have several periods that typically occur in a specific season in the prevaccination era. Rapid identification of a seasonal outbreak is essential to generate a reaction to healthcare professionals more quickly and efficiently. e seasonal outbreaks can lead to serious diseases, such as influenza or Influenza-Like Illness (ILI), which can lead to death when the epidemic breaks down epidemiologically in a region [1,2]. Influenza is a respiratory system disease, which causes a significant death rate every year globally. e flu virus is often medically mild and is acknowledged by symptoms such as headache, sneeze, fever, sore throat, and cough [3]. Influenza shots are almost always available during the winter season, and the infected person should move to a specialist instead of a normal doctor. e barrier of influenza can harm the patient and can create a much severe condition if not treated. As reported in [4], flu is an epidemic outbreak, and in such cases where the proliferation of infectious diseases, particularly influenza, is a real risk to people, the government must implement appropriate health surveillance to control the epidemic. Similarly, dengue infection is a mosquito-borne virus that causes serious ILI and also causes a potentially fatal risk factor called severe infection with dengue fever. Dengue is among the rapidly propagating contagious diseases in the world. erefore, providing real-time monitoring, early detection, and identification of contagious diseases related to the outbreak of influenza or dengue are important for public health [1,5].
In the health surveillance systems, Online Social Media (OSM) offers effective resources for epidemic outbreaks detection and an active way of coping against the outbreaks [6,7]. e effect of seasonal epidemic outbreaks (i.e., dengue or influenza) on population safety can be minimized by early notice of disease detection. To track the frequency of infectious outbreaks faster than healthcare practitioners and health agencies such as the Center for Disease Control and Prevention (CDC), OSM can also be configured for surveillance systems [1]. e CDC uses the Influenza-Like-Illness Surveillance Network (ILINet), a platform used to track early alerts of flu outbreaks by healthcare professionals. While it is an effective yet expensive and slow process, it requires weeks or even months when data becomes accessible from CDC. erefore, various studies concentrate on offering alternatives to track ILI through leveraging SMA and identifying early alerts regarding infectious diseases to conduct real-time analysis. OSM applications like Twitter can be used to predict public disease outbreaks and can promote timely information [8][9][10]. With the help of OSM, it is possible to alert healthcare consultants to provide the appropriate services and monitor the epidemic. Nowadays, people frequently use OSM applications to share ideas, opinions, and health status, specifically when there is an epidemic in a region. In order to decrease epidemic outbreaks, SMA can be used to deliver efficient information for disease monitoring and is a useful way to interact with the public [11][12][13][14][15][16][17].
In this paper, a machine-learning-based intelligent model is proposed, which will retrieve text on dengue and flu from Twitter. e tweets are categorized into two groups, that is, positive and negative, where positive tweets represent dengue and flu-infected cases (such as to represent the symptoms of dengue and flu in infected people in tweets) and negative tweets represent nondengue nor flu-infected cases. e main contributions of the proposed work are the use of machine learning techniques RF, KNN, SVM, and DT for tweet identification with the help of a TF-IDF approach. Applying a unigram approach, which means deriving sentiment of the public on dengue from an individual word in tweets, the early detection of seasonal outbreaks (dengue and flu in our case) through SMA can provide awareness about these seasonal outbreaks. It identifies dengue-and flurelated tweets in various regions using tweet data. Some people still think dengue is not dangerous and consider it similar to a "seasonal flu" which is a seasonal disease with available treatments and vaccines, not to mention that even seasonal flu kills around 30 k Americans a year. e rest of the paper is structured as follows: Section 2 presents a brief review of related work. e approach followed to conduct the experimental results is confirmed in Section 3, while in Section 4, the evaluation and the analysis of the results are presented. e concluding remarks and future research lines are presented in Section 5.

Related Work
In this section, a review of related work conducted on detecting seasonal epidemics from social media is presented.
With the enhancement of machine learning, it has been increasingly adopted for the SMA regarding disease detection. Wang et al. [18] developed a framework for influenza prediction based on real-time online OSM data. In their work, they deployed a Partial Differential Equation (PDE) for prediction. Furthermore, with flu reduction evaluation, they further predicted the volume of the tweets in the future. Chen et al. [19] presented a model based on two temporal topic models such as supervised and unsupervised models to grab the user's hidden states and geographic information from their tweets message for the purpose of better trends estimation.
e gap between surveillance strategies for phenomenological disease and epidemiological techniques has been narrowed by this approach using tweet data.
Recently, the influenza outbreak has been mainly triggered by CDC, which is now spreading all over the year, triggering a rise in instances between January and March. CDC has been encouraged by professionals to perform its role in supporting awareness regarding early detection and provide necessary recourses to control or cure flu epidemic that has become a risk for people with the advent of cold weather. ere was a need for media campaigns for awareness on the flu shot, and the healthcare organizations concerned must initiate efforts to educate people regarding the impacts of flu on their life preventative action they can take and therefore the treatments and appropriate medicines [20]. Now, in the OSM era, diseaserelated information is also shared on OSM directly or indirectly [2]. To utilize the information about flu outbreak, it can be easily extracted from SMA, particularly microblogging platform Twitter, to detect the early outbreak of influenza shot for the awareness of healthcare professionals to provide resources or medications and to control the epidemic [1,2]. e dengue outbreak is a globally transmitting contagious infection [21]. In order to effectively detect the outbreak of dengue fever and to examine the effect on primary prevention, dengue monitoring data is highly needed [22]. In China, dengue tends to be a significant public health concern with enlarging regions and massive cases recently [23]. However, in the scenario of infectious diseases and dengue virus surveillance systems in the big country of China, no extensive steps have been taken to predict and monitor early warnings of the dengue outbreak so far [23].
With powerful cooperative reaction from government and Nongovernmental Organizations (NGOs), healthcare professionals are helping to prevent further spread of the epidemic [24]. Public awareness about the seasonal epidemic outbreaks, especially regarding dengue fever, actual problem understanding, and the approaches to monitor dengue outbreaks are significant considerations. e perceptions, behaviour, and approaches regarding dengue in cities are explored in many studies [4,25]. Social networking could be utilized efficiently to 2 Complexity classify people contaminated with diseases and health awareness influences (e.g., influenza, dengue fever, anxiety, malaria, measles, etc.) with an intervention to improve public health. rough the use of SMA, emergency alert signs of the seasonal outbreaks can be identified and the time between occurrence and diagnosis can thus be shortened. Table 1 demonstrates the relevant research studies on outbreak detection using machine learning approaches. e literature has shown that the previous studies on OSM-based disease detection focused on conventional machine learning approaches including Naïve Bayes, Support Vector Machine, and linear regression; and many works focus on detecting the frequency of tweets about a disease. e important considerations of the proposed work are the primary prevention and intervention that provide identification and alert system of infectious disease outbreaks, epidemic tracking, and modelling and evaluation of public health emergency. Recent years have observed a fast development of machine learning approaches in rapidly changing, dynamic, and data-rich environments to achieve these tasks. In this article, a set of current publications based on how to improve the use of the virtual world's sensor and OSM data to enhance seasonal outbreaks detection and early warning capacities are summarized. Furthermore, a collection of methods aimed at mapping the spread of seasonal outbreaks and estimating epidemiological data from Twitter messages are also presented. e proposed work focuses on supervised machine learning approaches DT, KNN, SVM, and RF and how to improve the performance of the traditional machine learning approaches to analyze seasonal epidemic outbreaks. e proposed work presents promising results for traditional machine learning techniques.

The Proposed Methodology
e proposed methodology incorporates various components including data gathering, preprocessing, feature extraction, and classifier. Figure 1 represents the proposed methodology adopted for dengue detection. In the following subsections, we will discuss each component in detail.

Data Gathering.
In this work, the benchmark dataset on dengue and flu designed by Amin et al. [10] is analyzed. ey labelled 6000 tweets on dengue and flu as infected or not infected. Tweets are considered as positive when they represent the symptoms of dengue\flu in infected people and the remaining tweets are considered as negative when they represent dengue-\flu-related information but not symptoms as shown in Table 2, while Table 3 shows the number of total labelled tweets.

Preprocessing.
After a text is obtained, the collected data is promoted to certain preprocessing steps usually applied in natural language processing techniques [30] such as eliminating and removing stop words, sparse terms, and particular words, as those do not convey meaningful information. en punctuations, accent marks, and other diacritics were removed. After that, the tweet text was converted into token words. ese preprocessing steps were incorporated in order to enhance the efficiency of the proposed model and to improve the processing speed.

Feature Extraction.
To convert text data into numbers, the machine learning techniques feature vectors and TF-IDF are adopted in this work. Feature vector converts a tweet into a matrix of token counts, while TF-IDF is the most common technique for feature selection in machine learning [31]. In a given corpus or dataset, the TF-IDF calculation indicates the significance of a term. By weighting the frequency of occurrence in the text and calculating how much the same word appears in other texts, TF-IDF calculates how significant a word is. If a word appears in a specific document several times but not in others, then it may be extremely important to that particular document and thus more significance is given. Mathematically, it is calculated as follows: TF � (number of occurrences of a word in a document) (total of words in a document) , IDF � log (total number of documents) (number of documents containing the word) .
(1) e definition of the equations stated above is as follows: Term Frequency (TF): it calculates the number of occurrences of a word in certain tweets. Inverse Document Frequency (IDF): the calculation of TF-IDF indicates the significance of a word in a given document.

Machine Learning Models
Four machine learning algorithms (DT, KNN, SVM, and RF) are applied and evaluated for the classification task. In the following subsections, we will discuss those algorithms and explain the evaluation measures used.

Decision Tree
One of the foremost machine learning techniques is the Decision Tree (DT) [32]. DT can be used to solve both classification and regression problems, but it is most commonly used to tackle classification problems. It builds classification models in the form of a tree-like structure, where internal nodes represent the attributes of a dataset, root represents the decision rules, and each leaf node represents the outcome. e DT tree nodes consider the various levels, in which the root node is considered the first or top-most node. All inner networks contain measures on input variables or attributes (i.e., nodes having at least one child). e classification model splits towards the relevant child node, based on the test result, where the test and splitting process continues until it hits the leaf node. e nodes of the leaf or terminal match the results of the decision. DTs have been observed to be simple to understand and easy to learn and are a basic feature of many procedures for medical diagnosis. e results of all the tests for each node across the path can Complexity deliver adequate statistics to speculate about its class when navigating the tree for the classification of a sample. Mathematically, it is formulated as follows: where x shows the leaves in a tree, u represents the root node, and v is a subset in a tree.

K-Nearest Neighbor
One of the easiest and latest classification techniques is the K-Nearest Neighbor (KNN) method [29]. It can be considered as a simplified version of the Naïve Bayes classifier. e KNN approach does not need probability values to be considered, unlike the Naïve Bayes technique. "K" is the KNN algorithm that is known to take "poll" from the number of nearest neighbors. For the same sample item, the specification of distinct characteristics for "K" can produce separate classification accuracy. Mathematically, it is formulated as follows: where ui is the ith instance of the samples and y is the predicted result.

Support Vector Machine
Support Vector Machine is a classification technique used for supervised machine learning [33]. SVM operates by sorting the data into different classes by finding a line that is often referred to as a hyperplane that separates the set of data into categories. For text categorization, the fundamental concept behind linear SVM is to evaluate a hyperplane that separates the dataset or documents. e mathematical formulation of SVM is as follows: To predict influenza or flu trend based on real-time data from OSM data.
In future work, this systematic approach will be extended to other types of an outbreak.
J. S. Coberly et al. [26] 2014 SVM Dengue Twitter is study proposed a method that focused on geographic information about dengue fever.
To associate the new case of dengue in a region with the reported cases by public health departments in the Philippines.
Overall tweet must be processed to achieve better insight into text data.
A. Alessa et al. [27] 2019 To propose a VazaDengue system to detect mosquito-borne disease in tweets. To report and visualize new incidence of outbreaks.
In the future, there is a need to utilize Instagram content such as the classification of image data associated with the relevant post.
L. Chen et al. [19] 2016 Topic, hidden Markov model Flu Twitter is work proposed syndromic surveillance of flu outbreak in tweets. To predict the flu outbreak, temporal topic models were deployed in this work.
In future work, the proposed state transition probabilities can be utilized for traditional epidemiological approaches.
e definition of the above equation is as follows: S represents sample of dataset and x will be associated with the y value showing if the item or features refer to the class.

Random Forest
Random Forest contains a huge amount of a DT that works as an ensemble [34,35]. Each tree in an RF spits out a class prediction and the most voted class comes to be the prediction of the model. e basic idea about RF is that it is an ensemble  Flu season has been stopped in its tracks this winter. Negative

3.
My daughter Francine has contracted dengue fever. She is in a critical point of the fever she is very weak and ill. Her chance of survival is reducing every day as her blood platelets have dropped dangerously. Positive

4.
Dengue season is starting. Make sure you all are removing/changing water from coolers, indoor plants, and other small containers at least once a week. Negative

5.
Flu season is in full swing, and we are currently in the usual peak months of December and February. Negative 6. May the healthcare system be able to handle the added burden of dengue season. Negative 7.
Been fighting dengue for the last 5 days. Today I finally get to rest. Positive 8.
My 20-year-old nephew just lost a close friend to dengue. Feeling a little numb and blank on hearing the news. Positive Complexity 5 method and consists of several DTs that are close to the set of several trees in a forest [34]. e overfitting of the training data is also triggered by DTs, resulting in a high variance in the endpoint of classification for a minor alteration in the input features. ey are very responsive to their training data, making them vulnerable to errors in the test dataset. Using the various sections of the training dataset, the various DTs of an RF are learned. To identify a new sample, on each DT of the forest, the input vector of that dataset must be passed down. A different part of the input vector is then considered by each DT and delivers a classification result. e forest then decides to classify the most "votes" for discrete classification results or numeric classification outcome for the aggregate of all forest trees. As the RF algorithm evaluates the effects of several different DTs, the variance resulting from considering a particular DT for the same dataset can be minimized. Mathematically, it is formulated as follows: where z shows the training examples from x, y, and yz shows a regression or classification tree. After training, predictions for unseen samples x′ can be made by averaging the predictions from all the individual regression trees on x′.

Performance Measures
Different measures are adopted for determining the performance of machine learning techniques [36]. e proposed approaches of the performance measures have multiple attributes and provide multiple results for determining the infected and the noninfected. For instance, some performance measures such as precision, accuracy, recall, F-score, and ROC curve are computed for determining the dengue and flu outbreak as infected or not infected (presented in Section 4). ROC curve is used for binary as well as multiclass classification tasks. When the ROC curve is better, the model would be to detect the target classes, where accuracy determines the ratio of correctly detected infected/noninfected cases in a class, and it is determined as where, TN: True Negative (correct refusal), that is, the dengue negative is categorized as not infected, TP: True Positive (correct detection), that is, the dengue positive is categorized as infected, FP: False Positive (type-I error), that is, the dengue negative is categorized as infected, FN: False Negative (type-II error), that is, the dengue positive is categorized as not infected.
Precision or positive predictive is the proportion between the expected appropriate disease positive cases and the total prediction of related and inappropriate disease positive is known as precision, and it can be computed as A positive sensitivity or recall is the proportion of the actual infected obtained to the total number of actual infected, which can be measured as F-score or F1 measure can be determined with the help of harmonic mean. It organizes the data to better classify the test data and classify the findings obtained when a large number of incidents are not omitted.

Results and Discussion
e selected machine learning algorithms (DT, KNN, SVM, and RF) were applied to the data and evaluated using precision, accuracy, recall, F1 measure, and ROC curve. To train the proposed model, we split the dataset into two parts using the test-train division method of scikit-learn: (i) Training data: we have utilized 80% of the data for training set through which the model learns and is used to fit the model. (ii) Test data: to evaluate the results of the model on the unseen data, we have applied 20% of the dataset as testing data. Table 4 shows the detailed results. It is observed that using the RF model could lead to slightly improved results, while the DT and KNN also perform using confusion matrix for testing data with DT ( Figure 2 Figure 9). In the plots of confusion matrix and ROC curve, "1" shows positive class, while "0" shows negative class. Figure 10 shows the training and test accuracy for each model (DT, KNN, and RF), while Figure 11 shows the performance accuracy achieved by adopting the selected machine

Conclusion
e important considerations of primary prevention and intervention provide identification and alert system of infectious disease outbreaks and modelling and evaluation of public health emergency. Recent years have seen a fast development of machine learning approaches in rapidly changing, dynamic, and data-rich environments to achieve these tasks. In this article, we summarized a set of current publications based on how to improve the use of the virtual world's sensor and OSM data to enhance seasonal outbreaks detection and early warning capacities. We also presented a collection of methods aimed at mapping the spread of seasonal outbreaks and estimating epidemiological data from Twitter messages. is paper proposes a machine learning approach for the early detection of dengue and flu seasonal outbreaks. Four algorithms were applied (DT, KNN, SVM, and RF) and, for feature extraction, TF-IDF was used. e proposed methodology was evaluated on two datasets of 6000 labelled tweets on dengue and flu [10]. e results of the proposed method have been evaluated using confusion matrix performance evaluation techniques and ROC curve and their results are graphically visualized. Results showed that the RF classifier has outperformed SVM, DT, and KNN in terms of accuracy, precision, recall, and F1 measure. e proposed work offers favourable performance with total precision, accuracy, recall, and F1 measure ranging from 84% to 88% for conventional machine learning approaches.
Despite substantial findings revealed by the proposed model, it has certain drawbacks. Supervised learning was used in this research, because of which the data used for model training needed to be labelled. e model should be trained in an unsupervised way to avoid requiring labelled data. In the future, the proposed model may also be applicable as a surveillance system to quickly detect the transmission of coronavirus and COVID-19.
Data Availability e data of this article are available from the corresponding author upon request.  Figure 11: Results for precision, recall, and F1-score.