Tweets Classification and Sentiment Analysis for Personalized Tweets Recommendation

,


Introduction
In the last decade, social networks have witnessed multifold advancements due to the rapid digitization of the service industry and other advancements in the field of information technology. A plethora of information sharing platforms and the increased connectivity with the Internet [1] have also led to a change in the general perspective of networking, socialization, and personalization [2]. For the month of December 2018, an average of 1.52 billion users were active on Facebook daily [3]. is is besides auxiliary services offered by Facebook, such as WhatsApp, Messenger, and Instagram, each of which has over 1 billion active users, per month [4]. Similarly, as identified from third-party reports, other platforms, such as YouTube owned by Google, iMessage by Apple, and WeChat by Tencent, are also a part of the, no longer elite, 1 billion-per-month-active-user-club. More significantly, three out of every four adult Internet users are now actively utilizing at least one social network platform [5]. From a pure technological point of view, this enhanced connectivity has created unique challenges and opportunities [6,7] by allowing the users to not only consume services but also to share their experiences, feelings, and thoughts. One of the most impactful and emerging social networks is Twitter, which allows its users to broadcast the latest (personal, communal, national, or international) events in the form of short messages, "tweets," which are typically comprised of text, audiovisual content, and/or links to external websites [8,9]. Twitter is playing a key role in many fields such as social marketing [10], election campaigns [11], academia [12], and news. Hashtags (words identified by the symbol #) form a key part of any tweet, allowing public content to be categorized and made searchable for users. is allows the hashtag(s) to enrich the shared content and enable valuable analysis leading to discovering new insights and trends. In terms of information discovery and knowledge creation, this plethora of user created content allows the application of sentiment analysis, which aims to provide an automated mechanism for determining the writer's attitude towards the subject or its overall contextual polarity [13]. ese insights are especially useful for digital marketing, allowing organizations and in some cases governments (such as during the Arab Spring [14]) to monitor and measure social media and gain actionable business/social intelligence, allowing to understand how people view their brands, products, and services and to improve brand visibility.
In the same manner, social media is now playing an active role in improving healthcare service delivery [15]. Shifting to a more user-centric approach, social media enables near real-time information flow, which in turn enables immediate interventions for individuals and communities in hospitals, at clinics, or at homes [16]. For example, in a survey [17], the authors reported that search for health information stood out as the third most popular online activity. Today, patients, irrespective of their age, gender, or socio-economic standings, are relying on the web to find healthcare information related to their particular needs [18,19]. Additionally, patients can now make more informed decisions by examining the experiences of their peers in terms of symptoms, reactions, and treatments related to a particular disease, thereby bridging the communication gap between the patients and healthcare providers [20]. In addition, healthcare organizations can also take benefits by finding the timely response of problems and monitoring the user's behaviors, conditions, and feelings in between their visits [2]. Keckley and Hoffmann [21] studied online social networks to analyze their effect on patient health and found that people get more benefit while sharing their data on social networks such as PatientsLikeMe portal [22]. is virtual connectivity can provide many benefits, such as improving medication adherence, pharmacovigilance [23], reduction in side effects, enhanced community support, improved epidemiological analysis [24], and generally better healthcare services. Consequently, it is safe to say that healthcare benefits are directly related to social reachability [25]. According to PwC Health Research Institute [26], almost 90% users in the age of 18-24 were willing to share their health information on social networks.
However, such large use of social media has also introduced the problem of information overload. With an overwhelming amount of data on social media, users find it difficult to get personalized and concise information. Short and noisy text on social media also makes it hard to understand full context and classify data. In this paper, we propose a framework for providing personalized recommendations to the user by analyzing his health interest on social networks. While this work can be generalized in many domains, the research work presented henceforth is focused on processing healthcare data and information. e proposed classification and sentiment analysis system uses a semantic structure, important keywords, and opinion words from tweets to monitor user interests and then generates personalized healthcare and wellness-related tweet recommendations. ese personalized tweets consist of publicly available content which is precisely preclassified by our system. For tweet classification, the proposed system uses a domain-specific seed list which helps to decide which category a particular tweet belongs to. After classification, the proposed system also applies a lexicon-based sentiment analysis approach to extract topic level sentiments in tweets. To increase the accuracy of tweet analysis, the proposed system also uses synonyms with keywords. e proposed model performs more precise analyses of tweets enriching temporal patterns and semantics of keywords which optimize filtering result and help to extract more knowledge from tweets. For testing of profile generation, we collected 6000 tweets of users and generated user profile by extracting health-related keywords, entities, and sentiments. For classification, the system was tested on almost 1,000,000 tweets of different categories. Due to our preclassification strategy and other significant improvements, our current model showed an accuracy of 96% for tweet classification, which is significantly better than our previously published approach, with an accuracy of 89.5% [27].
e proposed system also measured how much information for one category can be extracted from other categories which were ignored by keyword-based search from tweets. e main contribution of the presented work is complete design and implementation of a personalized recommender system for a user based on his temporal social media history.
e proposed system does not just rely on keyword-based interest but it also takes user's temporal sentiments into account. e syntactic and semantic analysis of tweets leads to more complete profile generation and tweet classification. e rest of this research paper is structured as follows. Section 2 discusses related work closely aligned with our work. In Section 3, we present the theoretical foundations of the proposed platform and its components, followed by Section 4, which briefly describes our implementation strategy and presents the evaluation results of the proposed system. In the end, Section 5 concludes the research work and highlights future work.

Related Work
Social media analytics is an active, interdisciplinary research field, which has enabled the researchers to gain unique perspectives into human and data behaviors. e volume and variety of this largely unstructured data, produced at high velocity, has led to the development of many tools and technologies for extracting or rather enhancing the value of social interactions. Yet, there still remain many challenges in terms of identifying relevant data, tracking actions and reactions, increasing the veracity of data, optimization of data storage, data processing and visualization of information, extracting hidden patterns, and closing the data to knowledge loop [28]. A key task for researchers pursuing applied research in this field is to not only identify the techniques used for converting data to information and subsequently knowledge but also to look at its impact [29]. Twitter, along with its streaming API, and a large open (in 2 Complexity terms of keeping their tweets public) user base has further enabled the monitoring and analysis of a rich gold mine of data produced via a novel information propagation strategy. One of the more recent works in terms of analyzing tweet propagation, for prominent Mexican political figures, through the utilization of visual aids and pattern recognition approaches has been laid out by [30]. In this work, the authors collected tweets from six prominent Mexican politicians, their mentions, retweets, and favorites to their tweets. By applying sentiment analysis followed by a contrast pattern-based classifier working on 124 extracted (5 nominal and 119 numerical) features, the authors were to quantify impact of tweets based on their propagation patterns. In an earlier approach, as presented by [31], the authors utilized social features (such as number of followers, favorites, and others) and tweet features (such as number of hashtags, tweet length, and others) to predict the likelihood of a tweet being repropagated (also known as retweeting). In this work, the authors used a passive-aggressive algorithm for automated categorization of tweets. e performance of their model was slightly higher than manual categorization by human subjects. Tweet categorization is also important to identify relevant data for early responders, immediately after a disaster event. Li et al. [32] have built on earlier works and presented a supervised Naive Bayes model, along with an iterative self-training strategy which is able to provide good results. However, the presented results are from a controlled environment (CrisisLexT6-labelled data set, covering 6 disasters between Oct 2012 and July 2013), and its application in live environment would require a lot of data preprocessing.
A use case of such categorization is to build recommendation systems, which can provide a more personalized experience to the users. A basic URL recommendation system based on the user tweets, topic interest models, and social voting was introduced by Chen et al. [33]. Using 12 voting algorithms and feedback from 44 users, the authors were able to provide a basic platform for future recommendation systems based on Twitter data. Abel et al. [34][35][36] analyzed user modeling for presenting personalized news recommendation and improved the semantic of Twitter activities by enriching news items with tweets. e work used methods including topic-based, entity-based, and hashtags to analyze user modeling. ey also focused on temporal pattern extraction in users' profile. Piao and Breslin [37] analyzed user modeling strategies by incorporating categories, classes, and connected entities from DBpedia for extending user interest profiles and found that their proposed method significantly outperforms existing approaches in the context of link recommendations. A dynamic user modeling-based recommendation system was proposed by Deng et al. [38] to integrate information extracted from tweets and the video ranking system employed by Youtube based on the same user's profile. is strategy greatly enhanced the relevancy of the video recommendations. Celik et al. [39] identified the semantic relationship between Twitter entities to provide mediation among the same, thereby allowing the users to access the relevant content of their interest. Balabanović and Shoham [40] proposed a system to build user profile by combining both collaborative and content-based recommendation techniques. In content-based recommendation systems, user preferences are considered for providing recommendations. On the other hand, in the collaborative recommendation, the system identifies users with similar taste to that of the given user and provides recommendation based on this similarity.
Another popular use case of data analytics on Twitter is sentiment analysis. Yi et al. [41] presented a model to extract only subject-based sentiments from tweets by extracting topics and sentiments, followed by an application of a mixture model to detect relations between them. Similarly, Nasukawa and Yi [42] identified sentiment related to the particular subject using natural language processing techniques. e novelty of their approach was based on Markov model-based tagger for recognizing part of speech, followed by statistics-based techniques to identify sentiments related to a subject. Godbole et al. [43] introduced a system to determine public sentiment, and its variation over time, for news and blog entities. Using synonyms and antonyms, the authors were able to find a path between positive and negative polarity and increase seed list.
Some of the other popular use cases include improved search, improved tweet contents, and predicting election outcomes. Reviewing studies catering to these use cases is an important tool for identifying the techniques, which can help improve the impact and effectiveness of the recommendation system. Guo and Lease [44] proposed a novel ranking model, for enriching the search functionality on Twitter, with personalization and content analysis. Clark and Araki [45] introduced a text normalization technique to categorize errors and informal language used on social media into different groups, followed by natural language processing techniques to correct common phonetic and slang mistakes. On the contrary, Laniado and Peter [46] applied hashtags on Twitter and demonstrated mappings of fifty percent hashtags to entities in freebase. e system was categorized into four dimensions: frequency, specificity, consistency, and stability to assess hashtags as strong identifiers. Lösch and Müller [47] proposed a method to associate hashtags with encyclopedia entities. eir system used Wikipedia entities as a description of hashtags in microblogging service to understand the actual context of hashtags. Tumasjan et al. [48] analyzed Twitter as a source of predicting elections. ey used the context of the German federal election to investigate whether Twitter is used as a forum for political deliberation. ey used LIWC 2007 [49], a text analysis software, which uses a psychometrically validated dictionary for identifying and assessing the emotional, cognitive, and structural components of given text samples. e authors used 12 dimensions including past and future orientation, positive and negative emotion, sadness, anxiety, anger, tentativeness, certainty, work, achievement, and money to extract political sentiments from this data.
In this paper, we are providing the users with personalized health-related profiling and aggregated sentiment analysis using precisely classified data and sentiments. We Complexity 3 propose a novel approach for analyzing the behavior and lifestyle of individuals by monitoring patient's self-reported data and social posts. e archivist is a service that finds and archives tweets using Twitter search API. It helps the user to get real-time trend information on Twitter [50]. Our model uses the archivist to collect Twitter data and process them using natural language processing techniques to extract knowledge and sentiments from tweets. Twitter contains a lot of information; however, the proposed model focuses on how the information is filtered precisely to provide personalized knowledge to users.

The Proposed System Architecture
Twitter is a popular social media platform that enables users to post short texts, images, and videos of personal and/or collaborative nature. is data provides a unique insight into the user's personality. Of particular interest to our research work, are the user's interests and emotions, which are used by our proposed system to build a user profile and then provide personalized data/services to similar users. Our proposed system, as shown in Figure 1, consists of two modules and integrates with Twitter as a plug-in application. e first module builds user health profile by extracting the user's profile information, health interests, and emotions enriched with temporal patterns. To achieve the objectives, Alchemy API [51] is used for the extraction of user's interests from the free text (tweets). e API processes unstructured text using natural language processing techniques and machine learning algorithms to produce keywords, entities, concepts, and the sentiment of the user in relation to these (keywords and entities). e second module collects public data from Twitter and precisely classifies it to recommend users with personalized data based on their generated profile. To classify tweets and extract topic level sentiments, the system analyzes tweets using domain-specific seed words, opinion words, n-gram generator, POS tagger, synonym binder, and dependency parser. Seed words and opinion words are enriched by synonyms to increase accuracy of classification.

Data
Manager. Data manager acts as a plugable interface to Twitter, which internally utilizes a data fetcher to acquire streaming data. ese data are received in XML format, a sample of which is shown in Figure 2. Each tweet is encapsulated in a structured format, containing the username of the person tweeting, timestamp of the tweet, textual content of the tweet, IT unique identifier, any associated image, and other information. Using a DOM parser, we parse this XML corpus to extract the username, tweet date, status, tweet ID, and image fields. We then apply text preprocessing on the tweet text (status field) to convert the raw data into meaningful information. e main aim of this step is to convert abbreviations and slangs, contained in the tweets, into their formal counterparts.
is aim is set to alleviate the tweet behaviorisms, which have informally encouraged the use of abbreviations (such as "plz" instead of "please" and "gud" instead of "good") and other slang words [52], by Twitter users to save time and space. Users can also repeat characters in words to emphasize a particular word (such as using "Plzzz, as shown in the second tweet in Table 1"). Such words represent noise in data, since it affects the knowledge extraction process. e data preprocessor module achieves this aim by utilizing a repository of 1300 slang words to remove this noise. As a result of this process, the resulting data are free of most commonly used (on social media) slang and abbreviated words. Additionally, the spell checker module uses jazzy (Java-based spell checking API) to correct any spelling mistakes from the data. e final data produced by the data manager is very rich and can be used by the consuming services to build a user profile and extract knowledge.

Profile Builder.
is submodule extracts useful information from tweets and maintains temporal history to build user health interest-based profile. Profile builder extracts the user's interests by using Alchemy API. It accepts unstructured text and obtains knowledge by exposing the semantic richness hidden in posts using named entity and sentiment related to those entities. System stores extracted keywords, entities, and user sentiments in the user's profile repository for future use. Table 1 shows a sample of the keywords, entities, and associated sentiments extracted by profile builder using the IBM Watson Natural Language Understanding module (Alchemy API). For instance, the tweet "I feel my high blood pressure is at an unsafe level every time I'm at work. It's seriously going to give me a depression one of these days" when processed through this API shows "high blood pressure" as the most relevant keyword with the highest confidence score of 0.99206. Similarly, the highest rated concept against this tweet is "hypertension" with a score of 0.915043. e overall sentiment associated with this keyword is negative with a confidence score of −0.96. Similarly, the other sample tweets with their corresponding keywords, entity concepts, and entity sentiments are shown in Table 1, along with their score in parentheses. For each of these attributes, we have selected the top one keyword, concept, and sentiments, with respect to their relevance in the text. It is also pertinent to note that not all entities are correctly identified, as in the case of the third example in the table "Wide awake, I've got a headache and work in the morning" which has the correctly identified keyword "headache" with a score of 0.71, but an unrelated concept "2006 singles" with a confidence score of 0.86%. We do not disregard this incorrect conceptualization, which only slightly affects that overall accuracy, as will be shown in the result section.
After extracting this information from tweets, profile builder searches for the temporal patterns of user interest, e.g., in the morning, the user is usually interested in the blood sugar level, while in the evening, the user usually talks about insulin and diet. If same pattern appears more than two times, profile builder attaches temporal information with the knowledge extracted to use it for data recommendations. All the extracted data and temporal information are then stored in the database.

Knowledge Extractor.
Knowledge extractor module consumes the processed tweets, coming from the data manager in order to apply natural language processing and sentiment analysis techniques to precisely classify them. In particular, the proposed system uses the Stanford Part-of-Speech (POS) tagger, dependency parser, four-gram, and a synonym binder to classify the tweets. e tags identified by the Stanford POS tagger are used to extract synonyms from

Complexity
WordNet. Additionally, the synonym binder helps improve the accuracy of classification by binding synonyms from the seed list with each noun word. is binder is based on the WordNet dictionary, which also allows us to identify the contextual meaning of the present words. Jaws API [53] provides the synonym binder with an external interface to WordNet. For example, the word workout is not present in our seed list; however, its bound synonym exercise does exist.
e synonym binder also handles other problems related to word structure as well. For example, it can convert plurals to singulars, thereby binding calories with calorie and exercises with exercise. Sentiment analyzer uses sentiment lexicon to extract positive, negative, and neutral sentimental words from these enriched tweets. For positive and negative sentiments, the system uses the list of 6800 words from [54]. In addition, for neutral classes, a list of neutral keywords is built after analyzing tweets.
e proposed system classifies tweets based on the knowledge extracted from them. is classification process is dependent on the seed list, which is used to identify the particular category that a tweet belongs to. In this research work, we have focused on the healthcare domain by keeping the most frequently used healthcare and wellness terms in our seed list. e classified data are stored in a knowledgebase for improving accuracy and future use. Once the proposed system has classified and detected sentimental words from tweets, the Stanford dependency parser was used to identify the relation between the extracted categories and sentimental words. is helps the system to find topic-based sentiments in tweets. e proposed system uses dot, exclamation mark, and hyphen as sentence boundaries for splitting tweets into sentences if there are multiple sentences in a tweet. Typed dependencies are grammatical relations between words which help to decide either a sentiment belongs to a specific word or not. It also helps for extracting multiple sentiments from a tweet. Figure 3 shows how dependencies are used to find topicbased sentiments. Dependency parser also helped to find negation of any sentimental words to inverse its value, e.g., in tweet "I don't like the taste of that medicine" has the negation of a positive word "like." Without considering negation, the system was not able to link negative sentiment to "taste."

Filter Engine.
Filter engine processes classified tweets using personalized profile and aggregate sentimental result to recommend the user with relevant data. While generating data recommendation, filter engine also incorporates temporal patterns extracted by profile builder to generate more valuable, time-specific recommendations. Figure 4 shows the positive, negative, and neutral sentiments associated with the various common drugs used by diabetic patients and mentioned in their tweets. is sort of filtering can enable the physicians and caregivers to optimize drug delivery by incorporating the patient sentiments in their medicine prescription process. is could enable a positive impact on the medication adherence by the diabetic patient. Figure 5 shows another use case of the filter engine's application, whereby the diabetic patient is shown relevant tweets based on similar keywords and sentiments to reenforce constructive dialog and create a virtual support system for the diabetic patient. rough this approach, the patients can obtain useful information related to their disease and others' experiences on different kind of insulin, drugs, or medical tests.

Implementation and Result
While the presented approach can be generalized to any domain, in this research work, we have extended our previous approach, presented in [27], to extract healthcare knowledge from publicly available tweets, providing recommendations for diabetes. In order to realize the proposed framework, we have used Java and other open APIs to create an application which amalgamates the data curation service, knowledge extraction service, user profile building service, and filter engine into the proposed recommendation system. ese services are briefly explained in the following subsections.
By applying seed list-based classification and sentiment analysis, the system was able to recommend personalized diabetes-related tweets to users. e seed list was generated using the work presented in [55,56]. In order to overcome redundancy problems and formatting issues, Google Refine is used. To calculate the accuracy of our proposed system, we have used seed list for diabetes for tweet filtration. By integrating our proposed system with Twitter, the user would be able to get precisely classified and personalized data with sentiment value. Moreover, this tweet data is useful for clustering, trend analysis, and recommendations as well. e details of the data collection process, our experiments, and their results are as follows.

Data Collection.
Archivist tool has been used to scrawl a specific set of tweets for all the keywords presented in Table 2. Table 2 also shows the number of extracted tweets, along with their classification accuracy when using only n-gram and when using n-gram with synonyms.
To generate user profile, we analyzed tweets of 100 users and collected 6000 tweets related to diabetes which helped to build user profile. Some collected tweets for profile generation could not provide any information about user health interests, so the system ignored them and used only those tweets which helped to generate user's health profile. e seed list of diabetes-related terms has been generated by utilizing the work presented in [55,56]. is list was then divided into two parts, by using natural language processing to classify diabetes-related terms, based on their definition in the original source. As a result, 417 terms have been classified into categories, such as test, condition, body cell, diabetic study, professional, devices, medicine, and others (not to be confused with the well-defined category "other"). For example, "hyperinsulinemia" was defined in the seed source, as "a condition in which the level of insulin in the blood is higher than normal caused by overproduction of insulin by the body." e proposed system classified it as a "condition" term. Our system was able to classify 80.5% of the terms, leaving only 81 terms, which were labelled as belonging to the "other" category.
For sentiment analysis, the proposed system used the list of positive and negative sentiments which is composed of 6800 words from [54]. For neutral class, we manually build a list of 30 keywords.

4.2.
Testing. Almost six thousand tweets were used to generate user health profile. By using Alchemy API, this system extracted all important keywords, entities, and sentiments from tweets. is information is used to build user profile which helped to provide the user with personalized data recommendation. e data recommendation is precisely classified data with public sentiment analysis.

Complexity
Spell checker also improved system performance as social media data have spelling and typo errors. e proposed system has processed almost one million tweets of different categories for testing and verification of classification and sentiment analysis. By considering only four-gram, from all categories, 129,839 diabetes-related tweets were successfully classified. However, when the proposed system was employed in full, which uses four-gram and the synonym binder, 142,285 diabetes-related tweets were classified, from all categories.
is is because the synonym binder binds the context of words from tweets, which improves the categorization process. By applying preprocessing and then semantic and syntactic analysis, system accuracy has reached up to 96% for diabetes-related tweets, as shown in Table 2. e system used n-gram model with synonym binder to achieve this accuracy. Diabetesrelated tweets from other categories decreased information loss and increased the quality of sentiment analysis. Simple keyword-based search from Twitter is not able to provide all the related information for a specific category. is can be greatly enhanced by using a seed list, which would enable the retrieval of information related to the keyword. In the legacy search case, the term "diabetes" would only return tweets, containing this keyword. However, using the seed list to perform an advanced search can also return additional information by retrieving those tweets, which do not explicitly contain this keyword but are still of interest to the diabetic patient or the caregiver, for example, "Morning walk is very helpful to maintain blood glucose." is tweet is not filtered when we search Twitter for diabetes; however, the proposed system has successfully classified this tweet as a diabetes-related tweet.
Dependency parser has helped the proposed system to find an accurate relationship between sentiments and classes. It has also helped to find multiple sentiments for multiple classes from a single tweet. Figure 3 shows how the proposed system has extracted topic-based multiple sentiments from a single tweet. At first, sentimental words and topics were extracted, but it was not clear which sentiment is related to which topic. So, the system used a dependency parser to bind sentiments with the topic. Dependency parser also helps the system in negation detection, e.g., "neg (good, not)" shows that "good" is negated. Negation inverts opinion of the sentimental word from positive to negative and vice versa. Figure 4 shows the sentiment analysis of the tweet data generated for a diabetic person. It shows that 37% tweets about basal insulin are positive, 38% are negative, and 25% have neutral sentiments. e figure shows that the majority of sentiment for glucagon is negative.
ese results help the user not only to find related tweets but also aggregated sentiments. rough the application of advanced natural language processing techniques, such as topic modeling, keyword extraction, and sentiment analysis, the classification accuracy is greatly improved. Figure 6 shows comparison of the proposed system with existing technique [27]. It shows 6.5% performance improvements, from existing technique, in terms of accurately classifying tweets related to diabetes and 22.8% improvement on classification for blood pressure.
Additionally, the proposed system addresses a key use case of information loss, caused by a legacy keyword-based search engine. Twitter search can be greatly enhanced by using seed lists and short text classification to extract a larger set of related information, without increasing the cognitive load on the user. Table 2 shows the effectiveness of using this process for extracting information related to diabetes. Information diffusion varies in each category; while 10.6% tweets from the diet category and 6.1% tweets from dengue contain valuable information about diabetes in the blood pressure category, we found 95% of tweets containing content related to diabetes. Legacy keyword search on Twitter was not able to extract these tweets. It is also important to note that the information collected through this process is not unique, and as we found out, there is an

Conclusion
In this research work, we have demonstrated a personalized recommendation system, based on user profile matching. We have also presented the effectiveness of using a synonym binder for avoiding information loss and enhancing the knowledge extraction process, which was also supported by a sentiment analyzer. Sentiment analysis shows people attitude towards different topics which can be used to generate a richer user profile and personalized recommendations. Topic-based sentiment analysis can generate a rich user profile, personalized recommendation, and helps the user to gather summarized public opinions on entities of their interest. Domain-specific seed words helped to decrease information loss during the keyword-based search. User-generated profile from social media can be integrated with clinical decision support system (CDSS) or electronic health record (EHR) to know about user interest and behavior in detail. In future, we are planning to integrate user information from other social media and user activities log to find interesting patterns and use them in personalized recommender systems.

Data Availability
e data and code related to the data will be made available on Github.