Exploiting Language Models to Classify Events from Twitter

Classifying events is challenging in Twitter because tweets texts have a large amount of temporal data with a lot of noise and various kinds of topics. In this paper, we propose a method to classify events from Twitter. We firstly find the distinguishing terms between tweets in events and measure their similarities with learning language models such as ConceptNet and a latent Dirichlet allocation method for selectional preferences (LDA-SP), which have been widely studied based on large text corpora within computational linguistic relations. The relationship of term words in tweets will be discovered by checking them under each model. We then proposed a method to compute the similarity between tweets based on tweets' features including common term words and relationships among their distinguishing term words. It will be explicit and convenient for applying to k-nearest neighbor techniques for classification. We carefully applied experiments on the Edinburgh Twitter Corpus to show that our method achieves competitive results for classifying events.


Introduction
Twitter (https://twitter.com/) is a social networking application that allows people to microblog about a broad range of topics. Users of Twitter post short text, called "tweets" (about 140 characters), on a variety of topics as news events and pop culture, to mundane daily events and spam. Recently, Twitter has grown over 200 million active users producing over 200 million tweets per day. Twitter is a popular microblogging and social networking service that presents many opportunities for researches in natural language processing (NLP) and machine learning [1][2][3][4][5][6]. Locke and Martin [5] and Liu et al. [4] train a classifier to recognized entities based on annotated Twitter data for Named Entity Recognition (NER). Some research has explored Part of Speech (PoS) tagging [3], geographical variation in language found on Twitter [2], modeling informal conversations [1], and also applying NLP techniques to help crisis workers with the flood of information following natural disasters [6]. Benson et al. [7] applied distant supervision to train a relation extractor to recognize artists and venues mentioned within tweets of users who list their location.
Classifying events in Twitter is a difficult task that focuses on the automatic identification and classification of various types of events in tweet texts. In Twitter, events are topics that often draw public attention, for example, football matches or natural disasters. Several approaches have been proposed to classify events for detection such as wave analysis [8,9], topic model approach based on latent Dirichlet allocation [10], hierarchical Dirichlet processes [11], and text classification and clustering [12]. Kireyev et al. [8] explored the use of topics models for analysis of disaster-related Twitter data. Sakaki et al. [12] investigated the real-time interaction of events such as earthquakes in Twitter and proposed an algorithm to monitor tweets and to detect target events. However, existing approaches encounter failures from in either latent topics detection or analyzing terms relationships. Because topic model techniques [13][14][15] have only focused on how to list set of relevant words into a group (called topic) it is missed on analyzing relations between topics. Considering tweets have been discussed in two events shown in Table 1, we are easy to recognize that T 1 and T 2 are discussed in event 1 and T 4 and T 5 are discussed in event 2. However, if using topic models the system will group T 1 , T 2 , and T 3 in the same event category 2 Computational Intelligence and Neuroscience T 4 : plane crash kills majority of KHL team Lokomotiv. Yes T 5 : plane crash in Russia kills 36 or 37 assumed to be hockey player. Yes T 6 : plane crash, helicopter, was in Moscow with 2 dead. No even T 3 does not belong to the event because set of relation words as <"passed away, " "dead, " "died"> in these tweets is in the same topic model. Likewise, T 6 will be grouped into event 2 with T 4 and T 5 together even if T 6 does not belong to this event because sets of relation words as <"plane, " "crash, " "helicopter">, <"Russia, " "KHL team, " "Lokomotiv, " "hockey">, and <"kills", "dead"> in these tweets are within the same topic models, respectively. Due to limitations in using topic models, we therefore propose the method to exploit language models having relations reference to not only analyze topics but also analyze relatedness of event in tweets to overcome these problems.
In this paper, we investigate the use of generative and discriminate models for identifying the relationship of objects in tweets that describe one or more instance of a specified event type. We adapt language modeling approaches that capture how descriptions of event instances in text are likely to be generated. Our method will find the distinguishing term words between tweets and examining them with a series of relationships, extracted by language models such as Concept-Net [16] and LDA-SP [17]. These language models have been widely studied based on large text corpora within computational linguistic relations. Hence the relationship among distinguishing terms and common terms between tweets becomes clear to measure their similarity by examining them under each model. Measuring similarity between tweets is explicit and convenient to apply it in the classifier algorithms, such as SVM and k-nearest neighbor (kNN), to classify events in Twitter.
The rest of this paper is structured as follows. Section 2 presents related work that refers to research on event detection. In Section 3, we discuss exploiting language models. In addition, we present a method to calculate the similarity between tweets for event classification. In the next following section, experiments that are applied to the Edinburgh Twitter Corpus for event classification are presented and discussed. Section 5 ends with conclusions and future work.

Related Work
Several applications have detected events in Web to apply to weblogs [18][19][20], news stories [21,22], or scientific journal collections [23]. Glance et al. [19] presented the application of data mining, information extraction, and NLP algorithms for event detection across a subset of approximately 100,000 weblogs. They implemented a trend searching system that provides a way to estimate the relative buzz of word of mouth for given topics over time. Nallapati et al. [22] attempted to capture the rich structure of events and their dependencies on a news topic through event models by recognizing events and their dependencies on event threading. Besides the standard word for based features, their approaches took into account novel features such as the temporal locality of stories for event recognition. Besides that, some researches [24][25][26][27] have analyzed social network to search or detect emergency events on the internet. Dai et al. [25] presented a cycle model to describe the internet spreading process of emergency events which applied the Tobit model by analyzing social psychological impacts. Hu et al. [27] analyzed historical attributes then combined with HowNet polarity and sentiment words on microblog which has network information transmission of social emergency events. And, they then provided the important guidance in the analysis of microblog information dissemination that has relatedness with social emergency events on internet. Meanwhile, Dai et al. [24] proposed a method to search the shortest paths of emergency events through IBF algorithm by analyzing social network.
Some research has focused on summarizing Twitter posts for detecting events [28][29][30][31]. Harabagiu and Hickl [28] focused on the summarization of microblog posts relating to complex world events. To summarize, they captured event structure information from tweets and user behavior information relevant to a topic. Takamura et al. [31] summarized Japanese Twitter posts on soccer games during the time when people provide comments and expressed opinions on the timeline of a game's progress. They represented user actions in terms of retweets, responses, and quoted tweets. In particular, Sharifi et al. [30] detected events in Twitter by summarizing trending topics using a collection of a large number of posts on a topic. They created summaries in various ways and evaluate those using metrics for automatic summary evaluation.
Recently, several approaches have been proposed to detect events from tweets using topic model approach [8,10,12]. Kireyev et al. [8] explored the use of topic models for the analysis of disaster-related Twitter data. Becker et al. [32] and Popescu et al. [33] investigate discovering clusters of related words or tweets which correspond to events in progress. Sakaki et al. [12] investigated the real-time interaction of events in Twitter such as earthquakes and propose an algorithm to monitor tweets and to detect a target event. Diao et al. [10] attempted to find topics with bursty patterns on microblogs; they proposed a topic model that simultaneously captures two observations such as posts published around the same time and posts published by the same user. However, existing approaches have still met with failure in either latent topic detection or analyzing relationship terms, because tweets messages usually contain very limited common words in topics. Therefore, in this paper we propose a method to discover the relationship of objects in tweets by exploiting language models used to compare each of the snippets indirectly for classifying events in Twitter.

Exploiting Language Models to Classify Events
In this paper, we investigate the use of generative and discriminate models for identifying the relationship among objects in tweets that describe one or more instances of a specified event type. We adapt language modeling approaches that capture how descriptions of event instances in text are likely to be generated. We use language models to select plausible relationships between term words in tweets such as the relationship of "Object-Object" or "Object-relation-Object, " which aim to detect the relatedness of an event in tweets. We assume that the data collection of language models contains suitable knowledge on the relationships among term words to discover the elemental relationship among tweets with a statistical analysis to classify events. We explore two types of language models that have obtained high correlation with human judgment such as ConceptNet and LDA-SP. These models are used for calculating the similarity of a pairwise of tweets for detecting events. The relationship between the discriminate term words of the tweets will be discovered by checking their relatedness under pairs of relations. In addition, the similarity between tweets is computed based on their common term words and the relationship between their discriminate term words. It is intuitive and convenient to apply it in classifier algorithms to classify events in Twitter. The general proposed method consists of four stages as (1) data collection, (2) labeling stage, (3) data modeling, and (4) machine learning shown in Figure 1. Stages 1 and 2 will be discussed in Section 4.1; stage 3 will discussed in Section 3; and state 4 will be discussed in Sections 3.3 and 4.2.

ConceptNet Model.
To model the "Object-Object" relationships in tweets, we consider the ConceptNet [16] model. It is a large semantic graph containing concepts and the relations between them. It includes everyday basic, cultural, and scientific knowledge, which has been automatically, extracted from the internet using predefined rules. In this work, we use the most current version ConceptNet 5. As it is mined from free text using rules, the database has uncontrolled vocabulary and contains many false/nonsense statements. ConceptNet contains 24 relations with over 11 million pairs of relation. For example, "Nasa is located in United States" is presented as AtLocation ("Nasa", "United States") in Concept-Net model.
to relations in target events by keywords matching (in experiments) to extract relations.

LDA-SP Model.
To model the "Object-relation-Object" relationships in tweets, we adapt the LDA-SP model [17], which has been used for the selectional preference task in order to obtain the conditional probabilities of two objects in a relation. In particular, the LDA-SP, using LinkLDA [34], is an extension of latent Dirichlet allocation (LDA) [13] which simultaneously models two sets of distributions for each topic. The generative graphical model of LDA versus LDA-SP is depicted in Figure 2. In LDA-SP, they presented a series of topic models, at which objects belonged to them, for the task of computing selectional preferences. These models vary in terms of independence between Topic i and Topic j that is assumed. These two sets represent the two arguments for the relation R (Topic i , Topic j ). Each topic contains a list of relation words. Each relation, R, is generated by picking up over the same distribution, which keeps two different topics, Topic i and Topic j , sharing the same relation (Figure 2(b)). The LDA-SP is able to capture information about the pairs of topics that commonly cooccur. To model the relations with LDA-SP, we also follow the data preparation in [21], which was automatically extracted by TextRunner [35] from 500 million Web pages. This resulted in a vocabulary of about 32,000 noun phrases, a set of about 2.4 million tuples with 601 topics in our generalization corpus. Some samples of topics extracted through LDA-SP are illustrated in Table 3.

Similarity Measures in Tweets.
Classifying events in tweets from Twitter is a very challenging task because a very few words cooccur in tweets. Intuitively, the problem can be solved by exploring the relationships between tweets well; the intrinsic relationship among words may be discovered with a thesaurus. Hence, we present a method to discover the intrinsic relationships between objects based on statistical analysis of language models and then gain the similarity between tweets accordingly. We consider two types of relationships in tweets such as "Object-Object" and "Object-relation-Object. " "Object-Object". The event "Death of Amy Winehouse" is posted in tweets T 1 , T 2 , and T 3 shown in Figure 3. Traditional methods can only find one cooccurring term, "Amy Winehouse, " in the tweets after removing stop words. However, if we analyze and compare the relatedness between the pairs <"Singer"-"Amy Winehouse">, <"Amy Winehouse"-"passed away"> and <"Amy Winehouse"-"dead">, and <"Amy Winehouse"-"R.I.P.">, closer relationships will be exposed: "Object-Object" as "Topic 1 -Topic 2 " where a set of terms {"Singer"; "Amy Winehouse"} is in Topic 1 and a set of terms {"death", "passed away", "R.I.P."} is in Topic 2 .
"Object-Relation-Object. " The event "plane carrying Russian hockey team Lokomotiv crashes" is posted in T 4 , T 5 , and T 6 shown in Figure 4. We can discover the relationship between "Object-relation-Object" such as <"Plane"-"crash"-"KHL team Lokomotiv">, <"Plane"-"crash"-"Russia">, and <"Plan"-"crash"-"KHL team">. This also exhibits the closer relationships "Object-relation-Object" as "Topic 3 -crash-Topic 4 " where the term {"plane"} belongs to Topic 3 and a set of terms {"russia", "khl team lokomotiv", "hockey", "khl team"} belongs to Topic 4 . Our method extracts relation tuples from language models such as ConceptNet and LDA-SP. We treat all tweets from Twitter that are contained in the collection equally and then perform to match models of tuples generated from Con-ceptNet and LDA-SP with them. Hence, if we can discover relation tuples as "third-party" for both tweets and calculate the similarity between the two tweets by comparing the distinguishing term words with these tuples, we may find the real relationship underlying the two tweets. We assume that the data collection language models contain sufficient knowledge about the relationships among term words, from which we can find the elemental relationship among tweets. For computing similarity between tweets, we derive a set of relations, = ( , ) matched from language models and tweets combining with Bag-of-Words. Considering two original tweets, 1 and 2 , in data collection , we check with , existing in each tweet which match with relation tuples = ( , ) extracted from ConceptNet model. In using LDA-SP, we exam not only relations but also , existing in each tweet and then match them with relation tuples = ( , ) generated from LDA-SP. We then replace matched objects in tweets by relation tuples from language models. Thus, the relationship between the distinguishing terms of the tweets can be discovered by examining their relatedness under pairs of relations by "third-party. " We consider calculating the similarity between two tweets based on their common terms and the relationship between their distinguishing terms. To calculate the similarity between two tweets in an event category, we represent them as vectors: where is the weight of the th feature in the vector of and is defined by the tf-df measure as follows: where is the total number of documents in the collection, df is the document frequency, that is, the number of documents in which term occurs, tf is the term frequency of term in document , and tf is simply the number of occurrences of term in document .
With the relationship between the two distinguishing term words on a diversity of assigned model tuples, we can calculate the similarity of vectors 1 and 2 with the cosine method shown in For classifying events from tweets, many classifiers first need to calculate the similarity between tweets. kNN is one of the best methods of similarity calculation and selection of a proper number of neighbors. Therefore, it is intuitive and convenient to apply similarity calculation between tweets to kNN for classifying events. If our proposed method can calculate the similarity among tweets more accuracy, the kNN will select more appropriate neighbors for a test case and the classification performance of kNN will be higher than original tf-idf, since the performance of kNN based on the similarity measuring method outperforms other methods with tf-idf measure. We conclude that the proposed method is more effective on calculating tweets similarity to classify events. The result will be discussed in more detail in experimentation section.

Experimental Datasets and Evaluation Measures.
We have conducted experiments on the Edinburgh Twitter Corpus [36], a collection of events in Twitter, for event classification. The corpus contains 3034 tweet IDs spread into 27 event categories. Currently, some tweets in the dataset are deleted or lost from Twitter. We developed a tool using Twitter API (http://twitter4j.org) to collected documents including tweets, retweets, responses, and quoted tweets; we then filtered documents to guarantee that each event category contains at least 70 tweets. After the removal of noise and stop words, each word is stemmed into its root form. Table 4 shows the rest of nine significant event categories with checked mark for experiments as event 1, event 6, event 7, event 9, event 13, event 14, event 15, event 16, and event 21.
In this study, experiments are evaluated based on the precision, recall, and -measure with our proposed method. The precision, recall and -Measure are the evaluation metrics often used to rate the information retrieval system's performance. Precision is the number of correct results divided by the total number of returned responses; recall is the number of correct results divided by the number of results that should have been returned and -measure is used to balance between the recall and precision as follows: number of correct responses number of responses , Recall = number of correct responses number of corrects ,

Experiments and Comparison.
Checking similarity between tweets before experiments, we select some samples of tweets from experimental datasets as shown in Table 1. We used the tf-idf combined with the similarity functions to compare performance before and after using language models. Note that T 1 and T 2 were discussed in the same event; T 4 and T 5 were also discussed in the same event. And two pairs of tweets are, respectively, to calculate similarity with stop words removal. The result depicted in Table 5 shows that the tweets using ConceptNet and LDA-SP increase the similarity of questions from the same category. Moreover, if the tweets did not belong to target event like T 3 and T 6 , the method will reduce the similarity measure that helps system performance of classifying efficiently. T 3 : such a shame I loved her music R.I.P. Amy Winehouse.
T 2 : Amy Winehouse found dead at her home in North London. To classify events, 70% of the tweets for each category are randomly selected for training, and the rest is for testing. In our experiments, we compare the performance of four classifiers implemented as follows: (1) baseline kNN (without language model); (2) baseline SVM; and the kNN method combining our proposed methods (3) kNN-M1 (kNN with language model ConceptNet) and (4) kNN-M2 (kNN with language model LDA-SP). The SVM is also constructed using the tf-idf method to weight each vector component of the tweet and is used as second baseline for comparison with our proposed methods. We chose SVM because of a powerful and robust method for text classification [37][38][39]. The evaluation follows 5-fold cross validation schema. Table 6 shows the performance results applied to 7 categories of events from Twitter. The bold numbers show the best -measure of each event in four methods. For instance, the system obtained the highest -measure of 85.3% in event 1 with method kNN-M2. Method kNN-M1 yielded better -measure results in most of the event categories: event 6, event 7, event 9, event 14, event 15, and event 16. And, method kNN-M2 achieved bettermeasure result in three categories: event 1, event 13, and event 21.
The overall performance comparison is presented in Figure 5. We can see that the performance of kNN-M1 outperforms kNN-M2, SVM, and kNN. Both of our proposed methods are also higher than the baselines, kNN and SVM, in most of the performance metrics. In the overall results, kNN-M1, kNN-M2, SVM, and kNN obtained an -measure of 85%, 84.7%, 78.4%, and 76.8%, respectively.

Discussions.
We believe that effective performance of our proposed methods is result of the following reasons. First, noise and exclamative and repeated texts usually occur in the tweets of each event. The following are examples of such tweets. T 1 : "Sad day Sky sources now confirming Amy Winehouse is dead A musical legend who died way too young in my opinion, " T 2 : "Amy Winehouse found dead in her London flat according to sky news, " and T 3 : "Hmm. . .omg. . .gruuu Amy Winehouse is dead not totally surprised though ohhh. " We can observe that {"Amy Winehouse"; "dead"} is repeated text, {"gruuu"; "ohhh"} is noise text, and {"Hmm"; "omg"} is exclamative text. The repeated text will result in a positive value in the similarity measure; however, noise and exclamative texts will result in a negative value in the similarity measure. For preprocessing, stop words had been removed by a defined list of stop words automatically. However, we had checked and revised noise texts manually if they do not belong to list of stop words. For example, a lot of words "deaddddd" will be revised into "dead, " or {"RIP, " "R I P"} will be revised into "R.I.P. " The second reason we believe our method had effective performance is that quality universal datasets are used to build language models. In this study, more than five billion relation records extracted from Concept are used to build the models. In addition, models from LDA-SP are built by extracting 2.4 million tuples of relations and 601 topics. Furthermore, ConceptNet is a graphical relationship model which uses predefined rules. However, LDA-SP still has some  errors [17] in computing word statistics. In the experiment results, performance of ConceptNet is better than LDA-SP. The third reason believed to be behind our method's effective performance is that the models extracted from LDA-SP are intensely analyzed compared to ConceptNet for relationship. However ConceptNet obtained better performance results. Texts from tweets are incomplete sentences that result in failures in grammar parsing for analyzing relation. We did not include grammar parsing for analyzing tweets based on LDA-SP model. Therefore, ConceptNet exhibits a better performance for classifying events from Twitter than LDA-SP.

Conclusion and Future Work
We have presented methods to classify events from Twitter. We first find the distinguishing terms between tweets in events and calculate their similarity with learning language models: LDA-SP and ConceptNet. Next, we discover the relationship between the distinguishing terms of the tweets by examining them under each model. Then, we calculate the similarity between two tweets based on their common terms and the relationship between their distinguishing terms. The outcomes make it convenient to apply kNN techniques to classify events in Twitter. As a result, our approach obtained better performance results with both ConceptNet and LDA-SP than other methods. Regarding future work, the research has been suggested with attractive aspects to improve as follows. First, this approach can be considered for future work, including it with a larger corpus and experimenting with other event types.
Second, we will continue to investigate how to apply grammar parsing in tweets so that we can analyze deeply relationships to serve for classifying events. Finally, the research can be applied unsupervised learning with semantic similarity models as pointwise mutual information (PMI) [40,41] and latent semantic analysis (LSA) [42,43].