Continuous Similarity Learning with Shared Neural Semantic Representation for Joint Event Detection and Evolution

In the era of the rapid development of today's Internet, people often feel overwhelmed by vast official news streams or unofficial self-media tweets. To help people obtain the news topics they care about, there is a growing need for systems that can extract important events from this amount of data and construct the evolution procedure of events logically into a story. Most existing methods treat event detection and evolution as two independent subtasks under an integrated pipeline setting. However, the interdependence between these two subtasks is often ignored, which leads to a biased propagation. Besides, due to the limitations of news documents' semantic representation, the performance of event detection and evolution is still limited. To tackle these problems, we propose a Joint Event Detection and Evolution (JEDE) model, to detect events and discover the event evolution relationships from news streams in this paper. Specifically, the proposed JEDE model is built upon the Siamese network, which first introduces the bidirectional GRU attention network to learn the vector-based semantic representation for news documents shared across two subtask networks. Then, two continuous similarity metrics are learned using stacked neural networks to judge whether two news documents are related to the same event or two events are related to the same story. Furthermore, due to the limited available dataset with ground truths, we make efforts to construct a new dataset, named EDENS, which contains valid labels of events and stories. The experimental results on this newly created dataset demonstrate that, thanks to the shared representation and joint training, the proposed model consistently achieves significant improvements over the baseline methods.


Introduction
In recent years, with the great development of the Internet and AI technologies, tremendous volumes of news articles to report the breaking events are being rapidly generated by various media providers, for example, TV broadcast, the governmental news website, and information Portals (sina. com, Tencent News, Headlines Today, CNN, BBC, etc.). Besides, some popular social media, such as Twitter, Sina Weibo, and Facebook, also has become an attractive platform for people for expressing opinions, broadcasting news, and discussing topics. Compared with the former traditional news media, the tweet stream produced by the emerging social media or self-media often contains many informal expressions or irrelevant information. Despite massive information about a number of ongoing hot events is spread at every moment, people still feel unable to acquire useful information about the events of concern. In solving the problem of information overload, event detection and evolution on news streams have drawn extensive research attention over the past few years [1][2][3][4][5][6]. Generally, an event refers to a particular thing that happens at a specific time and place [7], for example, Opening Ceremony of the 29 th Modern Summer Olympic Games held in Beijing on August 8 th , 2008. Under the premise of this definition, event detection aims to identify a group of news documents that report the same topic possibly in different ways [8].
erefore, a new topic can be detected if new news documents are found not belonging to existing topic clusters. Followed by event detection, event evolution connects related events logically to tell evolving stories. Event detection and evolution provide an emerging and more natural alternative way to present news corpora; hence, people can be easier to know the development process of topics of interest. In this paper, we focus on the detection and evolution of events from formal news media.
For decades, research has introduced many methods for event detection [5,6,9] and event evolution [1][2][3][4]8] in the area of Topic Detection and Tracking (TDT) [10]. Of these methods, a similar pipeline is followed; that is, event detection and evolution are considered as two independent subtasks. First, news documents are aggregated into event clusters through any clustering algorithm such as K-means [11], incremental clustering [12], and locality sensitive hashing (LSH) [13]. In addition, there are some methods based on topic modeling [14] and term weighting [15] for event detection. After detecting extensive and diverse events, various methods such as event timelines [16], event threads [1], event evolution graphs [3], and information maps [17] are used to identify the relationship between events and present a clear story development process.
Despite great progress in these above methods, the interdependence between the event detection and evolution tasks is often ignored and has not been fully utilized. For example, a news document that comprehensively indicates the events in event detection should be scored high in the corresponding story architecture. Furthermore, as the base of events, the similarity of the two news documents also depicts the similarity of the events that documents belong to; hence, a better understanding of a news document is helpful for both event detection and event evolution. Besides, some of the popular clustering methods, for example, K-means, which requires setting the number of clusters in advance to discover events, may also not be applicable for the rapidly varying news streams that occur daily in the real world.
To handle these challenges, in this paper, we investigate a neural network model that monitors the news streams in the open domain, jointly detecting new events and tracking the evolution process of events, as illustrated in Figure 1. To better describe the news documents, we introduce the advanced bidirectional GRU network followed by attention operations to achieve the fix-length vectorial semantic representation. Such a representation can depict both finegrained news documents and holistic events; hence, it is shared across the event detection and evolution tasks. Moreover, inspired by great achievements achieved by multitask learning, we propose to jointly learn two continuous neural similarity metrics for judging whether two news documents are related to the same event or two events are related to the same story under the framework of the Siamese network. During the testing, the online incremental clustering method of news documents and events are applied for event detection and evolution, respectively. To further promote the interaction between the two subtasks, the strategy of neural stacking [18] is applied to the pipeline. Specifically, the hidden neural layers of the event detection model are fed as additional input features to the event evolution model, and the errors of event evolution are propagated to event detection during training, so that information is better shared between the predecessor detection and successor evolution. e contributions of this paper could be summarized as follows.
(i) To the best of our knowledge, this is the first work that considers the correlation of event detection and event evolution and investigates a neural model for Joint Event Detection and Evolution, which benefits both the tasks. (ii) e introduced GRU attention network produces a more powerful shared semantic representation than traditional features as the basis of two subtasks. Furthermore, the continuous similarity metric learning from data provides a direct, effective, and robust criterion to calculate the similarity of document pairs in event detection and event pairs in event evolution, respectively. e joint learning of shared semantic representation and continuous similarity metrics provides a good foundation for later incremental clustering based event detection and evolution pipeline.  (2) e top right corner shows the shared semantic representation of news stream, which is the input of both event detection network and event evolution network. Note that we utilize a standard bidirectional gated recurrent unit (GRU) model with attention mechanism to learn the shared representation across two subtasks (event detection subtask and event evolution subtask). (3) e bottom right corner shows the event detection module, in which our goal is to aggregate news reports related to the same event together. (4) After detecting a series of events, we further organize these events into multiple stories in an online manner, which forms the event evolution module in the bottom left corner.
on the real-world EDENS dataset demonstrate that our model significantly outperforms several baseline methods in both event detection and evolution benchmarks. e remainder of this paper is organized as follows: in Section 2, the related literature is reviewed. In Section 3, some concepts and problems are defined. In Section 4, we elaborate on the proposed joint model for event detection and evolution. Experiments and analysis are presented in Section 5. Finally, we conclude the paper and talk about future work in Section 6.

Related Work
In this section, some related literature about event detection, event evolution, and joint model is reviewed.

Event Detection.
Following the original Topic Detection and Tracking (TDT) program [10], event detection aims to discover new or previously unidentified events from the information stream. Most of the works detect events of interest from news media [8,11,19] or social media [5,20,21]. In addition, some research works are also called event detection but are essentially different from the former task; for example, [22] talks about ACE event detection task which focuses on extracting events with entities from sentences and [23] considers the event detection as a text classification problem, that is, categorizing each event to predefined types.
ere are also some works concerning the detection of certain types of events, such as violent and disaster events [5,24], significant events in the calendar [25]. In this paper, we study the problem of detecting events in the open domain from the news streams in an unsupervised clustering manner. e types of events are not specially restricted.
In general, the work on event detection can be divided into three categories, that is, word and phrase-based approaches, distance-based clustering approaches, and probabilistic model-based approaches. Specifically, word and phrase-based approaches detect events by utilizing important words or phrases. Zhou et al. [11] utilize Jaccard similarity coefficient × Inverse Dimension Frequency with time order to identify words with salient scores as event words and then extract the document embedding by applying word2vec to event words, and, finally, Bikmeans is employed to cluster all news documents based on obtained document embeddings. Slonim and Tishby [26] propose a two-phase strategy for document clustering. ey first find word-clusters such that most of the mutual information between words and documents is preserved and then leverage the word-clusters to perform document clustering.
Word and phrase-based approaches mainly focus on mining event words and do not directly utilize the eventrelated documents. On the contrary, distance-based clustering approaches group documents into events by measuring the similarity between documents with some distance metrics like Euclidean distance or Cosine distance. e performance of clustering mainly relies on document feature representations and clustering methods. e common feature representations include TF-IDF [27,28], BM25 term weighting [29], and neural vector [5,20]. e clustering methods include K-means [30], density-based algorithms [31], incremental clustering [5,20,27], or self-organizing map clustering [28]. Among these approaches, those based on neural representation and online incremental clustering show the best performance in both accuracy and efficiency; hence, the proposed model in this paper also follows this strategy.
Probabilistic model-based approaches assume a document generation process and then infer topic distributions of documents by optimizing generation probability [32,33]. Some representative topic models include Latent Dirichlet Allocation (LDA) [14], Probabilistic Latent Semantic Indexing (PLSA) [34], and Gaussian Mixture Model (GMM) [35]. ese models are computationally intensive and do not produce satisfying results for a finer clustering.

Event Evolution. Traditional Topic Detection and
Tracking (TDT) program [10] only detects new events or tracks previously spotted events, but the relationship between events is not interpreted. To help users better know the developing structure of events, different approaches have been proposed from various aspects. Reference [2] proposes a topic evolution model by discovering the temporal pattern of events with timestamps of the text stream. Reference [4] further proposes to construct a temporal event graph to analyze event evolution and determines the dependencies of two events by considering their temporal relationships, content dependencies, and event causality. Reference [1] proposes the concept of Event reading and tries to append each new event to its most similar earlier event. e similarity between two events is measured by the TF-IDF cosine similarity of the event centroids. Reference [3] proposes to measure the similarity relationship of events by analyzing content similarity, time proximity, and document distribution and models the event evolution structure by a directed acyclic graph (DAG).
Although the above structures effectively describe the evolution process of events, sorting events by timestamps omits the logical connection between events, while graph structures do not consider the evolving consistency of the whole story, leading to unnecessary connections between events. To solve these problems, [17] proposes the Metro Map model, which defines some metrics such as coherence and diversity for story quality evaluation and identifies the storyline by solving an optimization problem to maximize the topic diversity of storylines while guaranteeing the coherence of each storyline. Reference [36] further proposes a cross-modal solution, where two requisite properties of an ideal storyline, that is, coherence and diversity, are investigated; then, a unified algorithm is devised to extract all effective storylines by optimizing these properties at the same time. ese works regard the event evolution as an optimization problem with given news corpora. However, they do not deal with newly generated documents at any time and update story structures in an online manner. Recently, [8] proposes a structure of story tree to Computational Intelligence and Neuroscience characterize the evolution of the story, grows story trees with events online, and presents the most effective representation effect according to extensive user experience. In this paper, various evolution structures are investigated in conjunction with deep learning based event clustering algorithm.

Neural Joint Modeling.
e joint model has been extensively researched in various NLP tasks, for example, joint entity and relation extraction [37], joint event extraction and visualization [38], joint event detection and summarization [5,39], joint event detection and prediction [23], and joint parsing and name entity recognition [40]. e key to joint models is designing shared features to capture mutual information between the integrated tasks. For this purpose, neural network models have been shown effective by inducing semantic features automatically after data-driven joint training [5,41].
Among most of the neural joint models, two main strategies have been employed. On one hand, shared semantic representation learning between different tasks is crucial; for example, a shared LSTM layer is used for joint event detection and summarization [5], as well as joint event detection and prediction [23]. On the other hand, neural stacking is also widely applied by feeding the hidden layer of a predecessor model as additional input features to its successor model [5,18,42]. During training, errors of the predecessor model can thus be propagated to its successor model, so that information is better shared between the predecessor and successor models. In this paper, we first investigate the joint neural model for event detection and event evolution through a shared semantic representation and stacking model.

Preliminary
In this section, we first describe some key concepts and notations used in this paper and formally define our problem.

News Report. A news report (often edited by journalists)
is usually to report significant real-world breaking news. Different from popular tweets in social media, which contain various expressions of the same meaning, many informal colloquial phrases, or irrelevant information, most news reports formally describe time, location, person, organization, action, and some important keywords related to the news. In this paper, a news report R i is represented as and t i are the news body, news title, and news publish time, respectively.

News Stream.
A news stream is a continuous and temporal stream of news reports R 1 , R 2 , . . . , R k , . . . starting at an initial time t 0 .

Event.
An event E is a particular thing that happens at a specific time and place [7,43]. Specifically, it is a set of news documents reporting the same real-world breaking news [8].

Event Timestamp.
As mentioned before, an event consists of several news reports, and each report may have a different publish time. In order to clearly suggest the temporal information of an event, we simply consider the earliest news report time as the event timestamp.

Story.
A story S describes a news topic and comprises a set of related events that report a series of evolving realworld breaking news [8]. A directed link from events E 1 to E 2 indicates a temporal development or a logical connection between two events. An example can help further clarify the concept of stories versus events; that is, " e ZTE incident" in Table 1 is a story, which consists of several events, such as "2018-04-17: the US government bans ZTE from buying sensitive products from the US companies," "2018-04-19: ZTE issues an internal letter and sets up a crisis task force to urge employees to be clear-headed," "2018-04-23: ZTE issues a further announcement: measures have been taken to comply with the rejection order," . . ., "2018-06-08: ZTE's ban is lifted and $1 billion is fined," . . ., "the ZTE incident is over, and ZTE will start again with confidence." 3.6. Event Detection. Following the Topic Detection and Tracking (TDT) program [10], the task of event detection (or event clustering) is to cluster news reports of the same realworld events, so that a new event can be detected if new news reports are found not belonging to existing event clusters.

Event Evolution.
e task of event evolution is to connect the extracted related events to form a story, that is, S � E 1 , E 2 , . . . , E n . Since the relationship of event evolution includes temporal relationship or logical relationship [8], we can describe event evolution from different aspects in this paper.

Joint Event Detection and Evolution. Given a news
represents the news report published at time t i , our objective is to jointly cluster all news reports R into a set of events E � E 1 , E 2 , . . . , E |E| and connect the extracted events to form a set of stories S � S 1 , S 2 , . . . , S |S| , where each story . , E i n contains a set of related events E of the same topic.

Joint Model for Event Detection and Evolution
In this section, we elaborate on the proposed neural model for Joint Event Detection and Evolution (JEDE). e overall framework is illustrated in Figure 2, where H * is the shared semantic representation of a news report document, H dec is the hidden state of event detection, and H evo is the hidden state of event evolution. Specifically, the proposed JEDE model consists of two submodels: event detection network and event evolution network, which are based on the shared semantic representation. Besides, two submodels are stacked by feeding the hidden layer of the predecessor event detection network as additional input features to its successor event evolution network. e above design brings two advantages. On the one hand, as the common input of two subtasks, the shared semantic representation can achieve the optimal parameters through joint training across tasks, which benefit both subtasks. On the other hand, the neural stacking manner makes the errors of event evolution be propagated to event detection during the training stage, so that information is better shared between the predecessor event detection and successor event evolution. Hence, the stacked hidden layer can provide more useful information to assist the successor task during the inference stage.
In the following, we first introduce the shared semantic representation via learning a GRU attention network. en, we describe the event detection network where the key issue is the calculation of similarity between news report documents. Next, we describe the event evolution network where the key issue is the calculation of similarity between events. Finally, the training process of the proposed approach is presented in detail.

Shared Semantic Representation.
We apply a standard bidirectional gated recurrent unit (GRU) model [44] with an attention mechanism to learn the shared semantic  Computational Intelligence and Neuroscience 5 representation across two subtasks. Let d � (w 1 , w 2 , . . . , w n ) be the body words sequence of a news report, where n is the news length and w i is the i-th word in the news report. We first map each word w i into a vector x i using a pretrained word2vec tool (http://code.google.com/p/word2vec). Specifically, after training using the skip-gram algorithm [45], there is a word embedding matrix; that is, where V is the vocabulary of the preset size and d w is the embedding dimension of word w. en, each word is transformed into a real-valued vector by looking up the pretrained W wrd ; that is, where v i is a one-hot representation of |V| dimension. Hence, a news report document can be represented as X � (x 1 , x 2 , . . . , x n ).
en, we employ a bidirectional gated recurrent unit (GRU) network to extract the hidden vector representation H � (h 1 , h 2 , . . . , h n ) of X. e GRU can effectively solve the problem of long-term dependencies in RNN network, and, compared with traditional LSTM, GRU has fewer parameters and is easier to converge; hence, it is more suitable for our real-time system. Specifically, the GRU recurrently processes elements in the sequence X. At each step i, the hidden state vector h i of GRU model is updated based on the current vector x i and the previous state vector h i−1 , formulated as h i � GRU(x i , h i−1 ), where GRU refers to the standard GRU function [44].
Based on the general GRU representation and following the [33], bidirectional GRU processes the word embedding sequence X from both head to tail ( ⟶ ) and tail to head (←) and then concatenates two state vectors as output. Formally, it is calculated as follows: where h → i and h ← i represent the i-th state vector generated by GRU from two directions, respectively, x i is the i-th input embedding vector, h i is the i-th hidden state vector, and [·, ·] represents the concatenation operation.
Finally, the attention mechanism [46] is applied to aggregate the hidden vector representation sequence H � (h 1 , h 2 , . . . , h n ) into a fix-length vector as the shared semantic representation for d; that is, H * � Att ([h 1 , h 2 , . . . , h n ]). Specifically, we first map each hidden vector representation h i into a normalized vector representation u i in a hidden space; that is, where W h and b h are trainable parameters and the value range of u i is between −1 and 1. en, an attention weight α i is computed from the inner product between u i and a trainable vector u s , followed by a softmax operation; that is, e final output vector is the product between the input hidden vector representation sequence H � (h 1 , h 2 , . . . , h n ) and the learned attention weights α � (α 1 , α 2 , . . . , α n ); that is, H * � i α i h i . It can be seen that, by assigning different attention weights to each vector representation, important features are emphasized and noisy or irrelevant information is suppressed, so that we can achieve a more discriminative semantic representation of news report documents for subsequent tasks.

Event Detection
Network. For the event detection task, our goal is to aggregate news reports related to the same event together [10]. To this end, an online incremental clustering algorithm [27] is employed to cluster incoming news reports into corresponding event groups. Specifically, suppose that we have detected k events E � E 1 , E 2 , . . . , E k , where each event includes several news reports; that is, For an incoming news report R, we first calculate the similarity score s i j between it and each new report R i j in the existing event cluster E i and take the maximum similarity score s i � max j s i j to measure the similarity between it and the event cluster E i . en, we consider a threshold μ − 3δ to decide whether the news report R belongs to an existing cluster, where μ is the mean of all previous similarity scores and δ is the standard deviation. If all similarity scores s i , i � 1, 2, . . . , k are below the threshold, we empirically think a new event cluster is detected by including the news report R. Otherwise, the news report is added to the most similar existing event cluster, that is, E m , m � argmax i s i .
According to the above description, it can be seen that the core of the incremental clustering is to calculate the similarity between two news reports R i and R j . For this, we consider the Siamese network [5]. Specifically, for two news reports R i and R j , we first extract their own semantic representation vector H * i and H * j , as described in Section 4.1. en, two vectors are concatenated together, followed by a multilayer perception (mlp) and softmax layer to output the final similarity probability score P dec , as formulated in Here, ⊕ denotes the concatenation operation, σ represents the sigmoid function, and W h dec , b h dec , W dec , and B dec are trainable model parameters.
In addition, for the event detection task, the temporal information is also crucial. Hence, we additionally feed the temporal vector H * t of R i and R j from the news publish time to the Siamese network, resulting in Here, H * t i and H * t j denote the news publish time vector of R i and R j , respectively. It is noted that the temporal vector 6 Computational Intelligence and Neuroscience H * t of time t is mapped from the Unix timestamp of t using the same pretrained word2vec tool as Section 4.1.
e result of event detection network is the detected events set; that is, E � E 1 , E 2 , . . . , E |E| , where |E| is the number of events in the whole dataset, and E i � R i 1 , R i 2 , . . . , R i n , where i n is the number of news report documents in the event E i .

Event Evolution Network.
After detecting a series of events, we further organize these events into multiple stories in an online manner. Each story covers several interrelated events, which are connected based on their temporal order to characterize the evolving process of that story. Suppose that we have discovered l stories S � S 1 , S 2 , . . . , S l , where each story includes several detected events, that is, For an incoming event E, our online algorithm to grow the story first identifies the story to which the event belongs. If it does not belong to any existing stories, we create a new story by including it. Afterwards, for all events in each story, we simply connect them by their timestamp order. More specifically, we first calculate the similarity score δ i j between E and each event E i j in the existing story cluster S i and further take the maximum similarity score δ i � max j δ i j to measure the similarity between it and the story cluster S i . en, we still apply a similar threshold strategy to event detection to decide whether the new event E belongs to an existing story S i . If the similarity δ i is below the threshold, a new story is established. Otherwise, the event is added to the most similar existing story group, that is, S m , m � argmax i δ i .
Similar to the predecessor event evolution, we still consider a Siamese network to calculate the similarities between events due to its high accuracy and ease of integration, and the input of event evolution network is the event set. Concretely, for two events E i and E j , we first take the average of semantic representation vectors of news reports which are subordinate to them, as their own semantic representation vectors ε i and ε j . en, two vectors and corresponding timestamp vectors are concatenated together and fed into the former event detection network to generate the hidden feature vector H dec , as formulated in (5).
Furthermore, since the similarity between news reports is highly related to the similarity between corresponding events, for better integration between event detection and event evolution, we additionally feed the obtained hidden feature vector H dec of event detection network to the Siamese network, as is formalized in where ⊕ denotes the concatenation operation; σ represents the sigmoid function; W h evo , b h evo , W evo , and B evo are model parameters; ε i or ε j is semantic representation vector of the event E i or E j ; and ε t i or ε t j is the timestamp vector of the event E i or E j .
Finally, the proposed model generates a series of story sets; that is, S � S 1 , S 2 , . . . , S |S| , where |S| is the number of stories, and S i � E i 1 , E i 2 , . . . , E i n , where i n is the number of event from S i . Moreover, events in each story are connected from front to back based on their temporal or logical relationship.

Training and Parameters Setting.
e training of the joint model is based on a pair of news reports as input. For the event detection, our training objective is to minimize the cross-entropy loss between the predicted similarity probability and the one-hot ground truth derived from the labeled event ids. Afterwards, we take the H * i and H * j as ε i and ε j , respectively. And they are fed into the subsequent event evolution network to predict the story similarity. A crossentropy loss between the predicted similarity probability and the ground truth derived from the labeled story ids is used to optimize the event evolution network and is also propagated to the predecessor event detection network by the hidden feature vector H dec . Hence, the model parameters from the two networks can be jointly updated using the Nesterov Adam optimizer with the learning rate of 1e − 4 . It should be noted that it is reasonable to take the semantic vector of news reports to represent the event for training the event evolution network because an event consists of several related news reports, which should have similar semantic representations; thus, they are close to their average, which is used in the testing. e model is implemented in Keras with Tensorflow backend. e batch size is set to 16, and the number of epochs is set to 8. Besides, we train word embeddings using the skip-gram algorithm [45] and fine-tune them during the training of the joint model. e size of word embeddings is set to 32, the size of the hidden layers H dec , H evo , and shared semantic representation H * is set to 16. Before the experiment, we calculated the size of the word bag in advance, and our model predicted 2,500 words, with an extra token covering all the other words. e Dropout technology [47] is used on the word embeddings with a ratio of 0.2 for avoiding overfitting.

Experiments
In this section, the proposed model is evaluated on a realworld dataset and proved its effectiveness.

Dataset.
e proposed JEDE model is evaluated on a real-world Chinese News Report Documents dataset collected by the crawler tool (https://github.com/liuhuanyong/ EventMonitor). e original dataset does not contain available labels for all the two subtasks that we investigate. Hence, we have taken efforts to manually annotate 1,931 news report documents with their corresponding event id and story id, which we from now on refer to as EDENS (Event Detection and Evolution from News Streams) dataset, where the number of stories is 12, the number of events is 694, and each story contains 58 events on average, covering different topics in the open domain, as presented in Computational Intelligence and Neuroscience Table 1. During the annotation of the EDENS dataset, we invited four human annotators, including two doctoral students and two senior undergraduate students majoring in Computer Science and Chinese Language and Literature, to read news reports, respectively, mark news events and event evolution orders artificially, and review the results jointly. e interrater agreement is to refer to Baidu Encyclopedia and Wikipedia.
In addition, we removed common stop words and only kept tokens which are verbs, nouns, or adjectives from these news report documents. In the experiments, 80% of documents with annotated event id and story id are randomly selected as the training set, and the remaining data serves as the test set.

Evaluation Metrics.
We choose several metrics to evaluate the effectiveness of our model for event detection and evolution.

P, R, and F1.
e Precision (P), Recall (R), and F1 score are used to evaluate the clustering performance of the event detection and evolution tasks. Formally, they are defined as follows: where T � E 1 , E 2 , . . . , E |E| or S 1 , S 2 , . . . , S |S| is the set of artificially detected real-world events or stories (ground truth), and T ′ � E 1 ′ , E 2 ′ , . . . , E |E′| ′ or S 1 ′ , S 2 ′ , . . . , S |S′| ′ is the set of events or stories detected from our model in this paper. Among the above metrics, Precision measures the percentage of correctly detected events or stories. Recall gives the percentage of events or stories in the ground truth which are correctly detected. F-measure is defined as the harmonic mean of Precision and Recall to balance the two metrics. Higher values for Precision, Recall, and F1 indicate a better model for event detection and evolution.

Normalized Topic Weighted Minimum Cost (C min ).
For the event detection task, we also take the same evaluation method as in [5] that the normalized Topic Weight Minimum Cost (C min ) from the standard TDT evaluation procedure [10] is used to evaluate clustering accuracy. Formally, it is defined as follows: C min � C miss * N miss len(cluster) + C fa * N fa len(cluster) , where C miss and C fa are the costs of a missed detection and a false alarm, respectively, and set to C miss � 0.5, C fa � 0.5 in the experiments. In addition, N miss and N fa are the number of missed detections and false alarms, respectively, and len(cluster) is the size of the event cluster. From the above formulation, it can be seen that C min is a linear combination of missed detection and false alarm error probabilities, which allows a fair comparison between different methods based on such a single metric value. Lower C min shows better performance.

ACC.
Besides, we use ACC (i.e., accuracy) to evaluate the accuracy of event detection and evolution, which is the ratio of the correctly detected results of our model in all annotated results.

Evaluation of Event Detection Task.
We first report the performance of different models for event detection on the Chinese News Report Documents dataset. e following state-of-the-art baseline methods are used for comparisons: (i) JEDS [5] is a joint neural model that includes three subtasks: event filtering, detection, and summary, which is specially tailored for Twitter events. It uses shared text representation and neural stacking for joint event detection and summarization. In addition, it uses LSTM as the document representation of each tweet. (ii) LSH [9] (locality sensitive hashing) is used to detect and track events on unbounded high volume tweet stream in constant time and space, and it utilizes bag-of-words to represent each tweet. (iii) DBSCAN [31] (Density-Based Spatial Clustering of Applications with Noise) is a classical density-based clustering algorithm. Unlike the K-means algorithm, it does not need to preset the number of clusters but can generate clusters for arbitrary shapes based on the number of data speculation clusters.
According to Section 5.2, we use P, R, F1, C min , and ACC as the evaluation metrics of events detection results, as illustrated in Table 2. We can see that all neural networks based models significantly outperform traditional clustering models by a large margin, which can be explained by the fact that neural network models can capture a richer feature representation compared to discrete models. Among all the methods, our proposed JEDS model yields the best performance, reducing C min by around 3%, and improving the F1 score and ACC by around 4% and 1%, respectively, compared to the state-of-the-art method JEDS. is may because our proposed JEDE model uses a more discriminative GRU attention network in place of the regular LSTM in JEDS. Besides, the highest accuracy rate of 77.21% in our model implies that most of the detected document clusters (events) are pure; that is, most events only contain documents that talk about the same real-world breaking news.

Evaluation of Event Evolution Task.
Given the set of events extracted by our JEDE model, we further evaluate the performance of the event evolution task on the large twelve 8 Computational Intelligence and Neuroscience stories mentioned above. As described in [8], the event evolution task essentially involves two aspects: (a) identifying the story to which the event belongs and (b) presenting the evolving structure of that story. erefore, the following experimental results are reported based on these two aspects. Specifically, for task (b), following [8], events in a story can be organized in one of the following four structures, as shown in Figure 3: (a) Flat structure [10]: this structure does not include the dependencies between events. (b) Timeline structure [48]: this structure linearly organizes events by their timestamps. (c) Graph structure [3]: this structure checks the connection between all pairs of events and keeps a subset of most strong connections. (d) Tree structure [8]: this structure, called story tree, applies a tree to characterize the structures of evolving events within a story. e regular updating operations include merging, extending, and inserting.
To make fair comparisons, we use the same preprocessing and event cluster procedures proposed in our JEDE model to develop different story structures. Specifically, the JEDE models with flat, timeline, graph, and tree structures are termed as JEDE-Flat, JEDE-Timeline, JEDE-Graph, and JEDE-Tree, respectively. In addition, the Story Forest model, which introduces the tree structure (d) in Figure 3, clusters events into stories by calculating the compatibility between events and story tree based on their keyword sets. erefore, based on the same events clustering results from our JEDE model and event evolving structure, the Story Forest model is used to compare with our proposed JEDE-Tree model on the task (a). Overall, the above five models are used for benchmarking the event evolution task.
We enlisted 50 volunteers to blindly evaluate the results given by different approaches, and, in line with [8], the output story structures are compared from three aspects: the logical coherence of paths (Consistent Paths), the readability of different story structures (Best Structures), and the numbers of one's repeat reading for understanding the story structure (Repeat Readings). In terms of effectiveness, the Consistent Paths and Best Structure are used to evaluate whether a model can help news readers correctly capture the development of events in a short time. In terms of efficiency, the number of repeat readings for understanding the development of a story is recorded in order to compare the efforts a user spent on understanding. As shown in Table 3, in terms of four evolution structures, the tree structure shows the best performance, followed by the timeline structure, which basically accords with human perception. Besides, the flat structure also presents better results than the graph structure, which may indicate that the logic structures of a large portion of real-world news stories are simple, and thus complex graphs are easy to generate an overkill. It should be noted that path coherence is meaningless for flat or graph structure; hence, we ignore the corresponding metric results. A further observation can find that our proposed JEDE-Tree outperforms the Story Forest model on all metrics, with their only difference is in clustering events into a story. is may be explained that our model learns a better event similarity metric by adopting the Siamese neural network models.

Evaluation of Joint Model.
In order to investigate whether the joint model improves the accuracy of both tasks in our pipeline setting, we compare our model with several variants, including the following: (i) JEDE w/o shared uses a separate bidirectional GRU attention network in event detection and evolution to learn a semantic representation of news reports or events. In this pipeline setting, there is no parameter sharing. e temporal information is not used in learning the similarity metrics of event detection and evolution. (ii) JEDE w/o stack employs a bidirectional GRU attention network to learn a shared semantic representation H * for event detection and evolution without neural stacking and backpropagation between tasks. e temporal information is not used in learning the similarity metrics of event detection and evolution. (iii) JEDE w/o time uses a shared semantic representation and stacked event detection and evolution. e temporal information is not used in learning the similarity metrics of event detection and evolution.
(iv) JEDE is the proposed Joint Event Detection and Evolution model, which learns a shared semantic representation from a bidirectional GRU attention network for both subtasks and stacks both subtasks with backpropagation training. e temporal information is also used in learning the similarity metrics of event detection and evolution. Table 4 tabulates the results of different ablation settings, where the performance of task (a) in the event evolution, namely, events clustering into stories, is reported. As can be seen from the table, (1) the original pipeline model (i.e., JEDE w/o shared) shows the worst results but achieves a significant improvement on both tasks by simply sharing the representation of each task (i.e., JEDE w/o stack).  Computational Intelligence and Neuroscience JEDE). It demonstrates that temporal information can indeed enhance event detection and evolution. In summary, based on the shared semantic representation, neural stacking between both subtasks, and temporal information, the proposed joint model JEDE outperforms all baseline models on event detection and evolution tasks.

Case Study.
In this subsection, we provide a qualitative analysis to present the evolution process of events from our model JEDE, taking the story of " e ZTE incident" for example. According to Section 5.4, we select the most effective "Story Tree" structure to represent the event evolution, as shown in Figure 4. e detected story contains 90 nodes, where each node indicates an event in the ZTE incident, and each link represents a temporal or logical connection between two events. For brevity, we randomly delete some nodes for display. Specifically, for instance, event 12 says " e U.S. side agrees to let ZTE submit additional evidence," and event 55 says "ZTE's ban is lifted and $1 billion is fined." Most events are arranged by timeline, but there are 5 paths to represent the evolution of events by logical relationship, where the path "0->2->29" talks about the beginning of the ZTE incident, branch "30->46->54" is about the China-US trade consultation and its impact on ZTE, branch "49->59->75->84" is related to the ZTE share price, and so forth. Overall, qualitative results from our model JEDE show that our model successfully aggregates related news reports to the same event and further clearly present the development process of the whole story by the "Story Tree" structure, which demonstrates the effectiveness of our model. e bold values highlight the performance of the proposed JEDE model on several metrics. It can be seen from this ablation study that shared semantic representation, neural stacking, and time information help the proposed JEDE model achieve the optimal performance. We have described these results in Section 5.5.    Table 3 that our model with the tree structure performs better than other baseline models. We have described these results in Section 5.4.

Conclusions and Future Work
In this paper, we propose an effective neural network model JEDE for Joint Event Detection and Evolution, which benefits both tasks through information sharing between the two tasks. Our model takes the vast streams of trending and breaking news as input and clusters related news reports into several events as well as organizing events in various sensible story structures such as either a tree or a timeline or a flat structure in an online manner. Different from previous popular pipeline settings, our model first uses the bidirectional GRU network and attention mechanism to learn the vectorial semantic representation of news reports as well as events, without the bag-of-words assumption, which is globally shared on both event detection and evolution tasks. en, two similarity metrics about a pair of news documents or events are learned continuously by neural stacking, so that information is better shared between the predecessor event detection and successor event evolution networks. Empirical experiments on a newly annotated real-world dataset EDENS demonstrate the superior performance of our model over several baseline models on both subtasks. In summary, our model is able to effectively deal with event detection and evolution online for massive amounts of breaking news data.
Despite the remarkable improvements, the proposed model still faces some challenges. For example, as we all know, event description is crucial in event mining task. However, our model just simply uses the title of the earliest news report to summary an event, but without considering the richer contextual information contained in the clustered news reports. Hence, a well designed summarization module is necessary in the future joint model. Besides, this paper primarily talks about events about formal news media, and it is implicitly assumed that detected events are all realistic, due to the authority of formal news reports. However, for emerging social media, lack of supervision and free Twitter expressions may produce some rumor events, which may hinder people's understanding of the truth about topics of interest [49,50]. is issue has not yet been solved by the proposed model; hence, it is worth thinking deeply about how to detect and exclude the interference of rumors and present a clean and clear event evolution process for informal social media. e rumor propagation [51][52][53] mechanism is also another research topic of interest in the future.

Data Availability
e data used to support the findings of this study are available from the corresponding author upon request.  Computational Intelligence and Neuroscience 11