Learning Evolutionary Stages with Hidden Semi-Markov Model for Predicting Social Unrest Events

Social unrest events are common happenings in modern society which need to be proactively handled. An eﬀective method is to continuously assess the risk of upcoming social unrest events and predict the likelihood of these events. Our previous work built a hidden Markov model- (HMM-) based framework to predict indicators associated with country instability, leaving two shortcomings which can be optimized: omitting event participants’ interaction and implicitly learning the state residence time. Inspired by this, we propose a new prediction framework in this paper, using frequent subgraph patterns and hidden semi-Markov models (HSMMs). The feature called BoEAG (Bag-of-Event-Association-subGraph) is constructed based on frequent subgraph mining and the bag of word model. The new framework leverages the large-scale digital history events captured from GDELT (Global Data on Events, Location, and Tone) to characterize the transitional process of the social unrest events’ evolutionary stages, uncovering the underlying event development mechanics and formulating the social unrest event prediction as a sequence classiﬁcation problem based on Bayes decision. Experimental results with data from ﬁve main countries in Southeast Asia demonstrate the eﬀectiveness of the new method, which outperforms the traditional HMM by 5.3% to 16.8% and the logistic regression by 11.2% to 43.6%.


Introduction
e era of information technology boosts the rapid development of the Internet of things, social media, and big data. As a data-intensive science, social computing is an emerging thing that leverages the capacity to collect and analyze data with an unprecedented breadth, depth, and scale. It represents a new computing paradigm and an interdisciplinary research and application field. Topics related to social computing have attracted the attention of more and more researchers.
e social unrest events such as protests, strikes, demonstrations, and occupy movements are important research focuses in the social computing area, which are common happenings in both democracies and authoritarian regimes [1]. Most social unrest events initially intended to be a demonstration to the public or the government. However, in many occasions, they often escalate into general chaos, resulting in violent, riots, sabotage, and other forms of crime and social disorder. Take ailand for example; a series of political protests and three military coups happened between 1990 and 2015, resulting in the government being deposed, illustrating the power of the social unrest. Figure 1 depicts the activities that causally preceded the protest against the amnesty bill in Bangkok on August 7, 2013. Anticipating these latent instabilities before they occur and applying preventive strategies to avoid them have important ramifications such as prioritizing citizen grievances for the decision makers, issuance of travel warnings for the tourism industry, and insight into how citizens express themselves for the social scientist, which has motivated many social and data science researchers to focus on revealing the patterns contained in these events and further the prediction of future latent social unrest.
Traditionally, the research in the area of social unrest was based on static analysis from the macroqualitative perspective by the political researchers. Fortunately, with the development of data science, especially the rise of big data, there are more and more data-driven approaches proposed on microscopic insight into possible social unrest events. Last century, most researchers conducted the prediction work using human-coded data, including WEIS [2] and COPDAB [3]. In the recent two decades, several small-scale vertical machine-readable datasets [4,5] and large-scale coded event data like ICEWS (Integrated Crisis Early Warning System) [6] and GDELT [7] appeared, fueling the development of computation methods for the analysis and prediction of social unrest.
Our previous work [8] published in Discrete Dynamics in Nature and Society built a hidden Markov model-(HMM-) based framework to predict indicators associated with country instability. e framework used the temporal burst patterns in GDELT event streams as features to train the hidden Markov models. ere are two shortcomings in that work. First, the temporal burst pattern is essentially a simple feature in the number of coded events. e interaction characteristics between event participants are missing. Second, the probability of state residence time in the HMMs decreases exponentially with time, which is obviously not in line with the actual situation of social unrest events.
In response to the above shortcomings, we propose a new prediction framework in this paper, using frequent subgraph patterns and hidden semi-Markov models (HSMMs). e new framework also leverages the large-scale digital history events captured from GDELT to characterize the transitional process of the social unrest events' evolutionary stages. Our proposed framework converts the GDELT event streams to frequent subgraph patterns for capturing interaction features better. In addition, the mechanism of HSMM guarantees the prediction model can explicitly learn the probability distribution of state residence time from the historical data. Eventually, the social unrest event prediction is formulated as a sequence classification problem using Bayes decision. More concretely, our main contributions in this updated paper are four pronged: (i) First, we identify a sequence of stages of events that potentially lead to a social unrest. Typical evolutionary stages of social unrest include appeal, accusation, refuse, escalation, and eruption, where each stage corresponds to a state in the hidden semi-Markov model. It should be noted that not all unrest events will go through all the four development stages before reaching the eruption stage. (ii) Second, we propose the BoEAG (Bag-of-Event-Association-subGraph) features to capture the characteristics of frequent patterns instead of the temporal burst patterns used in our previous work [8]. e original GDELT data within a certain time are represented as an event element association graph, from which the frequent subgraph patterns are mined. In the end, the BoEAG features are constructed like the classic BoW (bag of word) model [9] used in the text processing. (iii) ird, we propose a hidden semi-Markov modelbased framework which contains four major components: ground set extraction, BoEAG feature construction, HSMM training, and event prediction. e ground set contains social unrest events that are significant enough to garner more-than-usual realtime coverage in mainstream news reporting. e BoEAG features of the GDELT stream are taken as the observations. en, two HSMM models are trained, with one for social unrest-prone sequences and one for social unrest-free sequences, after which new sequences' likelihoods are calculated and predictions are made by Bayes decision theory to specify the classification rule. (iv) Last, we conduct extensive experiment evaluations with GDELT event data from five main countries in Opposition Democrat spokesman Chavanond Intarakomalyasut said the prime minister knows that the debate of the amnesty bill will lead to conflict but she was ready to take the risk in an attempt to whitewash criminal culprits including ousted prime minister aksin Shinawatra. e Prime Minister Yingluck Shinawatra said in a televised address on Friday night that she and the government have sincerely and patiently promoted unity in the country … called on protesters to make use of Parliament forum in discussing the issue.  e proposed framework outperforms the traditional HMM by 5.3% to 16.8% and the logistic regression method by 11.2% to 43.6% for different countries. Sensitivity analyses are also conducted, revealing the impact of the parameters on the new framework's performance. e paper is organized as follows. A coarse introduction of related work is provided in Section 2. Our HSMM-based social unrest event prediction framework is presented in Section 3. In Section 4, extensive experiments to evaluate the performance of the new method are conducted and analyzed. e work is summarized and conclusions are drawn in Section 5.

Social Unrest Event Prediction.
Predictive analysis of social unrest events has long stayed at the level of qualitative analysis relying on the experience of experts, especially political scientists. Since 2009, research studies on social unrest event prediction based on data mining have taken shape in some international political science journals [5,10]. Especially since 2013, with the popularization of big data technology, big data-driven social unrest event prediction research has ushered in a period of vigorous development. In the conferences such as SIGKDD [11,12], WWW [13,14], SDM [15], AAAI [1,16], and journals such as IEEE Trans. [17,18], more than 30 related works have been published in succession, and the degree of attention is evident.
Planned event prediction methods do not need to mine patterns from the previous data. ey are based on the hypothesis that protests that are larger will be more disruptive and will communicate support for its cause better than smaller protests. Mobilizing large numbers of people is more likely to occur if a protest is organized and the time and place are announced in advance [1,11,25]. For example, Basnet et al. [34] used the GDELT data to propose a clustering method based on spatiotemporal k-dimensional structure trees to study the spatiotemporal distribution of conflict events in India in 2014.
Classification-based prediction incorporates volume features and informative features such as semantic topics to train a classification model and then predicts the occurrence of future events. Several classification methods are utilized such as random forest [13], support vector machines [21], logistic regression [22,24,28,35] and LASSO-based logistic regression [26,27]. Wang et al. [36] used the LSTM model combined with GDELT's event data to predict the number of conflicting events. Yang et al. [37] used a two-stage sentiment analysis method based on deep neural networks to conduct early warning research on group aggregation behavior. Phillips [38] summarized the use of social media to predict future events, including applied research in the detection of political events and threat events. Parrish [39] used the recurrent neural network GRU sequence model and aggregated the GDELT event data by day, splicing them into feature vectors to determine whether a country has a social unrest event including domestic political crisis, riots, racial violence, and change of leadership. Zhao et al. [40] used the multitask learning of geographical spatial stratification, judging whether unrest events occurred on the specified date. Wu et al. [41] used the "Protest Participation eory" proposed in the field of political science, combined with the SVM support vector machine model, to conduct early warning research on social unrest events. Deng et al. [12] extracted and learned graph representations from historical event documents. By employing the hidden word graph features, the model predicts the occurrence of future events and identifies sequences of dynamic graphs as event context.
Time series-based mining uses temporal correlation of relevant features such as tweet volume by adopting appropriate approaches. For example, Achrekar et al. [42] used autoregressive modeling to predict flu trends using twitter data. Radinsky et al. [29] utilized NYT news articles from 1986 to 2007 to build event chain and identify significant increases in the likelihood of disease outbreaks, deaths, and riots in advance of the occurrence of these events in the world.
So far, there are few works aiming at utilizing GDELT to make predictions about social unrest. Existing works attempted to use linear regression [43], time series forecasting [44], deep neural networks [36,39], and frequent subgraphs [28,35] to conduct the prediction work using GDELT. In [27], GDELT and ICEWS are used as data sources to predict unrest in Latin America. Nevertheless, in these works, comparatively little attention has been paid to consider the event evolutionary stages in the prediction models.

Hidden Semi-Markov Model.
A hidden semi-Markov model (HSMM) is a statistical model with the same structure as a hidden Markov model except that the probability of there being a change in the hidden state depends on the amount of time that has elapsed since entry into the current state.
is is in contrast to the original hidden Markov models where there is a constant probability of changing state given survival in the state up to that time.
HSMM was first proposed by Baum et al. [45] and has been successfully used in many applications, including word recognition task [46], daily return series modeling in financial market [47,48], equipment health diagnosis and prognosis [49], activity recognition and abnormality detection [50], DNA analysis [51], and online failure prediction [52]. It is worth noting that in our work, we referred to the basic idea of online failure prediction in a commercial telecommunication system by Salfner et al. [33,52]. e works motivate us to apply the hidden Markov model and hidden semi-Markov model to the social unrest event Discrete Dynamics in Nature and Society prediction task. e prediction mechanism and Bayes decision-based classification are adopted specifically.

e GDELT Dataset.
e GDELT Project [7] is a realtime network diagram and database of global human society for open research which monitors the world's broadcast, print, and web news from nearly every corner of every country in over 100 languages and identifies the people, locations, organizations, counts, themes, sources, emotions, counts, quotes, and events driving our global society every second of every day, creating a free open platform for computing on the entire world. Each day, the GDELT Project monitors the news media across nearly every corner of the world and compiles a list of over 300 categories of "events" from riots and protests to peace appeals and diplomatic exchanges, recording the details of the event, including its georeferenced location, into a master "event database" of more than a quarter billion events, dating back to 1979 and updated each morning around 4 AM EST. In particular, from 19 February 2015, GDELT 2.0 has been online which updates every 15 minutes accessing the world's breaking events and reaction in near real time.
In GDELT event data table, each record has 58 fields (61 fields in GDELT 2.0), capturing information pertaining to a specific event in CAMEO format [53]. In this paper, we use the following nine fields from a record: SQLDATE, MonthYear, EventRootCode, GoldsteinScale, NumMentions, AvgTone, ActionGeo_CountryCode, ActionGeo_Lat, and ActionGeo_Long. SQLDATE and MonthYear are the date the event took place in YYYYMMDD format and YYYYMM format, respectively. EventRootCode defines the root-level category the event code falls under. For example, code 1452 (engage in violent protest for policy change) has a root code of 14 (PROTEST). is makes it possible to aggregate events at various resolutions of specificity. GoldsteinScale is a numeric score from −10 to +10, capturing the theoretical potential impact that type of event will have on the stability of a country. NumMentions is the total number of mentions of this event across all source documents, which can be used as a method of assessing the importance of an event: the more discussion of that event, the more likely it is to be significant. AvgTone is the average tone of all documents containing one or more mentions of this event. e score ranges from −100 (extremely negative) to +100 (extremely positive). Action-Geo_CountryCode is the location of the event, which is a 2character FIPS10-4 country code for the location. Action-Geo_Lat and ActionGeo_Long are the centroid latitude and centroid longitude of the landmark for mapping. e dataset is also available on Google Cloud Platform1 and can be accessed using Google BigQuery. In this paper, we export the following GDELT event data for the experiments from the Google BigQuery2 web service.

HSMM-Based Social Unrest Event Prediction
3.1. Framework. Proactive reaction to social unrest events is at first glance closely coupled with social unrest event detection: an unrest event needs to be detected before the government can react to it. However, the fact is that not the detection result but the eruption of a social unrest event is the kind of event that should be primarily avoided, which makes a big difference. Hence, it goes without saying that efficient proactive handling of social unrest events requires the prediction of the future level of social unrest, to judge whether the current situation bears the risk of an unrest event or not. e evolutionary stages of the social unrest event cannot be directly observed. However, the stages have been explicitly coded more or less on the Internet. e basic assumption of our approach is that the eruption of social unrest events can be identified by frequent subgraph patterns of the event sequence prior to the happening time point using HSMMs. Prediction mechanism of the upcoming social unrest events is illustrated in Figure 2. If a prediction is performed at time t, we would like to know whether a social unrest event will occur or not between time t + Δt l to t + Δt l + Δt p .
Δt l usually is called the lead time. Δt l has a lower bound called warning time Δt w , which is determined by the time needed for the specified organization like the government to perform some proactive action, e.g., the time needed to make a public statement. Δt d stands for the length of the data window called data window size which contains the predictive sequence of data. e sequence describes the current state of the country or district. e prediction period Δt p is the length of the time interval for which the prediction holds.
Based on the above prediction mechanism, our prediction task will resolve around predicting significant social unrest events on the country level and considering that country alone. To accurately predict social unrest events, it is crucial to be able to characterize these events' underlying stage before the occurrence by utilizing relevant GDELT event records observations. We propose a hidden semi-Markov model-based framework to characterize the underlying development of these events. Figure 3 illustrates the proposed HSMM-based social unrest event prediction framework, which contains four major components: ground set extraction, BoEAG feature construction, HSMM training, and event prediction.
Formally, denote ER as a basic GDELT event record. ER ("column name") means the value of a specified column in a record. Denote D � ER c,t c∈Ω,t∈Γ as a collection of GDELT event record data split into different countries Ω in time period Γ. e country c and the day t can be filtered by ER(Action Geo Country Code) and ER(SQLDATE), respectively. Since event records ER are being added daily by the hundreds or thousands to the GDELT event table, we aggregate those event records by day, defined as DAER c,t , meaning the daily aggregated event record on the day t in country c.
en, a sequence of DAERs is defined as s � DAER c,t t∈T⊆Γ , which contains all the daily aggregated event records in country c in the time period T⊆Γ.

Ground Set Extraction.
Ground truth is absolutely vital for the prediction problem. Unfortunately, until now, there is no public ground set in the social unrest prediction area.
As a result, in this paper, we treat GDELTas the ground truth for social unrest events. Actually, the generated ground set does reflect the real world happenings well according to our manual inspection (see Figure 4).
For each country, the social unrest events we are interested in predicting are those that are significant enough to garner more-than-usual real-time coverage in mainstream news reporting for the country. at is, there is a significant social unrest event in country c on the day t. In GDELT, root event code 14 can be taken to mean social unrest. More records with event code 14 means more social unrest event report coverage. For each country c we are interested in, we firstly aggregate the count of event mention with root event code 14 on each day t. Since new events are being added daily by the hundreds or thousands to the GDELT, there is a heterogeneous upward trend in the event mention and what is more than usual in count changes. As a result, to remove the upward trend in the unrest event mentions, we normalize the mention counts with root code 14 by the average volume of the trailing quarter (90 days). at is, we let where M c,t is the normalized total count of social unrest event mentions on the day t in country c and ER(Num Mentions) is the value of Num Mentions of each record. Next, we define the average event mention count on each day in country c as where Γ denotes the set of days in the training set.
To smooth the data, we consider a seven-day moving average. By definition, we say that a significant social unrest in country c occurs during the 7-day stretches where θ is the significance threshold. Time   Discrete Dynamics in Nature and Society

BoEAG Feature Construction.
e Bag-of-Event-Association-subGraph (BoEAG) feature is constructed from frequent subgraphs and the bag of word model. e original GDELT data within a certain time are first represented as a big single event element association graph. en, the frequent subgraph patterns are mined from the big single graph. In the end, the BoEAG features are constructed like the classic BoW (bag of word) model. e event element association graph draws on the SUBDUE system [54] which analyzed aviation safety events using graph mining. e system converts a series of aviation safety related event records into graph data for processing. e node labels represent the aviation safety event id and the attribute value. e edge labels represent the attribute name (such as location, time, and flight altitude) and the relationship between events. For example, "near_to" relationship means that the distance between the two accidents occurred is within 200 km. Figure 5 gives a schematic diagram of the event element association graph of this paper. e figure contains two events numbered id1 and id2. e node label in the figure represents the number and attribute value of the GDELT event record, and the edge label represents the attribute name, such as event type, location, participants, and GoldsteinScale value. When two events contain at least one identical participant, there will be a "relate_to" relationship between the two events connected by an edge. Bag of words model is a feature vectorization method commonly used in the field of text retrieval and text classification. In this paper, BoEAG feature construction is similar to BoW. e collection of GDELT event element association graphs aggregated by day corresponds to the corpus in the BoW model. Each event element association graph corresponds to a document and each frequent subgraph corresponds to a word in the BoW model. e tf − idf weight of the frequent subgraph s of the i − th event element association graph can be calculated by the following formula: where f s,i denotes the frequency of subgraph s in the event association graph i. is value can be directly obtained through the single graph frequent subgraph mining algorithm SSIGRAM proposed in our previous work [55]. N denotes the number of event association graphs, that is, the time span of the dataset in days; n s is the number of event association graphs that contain subgraphs s.
Algorithm 1 gives the process of BoEAG feature construction illustrated above. e input of the algorithm includes three parameters: the original GDELT event records, such as a set of event records within a certain period of time in a certain country, the support threshold, and the maximum number of subgraphs. e output is the BoEAG feature vector set. Lines   return the maximum number of subgraphs. at is, when the total number of frequent subgraphs found during the mining process reaches N max , it will stop iterating and arrange all subgraphs in descending order of frequency. Line 24 obtains the standard adjacency matrix coding sequence of each subgraph and uses it as the "Word." Line 25 calculates the tf − idf feature vector corresponding to each event association graph according to formula (4).

Structure of HSMM.
Usually, the social unrest event has a series of evolutionary stages, through a longer or shorter life cycle, meaning that it is usually not a sudden outbreak. Typical stages in the events' life cycle often include appeal, accusation, refuse, escalation, and eruption. In this paper, a hidden semi-Markov model which contains five states with left and right structure is designed, whose structure is shown in Figure 6. e structure contains five states, corresponding to the typical stages of the evolutionary process of social unrest events from left to right, such as appeal, accusation, refuse, escalation, and eruption. e state in this structure starts from S1 (appeal) and ends at state S5 (eruption). During the state transition, the number of the next transition state cannot be lower than the current state number. Correspondingly, the state transition probability matrix A has the following form: In the traditional HMM model, the state residence time probability P i (d) shows an exponential downward trend with the number of residence time units [56], which is obviously not consistent with the state residence time of many application scenarios in the real world, especially the social unrest events. In order to improve this shortcoming, the state residence time probability distribution can be explicitly introduced into the HMM model so that it can automatically learn the probability distribution of the state residence time from historical data.
is is the original intention of the hidden semi-Markov model.
Let S � s i denote the set of latent states, 1 ≤ i ≤ N. Let π � [π i ] denote the vector of initial state probabilities. Given a sequence of the above BoEAG feature observations O, a standard continuous HSMM can be defined as λ � (π, A, B, P), where the initial state probability π and output matrix B have the same meaning as HMM, while the state transition matrix A is defined as is paper considers the discrete time probability, that is, the state residence time can only be an integer multiple of the residence time unit, e.g., day. Let D represent the maximum possible residence time; then, P can be denoted as a residence time probability matrix of N × D, whose element value p id represents the probability of the state s i lasting d time units:

Sequence Likelihood. Given an observation sequence consisting of L days' BoEAG feature vector set
e goal of hidden semi-Markov model training is to optimize the model parameters π, A, B, and P so that the likelihood of the model generating sequence O is maximized. Given the HSMM model λ � (π, A, B, P), the sequence likelihood of the observation sequence O is defined as where s � [s t ] represents the hidden state sequence with length N. Similar to the traditional HMM, the sum over s can also be calculated by the forward-backward algorithm proposed in [57]. e difference is that the state residence time needs to be explicitly added during the derivation Require: original event records ER, support threshold τ, and maximum subgraphs returned N max Ensure: BoEAG feature set X set (1) EAG set ⟵ ∅/ * e set of event association graphs * / (2) Sub G set ⟵ ∅/ * e set subgraphs * / (3) X set ⟵ ∅ (4) DAER list ⟵ ER: event records aggregated by day (5) for DAER t in DAER list do/ * All the event records at date t * / (6) EAG t ⟵ ∅/ * All the event association graphs at date t * / (7) for e i in DAER t do (8) if e i is not traversed then (9) EAG t ⟵ constructing the graph unit of event e i (10) for e j in DAER t do (11) If e j is not traversed then (12) EAG t ⟵ constructing the graph unit of event e j (13) if e i and e j contain at least one identical participant then (14) EAG t ⟵ generating "relate_to" edge between e i and e j (15) (20) for EAG t in EAG set do (21) Sub G t ⟵ SSIGRAM (EAG t , τ, N max )/ * Mining frequent subgraphs using the SSIGRAM algorithm * / (22) Sub G set · add(Sub G t ) (23) end for (24) Representing each subgraph in Sub G set as its standard adjacency matrix (CAM) coding sequence (for details of standard adjacency matrix, please refer to [55]). (25) X set ⟵ TF−IDF Sub G set / * Calculating feature set using formula (4) * / (26) Return X set ALGORITHM 1: e algorithm of BoEAG feature construction.  Discrete Dynamics in Nature and Society α t (j) can be recursively calculated from front to back as follows: Finally, the sequence likelihood can be efficiently computed by e backward variable is defined as β t (i), which means the probability of starting at the hidden state i at time t, given β t (i) can be recursively calculated from front to back as follows: e sequence likelihood can be efficiently computed by

Parameter Estimation.
ere are 4 parameters to be estimated for the model training, including initial probability distribution π, state transition probability a ij , observed probability density function b i (o t ), and state residence time probability density function p j (d). π and a ij can be calculated directly. b i (o t ) and p j (d) need to specify the description form of probability density function in advance. We use multivariate mixed Gaussian probability density to describe the probability density of where M represents the number of mixed Gaussian elements; c im is the weight of the m mixed Gaussian elements in the state i; M m�1 c im � 1; and μ im and U im are the mean and variance of the i − th Gaussian element, respectively.
We use a single Gaussian distribution to describe the probability density of state residence time p j (d): where m i and σ 2 i are the mean and variance, respectively.
Denote the variable ξ t (i, j) as the probability of transferring from state i to state j after residing in d time units at the time t. Given the observation sequence O and the model parameters λ, then Given the definitions of the forward variable and backward variable, ξ t (i, j) can be calculated as So far, the parameter estimation can be achieved by the expectation maximization (EM) algorithm, also known as the Baum-Welch algorithm in HMM [57]. e E step of the EM algorithm is to construct a Q function and then maximize the Q function in the M step. us, we can obtain the re-estimated model parameters π, a ij , b i (o t ), and p j (d).
en, the process iterates continuously until the parameters converge or the maximum number of iterations is reached, formulated as As the ground truth contains multiple positive samples and negative samples, we need to use multiple sets of observation data to train the model.
as the training data containing K Discrete Dynamics in Nature and Society observation sequences. All observation sequences have the same length L. We assume that each observation sequence is independent with each other. P(O | λ) represents the probability of the combination of observation sequences under a given model; then, Finally, we trained two HSMMs based on two corresponding set of sequences, one set from sequences prior to the positive 7-day stretches minus the lead time period and the other negative. us, one model characterizes the evolution process leading to a social unrest event, while the other one characterizes the process that does not lead to a social unrest event.

Event Prediction.
After the training of model parameters, we formalize the social unrest event prediction as a sequence classification problem. For the prediction, an unknown sequence prior to the target 7-day stretch minus the lead time period will be aligned with the above model in each class. e sequence will be classified into the class corresponding to the higher alignment score--higher likelihood. However, likelihood P(O | λ) gets small very quickly for long sequences, such that the limit of double-precision floating point operations may be reached. e scaling technique log-likelihood is used for this reason. Besides, different costs should be associated with classification. For example, falsely classifying a SU-prone sequence as SU-free might be much worse than vice versa.
We use Bayes decision theory to specify the classification rule: the unknown sequence of observations O is classified as SU-prone, if where c ta denotes the associated cost for assigning a sequence of type t to class a, e.g., c SU,SU denotes the cost for falsely classifying a SU-prone sequence as SU-free. P(SU) and P(SU) are constants representing the prior probabilities of SU sequences and SU sequences, respectively (see, e.g., [58] for a derivation of the formula). us, given the costs of misclassification, the right hand side of this inequality determines a constant threshold on the difference of sequence log-likelihood, denoted as ε. If the threshold is small, more sequences will be classified as SU-prone, increasing the chance of detecting SU-prone sequences. On the other hand, the risk of falsely classifying a SU-free sequence as SU-prone is also high. If the threshold increases, the behavior is inverse: more and more SU-prone sequences will not be detected at a lower risk of false classification for SU-free sequences.

Experimental Evaluation
is section presents an experimental evaluation of the performance of the proposed HSMM-based prediction framework based on five countries from Southeast Asia.

Experiment Design
Our focus area is distributed across five major nations in Southeast Asia: ailand, Malaysia, Philippines, Indonesia, and Cambodia. ese countries have experienced mass protests of varying degrees over the past decade, so they are ideal sources of research data. As mentioned above, GDELT uses the CAMEO coding system [53], where root event code 14 represents social unrest. Figure 7 illustrates the mention counts of protest event occurring in these countries retrieved from GDELT between January 1, 2001, and February 29, 2016. Among them, ailand (25877 times) was mentioned the most in protest reports, followed by the Philippines (23381 times), and Cambodia (7322 times) being the least. In consideration of the quarterly normalization in Section 3.2, the actual training data were from April 1, 2001, to December 31, 2013, and the test data were from January 1, 2014, to February 29, 2016.

Comparison Methods.
As a comparison, three methods are selected in this paper. One is the traditional hidden Markov model (HMM), and its structure is also a form of left to right as Figure 6, except that there is no explicit state residence time probability distribution estimation during the model training process; the remaining steps are the same as the HSMM method. e second is the logistic regression method. Two logistic regression models are trained, and sequence classification is conducted based on this. e third is baseline which does not train any model. It directly uses the probability of protest event records in a country in history as the future social unrest events' probability.

Performance Metrics.
We evaluate our social unrest event prediction framework using metrics similar to those described in Kallus et al. [13]. We quantify the success of the proposed predictive mechanism and comparison methods based on their balanced accuracy. Let T ct ∈ 0, 1 { } and P ct ∈ 0, 1 { }, respectively, denote whether a significant social unrest event occurs in country c during the days t − 3, t − 2, t − 1, t, t + 1, t + 2, and t + 3 and whether we predict there to be one. e true positive rate (TPR) is the fraction of positive instances (T ct � 1) correctly predicted to be positive (P ct � 1) and the true negative rate (TNR) is the fraction of negative instances predicted negative. e balanced accuracy (BACC) is the unweighted average of these: BACC, unlike the marginal accuracy, cannot be artificially inflated. In fact, due to the unbalanced distribution of positive and negative examples in our dataset, always predicting "no social unrest event" without using any data will yield a nearly 90% marginal accuracy but only 45% balanced accuracy. In fact, a prediction without any relevant data will always yield a BACC of 50% on average by statistical independence.

Parameter Settings.
In the extraction stage of ground truth, the threshold value of θ is set to 2.3. is value is approximately equal to the 90% quantile of the standard exponential distribution, that is, approximately 10% of the 7day time windows in the ground truth will be marked positive.
In the BoEAG feature extraction stage, the maximum number of returned frequent subgraphs N max is set to 10000.
e logistic regression has one parameter: the iteration convergence threshold, which is set to 10 − 6 in the experiment. e baseline method does not require any parameter values to be set in advance. e HMM model and the HSMM model both have 6 parameters that need to be set, including the hidden state number N, the number of mixed Gaussian elements used in the estimation of the probability density of the observation value M, the prediction interval Δt p , the lead time Δt l , the prediction data time window Δt d , and the likelihood threshold ε. In experiments, N, M, and Δt p are used as fixed parameters, that is, the three values are the same when the experiment is performed on the dataset of five countries. We set N � 5, M � 3, and Δt p � 7, respectively. e meaning of Δt p � 7 is to determine whether there will be a social unrest event during the 7-day (one week) time window. In addition, Δt d , Δt l , and ε are adjustable parameters, and the optimal value is obtained by performing 10-fold cross-validation on the training set of each country. e value interval of Δt l is one day to seven days. e values of Δt d is 10, 20, 30, and 40 days, and the value interval of ε is [−2, 2], with a step of 0.1. e final value details are shown in Table 1.  Figure 4 takes ailand as an example, giving its normalized number of protest reports (the red line represents the threshold θ). We mark the top ten 7-day time windows with the most reports and give a brief description. ese are the social unrest events that have really happened in ailand in the history, such as the "Tak Bai incident" with about 1500 protesters on October 28, 2004, which occurred in Tak Bai district in Southern ailand, caused by the detention of 6 Muslim believers. And the protest conflict against the Abhisit government which broke out in Bangkok on April 7, 2009, is also included. is also shows the effectiveness of the proposed method of extracting ground truth from GDELT data. Table 3 gives the balanced accuracy (BACC) values of the hidden semi-Markov model (HSMM), the traditional hidden Markov model (HMM), the logistic regression, and the baseline method on the test set. Based on the BoEAG feature pattern, it can be seen that in the test datasets of various countries, the performance of the prediction method based on the hidden semi-Markov model proposed in this paper is the best, which shows that the HSMM model can indeed better model the characteristics of mass protest events due to explicitly considering the residence time of the event development evolution stage. e performance of the HMM model is the second best, followed by the logistic regression, and the baseline performs the worst, which is basically random guessing. A longitudinal comparison of the five countries shows that each method performs best in the ailand test set, especially the HSMM method, which achieves a BACC value of 95.9%. For all the five countries, our proposed HSMM-based approach achieved the best overall performance in balanced accuracy, outperforming the HMM model by 12         used in our previous work [8], we can see that the BoEAG pattern constructed from frequent subgraphs can better model the stages of social unrest events, as the BACC values of HSMM, HSMM, and logistic regression all improve when the BoEAG patterns are used. is is because the BoEAG pattern considers both temporal burst and the interaction between event participants.

Event Prediction Results.
By adjusting the likelihood ratio threshold ε, a series of correspondences between the true positive rate (TPR) and the false positive rate (FPR) can be obtained, and then ROC analysis can be performed for each method. Figure 8 shows the ROC curve of the three methods of HSMM model, HMM model, and logistic regression. e larger the area under the curve (AUC) under the ROC curve, the better the prediction performance of the model. Obviously, among the three methods shown, the AUC of the hidden semi-Markov model (HSMM) is the largest on each test set, and its performance is the best among the methods.

Sensitivity Analysis on
Δt l and Δt d . Although the model parameters are fixed on the training set by 10-fold crossvalidation, it is still necessary to investigate the performance of the prediction model at different leading time Δt l and prediction time window Δt d , which also has guiding significance to the actual application model. Figure 9 shows the trend of the prediction performance of the HSMM model on each test set with Δt l and Δt d . e leading time Δt l is 1 day to 10 days, and the value of Δt d is 10 days, 20 days, and 30 days. Two phenomena can be found: First, as the leading time Δt l increases, the overall prediction accuracy of the model decreases. In most cases, when Δt l � 1, the BACC value is the highest. is is consistent with our common sense, that is, the closer the observation data are to the time point of the event, the more accurate the event can be predicted in the future. Second, the performance of the model is not necessarily related to the length of time windows of the observation sequence data used. It is not that the longer the observation sequence used, the higher the prediction accuracy, and the more the data, the more the interference. Given the trained prediction model and the lead time parameters, different test sets require different time windows for prediction data to achieve optimal prediction accuracy.

Discussion
is paper presents a hidden semi-Markov model-based framework for leveraging large-scale digital history coded events captured from GDELT to utilize the frequent subgraph patterns mined from the GDELT event streams to uncover the underlying event evolution mechanics and formulate the social unrest event prediction as a sequence classification problem. Extensive empirical testing with data from five countries in Southeast Asia demonstrated the effectiveness of this framework by comparing it with traditional HMM, the logistic regression model, and the baseline model. It shows that the GDELT dataset does reflect some useful precursor indicators that reveal the causes or evolution of future events.
We plan to conduct our future work in the following three aspects. First, we plan to introduce a multilevel prediction mechanism to our framework, such as city level or province level. Second, in GDELT 2.0, event mention details and global knowledge graphs [59] are also provided in real time, which can bring us with detail insights to the events. More machine learning and deep learning methods like the graph neural networks [60] can be developed with more events' elements. ird, the prediction framework may be improved by distinguishing widespread news coverage from localized coverage.

Data Availability
e GDELT data used to support the findings of this study are included within the article in Section 2.

Conflicts of Interest
e authors declare that there are no conflicts of interest regarding the publication of this paper.