Predicting Social Unrest Events with Hidden Markov Models Using GDELT

. Proactive handling of social unrest events which are common happenings in both democracies and authoritarian regimes requires that the risk of upcoming social unrest event is continuously assessed. Most existing approaches comparatively pay little attention to considering the event development stages. In this paper, we use autocoded events dataset GDELT (Global Data on Events, Location, and Tone) to build a Hidden Markov Models (HMMs) based framework to predict indicators associated with country instability. The framework utilizes the temporal burst patterns in GDELT event streams to uncover the underlying event development mechanics and formulates thesocial unrestevent prediction as a sequence classification problem based on Bayes decision. Extensive experiments with data from five countries in Southeast Asia demonstrate the effectiveness of this framework, which outperforms the logistic regression method by 7% to 27% and the baseline method 34% to 62% for various countries.


Introduction
Social unrest events (protests, strikes, demonstration, and occupation) are common happenings in both democracies and authoritarian regimes [1].Most social unrest events initially intended to be a demonstration to the public or the government.However, in many occasions they often escalate into general chaos, resulting in violent, riots, sabotage, and other forms of crime and social disorder.Take Thailand as an example; a series of political protests and three military coups happened between 1990 and 2015, resulting in the government being deposed, which illustrates the power of the social unrest.Figure 1 depicts the activities that causally preceded the protest against the amnesty bill in Bangkok at August 7, 2013.Anticipating these latent instabilities before they occur and applying preventive strategies to avoid them have important ramifications such as prioritizing citizen grievances for the decision makers, issuance of travel warnings for the tourism industry, and insight into how citizens express themselves for the social scientist, which has motivated many social and data science researchers to focus on revealing the patterns contained in these events and further the prediction of future latent social unrest.
Last century, most researchers conducted the prediction work using human-coded data, including WEIS [2] and COPDAB [3].In the last two decades, several small-scale vertical machine-readable datasets [4,5] and large scale code event datasets like ICEWS [6] and GDELT (Global Data on Events, Location, and Tone) [7] appeared, fueling the development of computation methods for the analysis and prediction of social unrest.It is worth mentioning that the GDELT dataset, with its tremendous amount of event records more than any other event datasets, opens up a new perspective of this research area.So far, there are few works aiming at utilizing GDELT to make predictions about social unrest.Existing works attempted to use linear regression [8], time series forecasting [9], and frequent subgraphs [10,11] to conduct the prediction work using GDELT.In [12], GDELT and ICEWS are used as data sources to predict unrest in Latin America.Nevertheless, in these works comparatively little attention has been paid to consider the event development stages in the forecasting models with GDELT.
This paper develops a hidden Markov models based framework for leveraging large scale digital history events captured from GDELT to characterize the transitional process of social unrest event evolutionary development.In the Opposition Democrat spokesman Chavanond Intarakomalyasut said the prime minister knows that the debate of the amnesty bill will lead to conflict but she was ready to take the risk in an attempt to whitewash criminal culprits including ousted prime minister aksin Shinawatra.HMM approach discussed by Rabiner [13], the sequencing of observed events can be considered that yield a likely path of hidden states or phases in which the events occur, which is consistent with the concept of event development stage.
Our proposed framework utilizes the temporal burst patterns in GDELT event streams to uncover the underlying event development mechanics starting from the prior probability of each stage.Eventually, the social unrest event prediction is formulated as a sequence classification problem.More concretely, our main contributions in this paper to social unrest event prediction with GDELT dataset are four pronged: (i) First, we identify a sequence or stages of events that potentially lead to a social unrest (like Figure 1).The paper is organized as follows: a coarse introduction of related work is provided in Section 2. Our HMM based social unrest event prediction framework is presented in Section 3. In Section 4, extensive experiments to evaluate the performance of the new model are conducted and analyzed.The work is summarized and conclusions are drawn in Section 5.In the last section, we give Appendix for technical discussion of Section 3.4.

Related Work
In this section, we will give a brief introduction of the existing works related to this paper, including researches on analysis of social unrest events and the guide to GDELT dataset.

Researches on Social Unrest
Events.Current researches into the analysis of social unrest events can be categorized into two main types: event detection and event prediction.
Event detection provides users what is going on.It has long been addressed and is an extensively studied topic in the literature.Researchers utilize news or social networks, for example, twitter, as real-time and ubiquitous social sensors to promptly discover new events occurring.Document clustering techniques are used to identify events retrospectively or as the stories arrive [14].Works like [15][16][17] focus on extraction patterns (templates) to extract information from text.For a survey on these detection techniques in twitter, we point the readers to [18].However, these event detection approaches can only uncover events after they have occurred and are unable to predict future events because they all focus on observations that directly reflect currently occurring events, rather than precursor indicators that reveal the causes or development of future events [19].
Event prediction has been explored in a variety of applications, including elections [20,21], disease outbreaks [22], stock market movements [23,24], social unrest event prediction [11,12,[25][26][27][28][29][30][31], movie earnings [23], crime [32], and failure prediction [33].Most recent social unrest event prediction techniques can be categorized into three types: planned event forecasting, classification based prediction, and time series mining.Planned event prediction methods do not need to mine patterns from the previous data.They are based on the hypothesis that protests that are larger will be more disruptive and communicate support for its cause better than smaller protests.Mobilizing large numbers of people is more likely to occur if a protest is organized and the time and place are announced in advance [1,26,29].Classification based prediction incorporates volume features and informative features such as semantic topics to train a classification model and then predicts the occurrence of future events.Several classification methods are utilized such as random forest [27], support vector machines [22], logistic regression [10,11,23,25], and LASSO based logistic regression [12,28].Time series based mining uses temporal correlation of relevant features such as tweet volume by adopting appropriate approaches.For example, Achrekar et al. [34] used autoregressive modeling to predict flu trends using twitter data.Radinsky and Horvitz [30] utilized NYT news articles from 1986 to 2007 to build event chain and identify significant increases in the likelihood of disease outbreaks, deaths, and riots in advance of the occurrence of these events in the world.

The GDELT Dataset.
The GDELT Project [7] is a realtime network diagram and database of global human society for open research which monitors the world's broadcast, print, and web news from nearly every corner of every country in over 100 languages and identifies the people, locations, organizations, counts, themes, sources, emotions, counts, quotes, and events driving our global society every second of every day, creating a free open platform for computing on the entire world.Each day the GDELT Project monitors the news media across nearly every corner of the world and compiles a list of over 300 categories of "events" from riots and protests to peace appeals and diplomatic exchanges, recording the details of the event, including its georeferenced location, into a master "event database" of more than a quarter-billion events, dating back to 1979 and updated each morning around 4AM EST.In particular, from 19 February, 2015, GDELT 2.0 has been online which updates every 15 minutes accessing the world's breaking events and reaction in near-real time.
In GDELT event data table, each record has 58 fields (61 fields in GDELT 2.0), capturing information pertaining to a specific event in CAMEO format [35].In this paper, we use the following nine fields from a record: SQLDATE, MonthYear, EventRootCode, GoldsteinScale, NumMentions, AvgTone, ActionGeo_CountryCode, ActionGeo_Lat, and ActionGeo_Long.SQLDATE and MonthYear are the date the event took place in YYYYMMDD format and YYYYMM format, respectively.EventRootCode defines the root-level category the event code falls under.For example, code 1452 (engaging in violent protest for policy change) has a root code of 14 (PROTEST).This makes it possible to aggregate events at various resolutions of specificity.GoldsteinScale is a numeric score from −10 to +10, capturing the theoretical potential impact that type of event will have on the stability of a country.NumMentions is the total number of mentions of this event across all source documents, which can be used as a method of assessing the importance of an event: the more the discussion of that event is, the more likely it is to be significant.AvgTone is the average tone of all documents containing one or more mentions of this event.The score ranges from −100 (extremely negative) to +100 (extremely positive).ActionGeo_CountryCode is the location of the event, which is a 2-character FIPS10-4 country code for the location.ActionGeo_Lat and ActionGeo_Long are the centroid latitude and centroid longitude of the landmark for mapping.
The dataset is also available on Google Cloud Platform (https://cloud.google.com/)and can be accessed using Google BigQuery.In this paper, we export the following GDELT event data for the experiments from the Google BigQuery (https://bigquery.cloud.google.com/table/gdelt-bq:full.events?pli=1)web service.

HMMs-Based Social Unrest Events Prediction
3.1.Framework.Proactive reaction to social unrest events is at first glance closely coupled with social unrest event detection: an unrest event needs to be detected before the government can react to it.However to be precise, not the detection result but the eruption of a social unrest event is the kind of event that should be primarily avoided, which makes a big difference.Hence, it goes without saying that efficient proactive handling of social unrest events requires the prediction of the future level of social unrest, to judge whether the current situation bears the risk of a unrest event or not.The basic assumption of our approach is that eruption of social unrest events can be identified by characteristic patterns of the event sequence prior to the happening time point using HMMs.Prediction mechanism of upcoming social unrest events is illustrated in Figure 2. If a prediction is performed at time , we would like to know whether a social unrest event will occur or not between time  + Δ  and  + Δ  + Δ  .
Δ  usually is called the lead time.Δ  has a lower bound called warning time Δ  , which is determined by the time needed for the specified organization like the government to perform some proactive action, for example, the time needed to make a public statement.Δ  stands for the length of the data window called data window size which contains the predictive sequence of data.The sequence describes the current state of the country or district.The prediction period Δ  is the length of the time interval for which the prediction holds.
Based on above prediction mechanism, our prediction task will resolve around predicting significant social unrest events on the country level and considering that country alone.To accurately predict social unrest events it is crucial to be able to characterize these events' underlying development before the occurrence by utilizing relevant GDELT event records observations.We propose a Hidden Markov Model based framework to characterize the underlying development of these events.Figure 3 illustrates the proposed HMMsbased social unrest event prediction framework, which contains four major components: ground set extraction, burstiness modeling, HMM training, and, last, event prediction.
Formally, denote ER as a basic GDELT event record.ER ("column Name") means the value of a specified column in a record.Denote  = {ER , } ∈Ω,∈Γ as a collection of GDELT event record data split into different countries Ω in time period Γ.The country  and the day  can be filtered by ER (ActionGeo_CountryCode) and ER (SQLDATE), respectively.Since event records ER are being added daily by the hundreds or thousands to the GDELT event table, we aggregate those event records by day, defined as DAER , , meaning the daily aggregated event record on the day  in country .Then a sequence of DAERs is defined as  = {DAER , } ∈⊆Γ , which contains all the daily aggregated event records in country  in the time period  ⊆ Γ.

Ground Set Extraction.
Ground truth is absolutely vital for the prediction problem.Unfortunately, until now there is no public ground set in the social unrest prediction area.As a result, in this paper we treat GDELT as the Ground Truth for social unrest events.Actually, the generated ground set does reflect the real world happenings well according to our manual inspection (see Figure 5).
For each country, the social unrest events we are interested in predicting are those that are significant enough to garner more-than-usual real-time coverage in mainstream news reporting for the country.That is, there is a significant social unrest event in country  on the day .In GDELT, root event code 14 can be taken to mean social unrest.More records with event code "14" mean more social unrest event report coverage.For each country  we are interested in, we firstly aggregate the count of event mention with root event code 14 on each day .Since new events are being added daily by the hundreds or thousands to the GDELT, there is a heterogeneous upward trend in the event mention and what is more than usual in counts changes.As a result, to remove the upward trend in the unrest event mentions, we normalize the mention counts with root code 14 by the average volume of the trailing quarter (90 days).That is, we let where  , is the normalized total count of social unrest event mentions on the day  in country  and ER (NumMentions) is the value of NumMentions of each record.Next we define the average event mention count on each day in country  as where Γ denotes the set of days in the training set.

Burstiness Modeling.
The states of the social unrest event are unobserved but have a close theoretical analog in the concept of development stage that has been explicitly coded in the dataset.Usually, the social unrest event has its breeding development and evolution until the last occurrence, through a longer or shorter life cycle, meaning that it is usually not a sudden outbreak.Typical stages in the events' life cycle often include appeal, accusation, refuse, escalation, and protest.Of course, not every social unrest event will go through all of these stages.Our HMM model characterizes the developments of each significant social unrest event as a sequence of latent states, with a sequence of DAERs being the observations generated by the latent stages.
The GDELT event data captures various types of event owing to the CAMEO event code scheme, with EventRoot-Code field in the data table.In consideration of the stages of social unrest event, the following event types in Table 1 are added to our observations.The count of each type of those events can reflect signals of social unrest event development.Given a sequence of daily aggregated event records , denote   , as the ratio of events with event root code  on the day  in country : where  = 10, 11, 12, 13, 14 and the denominator means 20 event types in GDELT.
The observed variable  should include ratios of the above five event types.In addition, we also add the mean value of ER (AvgTone) denoted as at , and mean value of ER (GoldsteinScale) denoted as gs , to the observation variable.Thus, observation  is a vector with 7 dimensions: Here we use a Gaussian mixture output distribution: where  is the number of mixture components in the Gaussian mixture and ∑  =1   = 1.As shown in Figure 3, the goal of training process is generating both the social unrest event model SU and nonunrest event model SU model, that is, calculating parameters  SU and  SU .We use the Baum-Welch expectation-maximization algorithm [13] for this purpose.The objective of the training algorithm is to optimize the HMM parameters , , and  such that the overall training sequence likelihood is maximized.Sequence likelihood is defined as the probability that a given HMM model  can generate observation sequence where s = [  ] denotes a sequence of latent states of length .The sum over s denotes that all possible state sequences are investigated.However, this will result in unacceptable complexity especially when the observation sequence is long.
Here we adopt forward algorithm or backward algorithm [13] to solve this issue.Denote the forward variable as .We have Finally, the sequence likelihood can be efficiently computed by A backward variable  is defined as The sequence likelihood is Using  and  together, let  denote the probability that a transition from latent state  to state  takes place at time : , , and  can be used to maximize the model parameters.The entire procedure of computation of , ,  and subsequent maximization of model parameters are iterated until convergence, which will converge at least to a local maximum.Inequivalent to standard HMM that start from a randomly initialized HMM, we initial  and  according to the long history records of GDELT event data, which aims to reduce the randomness initialization of parameters.See Appendix for a more technical discussion.
Finally, we trained two HMM models based on two corresponding sets of sequences, one set from sequences prior to the positive 7-day stretches minus the lead time period and the other negative.Thus, one model characterizes the development process leading to a social unrest event, while the other one characterizes the process that does not lead to a social unrest event.

Event Prediction.
After the training of model parameters, the social unrest event prediction is formalized as a sequence classification problem.For the prediction, an unknown sequence prior to the target 7-day stretch minus the lead time period will be aligned with the above model in each class.The sequence will be classified into the class corresponding to the higher alignment score, higher likelihood.However, likelihood ( | ) gets small very quickly for long sequences, such that limits of double-precision floating point operations are reached.Scaling technique log-likelihoods is used for this reason.Besides, different costs should be associated with classification.For example, falsely classifying a SU-prone sequence as SU-free might be much worse than vice versa.
We use Bayes decision theory to specify the classification rule: the unknown sequence of observations  is classified as where   denotes the associated cost for assigning a sequence of type  to class ; for example,  SU,SU denotes the cost for falsely classifying a SU-prone sequence as SU-free.(SU) and (SU) are constant representing the prior probabilities of SU sequences and SU sequences, respectively.See, for example, [36] for a derivation of the formula.Thus, given the costs of misclassification, the right hand side of this inequality determines a constant threshold on the difference of sequence log-likelihoods, denoted as .If the threshold is small more sequences will be classified as SU-prone increasing the chance of detecting SU-prone sequences.On the other hand, the risk of falsely classifying a SU-free sequence as SU-prone is also high.If the threshold increases, the behavior is inverse: more and more SU-prone sequences will not be detected at a lower risk of false classification for SU-free sequences.

Experimental Evaluation
This section presents an experimental evaluation of the performance of the proposed HMM based prediction approach based on comprehensive experiments on GDELT event data from five main countries from Southeast Asia.

Experiment Design
4.1.1.Dataset.Our goal in this paper is to predict the overall level of social unrest using GDELT, and our focus area is distributed across five major nations in Southeast Asia: Thailand, Malaysia, Philippines, Indonesia, and Cambodia.Numerous event records extracted from online media and frequent protests or strikes throughout these countries make them ideal countries to study patterns and signals prior to the happening of social unrest events.As mentioned above, GDELT uses the CAMEO coding system [35], where root event code 14 can be taken to mean social unrest.Figure 4 illustrates the mention counts of protest event occurring in these countries retrieved from GDELT between January 1, 2001, and February 29, 2016.The average counts of protest events per year for each country range from 480 in Cambodia to 1700 in Thailand.In consideration of the quarterly normalization in Section 3.2, the actual training The number of positive 7-day stretches in the 778 weeks with training and testing period, respectively, on different countries is shown in Table 2.The training period includes 666 7-day stretches while the testing period 112.An example plot of ground set for Thailand is shown in Figure 5 with annotations of news abstract describing the social unrest event in the top ten stretches above threshold.

Comparison Methods.
We compare the proposed HMM based social unrest event prediction method with logistic regression (LogReg) model and a baseline method.
The LogReg model [32] also treats the event prediction as a classification problem.The input feature here is the sum of event mentions of each type in the predictive sequences during the period of   .The output is 0 if there is no event and 1, if there is one.The baseline method considers the probability of historical social unrest event occurrence to be the probability of future social unrest event occurrence.Note that this baseline is also used as the prior parameter in the training process of the HMM models.

Performance Metrics.
We evaluate our social unrest event prediction framework using metrics similar to those described in Kallus [27].We quantify the success of the proposed predictive mechanism and comparison methods based on their balanced accuracy.Let   ∈ {0, 1} and   ∈ {0, 1}, respectively, denote whether a significant social unrest event occurs in country  during the days  − 3,  − 2,  − 1, ,  + 1,  + 2, and  + 3 and whether we predict there to be one.The true positive rate (TPR) is the fraction of positive instances (  = 1) correctly predicted to be positive (  = 1) and the true negative rate (TNR) is the fraction of negative instances predicted negative.The balanced accuracy (BACC) is the unweighted average of these:  BACC, unlike the marginal accuracy, cannot be artificially inflated.In fact, as the unbalanced distribution of positive and negative examples in our dataset, always predicting "no social unrest event" without using any data will yield a nearly 90% marginal accuracy but only 45% balanced accuracy.In fact, a prediction without any relevant data will always yield a BACC of 50% on average by statistical independence.
4.1.5.Parameter Settings.The baseline method does not require any parameters and we implemented the LogReg method based on its origin.The proposed HMM based prediction method has four prior parameters: prediction period Δ  , dimension of observation , the number of latent states , the number of Gaussian mixtures, and three tunable parameters: lead time Δ  , data window size Δ  , and threshold .We used a prediction period Δ  of seven days (one week) in our experiments.The number of latent states and the number of Gaussian mixtures were set as 5 and 3, respectively.The tunable parameters are estimated based on the tenfold cross-validation by maximizing the average balanced accuracy of the five countries.The lead time Δ  , the data window size Δ  , that is, the sequence length, and the threshold  were set to be 1, 10, and 6, respectively.Finally, we use the open-source HMM toolbox developed by Murphy; see [37], to implement the various HMM functions.

Event Prediction Results
. Figure 6 compares our proposed HMM based prediction method to the LogReg model and the baseline method based on the BACC metric.In every case in the figure, we note that, for all the five countries, our proposed approach achieved the best overall performance in balanced accuracy, outperforming the LogReg model by 27%, 17%, 7%, 15%, and 7% and the baseline 62%, 39%, 45%, 43%, and 34% for Thailand, Indonesia, Philippines, Malaysia, and Cambodia, respectively.This is likely because our HMM based prediction framework better captures the features and characterizes the development stages of social unrest events behind the observed sequence data.The poor performance of the baseline method, actually close to a totally random model, indicates that focusing solely on the probability of historical social unrest event occurrence is insufficient for the task of social unrest event prediction.
One of the advantages of our HMM prediction method is that it allows employing a customizable threshold permitting to control the tradeoff between the true positive rate (TPR) and false positive rate (FPR = 1 − TNR).As we vary the threshold , we can monotonically trade off TPR with FPR.The range of achievable such rates for each country is plotted in Figure 7, with the HMM based prediction method and the LogReg method in comparison.Also, the HMM based social unrest event prediction model outperforms the LogReg model for each individual country, with the areas under the curve (AUC) of HMM method for each country are obviously bigger than that of LogReg method for each country.In particular, the prediction task for Thailand achieves best performance obviously.This is probably because Thailand experienced more massive social unrest events and thus, more patterns of development were learned.

Sensitivity Analysis on Δ𝑡
and Δ  .Although we set fixed values for parameters in the comparison in last section, the impact of the number of days of lead times, that is, the parameter Δ  , and the data window size Δ  on the event prediction performance for each country were also studied.We turned Δ  from 1 to 10 and chose three data window sizes Δ  : 10, 20, and 30.The detailed variation tendencies are illustrated in Figure 8, leading to two main observations.Firstly, overall, the prediction balanced accuracy decreases as the days of lead time increase for all the data window sizes in each country, which indicates that the performance is sensitive to the number of days of lead time in the given value interval of parameters.In most cases, the lead time of 1 day achieves best performance.This is consistent with the common understanding that the more close to one the social unrest event is the more probable it may be predicted.Secondly, as shown by these curves, the balanced accuracy and the data window size do not have a relationship with obvious trend.It depends on specified lead time and specified country.For example, for Thailand, Δ  = 10 with Δ  = 1 performs best while Δ  = 30 performs best with other lead times.For other countries, this relationship takes on a different situation.This reflects that we should use different data window sizes with different lead time and different countries to achieve best prediction performance.

Discussion
This paper presents a hidden Markov models based framework for leveraging large scale digital history coded events captured from GDELT to utilize the temporal burst patterns in GDELT event streams to uncover the underlying event development mechanics and formulate the social unrest event prediction as a sequence classification problem.Extensive empirical testing with data from five countries in Southeast Asia demonstrated the effectiveness of this framework by comparing it with logistic regression model and the baseline model and the fact that the GDELT dataset does reflect some useful precursor indicators that reveal the causes or development of future events.
We plan to conduct our future work in the following four aspects.First, we will apply this proposed framework to the city level prediction within a country.Second, we want to add other informative data like Twitter and Facebook to enhance the prediction accuracy.In addition, in GDELT 2.0, event mention details and global knowledge graphs [38] are also provided real-timely, which can bring us with detailed insights to the events.Third, we also plan to label a Ground Truth dataset for social unrest events in Asia like the Gold Standard Report (GSR) [12] for Latin American to better evaluate our future methods.Last, in this paper we do not consider the geographical factor which also affects the event coverage.Next we will improve our model to distinguish widespread news coverage from localized coverage.records.First, the transition probability from state   to   is the biggest.This reveals the fact that each type of events often lasts for a period of time.Second, for each state, its second biggest transition probability comes from its neighbouring state, which means that the evolution of event stages follows some certain rules, not a random process.

Figure 1 :
Figure 1: Event development stages before the protest against the amnesty bill at August 7, 2013, in Bangkok, Thailand.

Figure 3 :
Figure 3: The proposed HMMs-based social unrest event prediction framework: two HMMs are trained, with one for SU-prone sequences and one for SU-free sequences.SU-prone sequences consist of observations within a time window of length Δ  preceding a social unrest event () by lead time Δ  .SU-free sequences consist of observations at times when no social unrest event was imminent. is the time the prediction is performed at.Δ  is the prediction period.

Figure 5 :
Figure 5: Normalized SU event mention counts of Thailand with annotations for top ten stretches above  (red line).

Figure 6 :
Figure 6: Comparison of our HMM based method with the LogReg model and the baseline method based on the BACC metric.

Figure 8 :
Figure 8: Sensitivity analysis on lead times Δ  and data window size Δ  .
, a standard continuous HMM can be defined as  = (, , ). is a  ×  state transition probability matrix, where  , = (  |   ) is the transition probability of moving from the latent state   to latent state   . is the emission probability matrix.The output probability for each state,  , =   (o  ) = (o  ;   ), is a function of the observations (o  ) that depends on model parameters   .

Table 2 :
Number of positive 7-day stretches in the 778 weeks of our experiment in different countries.April 1, 2001, to December 31, 2013, and the test data January 1, 2014, to February 29, 2016.
4.1.2.Ground Set.The ground set was generated as the manner described in Section 3.2.Overall across the five countries considered, about 11.5% of 7-day stretches are labeled positive, distributed mostly evenly among the countries.The whole training and testing period include 5448 days and 778 weeks.

Table 4 :
Initial state transition matrix  and initial state probabilities  for Thailand.