Multi-Ideology, Multiclass Online Extremism Dataset, and Its Evaluation Using Machine Learning

Social media platforms play a key role in fostering the outreach of extremism by influencing the views, opinions, and perceptions of people. These platforms are increasingly exploited by extremist elements for spreading propaganda, radicalizing, and recruiting youth. Hence, research on extremism detection on social media platforms is essential to curb its influence and ill effects. A study of existing literature on extremism detection reveals that it is restricted to a specific ideology, binary classification with limited insights on extremism text, and manual data validation methods to check data quality. In existing research studies, researchers have used datasets limited to a single ideology. As a result, they face serious issues such as class imbalance, limited insights with class labels, and a lack of automated data validation methods. A major contribution of this work is a balanced extremism text dataset, versatile with multiple ideologies verified by robust data validation methods for classifying extremism text into popular extremism types such as propaganda, radicalization, and recruitment. The presented extremism text dataset is a generalization of multiple ideologies such as the standard ISIS dataset, GAB White Supremacist dataset, and recent Twitter tweets on ISIS and white supremacist ideology. The dataset is analyzed to extract features for the three focused classes in extremism with TF-IDF unigram, bigrams, and trigrams features. Additionally, pretrained word2vec features are used for semantic analysis. The extracted features in the proposed dataset are evaluated using machine learning classification algorithms such as multinomial Naïve Bayes, support vector machine, random forest, and XGBoost algorithms. The best results were achieved by support vector machine using the TF-IDF unigram model confirming 0.67 F1 score. The proposed multi-ideology and multiclass dataset shows comparable performance to the existing datasets limited to single ideology and binary labels.


Introduction
Social media have become an integral part of life in the current era. People share their thoughts, beliefs, and ideas over social media platforms. Social media platforms such as Twitter, Facebook, WhatsApp, and Instagram are popular mediums of expression among people. Over 474,000 messages are posted on Twitter, and 293,000 statuses are updated on Facebook [1].
Social media platform ofers extensive outreach and hence become extremely infuential. Tis makes the social media platform a perfect tool for the extremists to spread their propaganda, radicalization, and recruitment. Te extremist groups share violent messages, images, and videos over social media. Te extremist organizations such as the Islamic State of Iraq and Syria (ISIS) [2] and Al Qaeda [3] use social media platforms for the spread of extremism amongst the susceptible youth.
Similarly, far-right-wing organizations such as Alt-Right [4] and Proud Boys [5] also use social media platforms to radicalize and recruit the youth. Bill S-894 [6] claims that 73% of the violent incidents in the USA after 11 September 2001 have links with far right-wing organizations.
In the recent Christchurch mosque attack [7], perpetrators were infuenced by Oslo attackers manifesto [8], spread through online means. Perpetrators live-streamed the Christchurch mosque attack on Facebook [8]. Facebook blocked the initial spread of the attack video; however, some reuploads were left undetected [9].
Online extremism research is crucial to constrain the spread of harmful ideologies amongst the susceptible youth. It also helps the regulatory bodies to monitor and control the spread of extremism.
Online extremism is carried out in the following three ways: (1) spreading propaganda, (2) attracting youths through the recruitment messages, and (3) the radical change in the perception towards an individual or community.
Propaganda is "content, generally biased, which is exploited for the personal or the political cause" [10]. Misinformation used for political gains is also termed "propaganda." Propaganda is usually used by dictatorial administrations such as Nazism in Germany and the former Soviet Union to brainwash people. Propaganda such as "America is dead! Long Live America" [11] is used to attract people.
Jihadist propaganda mainly related to ISIS can be found in their online magazines "Dabiq" and "Rumiyah" [12]. Te magazines contain propaganda in the form of glorifcation of the caliphate and battlefeld [13]. White supremacist propaganda used by some organizations follows methods such as pamphlets similar to ISIS [11].
Radicalization is a "change in behavior, attitude, and perception towards a person or a community" [14]. Miscreants use online radicalization to mislead people by quoting their beliefs that may be political or religious [15]. Both jihadists and white supremacists use current events, encourage weapons, and violent attacks as radicalization strategies [11]. Text such as "you do realize IS wants to destroy every single nation-state, Arab or Kurd or communist does not matter, that they come across?" [16], radicalizes people in the name of religion, organization, or nation.
Recruitment in the area of extremism is the "incitement of youths to sacrifce themselves and perform violent acts on behalf of the extremist organization [17]." Jihadist-ISIS recruiters glorify ISIS fghters' death as martyrdom and exploit it as a recruitment tactic [18]. White supremacists use "feelings of inadequacy," "anti-government themes," and recently "coronavirus themes" to recruit disgruntled youth [11]. Extremists use posters with text such as "Join the Atomwafen Division," which directly calls for recruitment to the specifc extremist organization [11].
Every type of extremist text and speech such as propaganda, radicalization, and recruitment has distinct features and efects. Tese are also explained in [19]. As social media reach is ever-expanding, extremist organizations use these platforms to spread propaganda, radicalize people, and recruit them for violent acts. Tus, it is necessary to develop a tool for identifying propaganda, radicalization, and recruitment to restrict the spread of extremism on social media platforms [16]. Te online extremism research faces the following challenges: (1) Lack of publicly available datasets of the extremism text (2) Lack of the ideology-independent and balanced datasets of the extremism text (3) Lack of automated data validation methods for checking the quality of data (4) Lack of accurate automated detection methods for the online extremism text (5) Limited work on extremism content classifcation into categories, such as radicalization, propaganda, and recruitment Te contribution of our work is as follows: (1) Construction of multi-ideology balanced and extremism text dataset collected from multiple sources such as StormFront Dataset [20], Gab dataset [21], ISIS Kaggle dataset [22], and Twitter (2) Te application of statistical data validation methods for checking the quality of the proposed dataset (3) Te development of an automated framework for the detection of online extremism text, which classifes the extremism content as radicalization, propaganda, and recruitment (4) Implementation of the proposed framework with AI techniques for efcient and accurate detection of online extremism (5) Comparative performance analysis of the proposed dataset Merged ISIS-White Supremacist (MIWS) with Merged ISIS dataset (MIS), Merged White Supremacist dataset (MWS) (6) Investigation of the best feature extraction technique and classifer for the proposed extremism text dataset Tis research work targets two ideologies ISIS/jihadist and white supremacist. Te reason behind selecting these two ideologies is based on various factors such as infamy [23], support of violence [2,8], and the spread of ideology online and ofine [24]. Twitter is one of the most popular social media platforms with an extensive reach. Multiple studies have proved that extremists prefer Twitter for spreading propaganda, radicalization, and recruitment [16,25,26]. So, StormFront [20] and Gab datasets [21] are referred to as hate speech datasets. Hate speech is defned as the "attack or use of discriminatory language with reference to a person or group" [27]. At the same time, extremism can be referred to as "ideas that are opposed to society's core values which can be of various forms racial or religious supremacy or ideologies that deny basic human rights or democratic principles" [28]. Tere are multiple defnitions of hate speech [29,30] and similarly multiple defnitions of 2 Computational Intelligence and Neuroscience extremism [31,32]. However, there is a signifcant similarity in the defnitions and interpretations of hate speech and extremism overlaps. Organizations such as the EU already consider StormFront and Gab the primary platform for right-wing extremist views [33]. Terefore, StormFront and Gab datasets are considered extremists for this paper.

Related Work
Existing literature on extremism detection is analyzed by considering the employed datasets and the classifer techniques applied.

Standard Dataset.
In standard datasets, extremism text is collected, which is based on a specifc ideology. Te ISIS Kaggle dataset [22] was compiled by the Fifth Tribe organization to analyze the online spread of ISIS and to counteract them. Te dataset contains 17,350 tweets from 112 pro-ISIS user accounts, collected after Paris attacks [34] in November 2015. Te dataset contains 15,684 Englishlanguage tweets. Tis dataset includes username, location, number of followers, and timestamp of the tweet. It is used in multiple studies to detect and analyze ISIS supporters [35,36]. Te ISIS Kaggle dataset is unlabelled. Diferent researchers used various techniques to label the dataset. Te main problem of the ISIS Kaggle dataset is that there are old accounts in the dataset, which Twitter may have suspended for discarding their hate speech policy.
Te "About ISIS Kaggle Dataset" [37] acts as a counterpoise to the ISIS Kaggle Dataset. Tis dataset has around 122K tweets mentioning "isis," "isil," "daesh," "islamic state," "raqqa," and "mosul." Te dataset is unlabelled, containing pro-ISIS accounts, as the data collected is based on keywords. Most of the accounts are unavailable or deleted in the ISIS Kaggle dataset.
In ISIS Religious Text Kaggle dataset [38], data is collected by Fifth Tribe. Tis dataset is compiled by scraping of ffteen and nine issues of Dabiq and Rumiyah magazines, respectively. Te dataset contains a total of 2,685 texts. Standard datasets related to jihadism or ISIS ideology are unlabelled and contain suspended accounts.
Tere are very few standard datasets available in the literature on White supremism hate speech. de Gibert et al. [20]  Kennedy [21] collected 27,000 posts from the Gab social network. Gab social network claims to preserve the freedom of speech and has become a haven for disseminating hate speech. Te authors categorize posts into attack on human dignity (HD), call for violence (CV), and ofensive/vulgar language (VO). Te authors further classify HD and CV into implicit, explicit, race/ethnicity, nationality, gender, religion, sexual orientation, ideology, political ideology, and mental/ physical health. Te authors considered three classes, HD, VO, and hate (a combination of HD and CV), for the classifcation.
Te standard datasets in both ISIS and White supremacist ideology are very few. Te accounts from which data is collected may have been inactive, suspended, or deleted by the user or the social media platforms. Terefore, the labels provided within datasets are inadequate to provide insights into extremism linguistics in both ideologies. Furthermore, there is a lack of data validation techniques to evaluate the standard datasets. Hence, many researchers prefer to collect extremism-related data from various sources, and manual annotation is performed due to these issues.

Custom
Dataset. Similar to standard datasets, custom datasets are created to represent specifc ideologies. Berger [25] in 2014 collected 20,000 ISIS-related accounts from Twitter. Te author analyzed the location of supporters, languages spoken by the supporters, identifcation information of supporters, when the supporter accounts were created, the content of posts by ISIS supporters, and the methods used for the identifcation of propaganda and recruitment.
Chatfeld et al. [16] collected 3,036 tweets from @shamiwitness, who was a known ISIS sympathizer. Te tweets of @shamiwitness were manually annotated with propaganda, radicalization, and recruitment by the authors. Te account of @shamiwitness is now suspended so that no further analysis can be performed. Te authors rely on manual data validation methods with no statistical evidence.
Rowe and Saif [39] used the dataset provided by O'Callaghan et al. [40] as the SEED dataset. From the SEED dataset, the authors identifed 154K users suspected of spreading ISIS propaganda. Te authors collected 3,200 tweets from each user resulting in 104 million tweets. Te authors found 43% of tweets in English, 41% in Arabic, and the rest in Spanish and Dutch. For validation of the dataset, the authors used interrater agreement using two annotators. In addition, the authors used a sample of 2,000 tweets for manual validation, and the agreement of annotators was between 0.4 and 0.6 Fleiss' Kappa. Te authors did not use any other statistical technique for data validation.
Kaati et al. [41] used 66 Twitter users as seeds obtained from Shumukh al-Islam Forum. Te authors used hashtags such as #ISLAMICSTATE, #ILoveISIS, and #AllEyesOnISIS. Tus, a total of 27,253 English pro-ISIS tweets and 16,000 Arabic pro-ISIS tweets were collected. Te authors did not provide any information on data validation.
Ashcroft et al. [42] used similar methods described by Kaati et al. [41] to collect a total of 7,500 tweets consisting of pro-ISIS, anti-ISIS, and random contexts. Unfortunately, most of the data were collected from older accounts, which may have been suspended.
Benigni et al. [43] used a two-step snowballing process to collect accounts related to ISIS. In the frst step, the authors used fve seed accounts to collect 1,345 unique accounts. Te Computational Intelligence and Neuroscience authors collected 1,19,156 user accounts in the second step, which followed or related to 1,345 accounts of the previous step. Tus, the authors collected a total of 862M tweets by the end of step two. Unfortunately, due to the Twitter datasharing policy, the tweets collected by the authors were not available to the public.
Abrar et al. [44] gathered 13,369 terrorism-supporting tweets, 16,506 terrorism-nonsupporting tweets, and 38,617 random tweets. However, the authors neither mentioned any seed accounts or terrorism-specifc keywords used to gather tweets nor performed any data validation methods on the collected dataset.
Ahmad et al. [45] gathered ISIS-related tweets using keywords such as ISIS, bomb, and suicide. Te authors also used manually identifed seed words for identifying ISISrelated tweets. Te authors conclude that 12,754 tweets were extremists and 8,432 were nonextremists. However, the research work lacks data validation on the collected data.
Asif et al. [46] used the Facebook pages of news agencies such as PTV news, Dawn, and Geo to gather extremist texts. A total of 19,497 posts were collected, from which 5,279 were labeled as moderate, 6,912 as highly extreme, 2,991 as low extreme, and 4,315 as neutral. Te authors used survey-based validation, using 109 random people. However, the authors used only a sample of 25 posts which may not represent the whole data.
Gialampoukidis et al. [47] collected ISIS-related data by searching fve keywords provided by law enforcement agencies and domain experts. So, this resulted in 9,528 tweets from 4,400 suspected ISIS-supporting users. Unfortunately, this dataset is unavailable due to the datasharing policy of Twitter.
Te researchers collected data for extreme right-wing, White supremacist ideology from diferent sources and locations. Jaki and De Smedt [48] collected 50,000 tweets from about 100 Twitter users suspected of supporting farright ideology in Germany. Te authors also collected 50,000 neutral tweets. Te authors did not provide any details about data validation methods.
Berger [26] manually collected data from 41 Twitter users who supported the alt-right movement. By checking these accounts' followers, the author collected 27,895 user accounts suspected of supporting the alt-right movement. Berger also collected data from 33,766 neutral user accounts. Te author used manual validation for the collected data. Alt-Right Demographics dataset is not available publicly due to Twitter data sharing policies. So, the reproducibility of results is not possible.
Some researchers also collected data from multiple ideologies. For example, De Smedt [49] used a multidomain perspective for extremism detection. Te authors divided the text into jihadism (ISIS), extremism (far right-wing from Germany, Belgium, Netherlands, US, UK, and Canada), sexism, and racism. Te authors collected 50,000 tweets for jihadism, 92,500 tweets for extremism, 10,000 tweets with 15,000 Facebook posts for racism, and 65,000 posts from Incels.me about sexism. Te authors used hate and safe labels for extremism, jihadism, sexism, and racism domains. Te authors also used left and right labels for the extremism domain. Te authors also analyzed demographic profling, psychological profling, sentiment analysis, and network analysis with detection. Unfortunately, De Smedt et al. do not provide access to the datasets due to strict Twitter policies on data sharing.
Similarly, Berger [23] compared two ideologies ISIS and Nazis, by collecting data from Twitter. First, to identify the users with White supremacist and Nazi sympathies, the author used 18 seed accounts. Te author then collected around 200 tweets from a total of 25,406 followers of these 18 seed accounts. Ten, for analysis, the authors used 4,000 highly relevant Nazisympathizing accounts. Finally, the author used a similar strategy to collect 4,000 ISIS sympathizing accounts from Twitter.
Heidarysafa et al. [50] compared the women-specifc content of ISIS with women-specifc Catholic preaching. Te authors collected 20 articles from Dabiq and Rumiyah targeting women and 132 articles from catholicwomensforum.org. Te authors relied on manual validation but did not provide any statistical evidence.
Araque and Iglesias [51] used diferent datasets such as Pro-Neu, Pro-Anti, Magazines, SemEval2019 [52], and Davidson [53] to classify radicalization and hate speech using Afective-Space and SenticNet. Te authors also used multiple features such as TF-IDF and similarity-based sentiment projection (SIMON) for prediction.
Mussiraliyeva et al. [54] collected religious extremist posts from VKontakte [55] social media platforms in the Kazakh language. Te authors used diferent extremist keywords such as "kafr" and "kill" to identify extremist texts. Te annotation of an extremist text is based on the appearance or absence of selected extremist keywords within the text.
From Table 1, it is observed that issues plaguing custom datasets are data availability, result reproducibility, binary classifcation, data imbalance, and single ideology focus. Data availability is an issue due to the policy of social media. So, in turn, this afects the reproducibility of the results for other researchers. Nearly all the researchers using the custom datasets use binary classifcation, which is inadequate for deeper analysis. Te extremism data are less than nonextremist data. Tus, the class imbalance is inherent in the custom datasets. Te biggest problem of both standard and custom datasets is that their focus is on a single ideology.
Tus, there is a need for a generic dataset of the extremism text, which accounts for multiple ideologies. Additionally, the dataset should help classify extremism text into popular types, that is, propaganda, radicalization, and recruitment. Tus, a generic dataset with multiple ideologies and a single-model multiclassifcation can efciently detect online extremism text. Tese challenges are further explained in Section 3.

Challenges with Existing Online Extremism Datasets.
Tere are various research gaps found in the dataset of online extremism text. Te following challenges are observed in online extremism text datasets as illustrated in Figure 1:

Data Imbalance and Binary
Classifcation. Data imbalance is a serious problem for online extremism datasets.   [20] and Gab dataset [21] are good examples of class imbalance. As extremism data is the fraction of the total data on social media, creating a balanced class dataset is challenging. Another problem with the dataset is binary or at the most three-class classifcation of extremism data. Extremistnonextremist, pro-ISIS-not Pro-ISIS, and hate-not hate are some of the available binary classes. Te third class, if available, is either called "irrelevant" or "neutral." Unfortunately, this classifcation does not provide analytical insights into the extremism text. Tus, limiting the understanding of extremist activities on social media. Moreover, the expressions of extremism are complex and change over time. Terefore, it is necessary to create the categories based on the context of extremist texts.

Language.
Te extremism in diferent ideologies is spread through diferent languages. Tus, the identifcation of the extremist text becomes more challenging. Most researchers use English as the global language. Te extremist widely uses English to spread their ideology worldwide. Multiple studies by Jaki and De Smedt [48], and De Smedt [49], have addressed online extremism in Dutch and German languages. Rowe and Saif [39] collected dataset containing ISIS-related tweets in English, Arabic, Spanish, and Dutch languages, but limited their research studies to English and Arabic languages.

Outdated Dataset.
Standard datasets such as ISIS Religious Text dataset [38] are old. Tis is because these datasets were obtained during the early days of ISIS. Another issue is the strict data-sharing policy of social media, which makes updating old datasets impossible. Tis strict datasharing policy is also one reason for the fewer numbers of standard datasets.

Validation.
Most researchers use manual validation with the interrater agreement. As it is impossible to validate an entire data manually, few random samples are used for data validation. Tus, bias is introduced unknowingly. Te number of experts also afects the bias in data validation. Fewer experts may give good interrater agreement, but the bias persists. Te use of multiple experts may lower the bias, but the interrater agreement may deteriorate [46].

Data Quality Assessment.
In online extremism research, researchers often collect their own data [26,35]. Due to the restriction of social media and other issues, previous custom datasets are not available publicly. So, the comparison of datasets is a huge issue in online extremism research. Tis also leads to another problem of comparison of results. As no study uses the same dataset, comparing results with diferent methods and techniques is difcult in online extremism detection research.

Suspended Accounts.
Social media has a strict policy on violence and hate speech [29,56]. Tus, many accounts with such extreme ideologies get suspended immediately. So even after data collection, other researchers cannot reproduce the results due to the unavailability of suspended accounts.
Tis work aims to address data quality challenges, data validation, data imbalance, and binary classifcation in extremism datasets. Te challenges about languages and suspended accounts do not fall into the scope of this work.

Classifers.
Network-based, machine learning-based, and deep learning-based techniques are popularly used in online extremism research [19].

Network/Graph-Based Techiques.
Network/graphbased techniques are preliminarily used due to the following reasons: (i) To cluster extremists on social media (ii) To identify extremist communities on social media (iii) To perform data collection by identifying connections among the extremists Since 2015, only few studies use the network/graph-based approach. Agarwal and Sureka [57] used the breadth-frst search and shark search algorithms to fnd the extremists and their communities on YouTube. Te authors used the class name relevant (extremist) and irrelevant (nonextremist). By using the shark search algorithm, the authors achieved an accuracy of 0.74 and an F1 score of 0.85.
Saif et al. [58] used closegraph to extract subgraphs of extremists on Twitter. Te authors used these subgraphs as features for machine learning algorithms such as Naïve Bayes, maximum entropy, and SVM. In addition to subgraphs, the authors used unigram, sentiment, and semantic features. Te authors Computational Intelligence and Neuroscience concluded that SVM performs the best with a precision, recall, and F1 score of 0.93 for pro-ISIS and anti-ISIS classes. Petrovskiy and Chikunov [59] also used graph techniques to extract features such as node page rank, hub and authority measure, and betweenness centrality. Tese features are then used as input for algorithms such as logistic regression, random forest, and XGBoost. Te XGBoost algorithm outperforms other algorithms with a ROC curve of 0.95 for train and 0.94 for test data.
Moussaoui et al. [60] used a possibilistic graph for extremist community detection. Features such as semantic similarity, structural similarity, and possibilistic similarity are extracted using a possibilistic graph-based approach. Te authors used subgraphs as features input to machine learning algorithms. Te authors used Naïve Bayes, multinomial Naïve Bayes (MNB), and stochastic gradient decent (SGD) classifers for extremism detection. SGD achieved a precision of 0.81 and an accuracy of 0.86 for extremism detection.
Network/graph techniques are used mostly to identify communications and interconnections but sufer from multiple challenges: (i) It cannot work for disconnected nodes in the graph (ii) Semantic analysis of extremism text cannot be performed with network/graph techniques Tus, to overcome the network/graph approach challenges, machine learning-based and deep learning-based methods are used for online extremism detection.
Machine learning-based approach is used for the classifcation of data into extremist, nonextremist, or neutral [46,61] or the classifcation of data into extremist and antiextremist [39,42].
Agarwal and Sureka [64] used k-nearest neighbor and libSVM to identify hate-oriented text from Twitter. Te authors used the term frequency as the feature. Te authors got an accuracy of 0.97, a precision of 0.78, and a recall of 0.83.
Asif et al. [46] used MNB and support vector classifer (SVC) to classify Facebook posts and comments as moderate, high extreme, low extreme, and random. SVC performs better for the classifcation than multinomial Naïve Bayes, giving an accuracy of 0.82.
Benigni et al. [43] proposed iterative vertex clustering and classifcation (IVCC) for extremism detection. Te authors also used k-means, Louvain grouping, and Newman method for extremism detection. Te authors classify Twitter users into ISIS members, nonmembers, and suspended. IVCC outperforms other classifcation methods with an accuracy of 0.96 and an F1 score of 0.93.
Araque and Iglesias [36] used feature engineering by creating emotion features (EmoFeat) and similarity-based feature extraction (SIMON) methods. Te authors labeled the data as positive (extremist) and negative. Te authors got the highest F1-score of 0.94 for EmoFeat and SIMON, with the dataset containing extremist and neutral tweets.
Ashcroft et al. [42] used a stylometric, sentiment, and time-based feature for online extremism detection. Te authors classify data into radical and nonradical. Te authors used SVM, Naïve Bayes, and AdaBoost. AdaBoost gave a precision of 0.88, specifcity of 0.99, and sensitivity of 0.79, with all the features outperforming other algorithms.
Fernandez et al. [35] divided extremists into individual (micro) infuence, group (meso) infuence, and global (macro) infuence based on their tweets. Te authors used the collaborative fltering and Naïve Bayes classifcation method. Te authors used precision as a performance metric. Using Naïve Bayes, the precision obtained for micro is 0.79, for meso is 0.69, and for macro is 0.90.
Mussiraliyeva et al. [62] divided Kazakh language posts from VKontakte [55] into extremist and nonextremist classes. Te authors used diferent classifers such as logistic gegression, MNB, and SVM. Te authors also used decision tree-based classifers such as random forest and gradient boosting. From all these classifers, gradient boosting with word2vec gave the best F1 score of 0.86.
Mussiraliyeva et al. [54] used multiple features such as linguistic inquiry and word count (LIWC), part-ofspeech (POS), and TF-IDF. Te authors used numerous machine learning algorithms such as SVM, k-nearest neighbors (KNN), decision tree, random forest, Naïve Bayes, and logistic regression. Te KNN using the oversampling method with statistical and TF-IDF features gives an accuracy of 0.99 for religious extremism classifcation.
Araque and Iglesias [51] used a combination of multiple features such as AfectiveSpace, SenticNet, TF-IDF, and SIMON. Te authors used machine learning algorithms such as logistic regression and linear SVM.
De Smedt et al. [67] identifed extremist hate speech within English, Arabic, and French language tweets. Te authors used character trigrams as features. Te tweets were labeled as hate and safe. Te authors used libSVM as the classifer. Te F1 score for the English language was 79, for French was 80, and for Arabic was 84.
Ul Rehman et al. [63] used religious words, radical words, and bad words to detect online extremism. Te authors used two classes, extremist and nonextremists. Te authors preferred diferent algorithms such as Naïve Bayes, SVM, and random forest for the classifcation. Te SVM with all the features outperforms other algorithms with an F1 score of 0.87.
Sharif [61] divided tweets into pro-Taliban, pro-Afghan, neutral, and irrelevant. Te authors used unigrams, bigrams, and TF-IDF for feature extraction. Te authors also used 8 Computational Intelligence and Neuroscience principal component analysis (PCA) to reduce dimensions. Te research work used Naïve Bayes, SVM, and random forest. SVM with TF-IDF and bigrams ofers the best precision of 0.84. Table 2 provides a comparison of all these studies in brief.

Deep Learning-Based Techniques.
Even if machine learning-based approaches are popular, they face some challenges such as the following: (i) Tey depend heavily on manual feature extraction or feature engineering (ii) Not suitable for large and unstructured datasets (iii) Context identifcation is a challenge Tese issues of machine learning methods can be addressed by using the deep learning approach. In the deep learning-based approach, the researchers have tried CNN [45], gated recurrent unit (GRU) [45], LSTM [65], and BERT [65].
A deep learning-based approach is used due to the following reasons: (i) Automated feature extraction (ii) Pretrained models on a large corpus Recently deep learning approaches are routinely used in online extremism detection due to automated feature extraction and large computing power.
Kaur et al. [72] classifed data into radical, nonradical, and irrelevant classes. Te authors used word2vec for features extraction. Multiple algorithms such as SVM, maximum entropy, and random forest were used. Te authors primarily focused on the deep learning approach using LSTM. LSTM with word2vec gives the best precision of 85.96.
Ahmad et al. [45] used n-grams, TF-IDF, and bag-ofwords (BoW) as feature extraction methods for online extremism detection. Te authors used the CNN model, LSTM model, FastText with word embedding, and GRU. Te LSTM with CNN model ofers an accuracy of 0.92 and a precision of 0.90 outperforming other algorithms.
Alatawi et al. [65] used BERT to detect hate speech related to White supremism on Twitter. Te work used pretrained networks such as Google News Word Vectors, GloVe trained on Wikipedia, and GloVe trained on Twitter. Te authors also train the extremist data using word2vec, referring to it as White supremacist word2Vec (WSW2V). BERT with WSW2V outperformed other techniques with an F1 score of 0.79 and a precision of 0.80. Te direct comparison between approaches in online extremism detection is a problem. Tis is due to the use of diferent datasets, most of which are custom and not publicly available.
Mussiraliyeva et al. [73] in a recent study used CNN and LSTM to classify extremist posts collected from VKontakte. Te CNN and LSTM both provide an AUC of 0.99 for extremism classifcation in the Kazakh language. Table 3 compares the studies employing deep learning for extremism detection.

Proposed Architecture.
Tis section proposes the architecture for constructing the dataset, which will be used to classify extremism text into propaganda class, radicalization class, and recruitment class, with discussions on data validation methods. Te architecture is modularized into the following phases: data collection, data preprocessing, data annotation, and data validation which are shown in Figure 2.

Data Collection.
Te construction of the proposed dataset was performed by collecting data from popular standard extremist text datasets and recent extremist tweets collected from Twitter. . Initially, these datasets were divided according to ideology, ISIS dataset as jihadist, while StormFront and Gab datasets as White supremacist. All these three datasets together contain around 24,900 extremist tweets. StormFront and Gab have two unique labels as hate and nonhate labels, while ISIS contains only extremist tweets. In addition, the StormFront dataset accounted for the posts between the years 2002 and 2017, while no data collection timeline is given for Gab dataset. Twitter was the preferred social media platform for collecting extremist tweets as it is the frst choice for the extremists to reach out to the target audience. In addition, it is popularly used in research work [48,67] due to its easy accessibility and microblogging format.

Data Extraction from Twitter.
As the standard dataset has its challenges such as outdated text, as mentioned in the previous section, we collected recent extremism tweets from Twitter from January 2021 to June 2021.
Twitter API allows the collection of real-time tweets with diferent parameters. Twitter API provides a choice to collect tweets based on specifc terms or hashtags, tweets of a specifc user, tweets from a specifc geographical area, and tweets of a specifc language. Twitter APIs also give additional information such as username, location, and @user mentions in the tweet. Diferent queries were formulated, and the fnal query was selected as To collect ISIS extremism text, specifc keywords such as "murtadeen," "munafqeen," "khawarij," "tafkir," "kufar," and "murtad" were used. Tese are popularly used ISISrelated words obtained from works such as [16,41]. In addition, the keywords such as "white genocide," "white lives matter," "it's okay to be white," and "anti-white" were used to collect White supremacist-related tweets. Tese White supremacist supporting keywords were obtained from [74][75][76].
A total of 2,000 ISIS supporters and 2,000 White supremacist supporting tweets were collected. All these collected tweets are in the English language. Figure 3 provides keywords used and the wordcloud of hashtags found for White supremacist and jihadist-ISIS supporting tweets.   [11,16]. Te assumption is that the seed example from diferent sources provided by diferent experts may reduce expert bias. A total of 100 examples were identifed for jihadist-ISIS and 100 examples of White supremacists on propaganda, radicalization, and recruitment.
As the examples are taken from diferent research works, they have multiple keywords and diferent contexts associated with them, reducing the overall bias of the SEED dataset. In Table 4, a few examples are presented to show the tweets and posts considered propaganda, radicalization, and recruitment by respective studies.

Data Preprocessing.
In this phase, data preprocessing is carried out in the following steps: (i) Removing Stopwords. Stopwords were removed at this step. Ten, the words representing nouns, verbs, adverbs, and adjectives were selected. Tis ensured the inclusion of only relevant words in the fnal process   Te preprocessing steps are illustrated in Figure 4.

Topic Modelling.
Topic modelling is a method to recognize, understand, and summarize a large collection of textual information. Topic modeling is a way to extract a group of words (topics) that accurately represent the collection of documents in a corpus. It is also a form of text mining in which word patterns in a corpus are identifed.

Latent Dirichlet Allocation (LDA).
LDA is a probabilistic topic modeling algorithm, which extracts topics from documents, and words in the document are collected by observing their probabilistic distribution.
Tere are diferent techniques other than LDA to identify abstract information from a corpus. Latent semantic analysis (LSA) [78] and probabilistic latent semantic indexing (pLSI) [79] are some of them.
LDA focuses on topic identifcation and analysis, while LSA focuses on reducing matrix dimensions. LSA converges faster due to dimensionality reduction but at the expense of accuracy. pLSI uses a probabilistic model with dimensionality reduction and is faster with acceptable accuracy. Top2Vec is a recent development in fnding topics within the documents. Top2Vec [80] has considerable advantages over LDA such as no need for stopword removal, stemming, or lemmatization. BERTopic [81] too has advantages such as deep learning and visualization. But both Top2Vec and BERTopic require a good amount of data which is a limitation of our study. In addition, LDA is preferred as we need a specifc number of topics. Moreover, LDA is used in multiple studies for extremism detection, thus making LDA reliable for extremism detection research.
LDA assumes the mixture of the probabilistic distribution of topics over corpus and words over the topic. LDA works in the following ways as shown in Figure 5: (i) Assume there are k topics over the entire corpus (ii) Distribute k topics across document M which is perdocument topic distribution also denoted as α. Te topic distribution for document M is denoted as θ (iii) Calculate z which is the topic of n th word in document M, while N is the number of words in the given document (iv) Calculate the probability of word ww which belongs to a particular topic based on the following: (a) Unique topics in document M.
(b) Te frequency of the word ww that has been assigned to a particular topic across all documents is also denoted as β.
For this study, it is needed to identify diferent topics within the extremism corpus. Later, these topics are compared for the labeling of extremist texts. So, LDA is used to extract topics from the extremism corpus due to its advantages as mentioned above and as described in Figure 5.

Cosine Similarity.
Cosine similarity computes the similarity between vectors. It calculates the cosine of the angle between vectors and determines whether vectors point in the same direction. In NLP, cosine similarity is commonly used to measure the similarity between the extracted features. Cosine similarity takes a total length of vectors; for example, considers TF-IDF vectors, thus considering repetitions of the word [82].

Computational Intelligence and Neuroscience
Tis property is used to identify unique words for a particular class in this work. So, cosine similarity is considered for assigning labels from SEED datasets to primary datasets. In this work, data labeling is designed to be a four-step process and the steps are described as follows: (1) Step 1. In the frst step, datasets are merged according to ideology. Te ISIS Kaggle dataset was merged with recent tweets of jihadist-ISIS collected from Twitter, referred to as the Merged ISIS dataset (MIS). Similarly, StormFront dataset, Gab dataset, and White supremacist tweets collected from Twitter merged to form Merged White Supremacist dataset (MWS). Tis process is shown in Figure 6. Only the text or tweet data is selected from these standard datasets, everything else is discarded. To preserve the distinct characteristics of ideology, we adopt the strategy to identify individual clusters within the ideological datasets. To identify these clusters, the topic modelling approach was chosen [83]. For feature extraction, TF-IDF is used. TF-IDF calculates important words in the corpus concerning documents. However, even if TF-IDF presents important words, it lacks in identifying context. So, to extract topics from the primary dataset, latent dirichlet allocation (LDA) [83] is used. Tis work aims to classify text into three classes: propaganda, radicalization, and recruitment; three topics are extracted from the MIS and MWS datasets. To achieve this, GridSearchCV [84] is applied to the LDA model with hyperparameters such as n_topics � [3][4][5], learning_rate � [0.999, 0.99999], cv � 10, and batch � "online." Using these hyperparameters, the model with the best results gives n_topics of 3 with distinct words per topic.
(2) Step 2. In the second step, we extract a single topic for propaganda, radicalization, and recruitment examples for each SEED dataset of jihadist-ISIS and White supremacist ideology using LDA. Tis results in a single topic with respective important words in propaganda, radicalization, and recruitment. (3) Step 3. To label text in the IS dataset and the WS dataset, cosine similarity [85] between the topics of individual MIS and MWS datasets, with the topic of propaganda, radicalization, and recruitment from SEED dataset, is calculated. Tis results in similarity matrix. When similarity is maximum for topic and label, the respective label, propaganda, radicalization, and recruitment, is assigned to a particular topic. Tus, documents in IS and WS datasets with the topics labeled are propaganda, radicalization, and recruitment. Figure 11 shows the complete process of data labeling. Te calculated cosine similarity between seed labels and identifed topics is small. Tere are diferent reasons for low cosine similarity, such as few seed examples, and not enough signifcant features in SEED dataset. Tis low cosine similarities are accepted as two diferent datasets i.e., SEED dataset and tweet + website dataset are compared.
Tis research work aims to develop an ideology independent extremism detection model. So, to achieve this aim, two datasets MIS and MWS datasets, are merged. Tis is carried out by retaining tweets or posts, topics, ideology, and labels from both datasets. Tis merged dataset will be henceforth referred to as Merged ISIS-White Supremacist dataset (MIWS). As seen in Table 5, for the MIS dataset, topic 0 is labeled as propaganda, topic 1 as radicalization, and topic 2 as recruitment, as signifcant cosine similarity was found with the respective classes in the SEED ISIS dataset. On the other hand, in the MWS dataset, topic 0, topic 1, and topic 2 are labeled as radicalization, recruitment, and propaganda as a signifcant similarity score was found with respective classes of the SEED White Supremacist dataset. Figures 12(a)-12(c) can provide important words in the MIWS dataset for propaganda, radicalization, and recruitment.

Data Validation (MIWS).
In this Section, we discuss the statistical tests, which will be employed for the data quality assessment. We employed three statistical techniques that are cosine similarity, Wilcoxon signed-rank test, and chisquare test.

Cosine Similarity.
Cosine similarity can be used to compare the similarity between samples. Propaganda, radicalization, and recruitment are compared based on words and their TF-IDF score. Te cosine function was applied to a pair of classes. Tese pairs are described in Table 6. Tus, each class is represented by distinct unique words, and they infuence each class diferently. Figure 11 shows cosine    similarity between diferent datasets, while in Table 6, similarities are seen within classes of the same dataset. Tus, even if values in Table 6 look signifcant, there is not enough similarity within the dataset given the N1 and N2 sizes.

Wilcoxon Signed-Rank Test.
Wilcoxon signed-rank test [86] is a nonparametric test. It can determine whether the two samples are collected from the population of the same distribution. Wilcoxon signed-rank test is also used to compare two closely related samples and perfectly matched samples.
In this paper, Wilcoxon signed-rank test is used to prove whether the selected random samples belonged to a particular class, i.e., propaganda, radicalization, or recruitment. Figure 13 shows detailed experiments performed to calculate the Wilcoxon signed-rank test. CountVectorizer [87] was applied for feature extraction to the corpus of each class separately. CountVectorizer returns the matrix with the count of tokens. Tis was performed so that higher count words from each corpus may get priority. TfdfVectorizer [88] was also considered for this experiment but leads to a dimensional mismatch for Wilcoxon Figure 11: Data labeling. To perform these experiments, a null hypothesis is required, which is as follows: H0-medians of word count of classes are equal. Terefore, there is no signifcant diference between classes H1-medians of word count of classes are not equal. Terefore, there is a signifcant diference between the classes Wilcoxon signed-rank test compares examples based on two test statistics. First, W test statistics which is the sum of ranks with diferences below or above zero. Te second is the p value which is the confrmation against the null hypothesis. Together, W and p value determine the validity of the null hypothesis.
To calculate W, the following procedure is performed: Let N be the sample size, and for pairs, let x 1,i, and x 2,i denote the measurements.
(i) Calculate |x 2,i -x 1,i | and sgn (x 2,i -x 1,i ), where sgn is the sign function that returns the sign of a real number (ii) Exclude the pair with |x 2,i -x 1,i | � 0, and the new sample will be N r (iii) Order the remaining pair in an ascending order with a diference of |x 2,i -x 1,i | (iv) Rank the pairs with the smallest nonzero diference as 1. Let R i denote the rank (v) Te test statistic W is calculated as Te p value is considered as the evidence against the null hypothesis. Te null hypothesis is rejected if the p value is <0.05. Tis threshold of 0.05 or 5% is considered a level of signifcance. Te count for each word representing classes is calculated.
As the classifcation is a multiclass classifcation, the tests are divided into diferent cases which are as follows: (i) Case 1: here, the propaganda class and recruitment class are compared using CountVectorizer of n number of words from both classes (ii) Case 2: here, radicalization class and propaganda class are compared using CountVectorizer of n number of words from both classes (iii) Case 3: here, recruitment class and radicalization class are compared using CountVectorizer of n number of words from both classes Table 7 shows cases, their samples, test statistics, hypothesis, and inference. Te Wilcoxon signed-rank test provides test statistic "W" which is used to calculate the p value from the reference table [86].

Chi-Square Test.
Te chi-square test is a popular statistical test used to evaluate the relationship between two variables [89]. Most of the time, the chi-square test is applied to test the dependence of the occurrence of the term and the occurrence of the class. Moreover, it is commonly used as a feature selection method. For example, the following formula is used to calculate the rank of terms that appear in the corpus: Here, e t and e c are binary variables in the contingency table, t is the term, c is the class, D is the corpus, N is the observed frequency, and E is the expected frequency. Te term t and class c are said to be dependent if χ 2 is high. Tus, making term t an important feature that causes term t to indicate class c. Table 8 shows important words within ISIS SEED, WS SEED, and MIWS datasets obtained by applying the chisquare test. Each dataset has a few repeated words. Tis can be attributed to diferent ideologies, sources, and dataset sizes.   Table 6, cosine similarity proves that the obtained classes, namely, propaganda, radicalization, and recruitment are signifcantly diferent. Te Wilcoxon signed-rank test also shows signifcant differences between the classes, so they have distinct features to make them unique. Te chi-square test in Table 8 shows distinct word features to depict propaganda, radicalization, and recruitment. Tus, it can be inferred that the newly formed classes propaganda, radicalization, and recruitment stand unique with statistical validation methods.

Dataset Evaluation
2.9.1. Experimental Setup. Experiments were carried out on the HP Workstation Z8 G4 machine. It is equipped with a Xeon processor of 3 GHz, 128 GB of RAM, and Nvidia Quadro P400 GPU with 2 GB memory. In addition, some experiments were carried out on Nvidia DGX-Server with 4 Nvidia Tesla V-100 GPUs with 32 GB memory. Due to the limited capability of these systems, Google Colab was used. All the results in Table 9 are obtained on Google Colab.

Size of Datasets.
Te size of datasets are provided in Table 10.

Analyzing Imbalance in Datasets.
Te balance and imbalance in datasets are shown in Table 11.

Feature Extraction Techniques.
To create word vectors, diferent feature extraction techniques are used in online extremism. In this work, the following feature extraction techniques are used: Table 9, the TF-IDF is used as the feature extraction technique. TF-IDF gives important words in the document based on its weightage in corpus [90]. Tus, TF-IDF was chosen, as it shows the word importance and is also used in many studies. Unigrams are considered to identify and elevate the importance of unique words representing the particular class, propaganda, radicalization, or recruitment.

Bigrams and Trigrams with TF-IDF.
Bigrams and trigrams features are used with TF-IDF for more complex analysis. Tese features provide the combination of words that afect the classifcation of the documents.

Word2Vec.
Word2vec uses a neural network to learn word embeddings or word vectors from the given corpus. Word2vec is used to gather more dimensional features to classify extremism text into propaganda, radicalization, and recruitment. Te word2vec model pretrained on Google News with 300 dimensions was used for feature extraction in this work. Figure 14 shows word vectors and their positions concerning each other using t-sne. Euclidean distance is used as a metric to calculate the distance between features. Tus, the lesser the Euclidean distance the more frequently the words appear together in a group. In Figure 14 it can be seen extremism infuencing words are close to each other. Words such as "islamic state," "dead," "Afghanistan," "wounded," and "targeted" form a group. It can be also observed "bomb," "raqqa," "destruction," "gaza," "terror," "attack," and "battle" indicates the focus of groups on a particular location. Te words such as "white," "muslims," "muslim," and "black" stood out from other keywords indicating their usage in diferent contexts. Tus, word2vec can be efectively used for online extremism detection.      [21] StormFront and Gab White supremacist ∼9000 (only hate class) Word2vec is used in combination with classifers mentioned in the next section. Word2vec is fne-tuned to a window size of 15, a minimum count of 10 words, and with ten iterations to provide the best possible performance metrics.

Classifers.
To classify and predict, this work uses the following ML algorithms: 2.11.1. Multinomial Naïve Bayes. MNB works on the probabilistic principle. Naïve Bayes assumes that there exists a conditional independence between every pair of features. In addition to this MNB, also assumes that distribution for all pair is multinomial distribution. Tis assumption of multinomial distribution works well in the case of word counts in the document. Tus, classifying text data based on the probabilistic appearance of a word within the document helps to get a baseline for performance metrics.

Support Vector Machine.
In online extremism detection, SVM can separate important words of a particular group or class by defning the exact separation line. Tis separation line is referred to as a hyperplane. SVM creates support vectors that are at the optimal distance from the hyperplane. Tis ensures the words of a particular group are at a signifcant distance from words of another group. So, one can get fairly accurate performance metrics due to this property of SVM.

Random Forest.
Random forest uses multiple decision trees to classify data. Every decision tree consists of decision nodes, root nodes, and leaf nodes. Tus, every decision tree in random forest is trained on a subsample of the dataset. Tus, each tree is ensured to be built upon the best subset of features. It takes the majority output of the decision trees to arrive at the classifcation. Tis reduces overftting, thus making random forest a good choice for the extremism text classifcation.

XGBoost.
XGBoost uses gradient boosting for the classifcation. In XGBoost, gradient boosting is achieved by pruning trees backward that exceed the maximum depth of tree criteria, thus, increasing the speed of the algorithm by employing the depth-frst technique. XGBoost can also work with a small amount of data. XGBoost also supports outof-core computing, that is, it can handle data more than disk space and memory. Another advantage of XGBoost is, it provides parallelization, thus making the classifcation process faster. Figure 15, provides details about the ML pipeline for the best-ft model. In this pipeline, the MIWS dataset with Computational Intelligence and Neuroscience preprocessed data is taken as input. Table 10 shows the count of tweets while  Table 12. Te ML algorithms are scored on basis of performance metrics such as precision, recall, and F1 score. Te ROC-AUC curve is also created for the visualising the performance of algorithms. On the basis of performance metrics and the ROC-AUC curve, the bestft model is selected. A total of 64 experiments were conducted to get consistent results. Te fnal models for every algorithm provided stable results as shown in Table 12. Te bold values in Table 12 indicate the best results due to these hyperparameter values.

Results and Discussion
Multiple machine learning classifers are used to assess and measure the classifcation performance of extremism data into propaganda, radicalization, and recruitment. Te algorithms used are MNB, SVM, random forest, and XGBoost.

24
Computational Intelligence and Neuroscience Tese machine learning classifers are chosen as they have been popularly used in online extremism detection research [36,62].

Comparison of TF-IDF Unigram Results
. Figures 16-19 shows the comparative performance of four feature extraction techniques with classifers. It can be observed from the fgures that TF-IDF unigram outperforms other feature extraction techniques, as unigram extracts the unique words that characterize the class. On the other hand, bigrams and trigrams ofer comparatively low performance compared to unigrams for the frequent combinations of words in the multi-ideology MIWS dataset.
Word2vec with XGBoost ofers comparable performance for the MIWS dataset, as it is pretrained on Google News data, as Google News may have accounted for extremism text. XGBoost with word2vec gives an F1 score of 0.60. It is also observed that word2vec can achieve better performance with more training epochs.

ROC-AUC (Unigram) for All Classifers for MIWS.
Receiver operating characteristics (ROC) is the graph that shows the performance of classifcation models at all classifcation thresholds [91]. Area under curve represents that the total two-dimensional are underneath ROC curve [92]. It is observed that the performance of all classifers on the MIWS dataset is satisfactory, with an AUC of around 0.70 for MNB and SVM. For random forest and XGBoost, the AUC is around 0.65. Tus, it can be said that SVM with TF-IDF unigram outperforms other classifers.
Furthermore, SVM performs better due to marginalizing classes based on the unique words present in the MIWS dataset.    of results on MIWS dataset. It can be observed that standard deviation is quite low. Tus, the results are stable. Table 14 provides rank for the algorithm with features based on results in Table 9. Freidman rank test was performed to determine a rank-based signifcance for obtained results. Te calculated p value by Freidman test was less than 0.05, that is, 1.7651e − 8. As seen in Tables 9  and 14, the ranks were calculated in descending order of results, so the lesser the rank, the more signifcant the results are. Terefore, SVM + TF-IDF results are signifcant and better than other algorithms and feature combinations. Tables 15-18 give precision, recall, F1 score, and support for the TF-IDF unigram on the MIWS dataset for the chosen classifers. It can be observed that SVM is the best classifer for propaganda, radicalization, and recruitment classes with an F1 score of 0.68, 0.72, and 0.63, respectively.          For bigram and trigram features, the performance of algorithms reduces drastically. Tis can be attributed to diferent words based on the ideologies that are merged in a single dataset. Tus, bigram and trigram may not be effective in identifying and analyzing multiple ideologies together. Word2vec gives better performance for XGBoost. Te F1score obtained from XGBoost with word2vec is 0.60. Figures 22 and 23 show the confusion matrix obtained by applying MNB, SVM, RF, and XGBoost on the MIWS dataset. Tables 9-18, the results are a bit low. Tis is due to the merging of two diferent ideologies as the aim is to develop a generalized and ideology-independent extremism detection model. Methods and techniques to improve the results are discussed in the Section, Future Work. Table 9 shows the comparative performance of the classifers on the diferent feature extraction methods. Te MIWS dataset with ∼17,000 ISIS and ∼11,000 WS examples is a multi-ideology dataset. Te extremist dataset was developed and validated with three statistical methods that proved that the dataset is robust with the unique features in the three classes. Te performance of ML algorithms on these extracted features in the dataset also shows potential for applying DL classifers.

Limitations.
Te size of the dataset is an important aspect of machine learning. However, the size of the SEED dataset used in this work is limited, with fewer research articles. Tis is due to the lower availability of extremist text examples classifed as propaganda, radicalization, and recruitment in the existing literature. Even with data imbalance, current data provides acceptable results, but balanced data is required to predict extremist text with precision.
Te extremist text in the existing literature was manually labeled as propaganda, radicalization, and recruitment by experts. However, this labeling is limited by interrater agreement or expert opinion in the existing literature. Tus, the SEED dataset that is employed for topic modeling has the threat of expert bias. Hence the work relies on statistical validation techniques to verify the strength of the dataset. Furthermore, it is challenging both experimentally and ethically to quantify the bias of experts. Hence, at current stage of research it is not possible to compare the bias of both experts and the ML algorithm.
In this work, only three diferent topics or classes are considered for extremism classifcation text. Terefore, these topics were identifed using simple LDA. Te context-aware LDA [93] or context-aware topic modeling could be used to extract multiple diferent topics within extremism text.
Rigorous statistical tests were essential for estimating the strengths of the topic clusters. Tis work employed cosine similarity, Wilcoxon signed-rank, and chi-square tests for data validation as they were popularly employed in the literature. However, more statistical tests can be additionally employed to ensure the quality of data.
In this work, only four feature extraction techniques and four machine learning classifers are employed on the developed MIWS dataset. Terefore, the results are limited by the choice of these representative classifers and feature extractors. Te classifcation and feature extraction purpose was to realize the model that would accurately classify the dataset.
A variety of advanced feature extraction techniques such as pretrained vectors can be further evaluated for a better accuracy. Advanced classifers andtransformers can also be employed for achieving better accuracy.

Conclusion
Tis work focuses on constructing a multi-ideology and multiclass extremism text dataset with a comparative analysis of the performance of features extraction techniques and machine learning classifers. Most extremism research studies focuses on a single ideology, with binary or tertiary classifcation such as extremist, nonextremist, and irrelevant classes. Consequently, there are limited insights from such works [19].
In this work, we develop a multi-ideology dataset with the most popular jihadist-ISIS and White supremacist ideologies. Tis dataset provides a broader view of extremism text with popular extremist ideologies brought together for better insights into data. Te dataset also builds a multilabel extremist text dataset by classifying data as propaganda, radicalization, and recruitment.
Te extremist text dataset was made contemporary by collecting extremist texts from diferent data sources (Twitter, ISIS Kaggle, StormFront dataset, and Gab dataset). In addition, we created ideology-specifc datasets, which are called MIS (jihadist-ISIS), MWS (White supremacist), and proposed MIWS (multi-ideology) datasets with data preprocessing techniques applied.
A SEED dataset was created using existing literature that provided us with labeled examples of propaganda, radicalization, and recruitment. Ten, the labeled SEED dataset was used to group/cluster the MIS, MWS, and MIWS datasets into propaganda, radicalization, and recruitment by using the LDA technique and cosine similarity. Te grouping/ clustering was further validated using statistical techniques. In this work, three diferent statistical tests, such as cosine similarity, Wilcoxon signed-rank test, and chi-square test, validated data labeling. Tus, our work is free from expert bias resulting due to manual validation such as previous literature. Te visualization of word vectors with t-sne is also performed to highlight the unique words in propaganda, radicalization, and recruitment classes from the MIWS dataset.
To assess the performance of datasets, multiple features such as TF-IDF (unigram, bigram, and trigram) and pretrained word2vec (Google News) are used. Tese features were provided as input to classifers such as MNB, SVM, RF, and XGBoost. For the proposed MIWS dataset, TF-IDF unigram with SVM provides the highest precision of 0.69, recall of 0.68, and F1score of 0.68. Tus, the results obtained using ML algorithms can be considered as a baseline for future work consisting of deep learning techniques.
Tis work, pioneers in developing the multi-ideology extremism text, MIWS dataset can classify extremism data into multiclasses such as propaganda, radicalization, and recruitment with robust statistical data validation techniques employed. Furthermore, this work investigates the best feature extraction technique and classifer for the proposed MIWS dataset, which guarantees better classifcation performance.

Future Work.
Te presented work is an important milestone in online extremism text detection research. Tis will open multiple avenues in the following research areas:

Versatility of Extremism Text Dataset.
Our work proves that multi-ideology datasets create a broader view of extremism text with comparable classifcation performance over single-ideology datasets. In the future, the presented dataset can be made more versatile with other popular extremist ideologies and sources. Increasing the SEED dataset also may produce more signifcant results. Diferent techniques such as word mover's distance [94] can also be used to calculate and improve the similarity between labels and topics.

Feature Extraction Techniques.
Context-aware topic modeling can be used to extract multiple diferent topics such as promoting violent acts and antisemitism. Popular feature extraction techniques such as pretrained vectors, GLoVe [95], and FastText [96], can be employed to extract complex relationships among extremism data. Tese can further enhance the accuracy of extremism detection models.

Transfer Learning and Deep Learning Approaches.
Tis research work uses machine learning classifers for evaluating the proposed dataset. Future works can use deep learning models such as LSTM and CNN, and pretrained networks such as FastText, BERT, or RoBERTa for a better semantic analysis of extremism data. Tis can help achieve a higher performance for the classifcation of extremism text into propaganda, radicalization, and recruitment.

Detection of Extremism Based on Geographical Context.
Te geographical location of extremists and extremist organizations plays an important role in analyzing propaganda, radicalization, and recruitment on social media platforms. Te researchers have used the tweet location to identify extremist afliations. It is necessary to identify the targeted nations through the extremist text which will speculate the activities of extremists. So, the extraction of geographical locations can play a major role in providing insights into extremist propaganda, radicalization, and recruitment tactics.

Data Availability
Te data used to support the fndings of this study are included within the article.

Conflicts of Interest
Te authors declare that they have no conficts of interest.