Design and Analysis of a Novel Authorship Verification Framework for Hijacked Social Media Accounts Compromised by a Human

Compromising the online social network account of a genuine user, by imitating the user’s writing trait for malicious purposes, is a standard method.,en, when it happens, the fast and accurate detection of intruders is an essential step to control the damage. In other words, an efficient authorship verification model is a binary classification for the investigation of the text, whether it is written by a genuine user or not. Herein, a novel authorship verification framework for hijacked social media accounts, compromised by a human, is proposed. Significant textual features are derived from a Twitter-based dataset.,ey are composed of 16124 tweets with 280 characters crawled and manually annotated with the authorship information. XGBoost algorithm is then used to highlight the significance of each textual feature in the dataset. Furthermore, the ELECTRE approach is utilized for feature selection, and the rank exponent weight method is applied for feature weighting. ,e reduced dataset is evaluated with many classifiers, and the achieved result of the F-score is 94.4%.


Introduction
Online social networks (OSNs) are considered as essential sources of information and platforms that bring people together. e wide usage of the OSNs along with the crowds' self-confidence makes OSNs unprotected against hijackers, which might be accompanied by hijacking attacks that represent significant challenges in OSNs [1]. Originating mainly from the science of stylometry, which studies the writing style of the author, and the authorship verification counting features in the text to indicate the personality of the author [2], the goal of authorship verification is to determine whether two separate documents were written by the same author [3]; Rocha et al. in [4] defined authorship verification as two given tweets; a prediction is made as to whether or not they are from the same author.
e OSNs that have accompanied hijacking attacks could be the following.

Compromised Accounts by Human Beings.
Hackers gain access to the OSN's account through credentials information or misusing the account security features by the user. e intruder is disguised as a genuine user to post fake news or gain private information from the user's network, which affects the reputation and leads to economic loss [15]. Trang et al. [1] have mentioned many examples describing this kind of attack. is kind of attack has been discussed by [1,12,18]. e apparent anomalous behaviors of the fake accounts and compromised accounts by a bot are the key insights for the researchers to uncover [11,17]. Twitter has shut down 70 million of these accounts in 2018. Unlike the compromised accounts by human beings, it is still a challenge [11], as there is no way to authenticate users' writing styles in the OSNs yet.
Motivated by the challenge of finding a method for profiling users through the most significant textual features that have existed in the short messages on OSNs to authenticate their writings with high accuracy, we propose a model throughout the study.
As a summary of the main steps of the study, a Twitterbased dataset has been crawled and manually annotated with the authorship information to evaluate the proposed model, which gathers some stylometry features from the tweets. en, the XGBoost algorithm is implemented for feature extraction of the dataset. To further improve the classification results, one of the Multicriteria Decision-Making (MCDM) approaches, called ELECTRE, is employed for feature reduction. In addition, ELECTRE chooses the most important features and eliminates the features that have a negative impact on the performance of the classifiers. e selected features are assigned a weight according to their rank using the rank exponent weight method. Finally, they are given to different classifiers, but the logistic regression algorithm has achieved the highest performance. Later, the proposed model is compared with the traditional state of the models using the performance measures recall, precision, F-score, and accuracy. e rest of this paper is divided into four sections. Section 2 presents the review of literature. A detailed explanation of the proposed approach is discussed in Section 3. en, in Section 4, we shed light on the experimental results. Section 5 incorporates the conclusion of the paper and limitations of the study.

Literature Review
Egele et al. [13], by following a profile-based paradigm, have developed a system called COMPA to detect the inconsistent behavior in the compromised accounts through clustering the behavioral features that describe the genuine user idiolect, using Sequential Minimal Optimization (SMO) classifier, the model validated on Twitter, and Facebook datasets. COMPA shows low precision in the small-scale hijacking attacks [1].
Trång et al. [1], following an instance-based paradigm, have improved COMPA to detect single hijacking attacks through replacing rather than clustering the accounts according to similar behavior by classifying the account whether it is compromised or not. e improved version of COMPA shows lower-than-expected result in the single hijackings. e authors have suggested COMPA to include moving anomaly scores and stylometric features in future works.
Barbon et al. [19] have proposed a model to detect the compromised accounts through a bot by combining the textual features that describe the user's profile. In their model, the k-NN algorithm is used for classification purposes. eir model has been evaluated using the Twitter dataset that involved 1000 users. ey have achieved classification accuracy rate over 93%, whereas the accuracy decreased in the Twitter accounts that are not used regularly. erefore, the authors have suggested adding nonlinguistic features to improve the accuracy.
Lagerholm [20] has proposed to measure the similarity among benign and malicious user tweets. In the model proposed by Lagerholm, the basic feature set includes n-gram, term frequency-inverse document frequency (tfidf ), and Bag of Words; Long Short-Term Memory (LSTM) Neural Network, then, is used as classifier. e experimental evaluation of Lagerholm's scheme involved the tweets of eight different users with cross-topics, and the approach attained accuracy of 93.32%. Barbon et al. [19] have suggested using a dataset with related topics to make the model more applicable in real life. Another research in [21] has developed a system for continuous authentication, a combination of deep belief networks and Gaussian units that have been introduced for classification purposes. e proposed approach has been evaluated using short messages that consist of blocks of texts of 140, 280, and 500 characters based on Enron e-mails and Twitter feeds, yielding an EER ranging from 8.21% to 16.73% of different configurations.
Kaur et al. have introduced a model to quantify the dissimilarity in a text by known and unknown users. K-means algorithm is used for classification purposes. eir feature set has included Bag of Words (BOW), n-grams, folksonomy, and stylometric features. eir model has been evaluated using the public Twitter dataset that has scored 89.24% as classification accuracy rate [22]. A similar model has been developed by Seyler et al., using feature set that included statistical measures extracted from public Twitter dataset, which is classified using logistic regression classifier. us, the achieved accuracy for synthetic data is 85% [16].
Recently, Savyan and Bhanu [15] have proposed an unsupervised system for authorship verification named UbCadet, which analyzes the anomalous behavior of OSN's user through quantifying the similarity between tweet text, hashtag, time, and geolocation. UbCadet has been evaluated using Yelp and Twitter datasets. UbCadet system has produced an overall accuracy of 83.1% when analyzing the feature set of five users.
After reviewing the studies for authorship verification in the literature, there seems to be, reasonably, little research on authorship verification of compromised accounts by human beings, especially Twitter-based datasets, which has given the scholars the motivation to work on the current study. e present study is taking features reduction into account to select the most influential features and is noticing the high prediction accuracy of the machine learning (ML) model depending on extracting the most relevant features and applying the appropriate dimension reduction method [23]. e current paper improves the authorship verification accuracy of hijacked social media accounts compromised by a human by proposing a three-layered dimension reduction approach followed by ML algorithms to classify the messages. e main contributions of the manuscript are listed as follows: (i) Creating and manually annotating the same genre and same topic on Twitter-based dataset with the authorship information for evaluating the proposed system. (ii) Verifying the authorship of a tweet and combining lexical, syntactic, and semantic features that can be used effectively on any short text messaging service. (iii) Conducting the traditional ML classifiers as metalearning algorithms to be used as a preprocessor in the feature selection process. (iv) Applying the three-layered dimension reduction approach which includes the use of the surpassed metalearning algorithm as a preprocessor to measure the contribution of each feature in verifying the tweet's authorship and then ranking these features using the MCDM approach (ELECTRE), where the least influential feature set is disregarded. e remained features are assigned weights according to their ranks using the rank exponent weight method.
(v) Implementing different ML algorithms for message classification to achieve the highest performance.
e proposed model, shown in Figure 1, is detailed in Section 3.

The Model
e general flow of the proposed model is described in Figure 1, and the flowchart is illustrated in Figure 2. e authorship verification process is passed through seven main subprocesses, as shown in Figure 1. e first step is to collect the users' tweets, tweets' history, and all available and related attributes. In the second step, the collected tweets are cleaned through eliminating the unused attributes and standardized tweets in a single format. Different textual features' vectors are extracted from the text and combined in one matrix to represent the tweets' corpus during the third step. In the fourth step, the surpassed classification algorithm among four different classifiers is selected as a preprocessor to quantify each feature's ability to verify the tweet's authorship. en, the feature reduction is done by ranking the feature sets using the MCDM approach ELECTRE in the fifth step. In the sixth step, the remaining feature sets are assigned weights according to their ranks using the rank exponent weight method. Finally, the four different classifier methods are applied to the weighted feature sets to achieve the highest performance in verifying the message authorship. In the following subsections, a detailed explanation of model steps is given.

Collecting and Preprocessing the Data.
With over 320 million active users, Twitter has become one of the most popular microblogging OSNs [5]. Unfortunately, there is not any publicly available standardized Twitter dataset for authorship verification studies [3]. So, it is crucial to introduce a Twitter dataset to facilitate authorship verification models. e way to obtain Twitter data is through its API [5]. However, Twitter allows retrieving a limited amount of data through the API, about 3200 tweets with 280 characters' block, and their related features through many batches [24]. erefore, the crawler has been built using Python programming language and Tweepy library to form the dataset. e data is collected from different users with the same topic and same genre. Of course, in the real-life scenarios, the authors differ in the topics and genre (e.g., documents, e-mail, tweet, etc.), but the main challenge is to focus on the author's stylometry [25,26]. Meanwhile, the information from the cross-topic or cross-genre could mislead the model [25], which makes the authorship verification difficult [27].
e Twitter dataset contains 16124 instances from different users, in which the retrieved objects related to the tweet are "ID" and "text" that represent the author's "ID" and the message, respectively. Considering the authorship verification as a one-class classification problem [28], the instances are labeled with 1 for the set of tweets from the known author and 0 for the set of tweets from the unknown author.
e collected data is preprocessed using simple regex to maintain the noise to precisely express the author's style. According to Rocha et al. [4], the dataset in the authorship verification should be preprocessed carefully. At the same time, eliminating or reshaping the corpus may impact the idiosyncrasies features of the author. In the preprocessing, the first phase is cleaning the retweets and replacing URLs with URL characters that do not affect the author's writing style [26]. e second phase is to replace punctuation marks, emojis, hashtags, percentages, and months with the metatags "!," "?," "#," "%," and "m," respectively. Moreover, the dataset is tokenized.

Feature
Extraction. ML algorithms are designed to learn from numerical vectors with prespecified size but not text data containing characters' sequences with various sizes.
us, the text data should be translated to numerical vectors before they progress. Herein, the extracted numerical features are illustrated below.

Lexical Feature.
e lexical or linguistic attributes include all characters and word-based statistical measures extracted from the corpus [19,22], independent from the language [29]. e lexical features could be extracted based on word level or character level [21]. In this study, the Security and Communication Networks extracted character-level feature is tweet-length variation, while the word-level features are the average number of words per tweet and lexical diversities. One of the most important lexical features commonly used in the authorship verification literature is the n-gram [21]. is study used character 2-grams to consider the difference in the order of the characters among known versus unknown authors and word 2-grams to consider the multiword expressions [30], in addition, to keep the order of word pairs [20].

Syntactic Feature.
e syntactic features describe author trait independent of context [21] via the punctuation that highlights the document boundaries to identify the sentences that could be tokenized [21]. Stop words and Part-of-Speech Tagger (POST) measures identify the function of the word in the context [29]. POST could be categorized as pronouns, prepositions, conjunctions, and auxiliary verbs, which grammatically describe the relationship between words in the corpus [31]. Function words, exclamation, question, and apostrophe marks per tweet features are used in this study.

Semantic Feature.
e semantic features are used to understand the meaning of a word or sentence in the linguistic context and its relations with other linguistic units [29]. e semantic components extracted in the current study are word embedding, Bag of Words (BOW), and tf-idf.
Word embedding represents the text in numerical vectors, whereas the words with the same meaning have the exact representation. us, the words need to be vectored and combined to form the word embedding. Vector length is calculated using the features number that describes the word (e.g., suppose that there are 200 features; then the vector length is 200); the features number, then, is less than the total words number, and each feature value is between [−1, 1]; whenever the value is close to 1, it represents the word. e model's algorithm to make the word embedding is Word2vec, which is a pretrained word embedding technique that was developed by Mikolov in 2013 [32].
Mikolov et al. [33] have suggested two associated models that are used to represent vector of words from datasets. e first is the Continuous Bag-of-Words model which, unlike the traditional Bag-of-Words model, predicts the word  according to the context, and the second is the continuous Skip-gram, which uses the input word to predict the surrounding words, as shown in Figure 3.
(i) Bag of Words. BOW is the count of token occurrences in the tweet, while tokenization splits the tweet into tokens and each token represents a word [34].
(ii) Term Frequency-Inverse Document Frequency. tf-idf gives more consideration to the importance of the word than its frequency in the corpus through calculating the term frequency in the tweet according to its frequency in all the tweets [35].

Selecting the Metalearning ML Algorithm.
In the past decade, advances in computing power and available datasets are accompanied to lead to an increase in the variety of ML algorithms. Moreover, metalearning had made the algorithm selection and its parameters' tuning less complex [36]. According to Hospedales et al., metalearning transfers the experience to the machine learning model through many phases [37]. On the other hand, traditional ML algorithms, according to Lemke et al., such as SVM and K-Nearest Neighbor, are very successful in metalearning algorithm selection [36]. In order to lessen the feature size, which in turn improves the classifier accuracy, the extracted dataset has been evaluated using many metalearning algorithms such as Support Vector Machine (SVM), logistic regression, random forest, and XGBoost algorithms. On the other hand, different metrics might be used to measure the rate of recognition. Some are listed in Table 1; see [22,23] for details. Table 2 shows the experiments' results for the metalearning algorithms. e surpassed algorithm is used as a preprocessor in the feature selection process as well.
XGBoost algorithm has many parameters that could be tuned to avoid overfitting and get better accuracy such as Booster parameters, Max Depth, and Min Child Weights. More parameters descriptions can be found in [38].
To achieve the optimized accuracy of XGBoost algorithm through parameters' tuning, the experiments have increased the value of Max Depth parameter from 7 to 9. Table 3 reflects the change in the XGBoost performance. Table 3, the accuracy of XGBoost algorithm has surpassed those of other algorithms.

Security and Communication Networks
Kotsiantis et al. have given the experimental recommendation to use ML algorithms as an introductory stage to discover the features' importance [39] us, XGBoost algorithm can be used to predict the contribution of each feature set in recognizing a tweet's authority, which is reflected in Table 4. e ELECTRE method is used to select the most powerful features from Table 4. ELECTRE is an MCDM approach to rank many alternatives in the Multicriteria Decision-Making problems [40]. e ELECTRE method means the elimination and selection that reflects the truth [41]. It is developed as a philosophy to solve the complex decision-making problems with many alternatives and few criteria [42]. It is based on binary superiority comparisons between alternatives according to the appropriate criteria [43]. ELECTRE's [44] stepwise implementation is demonstrated as in the following parts. Table 4 represents the decision matrix, where the textual features in the first column are m � 8 alternatives to be ranked, and the measures in the first row are n � 5 criteria that are used to rank the alternatives.

Calculate the Normalized Decision Matrix.
In order to make interattribute comparisons, the variation in the data scale between elements should be lesser, and normalization techniques scale down the elements to fall between 0 and 1.
e following formula has been used to normalize decision matrix in Table 4: i � 1, 2, · · · m, j � 1, 2, . . . , n. (1) Skip-gram Input Output Output Projection Projection Figure 3: e CBOW and Skip-gram models' architecture [33].   e decisionmaker assigns weight to each criterion to express its importance according to other criteria, but Saaty has developed a scale to gain the weight [45] as shown in Table 5. e matrix of criterion importance according to other criteria shown in Table 6 is developed using Kaur et al.'s experiment [22] in addition to the runtime criterion, which gained strongly important weight.
Normalizing the matrix of the criteria importance gives the resultant matrix obtained in Table 7. Criteria weights shown in Table 8 are calculated by averaging the normalized importance for each criterion according to the other criteria in the exact row.

e Calculation of the Weighted Normalized Decision
Matrix. Table 8 contains W � (w 1 , w 2 , . . . , w n ), the weight vectors for the criteria, where w j ≥ 0, n j�1 w j � 1, and the weighted normalized decision matrix is calculated by multiplying the normalized decision matrix obtained in Step 2 with criteria weights.

Determine the Discordance and Concordance Set.
Elements of every pair in the weighted normalized decision matrix are compared and the concordance set will contain the best or equal elements of each alternatives pair, determined by the following relationship: And the discordance set will contain the worst elements of each alternatives pair, determined by the following relationship:

Calculate the Concordance and Discordance
Matrix. e concordance matrix is calculated by adding the elements weights of concordance set: e discordance matrix is calculated by dividing the sum difference of discordance set elements by the sum difference of criteria elements.
3.4.7. Make Calculations of the Advantages. e averages of discordance and concordance are calculated, stating "yes" for the values in the concordance matrix either bigger than or equal to concordance average that obtains the concordance index matrix, while stating "no" for the values in the discordance matrix less than or equal to discordance average that obtains the discordance index matrix.

Calculate Net Superior and Inferior Values.
e alternatives are ranked according to the net superior and inferior values. e following formulas are used to calculate them: Table 9 demonstrates the textual features' superior ranking based on the ELECTRE method.
ELECTRE's ranking result has demonstrated the best ranking for the semantic feature. Furthermore, the last rank is the stop words feature, while the others obtain ranking based on their priority to verify the tweet's authorship as, respectively, exposed in Table 9.

Results and Discussion
While the features selection determines the used features in the prediction features, weighting indicates the importance of the selected features by assigning different weights according to their priority [46]. Feature weighting is used in [15,47]. Ranking textual features using the ELECTRE approach highlighted the differentiated performance of them. Furthermore, to eliminate the stop words features in the Feature Selection section, a weight is assigned to each feature based on its rank using the rank exponent weight method [48], which is defined by where r j is feature rank, n is number of features, and P is the most important criterion weight.
To build the dataset, 16124 tweets from different Twitter users have been crawled, and the extracted features from cleaned tweets have represented the dataset attributes. Computers with 16 GB RAM and Python 3 have been used to perform the experiments. K-fold is used as a cross-validation type to index the training and test sets in each fold iteration.
Combining the weighted ranked values of the features to train classifiers has demonstrated the prediction to the tweet's authorship verification in Table 10. Table 10 illustrates the performance of the logistic regression, SVM, random forest, and XGBoost algorithms in verifying the tweets' authorship based on the measures runtime, accuracy, recall, precision, and F-score. e comparative analysis of classifiers, after the feature selection and weighting using the ELECTRE approach and rank exponent weight method, highlights the fact that the performance of the logistic regression algorithm outperforms those of the other ML classifiers. Hence, using the logistic regression in the classification step in our model, it can be relied upon as a constructive approach to verify the tweet's authorship. Figure 4 illustrates the performance evaluation of logistic regression classifier before applying the suggested model after the feature selection process using the ELECTRE approach and after feature selection and weighting. It is evident from the figure that applying the suggested model yields better performance in comparison to logistic regression without feature selection and weighting and logistic regression with just feature selection. With a higher value of accuracy, recall, precision, and F-score successively than the others, it helps to justify the superiority of the suggested model, while the proposed model improves the accuracy from 90.29% to 91.1%, which means that more tweets are recognized correctly. Figure 5 illustrates the application of feature selection using the ELECTRE approach and rank exponent weight method lessening the training time for the logistic regression classifier to verify the tweet's authorship. Figure 6 compares the performance in verifying the tweet's authorship with the suggested model versus the UbCadet model in [15]. It is observed that, for the measures (accuracy, precision, and F-score), the suggested model      slightly enhancing the performance of the logistic regression as compared to before applying the suggested model as illustrated in Figure 4.
While the ELECTRE approach yields high performance rate for certain ML classifiers, it is considered a limitation of our study, where the weight assigned by the decision-maker to determine the criteria importance relies on subjective inputs by the decision-maker [43]. Incentivized collusion networks mentioned in [17] are considered as a limitation as they impede the authorship verification process, where users get paid to publish promotional messages on their accounts, modifying the textual features in the text, indicating the personality of the author.

Conclusion
e present study proposes a hybrid model for authorship verification in OSNs. Due to the nonavailability of standardized Twitter dataset publicly for authorship verification purpose, a same-genre and same-topic Twitter-based dataset is crawled and manually annotated with the authorship information. Successive preprocessing steps were performed to prepare the dataset for features extraction. Hence, a threelayered dimension reduction approach has been initiated. At the outset, XGBoost algorithm was selected as a preprocessor to calculate the textual features' performance in verifying the tweet's authorship on each criterion. Hence, the MCDM approach ELECTRE is used to solve the features selection problem. At the methodological level, it is the first application of ELECTRE to this domain, which is expected to be useful for similar problems. Based on Kaur et al.'s experiment [22], most criteria and their relative weights are obtained. ELECTRE uses the pairwise comparisons to rank eight textual features extracted from tweets. Further, the rank exponent weight method has been used for weighting the selected features. e reduced dataset performs evaluation with four ML classifiers, wherein two of them reflect enhancing performance compared with traditional ML in terms of the runtime, accuracy, recall, precision, and F-score. e experiment analysis for the proposed model with the logistic regression classifier reflects a high result of F-score reaching 94.4% for block sizes of 280 characters in verifying the tweet's authorship, which can extend the model implementation on another classification problem with high feature number as future work.

Data Availability
e data used to support the findings of this study are available from the corresponding author upon request. Disclosure e author Suleyman Alterkavı (Sleman AlTerkawi) has a dual citizenship, so his name is written in two different ways.

Conflicts of Interest
e authors declare that they have no conflicts of interest. Performance evalution of UbCadet model Figure 6: Comparison of the suggested model performance with the UbCadet model [15].