Comparative Experiment on TTP Classification with Class Imbalance Using Oversampling from CTI Dataset

to identify the situation and attack mechanism of a security the and common knowledge (ATT&CK) framework used as the de facto standard security threat modeling technique. large amount of data using the tactics, techniques, and procedures (TTP) of ATT&CK with a limited number of security personnel is time-consuming. To solve this cost-sensitive issue, research on automated classiﬁcation of TTP from CTI data using artiﬁcial intelligence techniques is currently underway but challenging. is because CTI data are domain-speciﬁc, and therefore, it is diﬃcult to obtain labeling data to be used as training data for AI models. Hence, the distribution of training data related to TTP labeling is imbalanced. Thus, the current accuracy of ML-based TTP classiﬁcation is still around 6080%. This study aims to improve the TTP classiﬁcation accuracy from unstructured CTI data using machine learning while mainly focusing on solving the problems of small training datasets and TTP class imbalance. Therefore, we proposed a TTP classiﬁcation method by applying easy data argumentation (EDA) and compared its performance with those of previous studies. By applying the proposed methodology, a 60–80% improvement was observed compared to the reference baseline model, TRAM. This indicates that the preprocessing methodology of applying the EDA technique is eﬀective at improving the performance of TTP classiﬁcation from unstructured CTI data in the CTI domain.


Introduction
e security operations center (SOC) collects security threat data to protect an organization's ICT infrastructure from internal and external cyber threats while monitoring and responding to security breaches. However, with the gradual expansion and ever-increasing number of cyberattacks, it is becoming more challenging for the SOC to promptly handle security solution events and respond to security breaches.
is is because the time required to analyze a large amount of data and to provide a sophisticated response is long, and there is a dearth of skilled security personnel and resources. Security orchestration, automation, and response (SOAR) technology [1], a new paradigm of security control technology, solves these issues by automating various security threat response processes to effectively reduce repetitive tasks of security personnel and helps to quickly and accurately respond to various security events. e core of SOAR is the integration and automation of security, orchestration, and automation (SOA), security incident response platform (SIRP), and threat intelligence platform (TIP) features [2]. SOA is a feature that interlocks and automates different workflows between numerous security solutions. e SIRP enables the automation of the process of responding to a security incident according to the response policy for each type of security incident. e TIP enables real-time collection and correlation analysis of internal and external threat data. A cyber threat intelligence (CTI) analysis is becoming crucial in quickly and effectively responding to advanced cyberattacks.
Cyber threat intelligence (CTI) data comprise of various information related to cyber threats, including information on attackers, attack procedures, and attack methods and consist of threat data analyzed by security experts, data collected from various threat sensors (such as threat data and detection data), and other related data [3]. Artificial intelligence (AI) models trained using such data are being increasingly used in the detection of new threats. In the recent past, the MITRE adversarial tactics, techniques, and common knowledge (ATT&CK) framework has often been used when analyzing cybersecurity threats and establishing a response strategy [4].
is is because the ATT&CK framework is an open-source project that is easily interoperable with other security-threat-related frameworks, such as CVE, CVSS, CAPEC, and CPE, developed by MI-TRE, and can be updated regularly whenever new attack techniques and patterns are discovered.
In contrast, using CTI data in conjunction with the tactics, techniques, and procedures (TTP) of MITRE ATT&CK is difficult.
is is because extracting TTP information from CTI data, which are often in the form of a report, is cost-sensitive and time-consuming because CTI reports, such as the advanced persistent threat (APT) report, are unstructured threat data provided in sentence form. Manually converting these explanatory TTP sentences into the TTP naming or ID format of the ATT&CK structure is time-consuming and requires strong expertise [5]. To address these problems, there have been several efforts since 2018 to identify (extract) TTP information from CTI reports or to automatically classify the tactics and techniques in TTP.
However, several issues must be addressed to automatically increase TTP extraction or classification performance from CTI reports using AI models [6]. e first issue is insufficient training data. Training data composed of labeled TTP data, which are output data related to CTI data and are required as the input data for machine learning models, are not sufficiently available. e second issue is that of generalization error due to miss detection. As attackers constantly vary their attacks and use more advanced attack techniques, the continuous updating of TTP classification for CTI reports with new attack techniques may result in significant generalization errors and inaccurate results. e purpose of this study is to improve TTP classification performance with insufficient training data by comparing and testing various data sampling methods. e contributions of this study are as follows: (i) In order to address the issues of insufficient dataset size and class imbalance in the field of CTI, two oversampling techniques, namely, synthetic minority oversampling technique (SMOTE) and easy data augmentation (EDA), were utilized, and changes in the TTP classification performance for sentence units of CTI reports were measured. (ii) e experiment results showed better precision, recall, and F1 scores at the sentence level than previous works, which were reference models. An experiment with three datasets was conducted to show the generalization performance. e structure of this study is as follows. Section 2 describes previous studies related to MITRE ATT&CK modeling and machine learning (ML)-based TTP classification. Section 3 defines the problem and describes the proposed methodology of this study. Section 4 describes the experimental design and evaluation metrics. Section 5 discusses the results and the comparative analysis with previous studies for verification of the proposed method. Finally, Section 6 describes the conclusions, implications, and future research directions.

MITRE TTP Modeling.
In CTI analysis, security threat modeling is a key step for developing and evaluating defense systems against targeted attacks, such as APT attacks and spear phishing. Security threat modeling, covering the various cyber kill chains, tactics, techniques, and procedures used by attackers to carry out attacks has long been studied, and well-known examples include STRIDE, Cyber Kill Chain, and MITRE ATT&CK modeling. Table 1 shows the characteristics of the three modeling approaches. STRIDE [7], developed by Praerit Garg and Loren Kohnfelder of Microsoft in 1999, was the first model to identify computer security threats and it was the model with the highest level of abstraction. We modeled six representative security threats that infringe on the three major elements of information protection, namely, confidentiality, integrity, and availability.
Cyber Kill Chain [8] was announced by Lockheed Martin in 2009 and is a strategic model for blocking APT attacks infiltrating the company in seven stages, namely, reconnaissance, weaponization, delivery, exploitation, installation, command and control, and exfiltration. e cyber kill chain model makes more specific attack steps than STRIDE, and defenders can utilize the cyber kill chain model when establishing a step-by-step defense strategy against APT attacks. e MITRE ATT&CK framework [4,9] is a modeling technique developed by MITRE in 2018. As shown in Figure 1, ATT&CK consists of tactics, techniques, and procedures related to attack techniques used to analyze the lifecycle of cyber attackers and achieve attack goals in the pre-and post-attack exploit operational stages. Currently, the enterprise ATT&CK matrix has 14 tactics and around 200 techniques (in the case of techniques, there are about 578 in total, which includes subtechniques).

Related Work.
e analysis of advanced attack technologies is becoming crucial for responding quickly and effectively to intelligent cyber threats. To effectively analyze cyberattacks, the information used in cyberattacks (e.g., malicious code, IP, domain, and vulnerability), the similarity between resources, attack techniques, attack targets, and activity times should be analyzed.
To identify TTP from CTI data using ML techniques, the type of CTI data used as input data is important. CTI data can be categorized as structured data and unstructured data.
Structured CTI data can express and contain TTP information in standardized formats, such as STIX, Database, and JSON, making it easier to identify TTP data from structured data than from unstructured data. However, the TTP data must be entered in advance in the specification field. Unstructured CTI data can have various forms, including reports and web pages, and when new threats arise, they are often shared in the form of reports. erefore, studies on the use of AI and natural language processing (NLP) techniques for automated TTP identification or classification from unstructured data began in 2017.
TRAM [6] released an open-source TRAM that can automatically identify and classify TTP from CTI reports using machine learning at MITRE. is model makes the greatest contribution by disclosing proof-of-concept codes and data networks that can automatically classify tactics, techniques, and procedures with machine learning and NLP techniques. TRAM built its own dataset in which the output performs multiclassification at the techniques level of TTP by receiving input from the CTI report at a sentence unit from the input layer. e classification performance ranged between 50% and 60%.
Husari et al. proposed TTPDrill [10] and ActionMiner [11]. TTPDrill aimed to collect CTI reports from its website to identify ATT&CK techniques and CAPEC attack patterns at the document level. is approach extracts and weighs threat action-related candidate information, namely, subject, verb, and object, from each CTI report through part-ofspeech tagging, and then generates 187 techniques and 19 tactics and converts them into a STIX structure. In addition, the ActionMiner model was published as a follow-up study.
e purpose of this model was to find the same threat information in CTI reports by extracting object-verb pairs related to malicious software using entropy and mutual information from Wikipedia.
Legoy et al. [12] proposed the rcATT model, which is an ML model used for automatically identifying TTP from sentence units in unstructured CTI reports. is approach uses term frequency-inverse document frequency (TF-IDF) and Word2Vec as word embedding techniques, and the decision tree, support vector classifier, and AdaBoost models as classifiers. e multiclass classification performance was measured as 79.3% at the tactics level and 72.22% at the techniques level.
Ayoade et al. [13] proposed a TTP classification model using TF-IDF and support vector machines. e proposed model uses the Symantec dataset as the training dataset and APT reports as the test dataset to extract attacker actions from various CTI reports. In addition, classification performance experiments were conducted by applying various bias correction methods. e classification performance obtained was 63% at the tactics level and 96.3% at the kill chain level.
Nakanishi et al. [14] proposed the SECCMiner model. is model is not an ML-based TTP automatic classification model, and its purpose is to identify TTP-related keywords included in CTI reports using the TF-IDF NLP technique.
Kim et al. [15] proposed a technology to collect indicators of compromise (IoC) from CTI reports using NLP techniques. e IoC data and attack techniques (TTP) used for cyberattacks were extracted using the SyntaxNet technique from Google. Evaluation of 190 reports based on the F1 score showed an average performance of 76%.
You et al. [5] proposed a TTP intelligence mining model that extracts and classifies TTP information from unstructured CTI reports. For this model, Sentence-BERT embeddings were used in the feature extraction step, and a twodimensional convolutional neural network and bidirectional long short-term memory network were used as classifiers, and a high F1 score of 0.97 was obtained. In particular, a model that focuses on embedding techniques related to the text in unstructured CTI reports was proposed. Experiments were conducted to classify six attack classes based on 6,061 TTP-related sentences for the dataset.
In summary, previous studies mainly utilized two approaches. ey could be categorized as studies that aimed to find TTP and IoC data from CTI reports using various NLP techniques, and studies that classified TTP in the MITRE ATT&CK framework using ML techniques. However, the  performance of identifying threat information or classifying it as TTP showed results of 70% to 80%. In addition, previous studies have suggested that for the automation of CTI analysis, it is necessary to solve the issues relating to securing quality training data and minimizing the generalization error between training data and actual data. is indicates that research on technology to automatically identify or classify cyber threat information using AI techniques is still in the early stages and that there are many open issues to solve.

Problem Definition.
e biggest issues facing automated TTP classification in CTI are related to the quality of training data, such as small dataset size and class imbalance. e performance of ML models depends on the quality of the data for training the models. If the training data are not balanced across different classes, the performance of the ML model is significantly degraded. Although most learning models perform learning under the assumption that the proportion of the training classes is similar and provides high-performance results, however, in practice, predictive accuracy increases for classes with large data distributions and decreases for classes with small distributions, which lowers the overall performance.
CTI data are in the form of a report in unstructured text sentences. Features X, in which this type of report is entered into the TTP classification, uses sentences or documents that make up the CTI report itself as an input. Output Y can also be classified as tactics or techniques of attacking TTP. However, the biggest problem with CTI is that samples comprising input training data with TTP labels are extremely rare. is is because the CTI data and TTP information are domain-specific, and therefore, there are not much learning data for label information. Since TTP information is the result of analyzing cyber threat information by security experts, such as log information of security equipment or hacker's attack techniques, it takes several months to analyze TTP. erefore, it takes a long time for training data with TTP labels to be opened, so it is difficult to collect training data. Moreover, it is difficult to obtain labeling datasets because ML-based TTP classification studies are in the early stages. In addition, because TTP consists of 12 tactics and 200 techniques, it has unbalanced data characteristics that inevitably result in significantly fewer data instances for each class. Currently, the only training dataset used to automatically classify TTP in unstructured CTI data is the training dataset provided by TRAM of MITRE.

Class Imbalance Issues.
In supervised learning, the problem of class imbalance can arise when there is an unbalanced distribution of classes in the training dataset [16]. While imbalances in class distributions can vary, severe imbalances are more difficult to model and may require specialized skills. e general solutions for addressing unbalanced data sampling currently include oversampling and undersampling [17].
Undersampling is the process of reducing the sample size of the majority class, which has a higher proportion, to balance the amount of sample data belonging to each class. However, performance degradation may occur due to the loss of useful data because data belonging to the majority class is omitted. Oversampling is a technique that supplements the training data with multiple copies of some minority classes to increase the sample size of the minority class. Existing oversampling techniques include random oversampling, SMOTE, and data augmentation. SMOTE [18] is an oversampling technique proposed by Chawla et al. in 2002. is technique involves the use of the k-nearest neighbor algorithm to find close neighbors of data instances belonging to the minority class and to generate virtual data through interpolation, such that the virtual data corresponds to the minority class and is not identical to the original data. In other words, a sample of a class with a small number of instances is taken and a random value is added to create a new instance, which is then added to the data [6].
EDA [19], published at the EMNLP (2019) conference, is a technique for increasing the amount of data by transforming the currently available data and is used when the amount of training data is insufficient or when a class imbalance occurs. In text classification tasks, the EDA technique improves the performance of classification models with only a small amount of data without requiring additional external data or generation models. e methods used in EDA include synonym replacement, which replaces a specific word with a synonym; random deletion, which deletes a random word; random insertion, which selects words within a sentence and inserts them into any position in the sentence; and random swap, which repositions any two words in a sentence [20,21]. e training data can be increased in various forms using the EDA technique, which improves the performance of AI models. e above four methods can be used to produce 4 + α augmented sentences for one sentence. It was also proved in the study that the sentences made in this way preserve the label of the original sentence, that is, the original meaning. Also, when generating sentences, noise is properly generated, which can suppress the overmatching phenomenon that may occur in data shortage problems.
Back translation [22] was first introduced in the 2016 ACL. is study attempted to improve the performance of machine translation using monolingual data. One of the methods to improve the performance of machine translation was the back translation, which was suggested to be effective. e machine translation model has an encoder-decoder structure. e source sentence is inputted to the encoder and the target sentence is inputted to the decoder and then the training of the translation model is proceeded. e author of this study proposed a methodology to create artificial source sentences with no perfect sentence format using target sentences. In other words, the original sentence is translated into another language and then retranslated to create a poor source sentence. Based on this concept, back translation techniques have been used in several studies to increase the amount of training data for performance improvement in text classification models.
In this study, the oversampling technique was used to solve the class imbalance problem of limited training data for TTP automatic classification. A small number of class distributions degrade the performance of the classification model. e reason is that the training results of the unbalanced data can bring biased results for a number of class data. However, if a small number of class data are removed and only a large number of class data are used, it may be difficult to properly classify a small number of TTP techniques. e oversampling technique was used as a methodology that can sufficiently utilize limited training data. In addition, the performance improvement results and effectiveness of the classification model using the oversampling technique were verified.

Datasets.
We prepared three training datasets for improving the generalization problem of the experiment. Table 2 summarizes the features of the three datasets. e first dataset was a training dataset provided by TRAM. is training data comprised of 1,410 CTI-related sentences and 100 classes corresponding to the techniques level of TTP. e second dataset was the data we prepared. e amount of training data provided by TRAM was small and the number of techniques to be classified was 100. Information provided by MITRE ATT&CK was collected to organize 578 techniques (including subtechniques) related to a total of 4,250 sentences. However, the number of instances per class was limited to 24. e last dataset was a combined dataset, comprising of the TRAM dataset and our dataset, which consisted of 578 techniques related to 5,660 sentences. Figure 2 shows the distribution plot of the three datasets.
As shown in Figure 2, the combined dataset data distribution was reinforced compared to the distribution of samples per TTP class provided by TRAM. Figure 3 shows the experimental procedure in two steps. e first step is a preprocessing stage that involves data preprocessing, sentence representation, and oversampling. Sentence representation was performed using the countVectorizer, a bag-of-words technique that expresses text as a numerical feature vector.

Experimental Procedure.
en, oversampling techniques such as SMOTE and EDA were applied to the training set, and the dataset was split into training and testing sets in an 8 : 2 ratio. In the second step, the classification and model evaluation was performed by training the classification model using the training set. en, the performance of the classification model was evaluated on the testing set.
is experiment was processed in two ways. e first experiment measured the classification performance of the baseline model using the three datasets to provide a baseline for a comparative analysis. e second experiment measured the classification performance of the ML models with oversampling techniques. In this experiment, the SMOTE and EDA oversampling techniques were used.

Evaluation Methods.
is section explains the evaluation method for the proposed model using our dataset. is experiment used accuracy, precision, recall, and the F1 score of the confusion matrix as performance indicators of the classification model. Since this experiment is a multiclassification and an imbalanced class problem, we focused on the F1 score and micro/macro average scores as performance metrics. In classification problems, the precision, recall, and F1 scores change depending on the number of instances in the target class. erefore, we used the microaverage and macroaverage metrics, which are methods for averaging the performance for each target class and evaluating the performance of classification models with imbalanced classes. e microaverage is a method of taking the average that considers the number of instances in each class when calculating the average and is a metric that can respond sensitively to class imbalance. e macroaverage is a method of taking the average regardless of the number of instances in each class and is an indicator that can evaluate the overall performance of the model.

Experiment Results.
is section describes the experimental results of baseline models, and the experimental results applied by oversampling techniques, SMTOE, and EDA, respectively.

Results with Baseline Model.
is result is the baseline experiment to compare the effectiveness of our approach, oversampling techniques. e results of the first experiment are shown in Table 3. Using the training data provided by the Security and Communication Networks TRAM model, which was used as the baseline model, a classification accuracy between 32.55% and 57% at the techniques level was shown. e micro-F1 score was between 26.08% and 57.28% and the macro-F1 score was between 20.65% and 34.32%. e reason for the poor performance is that the total amount of data samples is insufficient, which is affected by a small set of data, making it unsuitable.    proposed dataset was between 22% and 26.43%, and the combined dataset was between 21.92% and 28.30%. e performance of the experiment with SMOTE showed little improvement compared to the reference model. Oversampling with the SMOTE technique is affected by adjacent k values, so the number of samples per class is required. Since the dataset used in this experiment contains classes with a small number of samples, it seems that the experiment was conducted with the k � 1, resulting in low performance.

Results with EDA.
Data augmentation techniques for oversampling exist in text modification and generation methods. One of the modification methods is easy data augmentation (EDA), which augments text without external data using four text editing techniques and taking back translation and conditional pretraining as generation methods. In this study, we perform the experiment by using EDA and back translation. Table 5 shows the classification performance with the EDA-BT (back translation) technique. Compared with the baseline model, the results of applying the EDA technique showed that the classification performance in the case of the TRAM dataset was between 88.01% and 96.36%, in the case of the proposed dataset it was between 90.80% and 91.05%, and in the case of the combined dataset it was between 90.44% and 91.11%.

Discussion: Comparative Analysis.
is section compares the experimental results described in the previous section and analyses the results of previous studies and current studies.

Experiments Comparison.
e experimental comparison of each technique is the result of comparison before and after using of oversampling with SMOTE and EDA. Here, accuracy and F1 score were used as predictive performance metrics and ROC-AUC metrics were used to evaluate the effectiveness of the model. As shown in Table 6, the classification results using EDA oversampling are 90% to 95% on an average, and this result shows good classification performance.
ese results show that the EDA technique compared to the baseline model has significantly improved on an average in accuracy and micro/macro-F1 score regardless of dataset type. Figure 4 is a graph comparing the performance results of the baseline model, applying SMOTE and EDA, respectively, with accuracy, micro-F1 scores, and macro-F1 scores for multiple classifications. Here, the X-axis is divided into three datasets and the Y-axis is the result of each performance metric. Figure 4(a) is the classification performance with the baseline model and Figure 4(b) shows    the classification performance by using smote oversampling. Figure 4(c) shows the classification performance by applying EDA oversampling. As shown in the figure, the results of EDA improved by about 40% compared to the baseline model before applying the oversampling technique. Figure 5 shows the ROC-AUC results of the experiment with oversampling. Comparing the ROC-AUC results, which are indicators for evaluating the discriminant power of classification models, the AUC values of SMOTE and baseline models are from 0.62 to 0.78 indicating that the discriminant performance of the model is average. In the case of EDA, the AUC value is from 0.95 to 0.98, which means that it has the best discrimination performance of the model. Table 7 is the result of a comparative analysis between the current work and existing studies. As mentioned in previous studies, the comparative analysis targeted the ML-based TTP automatic classification model. e purpose of previous studies is similar to the current study with the aim of extracting keywords of TTP from CTI reports, classifying strategies/technologies, or extracting behaviours of attacks. However, the difference from the current study is that different techniques have been applied for each study. Compared to the current approach, it is common to focus on data preprocessing techniques, but there are differences in detailed data preprocessing methods and application models, such as TF-IDF and embedding techniques.   ere are also differences in input/output relationships. In five studies, including Husari et al. [10,11], the input data were used as the document level of the CTI report. ese studies aim to define the full content of the CTI report as a TTP or attack model. In three studies, including the current study, the input data were used as sentence-unit texts in the CTI report. It aims to define one sentence as a TTP or attack model. Evaluation metrics also differ from the proposed approach. Husari et al. [10,11] and Nakanishi et al. [14] focused on implementation over evaluation metrics. In the current work, we selected F1 scores as evaluation metrics because we were addressing the class imbalance problem, but rcATT [12] and Ayoade et al. [13] presented evaluation metrics using accuracy indicators.

Comparison of Previous Studies.
Direct comparative analysis with previous studies is difficult because the datasets, models, input/output relationships, and purposes used in each study are different. However, since the accuracy of TTP classification is not high at 20-50% on average in text-style CTI reports, and the results of previous studies to solve this problem show performance improvement at 60-90%, thus the results of this study showed good improvement compared to previous studies.

Conclusions
In this study, we present an automated classification of TTP from CTI data. As the occurrence of cybersecurity threats increases rapidly, it is necessary to quickly identify and respond to attacks. CTI information is mainly used to understand these threat situations and attack mechanisms. It is important to define a large amount of CTI information as a standardized attack model. However, analyzing a large amount of CTI data with a limited number of security personnel is time-consuming. erefore, in this study, we present an automated method for classifying TTP from unstructured CTI data using machine learning. It is expected that this will enable faster identification and response to security threats.
In this study, we also focus on improving TTP classification accuracy while solving the problem of small training datasets and TTP class imbalances. Imbalanced data is one of the most important problems to be solved in machine learning. We present the comparative experimental results of TTP identification and classification performance by using data augmentation techniques during data preprocessing to address insufficient training data issues in CTI domains. As a result, when the training data augmentation technique was used based on the TRAM model, which is a reference baseline model, a performance improvement of about 60%-80% for the F1score was achieved.
However, a limitation of this work is that it is highly prone to generalization errors. In particular, due to the nature of the cybersecurity domain, the accuracy of ML models is bound to vary depending on the content and amount of unknown new security threats or CTI reports, as attackers continue to find new attack techniques to bypass existing defense models. erefore, after solving this generalization error and classifying TTP from known information through rule-based matching, we believe that the proposed model can be applied to unmatched CTI information through machine learning. Future studies need to consider various improvements, such as quality training data generation, word embedding methods, model selection, and optimization, to improve automated TTP classification performance.
Data Availability e experimental data supporting the current study are available from the corresponding author upon request.

Conflicts of Interest
e authors declare that they have no conflict of interest..