Towards Supercomputing Categorizing the Maliciousness upon Cybersecurity Blacklists with Concept Drift

. In this article, we have carried out a case study to optimize the classi ﬁ cation of the maliciousness of cybersecurity events by IP addresses using machine learning techniques. The optimization is studied focusing on time complexity. Firstly, we have used the extreme gradient boosting model, and secondly, we have parallelized the machine learning algorithm to study the e ﬀ ect of using a di ﬀ erent number of cores for the problem. We have classi ﬁ ed the cybersecurity events ’ maliciousness in a biclass and a multiclass scenario. All the experiments have been carried out with a well-known optimal set of features: the geolocation information of the IP address. However, the geolocation features of an IP address can change over time. Also, the relation between the IP address and its label of maliciousness can be modi ﬁ ed if we test the address several times. Then, the models ’ performance could degrade because the information acquired from training on past samples may not generalize well to new samples. This situation is known as concept drift. For this reason, it is necessary to study if the optimization proposed works in a concept drift scenario. The results show that the concept drift does not degrade the models. Also, boosting algorithms achieving competitive or better performance compared to similar research works for the biclass scenario and an e ﬀ ective categorization for the multiclass case. The best e ﬃ cient setting is reached using ﬁ ve nodes regarding high-performance computation resources.


Introduction
Data science has become essential for companies and organizations to extract actionable knowledge. This can be a competitive edge whose value is directly related to the quality of the used datasets and the efficiency of the models and their implementations. The case of computer security incident response teams (CSIRTs) and the managing of cybersecurity databases is one of the best-known examples of the scenario described. A cybersecurity database is a database with reports of cybersecurity events. A cybersecurity report contains data about a cybersecurity incident that is considered malicious. The information included is, for example, its geolocation, time stamp, type of event, and con-fidentiality. Then, a cybersecurity database contains a lot of unstructured and correlated information. Several sources of information provide streams of data, and they are enriched by human agents, usually by requesting external platforms of blacklists or malware platforms. The information is updated daily, weekly, or online depending on the source and the type of event that is reported. Data flow is constant and dynamic, generating large volumes of data. In this scenario, knowing the severity of a cybersecurity event of potentially malicious activity is essential to determine an appropriate response.
In this article, we present a study case in which we optimize the application of supervised machine learning (ML) models to classify cybersecurity data streams of IP addresses in terms of the level of the maliciousness of the associated cybersecurity incident. In particular, we have applied the extreme gradient boosting algorithm and used geolocation features [1]. The study has been carried out on 99720 IP addresses provided by the Spanish National Cybersecurity Institute (INCIBE). We have conducted the experiments in two scenarios: biclass and multiclass. Also, data distribution can change over time, yielding a concept drift scenario and increasing the possible error associated with the models [2][3][4]. Then, detecting concept drift, or the absence, is crucial to evaluate the suitability of the models and the possible effect of this on the classification of the maliciousness [5]. For this reason, we have extracted the data from the experiments at two different time points, and we have analyzed the degree of concept drift and the possible effect on the accuracy of the results.
Since the usual cybersecurity databases are huge, we have created the ML models using a different number of cores to optimize the procedure's time complexity. Then, we highlight the necessary high-performance computing (HPC) resources, considering the validity of incoming data and resulting measures.
The concrete results of our experiments are three-fold; it is shown that there is no significant concept drift among the proposed databases; it is evaluated the degradation of geolocation features and, finally, the suitability of HPC to the creation of ML models among those cybersecurity databases. Regarding the latter question, although HPC does not improve the ML algorithms' accuracy/sensitivity/specificity performance, the optimum number of cores [6] is reached with 5, where our algorithms gain 50% of execution time.
The article is organized as follows: In Section 2, we develop the related work. In Section 3, experimental details are explained. The results are included and discussed in Section 4. Finally, the conclusions and the references are given.

Related Work
A cybersecurity event is a cybersecurity change that may have an impact on organizational operations (including capabilities or reputation). (https://csrc.nist.gov/glossary/ term/cybersecurity_event). The severity of a cybersecurity event is a measure that determines its risk or maliciousness. The assessment of this characteristic is crucial to ensure that the countermeasures that are taken are appropriate. For this reason, an increasing body of literature is trying to solve this task from several approaches. The perspective depends mainly on the type of cybersecurity event with which we deal and the resources that we have available. There are tools based on different methodologies and standards (Microsoft Security Bulletin Vulnerability Rating [7], Common Vulnerability Scoring System (CVSS) [8], Open Web Application Security Project (OWASP) Risk Rating Methodology [9], and Cyber Incident Scoring System [10], among others) or other approaches based on data science and ML models [11]. In all cases, we need a tool that not only determines the maliciousness of a cybersecurity event as closely as possible but also attaches importance to identifying false negatives. These cases may become difficult situations for citizens, institutions, and companies. This work focuses on the maliciousness assigned to an IP address. Then, we use registers of the IPs of the different several cybersecurity events as any occurrence of an adverse nature in a public or private sphere within a country's information and communication networks. In particular, an IP address's severity is considered a measure of its reputation. We deal with cybersecurity databases with all IP addresses associated with threats. It is not a question of determining whether an address is malicious. We know that all the registers are "threats." The point is to assess the level of maliciousness to provide an adequate response.
Measuring the maliciousness of the reputation of an IP address has been studied from several perspectives. The first approach is using blacklists to create alerts. These works apply techniques such as time series forecasting, clustering, or ML models based on data in the blacklists reaching maximum accuracy rates of 0.776 and predicting if an IP can be considered malicious or not. One of the disadvantages of this approach is the vast volume of black or whitelisted IPs to create the models [12][13][14][15]. The second approach takes advantage of contextual information about the IP address, such as geolocation, DNS registers, hosts, and the proper address. This information is easy to extract and does not require a large volume of data. In this case, the models that are created are based on computations about the frequencies at which contextual information appears, or again, clustering techniques [16][17][18][19]. A global accuracy of 0.77 is reached to classify an IP address as malicious or not. Another perspective is analyzing the dynamical behavior of the IP address from logs or intrusion detection systems [20][21][22]: the number of alerts that are generated, requests, access, etc. Although this approach reaches the best accuracies, 0.91-0.93, it implies additional resource costs because it requires monitoring and extracting the features online.
As we mentioned, one of the most used approaches to categorizing the maliciousness of an IP address is applying ML models. However, although we find relevant features such as geolocation variables, the values of these variables, or the blacklists are expected to change over time, leading to a concept drift scenario. A concept drift scenario is that in which there is a change in the data distribution ( [3]). If X denotes the feature vector space in a data sample and Y is the X label space, then the concept drift happens if P t ðyjXÞ ≠ P t+Δt ðyjXÞ and/or P t ðXÞ ≠ P t+Δt ðXÞ, where P t ðXÞ is the marginal distribution of data in an instant t. Analogously, P t ðyjXÞ. The drift is real, virtual, or a combination of both if the differences appear on one or the other-or both, probabilities [2][3][4]. Ough there are studies in which the concept drift is involved, usually, something other than this is the focus of the research.
Recently, in [1], an optimal feature set to categorize an IP address's maliciousness was configured with Autosklearn [23] by analyzing contextual variables joint with temporal information extracted by blacklists. Although the feature set has been optimized, the question now is whether the implementation can be optimized in terms of efficiency and scalability. For this reason, in this work, we have conducted a case study with the optimal configuration set of [1], the geolocation features, but changing the ML algorithm and adding a possible parallelization by HPC resources. Also, we have conducted the experiments in a biclass and a multiclass scenario to compare our results with other research works. 2 Computational and Mathematical Methods

Materials and Methods
In this section, the experimental details are described. (2) Then, we constructed another blacklist from BL by transforming the labels 1 and 3 into 0 and 6 and 9 into 1. Then, we clustered the samples with very low and low severity on the one hand and grouped the samples with high and very high severity. This experiment is denoted by B. Also, we have repeated the experiment but transforming the label 1 into 0, and grouping the labels 3, 6, and 9 into 1. This experiment is denoted by B ′ The reason to take a subset of 55728 IPs from BL and not all the IP addresses is that these present changes in the geolocation features.
The datasets are in https://github.com/amunc/IP_ datasets, but by the confidentiality agreement, the variable with the IP is transformed to a numerical value for anonymizing it. The proportion of each severity class in the datasets is included in Table 1. 3.3. Research Questions. The research questions are described below: following the approach given in [24]. Both take values between 0 and 1. Here, we take D 1 = D t , and D 2 = D t+Δt .
The ML models have been created by extreme gradient boosting, XGB, [25], whose implementation is [26]. The hyperparameters that have been optimized are the depth and the number of trees. 70% of the data is used for hyperparameter optimization, and the other 30% is used to evaluate the suitability of the optimized models. For all experiments, we have performed 10-fold crossvalidation. Since the datasets are unbalanced, the response variables that have been analyzed are the accuracy and Matthews' coefficient, MCC = TP · TN − FP · FN / ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ðTP + FPÞðTP + FNÞðTN + FPÞðTN + FNÞ p where TP, TN, FP, and FN denote the true positives, true negatives, false positives, and false negatives, [27] Also, we have computed the recall (or sensitivity) and the selectivity (or specificity) of the models for each class.
Also, to evaluate the computational cost, we have collected the total time in seconds of the procedure. This includes the time model construction, the time features selection, the time feature construction, the time preprocessing, and the time data load. We highlight that, in our case, the time feature selection is 0.
Finally, to decide if HPC is a suitable tool to face this problem, we have carried out all the above experiments with different cores parallelizing the XGB algorithm. We have performed the analyses with 1, 2, 3, 5, 10, and 16 cores. The experiments have been carried out with Python 3.8.

Results and Discussion
This section is organized according to the research questions proposed.     Table 2, we have included the results of the ML models constructed when they are applied over t and t + Δt.
Regarding the confusion matrices, they are included in Tables 3, 4, and 5.
We can see in Table 6 the average and the median results of applying ML-constructed models with D t+Δt predicting over D t+Δt .
Regarding the confusion matrices, they are included in Tables 7, 8

Computational and Mathematical Methods
and specificity are similar. They are better for the biclass scenario. As expected, the settings obtained with a different number of cores are very similar. So, it seems logical to study the temporal complexity of the process. Thus, we will be able to analyze whether the increase in the number of cores and the use of HPC resources provides us with a considerable reduction necessary to work in conceptual drift scenarios.
The overall running time (in seconds) of the biclass and multiclass scenarios is included in Figures 4, 5, and 6. We can observe that introducing a greater number of cores provides less consumed time. However, the gain is limited to 5 cores. From this, the asymptotic behavior of the parallelization process begins to lose time. The boosting algorithm depends on the results of past iterations, so the parallelization model used in XGB does not create several trees in parallel but produces several different candidate splits that are integrated into a single tree in each iteration. Synchronizing the splits incurs additional costs, so adding too many parallel processes increases the time spent in synchronization relative to the computation of the tree model. Using five cores instead of 1 gives us a gain of 39.50% in the binarized case and 53.82% in the multiclass scenario. Then, parallelizing the construction model, we reduce the time in half.
In Figures 7, 8, and 9, we have included the distribution of the running time (in seconds) of the biclass and multiclass scenarios. Still parallelized, what is taking the most time is the construction of the model.

Conclusions
We propose concrete experiments involving dynamic cybersecurity datasets to optimize the categorization of the maliciousness of an IP address by ML models and geolocation information. Also, we study whether concept drift degrades the obtained models. Furthermore, we want to know if HPC would improve our results and performances since cybersecurity datasets are always massive. Accurate boosting ML models are studied, showing that the optimum number of cores is around 5 for the analyzed dataset.
In future work, we plan to relate this optimum to the adequate size of datasets and the type of ML models.

Data Availability
The datasets generated during and/or analyzed during the current study are available in a Github repository, https: 208//github.com/amunc/IP_datasets.