Selecting the best configuration of hyperparameter values for a Machine Learning model yields directly in the performance of the model on the dataset. It is a laborious task that usually requires deep knowledge of the hyperparameter optimizations methods and the Machine Learning algorithms. Although there exist several automatic optimization techniques, these usually take significant resources, increasing the dynamic complexity in order to obtain a great accuracy. Since one of the most critical aspects in this computational consume is the available dataset, among others, in this paper we perform a study of the effect of using different partitions of a dataset in the hyperparameter optimization phase over the efficiency of a Machine Learning algorithm. Nonparametric inference has been used to measure the rate of different behaviors of the accuracy, time, and spatial complexity that are obtained among the partitions and the whole dataset. Also, a level of gain is assigned to each partition allowing us to study patterns and allocate whose samples are more profitable. Since Cybersecurity is a discipline in which the efficiency of Artificial Intelligence techniques is a key aspect in order to extract actionable knowledge, the statistical analyses have been carried out over five Cybersecurity datasets.
A Machine Learning (ML) solution for a classification problem is effective if it works efficiently in terms of accuracy and the required computational cost. The improvement of the first factor is faced on by several points of view that could affect to the second one in different forms.
The simplest way to get a ML model with a good accuracy is by testing and comparing different ML algorithms for the same problem and choosing, finally, the one that performs better. However, it is clear that, for instance, a decision tree model does not require, in general, as much computational time and memory to be trained as a Multilayer Perceptron. So, we will need to adjust the achieved accuracy with the available resources.
Another usual effective approach to reach a high accuracy is working with large training datasets. Nevertheless, this solution is limited because of the associated computational cost (obtaining and storing the data, cleaning and transformation processes, and learning from the data). A possible alternative to the mentioned problem is to reduce the training without losing too much information [
The research related to this aspect, in addition to usual filtering the data, is focused on how to optimize the training set, and not only reduce it. The progressive sampling method shows that the performance with random samples with determined sizes is equal or more effective than working with the entire dataset [
On the other hand, tuning hyperparameters of a ML algorithm is a critical aspect of the model training process that is considered the best practice for obtaining a successful Machine Learning application [
Given a supervised ML algorithm, the continuous HPO is usually solved by gradient descent-based methods [
Bayesian HPO algorithms balance the exploration process for finding promising hyperparameter configurations and the exploitation of the setting configuration in order to obtain always better results or gain more information with each test [
Although applying HPO algorithms on a ML model reflects a great improvement in the results’ quality of the models’ accuracy, we cannot overlook the computational complexity to implement these techniques. It is a critical issue because obtaining a good performance by applying HPO could require generations’ samples, several function evaluations, and expensive computational resources. For example, the GP methods usually require a high number of iterations. Likewise, some derivative-free optimizations behave poorly in hyperparameter optimization problems because the optimization target is smooth [
Recently, a Radial Basis Function (RBF) has been proposed as a deterministic surrogate model to approximate the error function of the hyperparameters through dynamic coordinate search that requires fewer evaluations in Multilayer Perceptron (MLP) and convolutional neural networks [
Examples of this application are the accelerated RS version (2x) [
We can also find studies about the effect of these HPO methods in the efficiency of the ML algorithms comparing different methods [
The goal in this article is to carry out an empirical statistical analysis about the effect of the factor size of datasets used in the stage of the HPO in the performance of several ML algorithms. The studied response variables are the main issues in the efficiency of the algorithm: quality and dynamic complexity [
The research questions that will be studied are the following:
In the case that we obtain a positive answer, we follow to the next questions.
In order to answer the research questions formulated above, an experiment has been carried out with different publicly available datasets about learning some tasks regarding cybersecurity events. Cybersecurity is a challenging research area due to the sophistication and the amount of kind of Cybersecurity attacks, which, in fact, increase very fast as time goes by. In this framework, the traditional tools and infrastructures are not useful because we deal with big data created with a high velocity, and the solutions and predictions must be faster than the threats. Artificial Intelligence and ML analytics have turned out in one of the most powerful tools against the cyberattackers (see [
Experimental analyses have been carried out in order to investigate the possible statistical differences over the efficiency of the Machine Learning algorithms Random Forest (RF), Gradient Boosting (GB), and MLP among using different sizes of samples in several HPO selection algorithms. We have used nonparametric statistical inference because, in an experimental design in the field of computational intelligence, these types of techniques are very useful to analyze the behavior of a method with respect to a set of algorithms [
The paper is structured as follows. In Section
Let
Let
Usually, in practice, one has a dataset
Suppose we have a ML model
The study that will be done is statistical, so we will consider several datasets as well as many Machine Learning algorithms. Different state-of-the-art HPO algorithms will be considered and applied to every possible combination of Machine Learning models and datasets in order to compare them.
Experiments were conducted testing eight HPO methods over five cybersecurity datasets for three ML algorithms.
As we mentioned above, we have evaluated the efficiency of the three well-known classifiers and predictor Machine Learning techniques algorithms: RF, GB, and MLP, commonly used in classification and prediction problems. The library used is [
Ensembles methods are commonly used because are based on the underlying idea that many different joint predictors will perform in a better way than any single predictor alone. Ensembling techniques could be divided into bagging and boosting methods.
First ones build many independent learners that are combined with average methods in order to give a final prediction. These handle overfitting and reduce the variance. The most known example of bagging ensemble methods is the RF. An RF is a classifier consisting of a chain of decision tree algorithms. Each tree is constructed by applying an algorithm to the training set and an additional random vector that is sampled via bootstrap resampling, so the trees will run and give independent results (see [
Second ones, Boosting, are ensemble techniques in which the predictors are made sequentially, learning from the mistakes of the previous predictor in order to optimize the subsequent learner. It usually takes less time/iterations to reach close to actual predictions, but we have to choose the stopping criteria carefully. These reduce bias and variance and can with the overfitting. One example of the most common boosting methods is GB. The library that is used is the GradientBoostingClassifier, and we tune the discrete hyperparameters that are the number of predictors and the maximum depth of them.
On the other hand, an artificial neural network is a model that is organized in layers (input layer, output layer, and hidden layers). An MLP is a modification of the standard linear perceptron where multiple layers of connected nodes are allowed. The standard algorithm for training a MLP is the backpropagation algorithm (see [
In Table
HPO algorithms. The star symbol
Name | Reference | Ready? | Python minimal version | Smart | Library |
---|---|---|---|---|---|
PS | [ | | 2.7 y 3 | X | [ |
| |||||
TPE | [ | | 2.7 y 3 | X | [ |
| |||||
CMA-ES | [ | | 2.7 y 3 | X | [ |
| |||||
NM | [ | | 2.7 y 3 | | [ |
| |||||
RS | [ | | 2.7 y 3 | X | [ |
| |||||
SMAC | [ | | 3 | X | [ |
The choice of datasets selected for the experiments was motivated by different reasons: available in public servers, diversity in the number of instances, classes, and features, and relating to cybersecurity. The set of datasets
Description of the set of datasets
Dataset | Name | Instances | Features | Classes | Reference |
---|---|---|---|---|---|
| Spambase | 4601 | 57 | 2 | [ |
| |||||
| Robots in RTLS | 6422 | 12 | 3 | [ |
| |||||
| Phishing websites | 11055 | 30 | 2 | [ |
| |||||
| Intrusion Detection (NSL-KDD) | 148517 | 39 | 5 | [ |
| |||||
| Credit Card Fraud Detection | 284807 | 30 | 2 | [ |
Regarding the transformation of features data of treatable datasets, this has been performed manually by Python.
Dataset
Also, the datasets have different number of instances as well as features and number of classes of the target variable.
For each dataset
Description of the partitions of the set of datasets
Dataset | | | | |
---|---|---|---|---|
| 383 | 766 | 2300 | 4601 |
| ||||
| 535 | 1070 | 3211 | 6422 |
| ||||
| 921 | 1842 | 5527 | 11055 |
| ||||
| 12376 | 24752 | 74258 | 148517 |
| ||||
| 23733 | 47467 | 142403 | 284807 |
On the other hand, we fix once and for all a partition
Finally, in order to build response variables to measure the goal of the study, we apply each
The time complexity, measured in seconds, is the sum of the time needed by the HPO algorithm for finding the optimum hyperparameters and the time needed by the ML algorithm for the training phase. The spatial complexity, measured in Kb, is defined as the maximum of use of memory along the HPO algorithm run, including the internal structures of the algorithm as well as train and test datasets load.
The analyses have been carried by the authors at high-performance computing facilitated by SCAYLE (
The analyses script has been implemented in Python language. Python uses an automated manager system of memory called garbage collector that releases the unused memory space. This phenomenon might be nondeterministic and certain fluctuations shown in the results may be due to not releasing the memory in that case.
It is worth noting that different technical tools (either software or hardware) could affect to the data about time and spatial complexity that have been collected. This is a fact that should be taken into account if we want to measure the effect of data sizes over the response variables in absolute terms, that is, over a single HPO algorithm. However, the influence of the technical specifications in the response variables is not a relevant factor in this study. This is a comparative study in which all the measures are collected under the same conditions, and the possible effect of the technical elements on the data is the same in each experiment. It is expected that the same behavior of the comparative encountered patterns will appear with other technical characteristics.
The aim of the study is to decide if the size of data is a factor that influences in the efficiency of a ML algorithm using a HPO method among partitions of the data.
We perform the following statistical analysis: First, for each level of the size, that is, the partitions At this point, we have applied the inference described over the accuracy, the time, and the spatial complexity. Then, we have obtained the From the results obtained in Wilcoxon’s tests we assign to each This correspondence is shown for each algorithm The design of Table Finally, we compute the average of gain of the smaller partitions in a general way, per datasets, and ML algorithms. These results let us measure the reliability of the conclusions.
Level of gains where ‘
Level | Condition |
---|---|
9 | If |
| |
8 | If |
| |
7 | If |
| |
6 | If |
| |
5 | If |
| |
4 | If |
| |
3 | If |
| |
2 | If |
| |
1 | If |
| |
0 | In another case |
The first research question deals with whether the size of a partition used in the HPO phase influences in certain sense on the efficiency of the algorithm. Once the comparison’s tests described in the above section have been done, we account, for each ML model, how many combinations show statistically significant differences across all the HPO methods and all the datasets. Although we find an influence on the response variables, we do not know whether this influence is positive or not. So, the second research question is focused on the study of this differences and equalities. Next, we analyze the reliability of the results. Finally, we include an overview of the global results that are obtained.
In the case of RF, the results included in Table
Statistically significant differences in each dimension for all HPO across all
Combination | Accuracy | Time Complexity | Spatial Complexity | (Acc+T.C.+S.C) |
---|---|---|---|---|
| | | | |
| ||||
| | | | |
| ||||
| | | | |
| ||||
| | | |
Also, we can see that the partition
In the case of GB, the results included in Table
Statistically significant differences in each dimension for all HPO across all
Combination | Accuracy | Time Complexity | Spatial Complexity | (Acc+T.C.+S.C) |
---|---|---|---|---|
| | | | |
| ||||
| | | | |
| ||||
| | | | |
| ||||
| | | |
Regarding the concrete partitions, we have a greater homogeneity of the results in the GB than in RF, although
Finally, in the case of MLP, the results included in Table
Statistically significant differences in each dimension for all HPO across all
Combination | Accuracy | Time Complexity | Spatial Complexity | (Acc+T.C.+S.C) |
---|---|---|---|---|
| | | | |
| ||||
| | | | |
| ||||
| | | | |
| ||||
| | | |
It is worth noting that
In general, we have found a high effect of the dataset’s size used in the HPO over the time and spatial complexity, for the three ML algorithms. In the case of the accuracy, the ensemble methods (RF and GB) show a medium effect (around a rate of 40%), while in the MLP, the level of effect is very low.
In order to study in depth whether the considered effect is positive or negative when we work with smaller partitions, the evolution of the response variables as the size of the partition grows up is developed. We are going to study how are the encountered differences in each dimension of the efficiency.
In Figures
Accuracy, time, and spatial complexity of RF.
Accuracy in
Time in
Spatial complexity in
Accuracy in
Time in
Spatial complexity in
Accuracy in
Time in
Spatial complexity in
Accuracy in
Time in
Spatial complexity in
Accuracy in
Time in
Spatial complexity in
Accuracy, time, and spatial complexity of GB.
Accuracy in
Time in
Spatial complexity in
Accuracy in
Time in
Spatial complexity in
Accuracy in
Time in
Spatial complexity in
Accuracy in
Time in
Spatial complexity in
Accuracy in
Time in
Spatial complexity in
Accuracy, time, and spatial complexity of MLP.
Accuracy in
Time in
Spatial complexity in
Accuracy in
Time in
Spatial complexity in
Accuracy in
Time in
Spatial complexity in
Accuracy in
Time in
Spatial complexity in
Accuracy in
Time in
Spatial complexity in
Both the time and spatial complexity appear with an increasing trend, the first one being more highlighted. Also, the case of spatial complexity is more variable in the case of MLP than in the ensemble methods. So, in order to gain efficiency when tuning the hyperparameters with a smaller proportion of data, different levels are assigned according to Table
Note that, in general, is not true that the accuracy increases as the size of the partition used for the HPO phase does. This can be seen more clearly when the ML model is GB or MLP and, certainly, depends on the dataset. See, for instance, the three charts for
The gain’s averages obtained in RF, GB, and MLP are included in Figures
Average of gain for each smaller partition with respect to the whole dataset in RF.
Average of gain for each smaller partition with respect to the whole dataset in GB.
Average of gain for each smaller partition with respect to the whole dataset in MLP.
In all the studied ML algorithms, we can obtain a gain when smaller partitions are used to HPO. Also, we can clearly find four different patterns (see Table
Patterns of profit.
ML method | Pattern 1 | Pattern 2 | Pattern 3 | Pattern 4 |
---|---|---|---|---|
RF | NM | SMAC, RS | PS, TPE, CMA-ES | |
| ||||
GB | RS | NM, TPE, CMA-ES | SMAC, PS | |
| ||||
MLP | RS, SMAC | CMA-ES | PS, TPE, NM |
In the case of RF, the lowest level of gain is 4.5 and the maximum value is 7. The other ensemble method, GB, has obtained levels of profit between 2.5 and 8. Finally, the MLP algorithm shows values between 5 and 7.5. Then, the neural network is the algorithm in which the HPO phase performs better with smaller partitions, followed by RF and GB.
In addition, for all ML algorithms there is at least one HPO algorithm that obtains a profit of level greater than 6 in the smaller partitions
If we compute the average of gain in each dataset, we obtain the results shown in Figure
Average of gain for each smaller partition with respect to the whole dataset in each dataset
Grain of RF for dataset
Grain of GB for dataset
Grain of MLP for dataset
The averages of the profit level for RF are between 3.2 and 8.5, being the largest dataset
Finally, we can conclude that, in a general way, in all datasets we obtain efficiency optimizing the hyperparameter values with smaller partition, although the data were different in terms of features, number of instances, or classes of the target variable. So, the results are consistent and reliable.
In Table
Average rate of statistically significant differences in each dimension for all HPO across all
Combination | Accuracy | Time Complexity | Spatial Complexity |
---|---|---|---|
| | | |
| |||
| | | |
| |||
| | | |
| |||
| | | |
Once we have statistically and globally analyzed the efficiency of a ML algorithm when we use smaller partitions instead of the whole dataset, the next step is going in depth over those cases in which differences are encountered. It should be noted that these could provide a gain or a loss of effectiveness. Also, a statistically difference in the accuracy, for example, could be due to a variation as low as a few ten thousandth of the total. In these cases, the relevance of this difference is meaningfully related to the order of gain or loss in the time and spatial complexity. The global average of the level of gain, according to Table
Average level of profit for all ML algorithms considered and for each comparison between
Combination | HPO | |
---|---|---|
| | |
| ||
| | |
| ||
| | |
The obtained global results show an average level of profit between 5.04 and 6.33 over 9 with an increasing trend related to the size of the partition.
Cybersecurity is a dynamical and emerging research discipline that faces on problems which are increasingly complex and that requires innovative solutions. The value of a database of Cybersecurity is very high due to the actionable knowledge that we extract from it, but in the most cases, we have to deal with a big volume of data that entails expensive costs of resources and time. Artificial Intelligence techniques, such as Machine Learning, are powerful tools that allow us to extract and generate knowledge in Cybersecurity, among other fields.
One of the main issues, in order to reach quality results by Machine Learning, is the optimization of the hyperparameter values of the algorithm. However, the automatic HPO methods suppose a cost in terms of dynamical complexity.
In this work, we have developed a statistical analysis of the fact of using smaller samples of the dataset for this process, and its influence on the effectiveness of the Machine Learning solution. The study was carried out over five different public datasets of Cybersecurity. The results let us conclude that working with smaller partitions turns out to be more efficient than performing the same process with the whole dataset. The obtained gain is different depending on the ML algorithm and the HPO technique providing the highest level of profit with the 50% of the dataset.
As future work, the following landmark would be to search what is the optimal partition to obtain the best gain, as well as studying other HPO methods over more types of ML algorithms.
Covariance Matrix Adaptation EvolutionaryStrategies
Gaussian Process
Gradient Boosting
Hyperparameters Optimization
Machine Learning
Multilayer Perceptron
Nelder-Mead
Particle Swarm
Radial Basis Function
Random Search
Random Forest
Sequential Model Automatic Configuration
Sequential Model-Based Optimization
Successive Halving
Tree Parzen Estimators.
The datasets supporting this meta-analysis are from previously reported studies and datasets, which have been cited. The processed data are available from the corresponding author upon request.
The authors declare that they have no conflicts of interest.
The authors would like to thank the Spanish National Cybersecurity Institute (INCIBE), who partially supported this work. Also, in this research, the resources of the Centro de Supercomputación de Castilla y León (SCAYLE, www.scayle.es), funded by the “European Regional Development Fund (ERDF)”, have been used.