Perimeter-based detection is no longer sufficient for mitigating the threat posed by malicious software. This is evident as antivirus (AV) products are replaced by endpoint detection and response (EDR) products, the latter allowing visibility into live machine activity rather than relying on the AV to filter out malicious artefacts. This paper argues that detecting malware in real-time on an endpoint necessitates an automated response due to the rapid and destructive nature of some malware. The proposed model uses statistical filtering on top of a machine learning dynamic behavioural malware detection model in order to detect individual malicious processes on the fly and kill those which are deemed malicious. In an experiment to measure the tangible impact of this system, we find that fast-acting ransomware is prevented from corrupting 92% of files with a false positive rate of 14%. Whilst the false-positive rate currently remains too high to adopt this approach as-is, these initial results demonstrate the need for a detection model that is able to act within seconds of the malware execution beginning; a timescale that has not been addressed by previous work.
Our increasingly digitised world broadens both the opportunities and motivations for cyberattacks, which can have devastating social and financial consequences [
Due to the huge numbers of new malware appearing each day, the detection of malware samples needs to be automated [
This paper argues that both of these approaches are vulnerable to evasion from the attacker. Static methods may be thwarted by simple code-obfuscation techniques whether rules are hand-generated [
There are several key challenges to address in detecting malware on-the-fly on a machine in use by comparison with detecting malicious applications that are detonated in isolation in a virtual machine. These are summarised below:
This paper seeks to address these key challenges and provides preliminary results including a measure of “damage prevented” in a live environment for fast-acting destructiveware. As well as the results from these experiments, this paper contributes an analysis of the computational resources against detection accuracy for many of the most popular machine-learning algorithms used for malware detection.
The key contributions of this paper are as follows: The first general malware detection model to demonstrate damage mitigation in real-time using process detection and killing Benchmarking of commonly used ML algorithm implementations with respect to computational resource consumption Presentation of real-time malware detection against more user background applications than have previously been investigated; increasing from 5 to 36 (up to 95 simultaneous processes)
The next section outlines related work, followed by a report of the three methodologies that were tested to try and address these challenges 3 in which the method for evaluating these models is also explained (6.5). The experimental setup is described in Section
Machine learning models trained on static data have shown good detection accuracy. Chen et al. [
Dynamic behavioural data are generated by the malware carrying out its functionality. Again machine learning models have been used to draw out patterns between malicious and benign software using dynamic data. Various dynamic data can be collected to describe malware behaviour. The most commonly used data are API calls made to the operating system, typically recorded in short sequences or by frequency of occurrence. Huang and Stokes’s research [
Dynamic detection is more difficult to obfuscate but typically the time taken to collect data is several minutes, making it less attractive for endpoint detection systems. Some progress has been made on early detection of malware. Previous work [
OS = operating system; HPCs = Hardware performance counters; DT = Decision Tree; MLP = Multi-layer perceptron; NN = Neural Network; RF = Random Forest.
Previous work has begun to address the four challenges set out in the introduction. Table
Real-time malware detection literature problems considered.
Problem considered | |||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Ref. | (1) Signal separation | (2) Early detection | (3) Quick classification/latency | (4) Impact of automated actions | Resource consumption | Real-time tested | Malware types | OS | # Samples | Features | Algorithm |
[ | X | X | General | Linux | 200 | HPCs | Boosted DT | ||||
[ | X | X | General | Linux | 200 | HPCs | Boosted DT | ||||
[ | X | X | X | X | General | Linux | 798 | API calls | MLP | ||
[ | X | X | X | General | Windows | 1,554 | Memory addresses, instructions | NN | |||
[ | X | X | General | Windows, linux | 500 | API calls | NN | ||||
[ | X | X | X | X | General | Windows | 9,992 | API calls | RF + NN | ||
[ | X | X | X | X | Crypto ransomware | Windows | 497 | File data | Rules |
To the best of our knowledge, challenge
Challenge
Greater attention has been paid to challenge
Both Sun et al. [
Challenge
As noted above, supervised learning models average errors across the training set but in the case of real-time detection and process killing, a
Tackling this issue was attempted in three different ways and all three are reported here in the interests of reporting negative results as well as the one which performed the best. These were: Statistical methods to smooth the alert surface and filter out single false-positives Reinforcement learning, which is capable of incorporating the consequences of model actions into learning A regression model based on the feedback of a reinforcement learning model made possible by having the ground-truth labels
Figure
High-level depiction of three approaches taken.
It is expected that transitioning from a supervised learning model to a real-time model will see a rise in false-positives since one single alert means benign processes (and all child processes) are terminated, which effectively renders all future data points as false positives. Filtering the output of the models, just as the human brain filters out transient electrical impulses in order to separate background noise from relevant data [
The proposed automated killing model may be better suited to a reinforcement learning strategy than to supervised learning. Reinforcement learning uses rewards and penalties from the model’s environment. The problem that this paper is seeking to solve is essentially a supervised learning problem, but one for which it is not possible to average predictions. There are no opportunities to classify the latter stages of a process if the agent kills the process, and this can be reflected by the reward mechanism of the reinforcement learning model (see Figure
Two limitations of this approach are that (1) reinforcement learning models can struggle to converge on a balanced solution, and the models must learn to balance the exploration of new actions with the re-use of known high-reward actions; commonly known as the exploration-exploitation trade-off [
For reinforcement learning, loss functions are replaced by reward functions which update the neural network weights to reinforce actions (in context) that lead to higher rewards and discourage actions (in context) that lead to lower rewards; these contexts and actions are known as state-action pairs. Typically, the reward is calculated from the perceived value of the new state that the action leads to, e.g., points scored in a game. Often this cannot be pre-labelled by a researcher since there are so many (maybe infinite) state-action pairs. However, in this case, all possible state-action pairs can be enumerated, which is the third approach tested (regression model, outlined in the next section).
The reinforcement model was still tested. Here the reward is
There are a number of reinforcement learning algorithms to choose from. This paper explores q-learning [
The DQN tries out some actions; stores the states, actions, resulting states, and rewards in memory; and uses these to learn the expected rewards of each available action, with the highest expected reward being the one that is chosen. Neural networks are well-suited to this problem since their parameters can easily be updated, tree-based algorithms like random forests and decision trees can be adapted to this end but not as easily. Future rewards can be built into the reward function and are usually discounted according to a tuned parameter usually signified by
In Mnih et al.’s [
Q-learning may enable a model to learn when it is confident enough to kill a process, using the discounted future rewards. For example, choosing not to kill some malware at time
Q-learning approximates rewards from experience, but in this case, all rewards from state-action pairs can actually be pre-calculated. Since one of the actions will kill the process and thus end the “experience” of the DQN, it could be difficult for this model to gain enough experience. Thus pre-calculation of rewards may improve the breadth of experience of the model. For this reason, a regression model is proposed to predict the
Unlike classification problems, regression problems can predict a continuous value rather than discrete (or probabilistic) values relating to a set of output classes. Regression algorithms are proposed here to predict the q-value of killing a process. If this value is positive, the process is killed.
There are two primary differences between this regression approach and the reinforcement learning DQN approach detailed in the previous section. Firstly, the datasets are likely to be the difference. Since the DQN generates training data through interacting with its environment, it may never see certain parts of the state-action space, e.g., if a particular process
Secondly,
The equation used to calculate the value of killing is positive for malware and negative for benignware; in both cases, it is scaled by the number of child processes impacted and in the case of malware, early detection increases the value of process killing (with an exponential decay). Let
The equation above negatively scores the killing of benignware in proportion to the number of subprocesses and scores the killing of malware positively in proportion to the number of subprocesses. A bonus reward is scored for killing malware early, with an exponential decay over time.
Previous, research has not addressed the extent to which damage is mitigated by process killing, since Sun et al. [
Early detection is particularly useful for types of malware from which recovery is difficult and/or costly. Cryptographic ransomware encrypts user files and withholds the decryption key until a ransom is paid to the attackers. This type of attack is typically costly to remedy, even if the victim is able to carry out data recovery [
This section outlines the data capture process and dataset statistics.
The same features as were used in previous work [
At the process-level, 26 machine metric features are collected; these were dictated by the attributes available using the Psutil [
The process-level machine activity metrics collected are: CPU use at the user level, CPU use at the system level, physical memory use, swap memory use, total memory use, number of child process, number of threads, maximum process ID from a child process, disk read, write and other I/O count, bytes read, written and used in other I/O processes, process priority, I/O process priority, number of command line arguments passed to process, number of handles being used by process, time since the process began, TCP packet count, UDP packet count, number of connections currently open, and 4 port statuses of those opened by the process (see Table
26 process-level features: 22 features + 4 port status values.
Category | |||
---|---|---|---|
CPU use (%) | System level | User level | |
Memory use (bytes) | Total | Physical (nonswapped) | Swap |
Child processes | Count | Maximum process ID | Number of threads |
I/O operation bytes on disk (bytes) | Read | Write | Nonread-write I/O operations |
I/O operation count on disk | Read | Write | Nonread-write I/O operations |
Priority | Process priority | I/O process priority | |
Network # packets | TCP packet count | UDP packet count | |
Network # bytes | # Bytes sent | # Bytes received | |
Network other | Number of connections currently open | Statuses of the ports opened by the process (4 statuses) | |
Miscellaneous | Number of command line arguments passed to process | Number of handles being used by process |
Feature normalisation is necessary for NNs to avoid over-weighting features with higher absolute values. The test, train, and validation sets (
During data capture, this research sought to improve upon previous work and emulate real machine use to a greater extent than has previously been trialled. The implementation details of the VM, simultaneous process execution, and RL simulation are outlined below:
The following experiments were conducted using a virtual machine (VM) running with Cuckoo Sandbox [
In typical machine use, multiple applications run simultaneously. This is not reflected by behavioural malware analysis research in which samples are injected individually to a virtual machine for observation. The environment used for the following experiments launches multiple applications on the same machine at slightly staggered intervals as if a user were opening them. Each malware is launched with a small number (1–3) and a larger number (3–35) of applications. It was not possible to find up-to-date user data on the number of simultaneous applications running on a typical desktop, so here it was elected to launch up to 36 applications (35 benign + 1 malicious) at once, which is the largest number of simultaneous apps for real-time data collection to date. From the existing real-time analysis literature, only Sun et al. [
Each application may in turn launch multiple processes, causing more than 35 processes to run at once; 95 is the largest number of simultaneous processes recorded; this excludes background OS processes.
For reinforcement learning, the DQN requires an observation of the resulting state following an action. To train the model, a simulated environment is created from the pre-collected training data whereby the impact of killing or not killing a process is returned as the next state. For process-level elements, this reduces all features to zero. A caveat here is that in reality, killing the process may not occur immediately and therefore memory, processing power, etc., may still be being consumed at the next data observation. For global metrics, the process-level values for the killed processes (includes child processes of the killed process) are subtracted from the global metrics. There is a risk again that this calculation may not correlate perfectly with what would be observed in a live machine environment.
In order to observe the model performance, a visualisation was developed to accompany the simulated environment. Figures
Benignware sample, normalised process-level metrics, 6 observations made without process being killed.
Malware sample, normalised process-level metrics, no observations made yet.
The dataset comprises 3,604 benign executables and 2,792 malicious applications (each containing at least one executable), with 2,877 for training and validation and 3,519 for testing. These dataset sizes are consistent with previous real-time detection dataset sizes (Das et al. [
In Pendelbury et al.’s analysis [
PUA = potentially unwanted application, RAT = remote access trojan.
This paper is not concerned with distinguishing particular malware families, but rather with identifying malware in general. However, a dataset consisting of just one malware family would present an unrealistic and easier problem than is found in the real world. The malware families included in this dataset are reported in Table
Malware families with more than 10 samples in the dataset. 315 families were represented in the dataset, with 27 having being represented more than 10 times. Basic description provided which does not cover the wide range of behaviours carried out by some malware families but is intended to indicate the range of behaviours in the top 27 families included in the dataset.
Malware family | # Train set | # Test set | Total | Description |
---|---|---|---|---|
Startsurf | 66 | 273 | 339 | Adware |
Fareit | 33 | 222 | 255 | Spyware |
Vigram | 23 | 212 | 235 | Adware |
Winwrapper | 78 | 8 | 86 | PUA |
Downloadguide | 15 | 59 | 74 | Adware |
Gandcrab | 5 | 54 | 59 | Ransomware |
Emotet | 12 | 46 | 58 | Credstealer |
Chapak | 4 | 37 | 41 | Installer |
Virut | 30 | 2 | 32 | Backdoor |
Installmonster | 12 | 18 | 30 | Installer |
Noon | 8 | 22 | 30 | Spyware |
Gamarue | 11 | 18 | 29 | Backdoor |
Razy | 7 | 16 | 23 | Crypto stealer |
Zeroaccess | 23 | 0 | 23 | Rootkit |
Soft32downloader | 5 | 22 | 23 | Installer |
Appster | 7 | 15 | 22 | PUA |
Prepscram | 1 | 20 | 21 | Installer |
Zusy | 2 | 19 | 21 | Spyware |
Darkkomet | 17 | 1 | 18 | RAT |
Adposhel | 4 | 14 | 16 | Adware |
Swrort | 13 | 0 | 13 | Backdoor |
Slugin | 13 | 0 | 13 | Installer |
Vobfus | 11 | 2 | 13 | Installer |
Speedingupmypc | 1 | 11 | 12 | Adware |
Relevantknowledge | 5 | 6 | 11 | Adware |
Kuaizip | 4 | 7 | 11 | PUA |
Bladabindi | 7 | 4 | 11 | Backdoor |
Other ( | 377 | 260 | 602 | — |
# Other families ( | 184 | 154 | 288 | — |
Unknown | 333 | 291 | 671 | — |
— |
Ascribing family labels to malware is nontrivial since antivirus vendors do not follow standardised naming conventions and many malware families have multiple aliases. Sebastián et al. [
Statistical inspection of the training set reveals that benign applications have fewer sub-processes than malicious processes, with 1.17 processes in the average benign process tree and 2.33 processes in the average malicious process tree. Malware was also more likely to spawn processes outside of the process tree of the root process, often using the names of legitimate Windows processes. In some cases, malware launches legitimate applications, such as Microsoft Excel in order to carry out a macro-based exploit. Although Excel is not a malicious application in itself, it is malicious in this context, which is why malicious labels are assigned if a malware sample has caused that process to come into being. It is therefore possible to argue that some processes launched by malware are not malicious, because they do not individually cause harm to the endpoint or user, but without the malware they would not be running and so can be considered at least undesirable even if only in the interests of conserving computational resources.
The dataset is split in half with the malicious samples in the test set coming from the more recent VirusShare repository, and those in the training set from the earlier repository. This is to increase the chances of simulating a real deployment scenario in which the malware tested contains new functionality by comparison with those in the training set.
Ideally, the benignware should also be split by date across the training and test set; however, it is not a trivial task to calculate the date at which benignware was compiled. It is possible to extract the compile time from PE header, but it is possible for the PE author to manually input this date which had clearly happened in some instances where the compile date was 1970-01-01 or in one instance 1970-01-16. In the latter case (1970-01-16), the file is first mentioned online in 2016, perhaps indicating a typographic error [
For training, an equal number of benign and malicious processes are selected, so that the model does not bias towards one class. 10% of these are held out for validation. In most ML model evaluations, the validation set would be drawn from the same distribution as the test set. However, because it is important not to leak any information about the malware in the test set, since it is split by date, the validation set here is drawn from the training distribution.
Data collection used the Psutil [
First, we demonstrate the unsuitability of a full-trace supervised learning malware detection model, which achieved more than 96% detection accuracy in Ref. [
Hyperparameter search space and the hyperparameters of the model giving the lowest mean false-positive and false-negative rates.
Possible values | Process-level data | Process-level data + global metrics | |
---|---|---|---|
Hyperparamter | |||
Hidden neurons | 8–1024 | 253 | 193 |
Depth | [1–3] | 2 | 2 |
Batch size | [64, 128, 256] | 128 | 256 |
Epochs | 1–200 | 89 | 161 |
Dropout rate | [0, 0.1, 0.2, 0.3, 0.4, 0.5] | 0.1 | 0.1 |
Window size | 2–30 | 16 | 6 |
Loss function | Binary cross-entropy | ||
Weight update rule | Adam [ | ||
Recurrent unit | GRU cell | ||
Validation F1-score ( | — | 94.61 |
It is expected that supervised malware detection models will not adapt well to process-killing due to the averaging of loss metrics as described earlier. Initially, this is verified by using supervised learning models to kill processes that are deemed malicious. For supervised classification, the model makes a prediction every time a data measurement is taken from a process. This approach is compared with one taking average predictions across all measurements for a process and for a process tree as well as the result of process killing. The models with the highest validation accuracy for classification and killing are compared.
Figure
F1 scores, true positive rates (TPR), and true negative rates (TNR) for partial-trace detection (process measurements), full-trace detection (whole process), whole application (process tree), and with process-level measurements + process killing (process killing) for validation set (left column) and test set (right column).
Three levels of data collection: each measurement, each process, each process tree.
F1-score, true positive rate (TPR), and true negative rates (TNR) (all
Features | Metric | Classify | Dataset | Kill |
---|---|---|---|---|
Proc. Data | F1 | Validation set | ||
Proc. Data | tnr | 94.72 | Validation set | 85.71 |
Proc. Data | tpr | 98.64 | Validation set | 95.80 |
Proc. Data + glob. | F1 | 94.61 | Validation set | 87.69 |
Proc. Data + glob. | tnr | 90.57 | Validation set | 77.31 |
Proc. Data + glob. | tpr | 95.93 | Validation set | 95.80 |
Proc. Data | F1 | 74.91 | Test set | |
Proc. Data | tnr | 69.41 | Test set | 59.63 |
Proc. Data | tpr | 87.52 | Test set | 91.82 |
Proc. Data + glob. | F1 | Test set | 71.83 | |
Proc. Data + glob. | tnr | 79.70 | Test set | 59.63 |
Proc. Data + glob. | tpr | 82.91 | Test set | 90.24 |
The highest F1-score on the validation set is achieved by an RNN using process data only. When process killing is applied, there is a drop of less than 5 percentage points in the F1-score, but more than 15 percentage points are lost from the TNR.
On the unseen test set, the highest F1-score is achieved by an RNN using process data + global metrics, but the improvement over the process data + total number of processes is negligible. Overall, there is a reduction in F1-score from (97.44, 94.61) to (74.91, 77.66), highlighting the initial challenge of learning to classifying individual processes rather than entire applications, especially when accounting for concept drift. Despite the low accuracy, these initial results indicate that the model is discriminating some of the samples correctly and may form a baseline from which to improve.
The test set TNR and TPR for classification on the best-performing model (process data only) are 79.70 and 82.91, respectively, but when process killing is applied, although the F1-score drops by 10 percentage points, the TNR and TPR move in opposite directions with the TNR falling to 59.63 and TPR increasing to 90.24. This is not surprising since a single malicious classification results in a process being classed as malicious. This is true for the best-performing models using either of the two feature sets (see Figure
Previous work on real-time detection has highlighted the requirement for a lightweight model (speed and computational resources). In the previous paper, RNNs were the best performing algorithm in classifying malware/benignware, but RNNs have many parameters and therefore may consume significant RAM and/or CPU. They also require preprocessing of the data to scale the values, which other ML algorithms such as tree-based algorithms do not. Whilst RAM and CPU should be minimised, taking model accuracy into account, inference duration is also an important metric.
Although the models in this paper have not been coded for performance and use common python libraries, comparing these metrics helps to decide whether certain models are vastly preferable to others with respect to computational resource consumption. The PyRAPL library [
For the RNN, a “large” and a “small” model are included. The large models have the highest number of parameters tested in the random search (981 hidden neurons, 3 hidden layers, sequence length of 17) and the smallest (41 neurons, 1 hidden layer, sequence length of 13). These two RNN configurations are compared against other machine learning models which have been used for malware detection: Multi-Layer Perceptron (feed-forward neural network), Support Vector Machine, Naive Bayes Classifier, Decision Tree Classifier, Gradient Boosted Decision Tree Classifier (GBDTs), Random Forest, and AdaBoost.
26 features = process-level only, 37 features = machine and process level features
Table
Average resource consumption over 100 iterations for a batch size of 100 vs. F1-scores on validation and test set for classification and process killing across 14 models..
Model | n features | Avg. cpu ( | Avg. dram (W) | Avg. Duration ( | Val F1 | Kill val F1 | Test F1 | Kill test F1 |
---|---|---|---|---|---|---|---|---|
AdaBoost | 26 | 127967.84 | 7981.51 | 6595.37 | 88.35 | 74.36 | 77.19 | 60.09 |
AdaBoost | 37 | 125041.20 | 7142.93 | 6469.16 | 89.63 | 76.07 | 80.10 | 60.14 |
DT | 26 | 3905.63 | 202.65 | 128.02 | 97.39 | 88.48 | 66.44 | 62.95 |
DT | 37 | 96.32 | 83.57 | 79.61 | 62.50 | |||
GBDT | 26 | 8788.41 | 338.78 | 349.31 | 92.27 | 78.26 | 82.47 | 63.33 |
GBDT | 37 | 11005.80 | 486.46 | 329.45 | 93.13 | 80.26 | 84.94 | 63.46 |
MLP | 26 | 11044.88 | 645.14 | 461.04 | 82.84 | 70.18 | 41.62 | 57.65 |
MLP | 37 | 12932.09 | 628.64 | 555.42 | 73.00 | 67.63 | 57.66 | 57.26 |
NB | 26 | 6947.67 | 297.87 | 185.73 | 75.80 | 67.42 | 62.90 | 56.11 |
NB | 37 | 5187.96 | 258.80 | 177.37 | 75.58 | 67.61 | 61.88 | 55.33 |
RF | 26 | 238621.20 | 11052.84 | 8997.31 | 97.12 | 71.58 | 75.97 | |
RF | 37 | 236598.44 | 9967.63 | 8879.97 | 96.57 | 91.05 | ||
RNN | 26 | 887664.31 | 48885.96 | 27869.30 | 90.70 | 74.91 | 73.08 | |
RNN | 37 | 312108.07 | 17120.90 | 10414.58 | 94.61 | 87.31 | 77.66 | 71.95 |
SVM | 26 | 6630490.84 | 464082.07 | 282026.57 | 78.34 | 67.04 | 68.16 | 56.91 |
SVM | 37 | 7792179.78 | 730786.06 | 429081.31 | 64.89 | 65.68 | 61.39 | 56.25 |
From the results above, it is clear that supervised learning models see a significant drop in classification accuracy when processes are killed as the result of a malicious label. This confirmation of the initial hypothesis presented here justifies the need to examine alternative methods. In the interests of future work and negative result reporting, this paper reports all of the methods attempted and finds that simple statistical manipulations on the supervised learning models perform better than using alternative training methods. This section briefly describes the logic of each method and provides a textual summary of the results with a formula where appropriate. This is followed by a table of the numerical results for each method. In the following section, let
Reasoning: Taking the average prediction across the whole process will smooth out those process killing results.
Reasoning: Taking the average over a few measurements will eliminate those false positives that are caused by a single false positive over a subset of the execution trace. Window sizes of 2 to 5 are tested. Let
Reasoning: Like the rolling mean, single false positives will be eliminated but unlike the rolling mean, the alerts are cumulative over the entire trace such that a single alert at the start and 30 seconds into the process will cause the process to be killed rather than requiring that both alerts are within a window of time. Between 2 and 5 minimum alerts are tested
Reasoning: The data are labelled at the application level; therefore, the average predictions across the process tree should be considered for classification
Reasoning: The data are labelled at the application level; therefore, the sum of resources of each process tree should be classified at each measurement, not the individual processes.
Reasoning: Reinforcement learning is designed for state-action space learning. Both pre-training the model with a supervised learning approach and not pre-training the model were tested.
Reasoning: Though the DQN explores and exploits different state-action pairs and their associated rewards, when the reward from each action is known in the first place and the training set is limited, as it is here, Q-learning can be framed as a regression problem in which the model tries to learn the return (rewards + future rewards), the training is faster and can be used by any regression-capable algorithm. Let
Table
Summary of the best process killing models by model training methodology. F1, TNR, and TPR for validation and test datasets (full results in Appendix Tables
Methodology | Best dataset | Model | Val | Test | |||||
---|---|---|---|---|---|---|---|---|---|
n features | F1 | tnr | tpr | F1 | tnr | tpr | |||
Supervised learning | Val | RF | 26 | 92.37 | 87.39 | 96.64 | 74.57 | 62.71 | 92.95 |
Test | RF | 37 | 89.68 | 83.19 | 94.96 | 76.43 | 67.19 | 92.52 | |
Rolling mean | Val | RF (min: 2) | 26 | 94.12 | 92.44 | 78.26 | 73.83 | 89.76 | |
Test | RF (min: 2) | 37 | 92.70 | 94.96 | 90.76 | 80.77 | 78.88 | 89.38 | |
Alert threshold | Val | DT (min: 2) | 26 | 92.17 | 95.80 | 89.08 | 73.43 | 67.44 | 86.56 |
Test | RF (min: 2) | 37 | 91.30 | 94.96 | 88.24 | 81.53 | 87.97 | ||
Process tree averaging | Val | RF | 26 | 92.74 | 88.24 | 96.64 | 74.79 | 64.04 | 92.20 |
Test | RF | 37 | 90.48 | 84.03 | 95.80 | 76.34 | 67.66 | 91.92 | |
Process tree training | Val | RF | 26 | 90.35 | 82.58 | 98.32 | 74.20 | 52.44 | 92.74 |
Test | RF | 26 | 90.35 | 82.58 | 98.32 | 74.20 | 52.44 | 92.74 | |
Q-learning | Val | DQN | 26 | 51.71 | 72.27 | 44.54 | 27.74 | 55.50 | 26.94 |
Test | DQN | 26 | 51.71 | 72.27 | 44.54 | 27.74 | 55.50 | 26.94 | |
Regression | Val | RF | 26 | 91.94 | 87.39 | 95.80 | 74.77 | 66.05 | 90.35 |
Test | RF | 26 | 91.94 | 87.39 | 95.80 | 74.77 | 66.05 | 90.35 |
Summary of process killing models, validation and test set score metrics [Table 1 of 3].
Model | Val | Test | ||||
---|---|---|---|---|---|---|
f1 | tnr | tpr | f1 | tnr | tpr | |
AdaBoostModel_glo_pro | 77.58 | 55.46 | 91.60 | 67.04 | 49.80 | 88.67 |
AdaBoostModel_glo_pro mean process tree | 78.01 | 55.46 | 92.44 | 66.75 | 50.48 | 87.59 |
AdaBoostModel_glo_pro process tree min alerts: 1 | 78.87 | 55.46 | 94.12 | 62.29 | 34.03 | 90.35 |
AdaBoostModel_glo_pro process tree min alerts: 2 | 78.87 | 55.46 | 94.12 | 62.29 | 34.03 | 90.35 |
AdaBoostModel_glo_pro process tree min alerts: 3 | 78.87 | 55.46 | 94.12 | 62.29 | 34.03 | 90.35 |
AdaBoostModel_glo_pro process tree min alerts: 4 | 78.87 | 55.46 | 94.12 | 62.29 | 34.03 | 90.35 |
AdaBoostModel_glo_pro rolling mean window: 2 | 79.22 | 70.59 | 84.87 | 69.57 | 60.88 | 84.88 |
AdaBoostModel_glo_pro rolling mean window: 3 | 79.37 | 72.27 | 84.03 | 69.59 | 61.53 | 84.39 |
AdaBoostModel_glo_pro rolling mean window: 4 | 80.67 | 80.67 | 80.67 | 68.44 | 67.80 | 77.34 |
AdaBoostModel_glo_pro sum alerts min: 2 | 80.66 | 78.15 | 82.35 | 69.35 | 66.58 | 79.89 |
AdaBoostModel_glo_pro sum alerts min: 3 | 81.20 | 83.19 | 79.83 | 67.83 | 70.92 | 73.88 |
AdaBoostModel_glo_pro sum alerts min: 4 | 80.87 | 84.87 | 78.15 | 65.92 | 73.32 | 69.00 |
AdaBoostModel_pro | 75.34 | 47.06 | 92.44 | 65.64 | 45.79 | 88.89 |
AdaBoostModel_pro mean process tree | 75.86 | 48.74 | 92.44 | 65.74 | 47.83 | 87.59 |
AdaBoostModel_pro process tree min alerts: 1 | 75.68 | 45.38 | 94.12 | 60.31 | 26.46 | 91.17 |
AdaBoostModel_pro process tree min alerts: 2 | 75.68 | 45.38 | 94.12 | 60.31 | 26.46 | 91.17 |
AdaBoostModel_pro process tree min alerts: 3 | 75.68 | 45.38 | 94.12 | 60.31 | 26.46 | 91.17 |
AdaBoostModel_pro process tree min alerts: 4 | 75.68 | 45.38 | 94.12 | 60.31 | 26.46 | 91.17 |
AdaBoostModel_pro rolling mean window: 2 | 78.03 | 64.71 | 86.55 | 69.35 | 59.63 | 85.47 |
AdaBoostModel_pro rolling mean window: 3 | 77.99 | 67.23 | 84.87 | 68.91 | 59.20 | 84.99 |
AdaBoostModel_pro rolling mean window: 4 | 80.83 | 79.83 | 81.51 | 69.01 | 66.44 | 79.40 |
AdaBoostModel_pro sum alerts min: 2 | 81.12 | 75.63 | 84.87 | 67.91 | 61.06 | 81.68 |
AdaBoostModel_pro sum alerts min: 3 | 81.17 | 80.67 | 81.51 | 66.64 | 64.97 | 76.42 |
AdaBoostModel_pro sum alerts min: 4 | 79.66 | 80.67 | 78.99 | 64.25 | 68.12 | 70.14 |
AdaBoostModel_pro_tree | 75.08 | 47.73 | 94.96 | 64.12 | 30.58 | 86.60 |
AdaBoostRegression_pro_process | 56.63 | 100.00 | 39.50 | 15.06 | 97.92 | 8.40 |
DTModel_glo_pro | 84.53 | 71.43 | 94.12 | 71.41 | 58.19 | 90.62 |
DTModel_glo_pro mean process tree | 85.93 | 73.95 | 94.96 | 71.49 | 59.48 | 89.70 |
DTModel_glo_pro process tree min alerts: 1 | 84.64 | 70.59 | 94.96 | 65.59 | 42.42 | 91.27 |
DTModel_glo_pro process tree min alerts: 2 | 84.64 | 70.59 | 94.96 | 65.56 | 42.34 | 91.27 |
DTModel_glo_pro process tree min alerts: 3 | 84.64 | 70.59 | 94.96 | 65.56 | 42.34 | 91.27 |
DTModel_glo_pro process tree min alerts: 4 | 84.64 | 70.59 | 94.96 | 65.56 | 42.34 | 91.27 |
DTModel_glo_pro rolling mean window: 2 | 88.70 | 88.24 | 89.08 | 75.09 | 70.49 | 86.94 |
DTModel_glo_pro rolling mean window: 3 | 87.87 | 87.39 | 88.24 | 74.57 | 69.77 | 86.61 |
DTModel_glo_pro rolling mean window: 4 | 89.08 | 93.28 | 85.71 | 74.04 | 74.29 | 81.63 |
DTModel_glo_pro sum alerts min: 2 | 89.74 | 91.60 | 88.24 | 75.48 | 72.64 | 85.69 |
DTModel_glo_pro sum alerts min: 3 | 89.47 | 94.12 | 85.71 | 74.00 | 75.62 | 80.38 |
DTModel_glo_pro sum alerts min: 4 | 88.39 | 94.96 | 83.19 | 70.19 | 77.30 | 72.63 |
DTModel_pro | 89.76 | 82.35 | 95.80 | 71.53 | 56.54 | 92.25 |
DTModel_pro mean process tree | 90.91 | 84.03 | 96.64 | 72.13 | 59.38 | 91.06 |
DTModel_pro process tree min alerts: 1 | 90.20 | 82.35 | 96.64 | 64.45 | 37.04 | 92.79 |
DTModel_pro process tree min alerts: 2 | 90.20 | 82.35 | 96.64 | 64.42 | 36.97 | 92.79 |
DTModel_pro process tree min alerts: 3 | 90.20 | 82.35 | 96.64 | 64.42 | 36.97 | 92.79 |
DTModel_pro process tree min alerts: 4 | 90.20 | 82.35 | 96.64 | 64.42 | 36.97 | 92.79 |
DTModel_pro rolling mean window: 2 | 93.16 | 94.96 | 91.60 | 73.82 | 66.19 | 88.40 |
DTModel_pro rolling mean window: 3 | 91.77 | 94.96 | 89.08 | 73.49 | 66.15 | 87.80 |
DTModel_pro rolling mean window: 4 | 90.75 | 95.80 | 86.55 | 72.05 | 69.38 | 82.38 |
DTModel_pro sum alerts min: 2 | 92.17 | 95.80 | 89.08 | 73.43 | 67.44 | 86.56 |
DTModel_pro sum alerts min: 3 | 90.75 | 95.80 | 86.55 | 71.53 | 69.63 | 81.25 |
DTModel_pro sum alerts min: 4 | 89.29 | 95.80 | 84.03 | 67.58 | 70.81 | 73.55 |
DTModel_pro_tree | 85.93 | 73.48 | 97.48 | 70.40 | 43.02 | 91.57 |
DTRegression_pro_process | 89.06 | 80.67 | 95.80 | 71.62 | 57.98 | 91.22 |
GBDTModel_glo_pro | 80.44 | 63.87 | 91.60 | 72.62 | 59.73 | 91.71 |
GBDTModel_glo_pro mean process tree | 80.88 | 63.87 | 92.44 | 72.76 | 60.63 | 91.22 |
GBDTModel_glo_pro process tree min alerts: 1 | 81.75 | 63.87 | 94.12 | 66.32 | 43.28 | 92.14 |
GBDTModel_glo_pro process tree min alerts: 2 | 81.75 | 63.87 | 94.12 | 66.32 | 43.28 | 92.14 |
GBDTModel_glo_pro process tree min alerts: 3 | 81.75 | 63.87 | 94.12 | 66.32 | 43.28 | 92.14 |
GBDTModel_glo_pro process tree min alerts: 4 | 81.75 | 63.87 | 94.12 | 66.32 | 43.28 | 92.14 |
GBDTModel_glo_pro rolling mean window: 2 | 85.12 | 83.19 | 86.55 | 76.06 | 71.50 | 87.80 |
GBDTModel_glo_pro rolling mean window: 3 | 84.52 | 84.03 | 84.87 | 75.87 | 71.82 | 87.15 |
GBDTModel_glo_pro rolling mean window: 4 | 84.12 | 86.55 | 82.35 | 75.25 | 76.69 | 81.57 |
GBDTModel_glo_pro sum alerts min: 2 | 84.87 | 84.87 | 84.87 | 76.46 | 75.65 | 84.66 |
GBDTModel_glo_pro sum alerts min: 3 | 85.22 | 89.08 | 82.35 | 74.12 | 78.77 | 77.78 |
GBDTModel_glo_pro sum alerts min: 4 | 84.44 | 90.76 | 79.83 | 71.99 | 81.10 | 72.30 |
GBDTModel_pro | 80.73 | 62.18 | 93.28 | 71.31 | 58.09 | 90.51 |
GBDTModel_pro mean process tree | 82.05 | 64.71 | 94.12 | 71.76 | 59.59 | 90.14 |
GBDTModel_pro process tree min alerts: 1 | 80.71 | 59.66 | 94.96 | 64.88 | 40.34 | 91.33 |
GBDTModel_pro process tree min alerts: 2 | 80.71 | 59.66 | 94.96 | 64.87 | 40.30 | 91.33 |
GBDTModel_pro process tree min alerts: 3 | 80.71 | 59.66 | 94.96 | 64.87 | 40.30 | 91.33 |
GBDTModel_pro process tree min alerts: 4 | 80.71 | 59.66 | 94.96 | 64.87 | 40.30 | 91.33 |
GBDTModel_pro rolling mean window: 2 | 84.68 | 79.83 | 88.24 | 75.08 | 71.60 | 85.91 |
GBDTModel_pro rolling mean window: 3 | 84.08 | 80.67 | 86.55 | 74.99 | 71.71 | 85.64 |
GBDTModel_pro rolling mean window: 4 | 84.39 | 84.87 | 84.03 | 73.91 | 76.05 | 79.84 |
GBDTModel_pro sum alerts min: 2 | 85.48 | 84.03 | 86.55 | 74.50 | 74.40 | 82.33 |
GBDTModel_pro sum alerts min: 3 | 85.11 | 86.55 | 84.03 | 72.35 | 76.77 | 76.59 |
GBDTModel_pro sum alerts min: 4 | 83.84 | 88.24 | 80.67 | 70.17 | 78.38 | 71.71 |
GBDTModel_pro_tree | 79.02 | 59.09 | 94.96 | 71.08 | 46.68 | 90.51 |
GBDTRegression_pro_process | 89.71 | 87.39 | 91.60 | 71.84 | 80.57 | 72.52 |
MLPModel_glo_pro | 66.67 | 13.45 | 93.28 | 57.92 | 19.00 | 90.68 |
MLPModel_glo_pro mean process tree | 67.48 | 16.81 | 93.28 | 59.79 | 25.74 | 90.51 |
MLPModel_glo_pro process tree min alerts: 1 | 67.46 | 13.45 | 94.96 | 57.61 | 17.64 | 90.84 |
MLPModel_glo_pro process tree min alerts: 2 | 67.46 | 13.45 | 94.96 | 57.61 | 17.64 | 90.84 |
MLPModel_glo_pro process tree min alerts: 3 | 67.46 | 13.45 | 94.96 | 57.61 | 17.64 | 90.84 |
MLPModel_glo_pro process tree min alerts: 4 | 67.46 | 13.45 | 94.96 | 57.61 | 17.64 | 90.84 |
MLPModel_glo_pro rolling mean window: 2 | 67.96 | 28.57 | 88.24 | 58.73 | 32.20 | 84.17 |
MLPModel_glo_pro rolling mean window: 3 | 67.79 | 34.45 | 84.87 | 58.90 | 34.31 | 83.20 |
MLPModel_glo_pro rolling mean window: 4 | 68.75 | 41.18 | 83.19 | 58.32 | 40.55 | 78.16 |
MLPModel_glo_pro sum alerts min: 2 | 68.94 | 38.66 | 84.87 | 59.51 | 39.19 | 81.30 |
MLPModel_glo_pro sum alerts min: 3 | 69.04 | 45.38 | 81.51 | 58.47 | 43.17 | 76.80 |
MLPModel_glo_pro sum alerts min: 4 | 70.63 | 53.78 | 79.83 | 57.23 | 47.72 | 71.76 |
Summary of process killing models, validation, and test set score metrics [Table 2 of 3].
Model | Val | Test | ||||
---|---|---|---|---|---|---|
f1 | tnr | tpr | f1 | tnr | tpr | |
MLPModel_pro | 71.43 | 56.30 | 79.83 | 57.54 | 52.53 | 69.38 |
MLPModel_pro mean process tree | 72.18 | 57.14 | 80.67 | 57.06 | 53.53 | 67.97 |
MLPModel_pro process tree min alerts: 1 | 72.32 | 54.62 | 82.35 | 57.41 | 49.41 | 71.06 |
MLPModel_pro process tree min alerts: 2 | 72.32 | 54.62 | 82.35 | 57.41 | 49.41 | 71.06 |
MLPModel_pro process tree min alerts: 3 | 72.32 | 54.62 | 82.35 | 57.41 | 49.41 | 71.06 |
MLPModel_pro process tree min alerts: 4 | 72.32 | 54.62 | 82.35 | 57.41 | 49.41 | 71.06 |
MLPModel_pro rolling mean window: 2 | 71.77 | 66.39 | 74.79 | 48.88 | 63.18 | 50.35 |
MLPModel_pro rolling mean window: 3 | 72.65 | 68.91 | 74.79 | 49.23 | 63.71 | 50.57 |
MLPModel_pro rolling mean window: 4 | 72.34 | 73.95 | 71.43 | 46.40 | 67.05 | 45.26 |
MLPModel_pro sum alerts min: 2 | 73.86 | 72.27 | 74.79 | 47.14 | 66.40 | 46.50 |
MLPModel_pro sum alerts min: 3 | 73.68 | 78.99 | 70.59 | 45.27 | 68.12 | 43.36 |
MLPModel_pro sum alerts min: 4 | 73.30 | 82.35 | 68.07 | 44.48 | 69.63 | 41.73 |
MLPModel_pro_tree | 71.38 | 31.82 | 97.48 | 64.36 | 21.07 | 92.52 |
MLPRegression_pro_process | 38.89 | 53.78 | 35.29 | 54.83 | 48.73 | 67.05 |
MLPRegression_pro_process mean process tree | 37.32 | 57.14 | 32.77 | 56.75 | 57.37 | 65.15 |
NBModel_glo_pro | 67.25 | 9.24 | 96.64 | 55.07 | 10.36 | 89.49 |
NBModel_glo_pro mean process tree | 67.25 | 9.24 | 96.64 | 55.62 | 12.69 | 89.38 |
NBModel_glo_pro process tree min alerts: 1 | 67.25 | 9.24 | 96.64 | 55.00 | 10.18 | 89.43 |
NBModel_glo_pro process tree min alerts: 2 | 67.25 | 9.24 | 96.64 | 55.00 | 10.18 | 89.43 |
NBModel_glo_pro process tree min alerts: 3 | 67.25 | 9.24 | 96.64 | 55.00 | 10.18 | 89.43 |
NBModel_glo_pro process tree min alerts: 4 | 67.25 | 9.24 | 96.64 | 55.00 | 10.18 | 89.43 |
NBModel_glo_pro rolling mean window: 2 | 67.69 | 19.33 | 92.44 | 55.20 | 15.10 | 87.05 |
NBModel_glo_pro rolling mean window: 3 | 67.73 | 26.05 | 89.08 | 55.32 | 17.32 | 86.02 |
NBModel_glo_pro rolling mean window: 4 | 67.76 | 31.09 | 86.55 | 54.28 | 21.08 | 81.68 |
NBModel_glo_pro sum alerts min: 2 | 68.17 | 27.73 | 89.08 | 55.70 | 19.40 | 85.64 |
NBModel_glo_pro sum alerts min: 3 | 67.99 | 31.93 | 86.55 | 54.42 | 22.27 | 81.30 |
NBModel_glo_pro sum alerts min: 4 | 68.03 | 36.97 | 84.03 | 51.32 | 25.39 | 73.44 |
NBModel_pro | 67.06 | 8.40 | 96.64 | 55.60 | 7.03 | 92.63 |
NBModel_pro mean process tree | 67.06 | 8.40 | 96.64 | 56.17 | 9.61 | 92.41 |
NBModel_pro process tree min alerts: 1 | 67.06 | 8.40 | 96.64 | 55.56 | 6.78 | 92.68 |
NBModel_pro process tree min alerts: 2 | 67.06 | 8.40 | 96.64 | 55.56 | 6.78 | 92.68 |
NBModel_pro process tree min alerts: 3 | 67.06 | 8.40 | 96.64 | 55.56 | 6.78 | 92.68 |
NBModel_pro process tree min alerts: 4 | 67.06 | 8.40 | 96.64 | 55.56 | 6.78 | 92.68 |
NBModel_pro rolling mean window: 2 | 67.69 | 19.33 | 92.44 | 56.01 | 13.41 | 89.81 |
NBModel_pro rolling mean window: 3 | 67.52 | 25.21 | 89.08 | 56.18 | 15.81 | 88.78 |
NBModel_pro rolling mean window: 4 | 67.99 | 31.93 | 86.55 | 54.94 | 19.61 | 83.90 |
NBModel_pro sum alerts min: 2 | 68.61 | 29.41 | 89.08 | 56.52 | 18.14 | 88.13 |
NBModel_pro sum alerts min: 3 | 68.21 | 32.77 | 86.55 | 55.27 | 21.30 | 83.63 |
NBModel_pro sum alerts min: 4 | 67.81 | 37.82 | 83.19 | 52.25 | 24.42 | 75.77 |
NBModel_pro_tree | 66.10 | 10.61 | 98.32 | 61.25 | 8.63 | 92.69 |
NBModel_pro_tree mean process tree | 66.10 | 10.61 | 98.32 | 61.25 | 8.63 | 92.69 |
RFModel_glo_pro | 89.68 | 83.19 | 94.96 | 76.43 | 67.19 | 92.52 |
RFModel_glo_pro mean process tree | 90.48 | 84.03 | 95.80 | 76.34 | 67.66 | 91.92 |
RFModel_glo_pro process tree min alerts: 1 | 90.91 | 84.03 | 96.64 | 69.45 | 50.56 | 92.95 |
RFModel_glo_pro process tree min alerts: 2 | 90.91 | 84.03 | 96.64 | 69.45 | 50.56 | 92.95 |
RFModel_glo_pro process tree min alerts: 3 | 90.91 | 84.03 | 96.64 | 69.45 | 50.56 | 92.95 |
RFModel_glo_pro process tree min alerts: 4 | 90.91 | 84.03 | 96.64 | 69.45 | 50.56 | 92.95 |
RFModel_glo_pro rolling mean window: 2 | 92.70 | 94.96 | 90.76 | 80.77 | 78.88 | 89.38 |
RFModel_glo_pro rolling mean window: 3 | 91.30 | 94.96 | 88.24 | 80.19 | 78.67 | 88.51 |
RFModel_glo_pro rolling mean window: 4 | 90.27 | 95.80 | 85.71 | 79.86 | 82.86 | 83.69 |
RFModel_glo_pro sum alerts min: 2 | 91.30 | 94.96 | 88.24 | 81.50 | 81.53 | 87.97 |
RFModel_glo_pro sum alerts min: 3 | 90.27 | 95.80 | 85.71 | 79.99 | 84.01 | 82.76 |
RFModel_glo_pro sum alerts min: 4 | 88.79 | 95.80 | 83.19 | 76.11 | 85.37 | 75.01 |
RFModel_pro | 92.37 | 87.39 | 96.64 | 74.57 | 62.71 | 92.95 |
RFModel_pro mean process tree | 92.74 | 88.24 | 96.64 | 74.79 | 64.04 | 92.20 |
RFModel_pro process tree min alerts: 1 | 92.74 | 88.24 | 96.64 | 68.75 | 48.23 | 93.39 |
RFModel_pro process tree min alerts: 2 | 92.74 | 88.24 | 96.64 | 68.75 | 48.23 | 93.39 |
RFModel_pro process tree min alerts: 3 | 92.74 | 88.24 | 96.64 | 68.75 | 48.23 | 93.39 |
RFModel_pro process tree min alerts: 4 | 92.74 | 88.24 | 96.64 | 68.75 | 48.23 | 93.39 |
RFModel_pro rolling mean window: 2 | 93.22 | 94.12 | 92.44 | 78.28 | 73.83 | 89.76 |
RFModel_pro rolling mean window: 3 | 91.38 | 94.12 | 89.08 | 77.47 | 73.25 | 88.78 |
RFModel_pro rolling mean window: 4 | 89.96 | 94.12 | 86.55 | 77.08 | 77.70 | 83.85 |
RFModel_pro sum alerts min: 2 | 91.38 | 94.12 | 89.08 | 77.98 | 74.65 | 88.40 |
RFModel_pro sum alerts min: 3 | 89.96 | 94.12 | 86.55 | 77.05 | 77.52 | 83.96 |
RFModel_pro sum alerts min: 4 | 88.50 | 94.12 | 84.03 | 73.11 | 79.35 | 75.61 |
RFModel_pro_tree | 90.35 | 82.58 | 98.32 | 74.20 | 52.44 | 92.74 |
RFRegression_pro_process | 91.94 | 87.39 | 95.80 | 74.77 | 66.05 | 90.35 |
SVMModel_glo_pro | 65.23 | 15.97 | 89.08 | 57.34 | 24.24 | 86.23 |
SVMModel_glo_pro mean process tree | 65.23 | 15.97 | 89.08 | 58.11 | 27.39 | 85.91 |
SVMModel_glo_pro process tree min alerts: 1 | 65.23 | 15.97 | 89.08 | 57.32 | 23.81 | 86.45 |
SVMModel_glo_pro process tree min alerts: 2 | 65.23 | 15.97 | 89.08 | 57.32 | 23.81 | 86.45 |
SVMModel_glo_pro process tree min alerts: 3 | 65.23 | 15.97 | 89.08 | 57.32 | 23.81 | 86.45 |
SVMModel_glo_pro process tree min alerts: 4 | 65.23 | 15.97 | 89.08 | 57.32 | 23.81 | 86.45 |
SVMModel_glo_pro rolling mean window: 2 | 65.15 | 26.05 | 84.03 | 57.98 | 33.52 | 81.84 |
SVMModel_glo_pro rolling mean window: 3 | 64.65 | 31.09 | 80.67 | 58.14 | 35.46 | 80.98 |
SVMModel_glo_pro rolling mean window: 4 | 64.31 | 38.66 | 76.47 | 56.76 | 40.37 | 75.34 |
SVMModel_glo_pro sum alerts min: 2 | 65.05 | 36.13 | 78.99 | 58.35 | 39.08 | 79.13 |
SVMModel_glo_pro sum alerts min: 3 | 64.75 | 42.02 | 75.63 | 57.05 | 43.24 | 74.15 |
SVMModel_glo_pro sum alerts min: 4 | 64.89 | 51.26 | 71.43 | 54.70 | 47.40 | 67.59 |
SVMModel_pro | 66.47 | 5.88 | 96.64 | 56.92 | 10.33 | 93.71 |
Summary of process killing models, validation, and test set score metrics [Table 3 of 3].
Val | Test | |||||
---|---|---|---|---|---|---|
SVMModel_pro mean process tree | 67.25 | 9.24 | 96.64 | 57.55 | 13.34 | 93.33 |
SVMModel_pro process tree min alerts: 1 | 66.47 | 5.88 | 96.64 | 56.21 | 7.28 | 93.88 |
SVMModel_pro process tree min alerts: 2 | 66.47 | 5.88 | 96.64 | 56.21 | 7.28 | 93.88 |
SVMModel_pro process tree min alerts: 3 | 66.47 | 5.88 | 96.64 | 56.21 | 7.28 | 93.88 |
SVMModel_pro process tree min alerts: 4 | 66.47 | 5.88 | 96.64 | 56.21 | 7.28 | 93.88 |
SVMModel_pro rolling mean window: 2 | 66.87 | 15.97 | 92.44 | 58.60 | 22.02 | 90.30 |
SVMModel_pro rolling mean window: 3 | 67.30 | 24.37 | 89.08 | 58.82 | 24.42 | 89.27 |
SVMModel_pro rolling mean window: 4 | 67.99 | 31.93 | 86.55 | 57.98 | 28.97 | 84.66 |
SVMModel_pro sum alerts min: 2 | 67.96 | 28.57 | 88.24 | 59.52 | 27.61 | 88.73 |
SVMModel_pro sum alerts min: 3 | 68.90 | 35.29 | 86.55 | 59.06 | 33.35 | 84.12 |
SVMModel_pro sum alerts min: 4 | 68.75 | 41.18 | 83.19 | 56.68 | 38.87 | 76.10 |
SVMModel_pro_tree | 65.73 | 9.09 | 98.32 | 61.79 | 9.88 | 93.19 |
Dqn | 51.71 | 72.27 | 44.54 | 27.74 | 55.50 | 26.94 |
random_search_glo_pro_RNN | 87.69 | 77.31 | 95.80 | 71.83 | 59.63 | 90.24 |
random_search_glo_pro_RNN mean process tree | 88.03 | 78.15 | 95.80 | 72.50 | 61.67 | 89.81 |
random_search_glo_pro_RNN_Regression | 85.71 | 72.27 | 95.80 | 72.44 | 61.78 | 89.59 |
random_search_pro_RNN | 91.20 | 85.71 | 95.80 | 72.63 | 59.63 | 91.82 |
random_search_pro_RNN mean process tree | 91.20 | 85.71 | 95.80 | 73.03 | 60.92 | 91.49 |
random_search_pro_RNN_Regression | 88.37 | 78.99 | 95.80 | 72.71 | 60.70 | 91.06 |
random_search_pro_RNN_tree | 88.19 | 80.67 | 94.12 | 73.72 | 65.79 | 88.56 |
In most cases, the models with the highest F1-score on the validation and test sets are not the same. The highest F1-score is 81.50 from an RF using a minimum alert threshold of 2 and both process-level and global process metrics.
Although the proposed model is motivated by the desire to prevent malware from executing, the best TNR reported amongst the models above is 81.50%. 20% of benign processes being killed would not be acceptable to a user. Whilst this research is a novel attempt at very early-stage real-time malware detection and process killing, one might consider the usability and prefer a model with a very high TNR, even if this is at the expense of the TPR.
Considering this, the AdaBoost regression algorithm achieves a 100% TNR with a 39.50% TPR on the validation set. The high FNR is retained in the test set standing at 97.92%, but the TPR drops even further to just 8.40%. The GBDT also uses regression to estimate the value of process killing and coupled with a minimum of 4 alerts performs well on the test set but does not stand out in the validation set, see Table
Two models’ F1-score, TNR, TPR for the validation and test set scoring the highest TNR on the validation and test sets.
Methodology | Model | n features | Val | Test | ||||
---|---|---|---|---|---|---|---|---|
F1 | tnr | tpr | F1 | tnr | tpr | |||
Regression | AdaBoost | 26 | 56.63 | 100.00 | 39.50 | 15.06 | 97.92 | 8.40 |
Regression + 4 alerts | GBDT | 26 | 85.91 | 95.80 | 77.31 | 68.50 | 94.98 | 56.04 |
Although less than 10% of the test set malicious processes is killed by the AdaBoost regressor, this model may be the most viable despite the low TPR. Future work may examine the precise behaviour and harm caused by malware that is/is not detected. To summarise results, the most-detected families were Ekstak (180), Mikey (80), Prepscram (53 processes), and Zusy (49 processes) of the 745 total samples.
Although a high percentage of processes are correctly identified as malicious by the best performing model (RF with 2 alerts and 37 features), it may be that the model detects the malware after it has already caused damage to the endpoint. Therefore, instead of looking at the time at which the malware is correctly detected, a live test was carried out with ransomware to measure the percentage of files corrupted with and without the process killing model working. This real-time test also assesses whether malware can indeed be detected in the early stages of execution or whether the data recording, model inference, and process killing is too slow in practice to prevent damage.
Ransomware is the broad term given to malware that prevents access to user data (often by encrypting files) and holds the means for restoring the data (usually a decryption key) from the user until a ransom is paid. It is possible to quantify the damage caused by ransomware using the proportion of modified files as Scaife et al. [
Although the RF with a minimum of 2 alerts using both process and global data gave the highest F1-score on the test set (81.50), earlier experiments showed that RFs are not one of the most computationally efficient models by comparison with those tested. Therefore, a decision tree is trained on process-only data (26 features) in case the time-to-classification is important for damage reduction despite the lower F1-score. For this reason, the decision tree model is used in this test. The DT also has a very slightly higher TPR (see Table
Random Forest and Decision Tree each with a minimum requirement of two alerts (“malicious classifications”) to kill a process. F1, TNR, and TPR reported on validation and test set.
Model | n features | Val | Test | ||||
---|---|---|---|---|---|---|---|
F1 | Tnr | tpr | F1 | tnr | tpr | ||
RF (alerts: 2) | 37 | 91.30 | 94.96 | 88.24 | 81.53 | 87.97 | |
DT (rolling mean: 2) | 26 | 94.96 | 91.60 | 73.82 | 66.19 | 88.40 |
22 fast-acting ransomware files were identified from a separate VirusShare [
The 22 samples were executed for 30 seconds each without the process killing model and the number of files modified was recorded. The process was repeated with 4 process killing models: DT with min. 2 alerts and 26 features, RF with min. 2 alerts and 37 features, AdaBoost regressor with 26 features, and GDBT regressor with min. 4 alerts and 26 features.
It was necessary to run the killing model with administrator privileges and to write an exception for the Cuckoo sandbox agent process which enables the host machine to read data from the guest machine since the models killed this process. The need for this exception highlights that there are benign applications with malicious-like behaviours, perhaps especially those used for networking and security.
Figure
Total number of files corrupted by ransomware with no process killing and with three process killing models within the first 30 seconds of execution.
Total number of files corrupted by ransomware with no process killing and with three process killing models within the first 30 seconds of execution. Damage reduction is the percentage of files spared when no killing is implemented.
Model | Files damaged | Damage reduction | Detection rate (ransomware TPR) | Test set TPR |
---|---|---|---|---|
No killing | 19,997 | — | — | — |
DT pro rolling mean 2 | 3 | 99.98% | 100.00 | 88.40 |
RF glo + pro min alerts 2 | 1,464 | 92.68% | 100.00 | 87.97 |
GBDT regressor + min 4 alerts | 15,432 | 22.83% | 22.07 | 56.04 |
AdaBoost regressor | 20,578 | 0.00% | 9.09 | 8.83 |
The DT model almost entirely eliminates any file corruption with only three being corrupted. The RF saves 92.68% of files. The ordinal ranking of “damage prevention” is the same as the TPR on the test set, but the relationship is not proportional. The same ordinal relationship indicates that the simulated impact of process killing on the collected test set was perhaps a reasonable approximation of measuring at least fast-acting ransomware damage, despite the TPR test set metrics being based on other malware families, too.
The DT demonstrates that this architecture is capable of preventing damage, but the TNR on the test set of the DT model is so low (66.19) that this model cannot be preferred to the RF (81.53 TNR), which still prevents over 90% of file damage.
The GBDT prevents some damage, and detects a comparable number of ransomware samples (1 in 5). The AdaBoost regressor detected 2 ransomware samples of the 22, and in these two cases more than 64% and 45% of files were saved, respectively; perhaps, with more execution time, the files would be detected but the key benefit of process killing is to stop damaging software like these ransomware samples and this algorithm actually saw more files encrypted than when no killing model was used; this is because there will be a slight variance in the ransomware behaviour and execution time each time it runs. The Random Forest is the most plausible model, balancing damage prevention and TNR; however, the delay in classification may be a result of the requirement to collect more features and/or the real-time of the model itself.
Although algorithm execution duration was measured above, due to batch processing used by the models, the number of processes being classified can be increased by an order of magnitude with a negligible impact on execution time. The data collection and process killing both have linear,
Some processes demand more computational resources than others, and some malware in our test set locked pages in memory [
These differences in behaviour can cause the evaluation time to lag as demonstrated by the outlier points visible in Figure
Mean time to collect data, analyse data with Random Forest, and kill varying numbers of processes.
The experiments in this paper address a largely unexplored area of malware detection, by comparison with post-trace classification. Real-time processing and response has a number of benefits outlined above and the results presented here give tentative indications of the advantages and challenges of such an approach.
The initial experiments (Section
The next set of experiments (Section
Attempting to improve detection accuracy, three approaches were tested: statistical filtering, reinforcement learning, and a regression model estimating the utility (q-value) of killing a process. Statistical filters using rolling mean or alert thresholds were the only approach to improve on the supervised learning model F1 score. Reinforcement learning tended to kill processes too early and therefore not explore enough scenarios (and thus receive the requisite reinforcement) to allow benign processes to continue; this does not mean that future models could not improve upon this result. This may be supported by the success of the regression models in maintaining a high true-negative rate, given that these models ascribed a similar utility to killing processes as the reinforcement learning models.
The accuracy metrics tested thus far simply indicate whether a process was ever killed, but do not address whether damage was actually prevented by process killing. If damage was not prevented, there is little point to process killing and a database of alerts for analysis would be a better solution since the risk of killing benignware is eliminated. This is why the final set of experiments in Section
Real-time attack detection has wider applications than endpoint detection, as Alazab et al. [
However, some challenges remain to be solved; behavioural malware analysis research using machine learning regularly reports
There are some key challenges to implementation, which have been outlined in this paper: The need for signal separation drives the use of individual processes and only partial traces can be used. The significant drop in accuracy on the unseen test set, even without process killing demonstrates that additional features may be necessary to improve detection accuracy. With the introduction of process killing, the poor performance of the models on either benignware classification (RF min 2 alerts: TNR 81% with an 88% TPR on the test set) or on malware classification (GBDT regressor min 4 alerts: 56% TPR with a 94% TNR on the test set) means that considerable further work is needed before very early stage real-time detection can be considered for real-world use. Real-time detection using full execution traces of processes, however, may be viable. This is useful to handle VM-aware malware, which may only reveal its true behaviour in the target environment. Although the more complex approach using DQNs algorithms did not outperform the supervised models with some additional statistical thresholds, the regression models had better performance in correctly classifying benignware. Reinforcement learning could still be useful for real-time detection and automated cyber defense models, but the DQN in these experiments did not perform well. Despite the theoretical unsuitability of supervised learning models to state-action problems, these experiments demonstrate how powerful supervised learning can be for classification problems, even if the problem is not quite the one that the model is attempting to solve. Future work may require a more comprehensive manual labelling effort at the process level and perhaps labelling sub-sections of processes as malicious or benign.
An additional consideration for real-time detection with automated actions is whether this introduces an additional denial-of-service vector using process injection for example to trigger process killing. This may also however indicate that an attacker is present and therefore aid the user.
This paper has built on previous work in real-time detection to address some of the key challenges: signal separation, detection with partial execution traces, and computational resource consumption with a focus on preventing harm to the user, since real-time detection introduces this risk.
Behavioural malware detection using virtual machines is a well-established research field yielding high detection accuracy in recent literature [
To the best of our knowledge, previous real-time detection work has used up to 5 simultaneous applications, whereas other users may use far more. This paper has demonstrated that up to 35 simultaneous applications (and nearly 100 simultaneous processes) can be constantly monitored, where previous work [
Automatic actions are necessary in response to detection if the goal is to prevent harm. Otherwise, this is equivalent to letting the malware fully execute and simply monitor its behaviour since human response times are unlikely to be quick enough for fast-acting malware. From a user perspective, the question is not “What percentage of malware was executed?” or “Was the malware detected in 5 or 10 minutes?” but “How much damage has been done?”.
This paper found that by using simple statistical filters on top of supervised learning models, it was possible to prevent 92% of files from being corrupted by fast-acting ransomware thus reducing the requirements on the user or organisation to remediate the damage, since it was prevented in the first instance (the rest of the attack vector would remain a concern).
This approach does not achieve the detection accuracies of state-of-the art offline behavioural analysis models but, as stated in the introduction, these models typically use the full post-execution trace of malicious behaviour. Delaying classification until post-execution negates the principal advantages of real-time detection. However, the proposed model presents an initial step towards a fully automated endpoint protection model, which becomes increasingly necessary as adversaries become more and more motivated to evade offline automated detection tools.
Information on the data underpinning the results presented here, including how to access them, can be found in the Cardiff University data catalogue at 10.17035/d.2021.0148229014.
The authors declare that they have no conflicts of interest.
This research was partly funded by the Engineering and Physical Sciences Research Council (EPSRC)-grant references EP/P510452/1 and EP/S035362/1. The research was also partly funded by Aureirbus Operations Ltd..