Real-time malware process detection and automated process killing

Perimeter-based detection is no longer sufficient for mitigating the threat posed by malicious software. This is evident as antivirus (AV) products are replaced by endpoint detection and response (EDR) products, the latter allowing visibility into live machine activity rather than relying on the AV to filter out malicious artefacts. This paper argues that detecting malware in real-time on an endpoint necessitates an automated response due to the rapid and destructive nature of some malware. The proposed model uses statistical filtering on top of a machine learning dynamic behavioural malware detection model in order to detect individual malicious processes on the fly and kill those which are deemed malicious. In an experiment to measure the tangible impact of this system, we find that fast-acting ransomware is prevented from corrupting 92% of files with a false positive rate of 14%. Whilst the false-positive rate currently remains too high to adopt this approach as-is, these initial results demonstrate the need for a detection model which is able to act within seconds of the malware execution beginning; a timescale that has not been addressed by previous work.


Introduction
Our increasingly digitised world broadens both the opportunities and motivations for cyber attacks, which can have devastating social and financial consequences [1]. Malicious software (malware) is one of the most commonly used vectors to propagate malicious activity and exploit code vulnerabilities.
Due to the huge numbers of new malware appearing each day, the detection of malware samples needs to be automated [2]. Signature-matching methods are not resilient enough to handle obfuscation techniques nor to catch unseen malware types and as such, automated methods of generating detection rules, such as machine learning have been widely proposed e.g. [3,4,5,6]. These approaches typically analyse samples when the file is first ingested, either using static code-based methods or by observing dynamic behaviours in a virtual environment.
This paper argues that both of these approaches are vulnerable to evasion from the attacker. Static methods may be thwarted by simple code-obfuscation techniques whether rules are hand-generated [7] or created using machine learning [8]. Dynamic detection in a sandboxed environment cannot continue forever, either it is time-limited e.g. [9] or ends after some period of inactivity e.g. [10]. This fixed period allows attackers to inject benign activity during analysis and wait to carry out malicious activity once the sample has been deemed harmless and passed on to the victim's environment. The pre-execution filtering of malware is the model used by antivirus but this is insufficient to keep up with the ever-evolving malware landscape and has lead to the creation of endpoint detection and response (EDR) products which allow security professionals to monitor and respond to malicious activity on the victim machine. Real-time malware detection also monitors malware live on the machine thus capturing any malicious activity on the victim machine even if it was not evident during initial analysis. This paper proposes that once a threat is detected, due to the fast-acting nature of some destructive malware, it is vital to have automated actions to support these detections. In this paper we investigate automated detection and killing of malicious processes for endpoint protection.
There are several key challenges to address in detecting malware on-the-fly on a machine in use by comparison with detecting malicious applications that are detonated in isolation in a virtual machine. These are summarised below: 1. Signal Separation: Detection in real time requires that the malicious and benign activity are separated in order that automated actions can be taken on only the malicious processes.
2. Use of Partial Traces: In order to try and mitigate damage, malware needs to be detected as early as possible but, as shown in previous work [11], there is a trade-off between the amount of data collected and classification accuracy in the first few seconds of an application launching and the same may be true for individual processes.

Quick Classification:
The inference itself should be as fast as possible in order to further limit the change of malicious damage once the process is deemed malicious. 4. Impact of Automated Killing in Supervised Learning: Supervised learning averages the error rate across the entire training set but when the classification results in an action, this smoothing out of errors across the temporal dataset is not possible.
This paper seeks to address these key challenges and provides preliminary results including a measure of 'damage prevented' in a live environment for fast-acting destructiveware. As well as the results from these experiments this paper contributes an analysis of the computational resources against detection accuracy for many of the most popular machine learning algorithms used for malware detection.
The key contributions of this paper are as follows: • The first general malware detection model to demonstrate damage mitigation in real-time using process detection and killing • Benchmarking of commonly used ML algorithm implementations with respect to computational resource consumption • Presentation of real-time malware detection against more user background applications than have previously been investigated; increasing from 5 to 36 (up to 95 simultaneous processes) The next section outlines related work, followed by a report of the three methodologies that were tested to try and address these challenges 3 in which the method for evaluating these models is also explained (6.5). The experimental setup is described in section 5.2.1 followed by results in sections 6.
2 Related Work 2.1 Malware detection with static or post-collection behavioural traces Static sources: Machine learning models trained on static data have shown good detection accuracy, e.g. Chen et. al. [5] achieved 96% detection accuracy using statically-extracted sequences of API calls to train a Random Forest model. However, static data has been demonstrated to be quite vulnerable to concept drift [12,13]. Adversarial samples present an additional emerging concern; Grosse et al. [14] and Kolosnaji et al. [8] demonstrated that static malware detection models achieving over 90% detection accuracy could be thwarted by injecting code or simply altering the padded code at the end of a compiled binary respectively. Post-collection dynamic data: Dynamic behavioural data is generated by the malware carrying out its functionality. Again machine learning models have been used to draw out patterns between malicious and benign software using dynamic data. Various dynamic data can be collected to describe malware behaviour. The most commonly used data are API calls made to the operating system, typically recorded in short sequences or by frequency of occurrence. Huang and Stokes's research [3] reports the highest accuracy in recent malware detection literature with a very large dataset of more than 6 million samples to achieve an accurate detection rate of 99.64% using a neural network trained on the input parameters passed to API calls, their return values, and the cooccurrence of API calls. Other dynamic data sources include dynamic opcode sequences (e.g. Carlin et al. [9] achieve 99% using a Random Forest), hardware performance counters (e.g. Sayadi [15] achieve 94% on Linux/Ubuntu malware using a decision tree), network activity and file system activity (e.g. Usman et al. [16] achieve 93% using a decision tree in combination with threat intelligence feeds and these data sources), and machine activity metrics (e.g. Burnap et al. [17] achieve 94% using a self-organising map). Previous work [18] demonstrated the robustness of machine activity metrics over API calls in detecting malware collected from different sources.
Dynamic detection is more difficult to obfuscate but typically the time taken to collect data is several minutes, making it less attractive for endpoint detection systems. Some progress has been made on early detection of malware. Previous work [11]) was able to detect malware with 94% accuracy within 5 seconds of execution beginning. However, as a sandbox-based method, malware which is inactive for the first 5 seconds is unlikely to be detected with this approach. Moreover, the majority of dynamic malware detection papers use virtualised environments to collect data.

Real-time malware detection with partial behavioural traces
Previous work has begun to address the four challenges set out in the introduction. Table 1 summarises the related literature and the problems considered by the researchers.
To the best of our knowledge, challenge (1) signal separation has only previously been addressed by Sun et al. [23] using sequential API call data. The authors execute up to 5 benign and malicious programs simultaneously achieving 87% detection accuracy after 5 minutes of execution and 91% accuracy after 10 minutes of execution.
Challenge (2) to detect malware using partial traces as early as possible has not been directly addressed. Some work has looked at early run-time detection; Das et al. [20] used an FPGA as part of a hybrid hardware-software approach to detect malicious Linux applications using system API calls which are then classified using a multilayer perceptron. Their model was able to detect 46% of malware within the first 30% of its execution with a false-positive rate of 2% in offline testing. These findings however were not tested with multiple benign and malicious programs running simultaneously and do not explain the impact of detecting 46% of malware within 30% of its execution trace in terms of benefits to a user or the endpoint being protected. How long does it take for 30% of the malware to execute? What has occurred in that time?
Greater attention has been paid to challenge (3) quick classification, insofar as this problem also encompasses the need for lightweight detection. Some previous work has proposed hardware based detection for lightweight monitoring. Syadi et al. [15] [21] use low-level architectural events to train a multilayer perceptron on the more widely used [25] (and attacked) Windows operating system. The model was able to detect 94% of malware with a false positive rate of 7% using partial execution traces of 10,000 committed instructions. The hardware based detection models however, are less portable than software-based systems due to the ability for the same operating system to run on a variety of hardware configurations. Both Sun et al. [23] and Yuan [22] propose two-stage models to address the need for lightweight computation. The first stage comprises a lightweight ML model such as a Random Forest to alert suspicious processes, the second being a deep learning model which is more accurate but more computationally 5 intensive to run. Two-stage models, as Sun et al. [23] note, can get stuck in an infinite loop of analysis in which the first model flags a process as suspicious but the second model deems it benign and this labelling cycle continues repeatedly. Furthermore, if the first model of the two is prone to false negatives, malware will never be passed to the second model for deeper analysis.
Challenge (4) the impact of automated actions has been discussed by Sun et al. [23]. The authors also propose the two-stage approach as a solution to this problem. The authors apply restrictions to the process whilst the deeper NN analysis takes place followed by the killing of malicious-labelled processes. The authors found that the delaying strategy impacted benignware more than malware and used this two-stage process to account for the irreversibility of the decision to kill a process. The authors did not assess the impact on the endpoint with respect to the time at which the correctly classified malware was terminated.

Methodology -three approaches
As noted above, supervised learning models average errors across the training set but in the case of real-time detection and process killing, a single false positive on a benign process amongst 300 true-negatives would cause disruption to the user. The time at which an malware is detected is also important, the earlier the better. Therefore the supervised learning model needs to be adapted to take account of these new requirements.
Tackling this issue was attempted in three different ways and all three are reported here in the interests of reporting negative results as well as the one which performed the best. These were: 1. Statistical methods to smooth the alert surface and filter out single falsepositives 2. Reinforcement learning, which is capable of incorporating the consequences of model actions into learning 3. A regression model based on the feedback of a reinforcement learning model made possible by having the ground-truth labels Figure 1 gives a high-level depiction of the three approaches tested in this paper.

Statistical Approach: Alert Filtering
It is expected that transitioning from a supervised learning model to a realtime model will see a rise in false-positives since one single alert means benign processes (and all child processes) are terminated, which effectively renders all future data points as false positives. Filtering the output of the models, just as the human brain filters out transient electrical impulses in order to separate 6 Figure 1: High-level depiction of three approaches taken 7 background noise from relevant data [26], may be sufficient to make supervised models into suitable agents. This is attractive because supervised learning models are already known to perform well for malware detection, as confirmed by the previous paper and other related work [11,27,20,28]. A disadvantage of this approach is that it introduces additional memory and computational requirements both in order to calculate the filtered results and to track processes current and historic scores, therefore a model which integrates the expected consequences of an action into learning is also tested: reinforcement learning.

Reinforcement Learning: Q-learning with Deep Q Networks
The proposed automated killing model may be better suited to a reinforcement learning strategy than to supervised learning. Reinforcement learning uses rewards and penalties from the model's environment. The problem that this paper is seeking to solve is essentially a supervised learning problem, but one for which it is not possible to average predictions. There are no opportunities to classify the latter stages of a process if the agent kills the process, and this can be reflected by the reward mechanism of the reinforcement learning model (see Figure 1 above). Therefore reinforcement learning seems like a good candidate for this problem space. Two limitations of this approach are that (1) reinforcement learning models can struggle to converge on a balanced solution, the models must learn to balance the exploration of new actions with the re-use of known high-reward actions; commonly known as the exploration-exploitation trade-off [29] (2) in these experiments, the reward is based on the malware/benignware label at the application level rather than being linked to the actual damage being caused, therefore the signal is a proxy for what the model should be learning. This is used because, as discussed above, the damage caused by different malware is subjective.
For reinforcement learning, loss functions are replaced by reward functions which update the neural network weights to reinforce actions (in context) that lead to higher rewards and discourage actions (in context) that lead to lower rewards; these contexts and actions are known as state-action pairs. Typically the reward is calculated from the perceived value of the new state that the action leads to e.g. points scored in a game. Often this cannot be pre-labelled by a researcher since there are so many (maybe infinite) state-action pairs. However in this case, all possible state-action pairs can be enumerated, which is the third approach tested (regression model -outlined in the next section).
The reinforcement model was still tested. Here the reward is +N for a correct prediction, −N for an incorrect prediction where N is the total number of processes impacted by the prediction e.g. if there is only one process in a process tree but 5 more will appear over the course of execution, a correct prediction gives a reward of +6, and incorrect prediction gives a reward of −6.
There are a number of reinforcement learning algorithms to choose from. This paper explores q-learning [30,31,32,33] to approximate the value or 'quality' (q) of a given action in a given situation. Q-learning approximates qtables, which are look-up tables of every state-action pair and their associated rewards. A state-action pair is a particular state in the environment coupled with a particular action i.e. the machine metrics of the process at a given point in time with the action to leave the process running. When the number of stateaction pairs becomes quite large, it is easier to approximate the value using an algorithm. Deep Q networks (DQN) are neural networks that implement qlearning and have been used in state-of-the-art reinforcement learning arcade game playing, see Mnih et al. [34]. A DQN was the reinforcement algorithm trialled here, though it did not perform well by comparison with the other methods, a different RL algorithm may perform better [35], but the results are still included in the interests of future work. The following paragraphs will explain some of the key features of the DQN.
The DQN tries out some actions, stores the states, actions, resulting states and rewards in a memory and uses these to learn the expected rewards of each available action; with the highest expected reward being the one that is chosen. Neural networks are well-suited to this problem since their parameters can easily be updated, tree-based algorithms like random forests and decision trees can be adapted to this end but not as easily. Future rewards can be built into the reward function and are usually discounted according to a tuned parameter usually signified by γ.
In Mnih et al's [34] formulation, in order to address the exploration-exploitation trade off, DQNs either exploit a known action or explore a new one, with the chance of choosing exploration falling over time. When retraining the model based on new experiences, there is a risk that previous useful learnt behaviours are lost, this problem is known as catastrophic forgetting [36]. Mnih et al's [34] DQNs use two tools to combat this problem. First, experience replay by which past state-action pairs are shuffled before being used for retraining so that the model does not catastrophically forget. Second, DQNs utilise a second network, which updates at infrequent intervals in order to stabilise the learning.
Q-learning may enable a model to learn when it is confident enough to kill a process, using the discounted future rewards. For example, choosing not to kill some malware at time t may have some benefit as it allows the model to see more behaviour at t+1 which gives the model greater confidence that the process is in fact malicious. Whilst there are other reinforcement learning algorithms.
Q-learning approximates rewards from experience, but in this case, all rewards from state-action pairs can actually be pre-calculated. Since one of the actions will kill the process and thus end the 'experience' of the DQN, it could be difficult for this model to gain enough experience. Thus pre-calculation of rewards may improve the breadth of experience of the model, for this reason a regression model is proposed to predict the Q-value of a given action.

Regression using Q-Values
Unlike classification problems, regression problems can predict a continuous value rather than discrete (or probabilitic) values relating to a set of output classes. Regression algorithms are proposed here to predict the q-value of killing a process. If this value is positive, the process is killed.
Q-values estimate the value of a particular action based on the 'experience' of the agent. Since the optimal action for the agent is always known, it is possible to precompute the '(q-)value' of killing a process and train various ML models to learn this value. It would typically be quicker to train a regression model which tries to learn the value of killing a process than to train a DQN which explores the state-action space and calculates rewards between learning, since the interaction and calculation of rewards is no longer necessary. The regression approach can be used with any machine learning algorithm capable of learning a regression problem, regardless of whether it is capable of partial training.
There are two primary differences between this regression approach and the reinforcement learning DQN approach detailed in the previous section. Firstly, the datasets are likely to be difference. Since the DQN generates training data through interacting with its environment it may never see certain parts of the state-action space e.g. if a particular process A is always killed during training before time t * , the model is not able to learn from the process A data after t * .
Secondly, only the expected value of killing is modelled by the regressor, whereas the DQN tries to predict the value of both killing and of not killing the process. This means that the equation used to model the value of process killing is only an approximation of the reward function used by the DQN.
The equation used to calculate the value of killing is positive for malware and negative for benignware, in both cases it is scaled by the number of child processes impacted and in the case of malware, early detection increases the value of process killing (with an exponential decay). Let y be the true label of the process (0=benign, 1=malicious), N is the number of child processes and t is the time in seconds at which the process is killed then the value of killing a process is: The equation above negatively scores the killing of benignware in proportion to the number of subprocesses and scores the killing of malware positively in proportion to the number of subprocesses. A bonus reward is scored for killing malware early, with an exponential decay over time.

Evaluation Methodology: Ransomware detection
As noted in the background paper, to date research has not addressed the extent to which damage is mitigated by process killing, since Sun et al. [23] presented the only previous work to test process killing and damage with and without process killing is not assessed. To this end, this paper uses ransomware as a proxy to detect malicious damage, inspired by Scaife et. al's approach [24]. A brief overview of Scaife et al.'s damage measurement is outlined below: Early detection is particularly useful for types of malware from which recovery is difficult and/or costly. Cryptographic ransomware encrypts user files and withholds the decryption key until a ransom is paid to the attackers. This type of attack is typically costly to remedy, even if the victim is able to carry out data recovery [37]. Scaife et al.'s work [24] on ransomware detection uses features from file system data, such as whether the contents appears to have been encrypted, and number of changes made to the file type. The authors were able to detect and block all of the 492 ransomware samples tested with less than 33% of user data being lost in each instance. Continella et al. [38] propose a selfhealing system, which detects malware using file system machine activity (such as read/write file counts), the authors were able to detect all 305 ransomware samples tested, with a very low false-positive rate. These two approaches use features selected specifically for their ability to detect ransomware, but this requires knowledge of how the malware operates. Whereas the approach taken here seeks to use features which can be used to detect malware in general. The key purpose of this final experiment (section 6.5) is to show that our general model of malware detection is able to detect general types of malware as well as time-critical samples such as ransomware.

Experimental Setup
This section outlines the data capture process and dataset statistics.

Features
The same features as were used in previous work [11] are used here for process detection, with some additional features to measure process-specific data. Despite the popularity of API calls noted in [18], due to these findings and Sun et al.'s [23] difficulties hooking this data in real-time, these were not considered as features to train the model.
At the process-level, 26 machine metric features are collected, these were dictated by the attributes available using the Psutil [39] python library. It is also possible to include the 'global' machine learning metrics that were used in the previous papers. Though global metrics will not provide process-level granularity, they may give muffled indications of the activity of a wider process tree. The 9 global metrics are: system level CPU use, user level CPU use, memory use, swap memory use, number of packets received and sent, number of bytes received and sent and the total number of processes running.
The process-level machine activity metrics collected are: CPU use at the user level, CPU use at the system level, physical memory use, swap memory use, total memory use, number of child process, number of threads, maximum process ID from a child process, disk read, write and other I/O count, bytes read, written and used in other I/O processes, process priority, I/O process priority, number of command line arguments passed to process, number of handles being used by process, time since the process began, TCP packet count, UDP packet   Table 2).

Preprocessing
Feature normalisation is necessary for NNs to avoid over-weighting features with higher absolute values. The test, train and validation sets (x) are all normalised by subtracting the mean (µ) and dividing by the standard deviation (σ) of each feature in the training set: x−µ σ . This sets the range of input values largely between -1 and 1 for all input features, avoiding the potential for some features to be weighted more important than others during training purely due to the scalar values of those features. This requires additional computational resources but is not necessary for all ML algorithms; this is another reason why the supervised RNN used in [11] may not be well-suited for real-time detection.

Data Capture
During data capture, this research sought to improve upon previous work and emulate real machine use to a greater extent than has previously been trialled. The implementation details of the VM, simultaneous process execution and RL simulation are outlined below:

Environment: Machine setup
The following experiments were conducted using a virtual machine (VM) running with Cuckoo Sandbox [40] for ease of collecting data and restarting between experiments and because the Cuckoo Sandbox emulates human interaction with programs to some extent to promote software activity. In order to emulate the capabilities of a typical machine, the modal hardware attributes of the top 10 'best seller' laptops according to a popular internet vendor [41] were used, and these attributes were the basis of the VM configuration. This resulted in a VM with 4GB RAM, 128GB storage and dual-core processing running Windows 7 64-bit. Windows 7 was the most prevalent computer operating system (OS) globally at the time of designing the experiment [25], though Windows 10 is now the most popular OS, the findings in this research should still be relevant.

Simultaneous applications
In typical machine use, multiple applications run simultaneously. This is not reflected by behavioural malware analysis research in which samples are injected individually to a virtual machine for observation. The environment used for the following experiments launches multiple applications on the same machine at slightly staggered intervals as if a user were opening them. Each malware is launched with a small number (1-3) and a larger number (3-35) of applications. It was not possible to find find up-to-date user data on the number of simultaneous applications running on a typical desktop, so here it was elected to launch up to 36 applications (35 benign + 1 malicious) at once, which is the largest number of simultaneous apps for real-time data collection to date. From the existing real-time analysis literature only Sun et al. [23] run multiple applications at the same time, with a maximum of 5 running simultaneously.
Each application may in turn launch multiple processes, causing more than 35 processes to run at once; 95 is the largest number of simultaneous processes recorded, this excludes background OS processes.

Reinforcement Learning Simulation
For reinforcement learning, the DQN requires an observation of the resulting state following an action. To train the model, a simulated environment is created from the pre-collected training data whereby the impact of killing or not killing a process is returned as the next state. For process-level elements this reduces all features to zero. A caveat here is that in reality killing the process may not occur immediately and therefore memory, processing power etc. may still be being consumed at the next data observation. For global metrics, the process-level values for the killed processes (includes child processes of the killed process) are subtracted from the global metrics. There is a risk again that this calculation may not correlate perfectly with what would be observed in a live machine environment.  In order to observe the model performance, a visualisation was developed to accompany the simulated environment. Figures 2 and 3 show a screenshots of the environment visualisation for one malicious and one benign process.

Dataset
The dataset is comprised of 3,604 benign executables and 2,792 malicious applications (each containing at least one executable), with 2,877 for training and validation and 3,519 for testing. These dataset sizes are consistent with previous real-time detection dataset sizes e.g. (Das et al. [20] use 168 malicious, 370 benign; Sayadi et al. [15] use over 100 each benign and malicious; Ozsoy et al. [21] use 1,087 malicious and 467 benign; Sun et al. [23] use 9,115 malicious, 877 benign). With multiple samples running concurrently to simulate real endpoint use, there are 24K processes in the training set and 34K in the test set. Overall there are 58K behavioural traces of processes in the training and testing datasets. The benign samples comprise files from VirusTotal [42], from free software websites (later verified as benign with VirusTotal), and from a fresh Microsoft Windows 7 installation. The malicious samples were collected from two different VirusShare [43] repositories.
In Pendelbury et al's analysis [13], the authors estimate that in the wild between 6% and 22% of applications are malicious, normalising to 10% for their experiments. Using this estimation of Android malware, ratios a similar ratio was used in the test set in which 13.5% were malicious.

Malware families
This paper is not concerned with distinguishing particular malware families, but rather with identifying malware in general. However, a dataset consisting of just one malware family would present an unrealistic and easier problem than is found in the real-world. The malware families included in this dataset are reported in Table 3. The malware family labels are derived from the output of around 60 antivirus engines used by VirusTotal [42].
Ascribing family labels to malware is non-trivial since antivirus vendors do not follow standardised naming conventions and many malware families have multiple aliases. Sebastián et al. [44] have developed an open source tool, AVClass, to extract meaningful labels and correlate aliases between different antivirus outputs. AVClass was used to label the malware in this dataset. Sometimes there is no consensus amongst the antivirus' output or the sample is not recognised as a member of an existing family. AVClass also excludes malware that belongs to very broad classes of malware (e.g. "agent", "eldorado", "artemis") as these are likely to comprise a wide range of behaviours and so may be applied as a default label in cases for which antivirus engines are unsure. In the dataset established in this research, 2,121 of the 2,792 samples were assigned to a malware family. Table 3 gives the number of samples in each family for which there were more than 10 instances found in the dataset. 315 families were detected overall, with 27 families being represented more than 10 times.

Malicious vs. Benign Behaviour
Statistical inspection of the training set reveals that benign applications have fewer sub-processes than malicious processes, with 1.17 processes in the average benign process tree and 2.33 processes in the average malicious process tree.
Malware was also more likely to spawn processes outside of the process tree of the root process, often using the names of legitimate Windows processes. In some cases malware launches legitimate applications, such as Microsoft Excel in order to carry out a macro-based exploit. Although Excel is not a malicious application in itself, it is malicious in this context, which is why malicious labels are assigned if a malware sample has caused that process to come into being. It is therefore possible to argue that some processes launched by malware are not malicious, because they do not individually cause harm to the endpoint or user, but without the malware they would not be running and so can be considered at least undesirable even if only in the interests of conserving computational resources.

Train-Test Split
The dataset is split in half with the malicious samples in the test set coming from the more recent VirusShare repository, and those in the training set from the earlier repository. This is to increase the chances of simulating a real deployment scenario in which the malware tested contain new functionality by comparison with those in the training set.
Ideally the benignware should also be split by date across the training and test set, however it is not a trivial task to calculate the date at which benignware was compiled. It is possible to extract the compile time from PE header, but it is possible for the PE author to manually input this date which had clearly happened in some instances where the compile date was 1970-01-01 or in one instance 1970-01-16. In the latter case (1970-01-16), the file is first mentioned online in 2016, perhaps indicating a typographic error [45]. Using internet sources such as VirusTotal [42] can give an indication when software was first seen but if the file is not very suspicious i.e. from a reputable source, it may not have been uploaded until years after it was first seen "in the wild'. Due to the difficulty in dating benignware in the dataset collected for this research, samples were assigned to the training or test set randomly.
For training, an equal number of benign and malicious processes are selected, so that the model does not bias towards one class. 10% of these are held out for validation. In most ML model evaluations, the validation set would be drawn from the same distribution as the test set. However, because it is important not to leak any information about the malware in the test set, since it is split by date, the validation set here is drawn from the training distribution.

Implementation Tools
Data collection used the Psutil [39] Python library to collect machine activity data for running processes and to kill those processes deemed malicious. The RNN and Random Forests were implemented using the Pytorch [46] and Scikit-Learn [47] python libraries respectively. The model runs with high priority and administrator rights to make sure the polling is maintained when compute resources are scarce.
6 Experimental Results

Supervised Learning for Process Killing
First we demonstrate the unsuitability of a full-trace supervised learning malware detection model, which achieved more than 96% detection accuracy in [11]. The model used is a gated-recurrent unit recurrent neural network since this algorithm is designed to process time-series data. The hyperparameter configuration of this model was conducted using a random search of hyperparameters (see table 10 in the Appendix for details.) It is expected that supervised malware detection models will not adapt well to process-killing due to the averaging of loss metrics as described earlier. Initially this is verified by using supervised learning models to kill processes that are deemed malicious. For supervised classification, the model makes a prediction every time a data measurement is taken from a process. This approach is compared with one taking average predictions across all measurements for a process and for a process tree as well as the result of process killing. The models with the highest validation accuracy for classification and killing are compared.  Table 4: F1-score, true positive rate (TPR) and true negative rates (TNR) (all * 100) on test and validation sets for classification and process killing Figure 4 illustrates the difference in validation set and test set F1-score, true positive rate and false positive rate for these 4 levels of classification: each measurement, each process, each process tree, and finally showing process killing; see Figures 5 for diagrammatic representation of these first 3 levels. Table 4 reports the F1, TPR and TNR for classification (each measurement of each Figure 4: F1 scores, true positive rates (TPR) and true negative rates (TNR) for partial-trace detection (process measurements), full-trace detection (whole process), whole application (process tree) and with process level measurements + process killing (process killing) for validation set (left column) and test set (right column) process) and for process killing.
The highest F1-score on the validation set is achieved by an RNN using process data only. When process killing is applied there is a drop of less than 5 percentage points in the F1-score but more than 15 percentage points are lost from the TNR.
On the unseen test set the highest F1-score is achieved by an RNN using process data + global metrics, but the improvement over the process data + total number of processes is negligible. Overall is there is a reduction in F1score from (97.44, 94.61) to (74.91, 77.66), highlighting the initial challenge of learning to classifying individual processes rather than entire applications, especially when accounting for concept drift. Despite the low accuracy, these initial results indicate that the model is discriminating some of the samples correctly and may form a baseline from which to improve.
The test set TNR and TPR for classification on the best-performing model (process data only) are 79.70 and 82.91 respectively, but when process killing is applied, though the F1-score drops by 10 percentage points, the TNR and TPR move in opposite directions with the TNR falling to 59.63 and TPR increasing to 90.24. This is not surprising since a single malicious classification results in a process being classed as malicious. This is true for the best-performing models using either of the two feature sets (see Fig.4 above).

Accuracy vs. Resource consumption
Previous work on real-time detection has highlighted the requirement for a lightweight model (speed and computational resources). The previous paper, RNNs were the best performing algorithm in classifying malware/benignware but RNNs have many parameters and therefore may consume significant RAM and/or CPU, they also require preprocessing of the data to scale the values, which other ML algorithms such as tree-based algorithms do not. Whilst RAM and CPU should be minimised, taking model accuracy into account, inference duration is also an important metric.
Though the models in this paper have not been coded for performance and use common python libraries, comparing these metrics helps to decide whether certain models are vastly preferable to others with respect to computational resource consumption. The PyRAPL library [48] is used measure the CPU, RAM and duration used by each model. This library uses Intel processor 'Running Average Power Limit' (RAPL) metrics. Only the data pre-processing and inference is measured as training may be conducted centrally in a resource-rich environment. Batch sizes of 1, 10, 100 and 1000 samples are tested with 26 Table 5 reports the computational resource consumption and accuracy metrics together. Decision tree with 38 features is the lowest cost to run, RNN performs best at supervised learning classification on the validation set but only just outperforms the decision tree with 26 features, which is the best performing model at process killing on the validation set at 92.97 F1-score. The highest F1-score for process killing uses a Random Forest with 37 features, scoring 77.85 F1, which is 2 percentage points higher than the RF with 26 features (75.97).
The models all perform at least 10 percentage points better on the validation set indicating the importance of taking concept drift into account when validating models.

How to solve a problem like process killing?
From the results above, it is clear that supervised learning models see a significant drop in classification accuracy when processes are killed as the result of a malicious label. This confirmation of the initial hypothesis presented here justifies the need to examine alternative methods. In the interests of future work and negative result reporting this paper reports all of the methods attempted and finds that simple statistical manipulations on the supervised learning models perform better than using alternative training methods. This section briefly describes the logic of each method and provides a textual summary of the results with a formulae where appropriate. This is followed by a table of the numerical results for each method. In the following section let P be a set of processes {p 0 , p 1 ...p P } in a process tree, let t * be the time at which a prediction is made and letŷ i be the prediction for process i at time t* where a prediction equal to or greater than 1 classifies malware.

a) Mean predictions
Reasoning: Taking the average prediction across the whole process will smooth out those process killing results Not tested This was not attempted for two reasons: (1) Taking the mean at the end of the process means the damage is done (2) This method can easily be manipulated by an attacker: 50 seconds of injected benign activity required 50 seconds of malicious activity to achieve a true positivê

b) Rolling mean predictions
Reasoning: Taking the average over a few measurements will eliminate those false positives that are caused by a single false positive over a subset of the execution trace. Window sizes of 2 to 5 are tested. Let w be the window size: Summary of results: A small but unilateral increase in F1-Score using a rolling window over 2 measurements on the validation set. Using a rolling window of size 2 on the test-set saw a 10 to 20 percentage point increase in true negative rate (to a maximum of 80.77) with 3 percentage points lost from the true positive rate. This was one of the most promising approaches.

c) Alert threshold
Reasoning: Like the rolling mean, single false positives will be eliminated but unlike the rolling mean, the alerts are cumulative over the entire trace such that a single alert at the start and 30 seconds into the process will cause the process to be killed rather than requiring that both alerts are within a window of time. Between 2 and 5 minimum alerts are tested Summary of results: Again a small increase across all models, with an optimal minimum number of alerts being 2 for maximum F1-score, competitive with the rolling mean approach.

d) Process-tree averaging
Reasoning: the data are labelled at the application level, therefore the average predictions across the process tree should be considered for classification Summary of results: Negligible performance increase on validation and test set data (less than 1 percentage point). This is likely because few samples have more than one process executing simultaneously.

e) Process-tree training
Reasoning: the data are labelled at the application level, therefore the sum of resources of each process tree should be classified at each measurement, not the individual processes Summary of results: Somewhat surprisingly there was a slight reduction in classification accuracy when using process tree data. One explanation for this may be that the process tree creates noise around the differentiating characteristics that are visible at the process level.

DQN
Reasoning: Reinforcement learning is designed for state-action space learning. Both pre-training the model with a supervised learning approach and not pretraining the model were tested.
Summary of results: Poor performance, typically converging to either kill or not kill everything, of the few models which did not converge to a single dominant action, it does not distinguish malware or benignware well, indicating that it may not have learned anything. Reinforcement learning may help the problem of real-time malware detection and process killing but this initial implementation of a DQN did not converge to a better or even competitive solution to supervised learning. Perhaps better formulation of rewards (e.g. damage prevented) would help the agent learn.

Regression on predicted kill value
Reasoning: Though the DQN explores and exploits different state-action pairs and their associated rewards, when the reward from each action is known in the first place and the training set is limited, as it is here, Q-learning can be framed as a regression problem in which the model tries to learn the return (rewards + future rewards), the training is faster and can be used by any regression-capable algorithm. Let N be the number of current and future child processes for p i at t * (y * 2 − 1) * (1 + N ) * (1 + (y * (e −t * ))) Summary of results: Improved performance on true negative rate, though not perceptible for the highest-scoring F1 models since F1-scores reward true positives more than true negatives, this metric can struggle to reflect a balance between the true positive and true negative rates. The highest true negative rate models are all regression models.   Tables 11, 12 and 13  Table 6 lists the F1, TPR and TNR on the validation and test set for each of the methods described above. The best-performing model on the test and validation sets are reported and the full results can be found in Appendix Table  11. Small improvements are made by some models on the validation F1-score but the test set F1-score improves by 4 percentage points in the best instance.
In most cases, the models with the highest F1-score on the validation and test sets are not the same. The highest F1-score is 81.50 from an RF using a minimum alert threshold of 2 and both process-level and global process metrics.

Further experiment: Favouring high TNR
Though the proposed model is motivated by the desire to prevent malware from executing, the best TNR reported amongst the models above is 81.50%. 20% of benign processes being killed would not be acceptable to a user. Whilst this research is a novel attempt at very early-stage real-time malware detection and process killing, one might consider the usability and prefer a model with a very high TNR, even if this is at the expense of the TPR.  Considering this, the AdaBoost regression algorithm achieves a 100% TNR with a 39.50% TPR on the validation set. The high FNR is retained in the test set standing at 97.92% but the TPR drops even further to just 8.40%. The GBDT also using regression to estimate the value of process killing, and coupled with a minimum of 4 alerts performs well on the test set but does not stand out in the validation set see Table 7.
Though less than 10% of the test set malicious processes are killed by the AdaBoost regressor, this model may be the most viable despite the low TPR. Future work may examine the precise behaviour and harm caused by malware that is/is not detected. To summarise results, the most-detected families were Ekstak (180), Mikey (80), Prepscram (53 processes) and Zusy (49 processes) of 745 total samples.

Measuring damage prevention in real time
Though a high percentage of processes are correctly identified as malicious by the best performing model (RF with 2 alerts and 37 features); it may be that the model detects the malware after it has already caused damage to the endpoint. Therefore, instead of looking at the time at which the malware is correctly detected, a live test was carried out with ransomware to measure the percentage of files corrupted with and without the process killing model working. This real time test also assesses whether malware can indeed be detected in the early stages of execution or whether the data recording, model inference and process killing is too slow in practice to prevent damage.
Ransomware is the broad term given to malware that prevents access to user data (often by encrypting files) and holds the means for restoring the data (usually a decryption key) from the user until a ransom is paid. It is possible to quantify the damage caused by ransomware using the proportion of modified files as Scaife et al. [24] have done in developing a real-time ransomware (only) detection system. The damage of some malware types are more difficult to quantify owing to their dependence on factors outside the control of the malware. For example the damage caused by spyware will depend on what information it is able to obtain so it is difficult to quantify the benefit of killing spyware 5 seconds after execution compared with 5 minutes into execution. Ransomware offers a clear metric for the benefits of early detection and process killing.  Although the RF with with a minimum of 2 alerts using both process and global data gave the highest F1-score on the test set (81.50), earlier experiments showed that RFs are not one of the most computationally efficient models by comparison with those tested. Therefore a decision tree is trained on processonly data (26 features) in case the time-to-classification is important for damage reduction despite the lower F1-score. For this reason the decision tree model is used in this test. The DT also has a very slightly higher TPR (see Table 8) so a higher damage prevention rate may be partially due to the model itself rather than just the fewer features being collected and model classification speed.
22 fast-acting ransomware files were identified from a separate VirusShare [43] repository which (i) do not require internet connection and (ii) begin encrypting files within the first few second of execution. The former condition is set because the malicious server may no longer exist and for safety it is not desirable to connect to it if it does. These are the types of malware that if the proposed model could block, would save significant damage to the user in a time-frame that it would be difficult for a human to react to.
The 22 samples were executed for 30 seconds each without the process killing model and the number of files modified was recorded. The process was repeated with 4 process killing models: DT with min. 2 alerts and 26 features, RF with min. 2 alerts and 37 features, AdaBoost regressor with 26 features and GDBT regressor with min. 4 alerts and 26 features.
It was necessary to run the killing model with administrator privileges and to write an exception for the Cuckoo sandbox agent process which enables the host machine to read data from the guest machine since the models killed this process. The need for this exception highlights that there are benign applications with malicious-like behaviours perhaps especially those used for networking and security. Figure 6: Total number of files corrupted by ransomware with no process killing and with three process killing models within the first 30 seconds of execution.  Table 9: Total number of files corrupted by ransomware with no process killing and with three process killing models within the first 30 seconds of execution. Damage reduction is the percentage of files spared when no killing is implemented Figure 6 and Table 9 give the total number of corrupted files across the 22 samples. The damage prevention column is a proxy metric denoting how many files were not corrupted using a given process killing model by comparison with no model being in place. The 22 samples on average each corrupt 910 files within 30 seconds.
The DT model almost entirely eliminates any file corruption with only three being corrupted. The RF saves 92.68% of files. The ordinal ranking of 'damage prevention' is the same as the TPR on the test set but the relationship is not proportional. The same ordinal relationship indicates that the simulated impact of process killing on the collected test set was perhaps a reasonable approximation of measuring at least fast-acting ransomware damage, despite the TPR test set metrics being based on other malware families too.
The DT demonstrates that this architecture is capable of preventing damage, but the TNR on the test set of the DT model is so low (66.19) that this model cannot be preferred to the RF (81.53 TNR), which still prevents over 90% of file damage.
The GBDT prevents some damage, and detects a comparable number of ransomware samples (1 in 5). The AdaBoost regressor detected 2 ransomware samples of the 22, and in these two cases more than 64% and 45% of files were saved respectively; perhaps with more execution time the files would be detected but the key benefit of process killing is to stop damaging software like these ransomware samples and this algorithm actually saw more files encrypted than when no killing model was used; this is because there will be a slight variance in the ransomware behaviour and execution time each time it runs. The Random Forest is the most plausible model, balancing damage prevention and TNR, however the delay in classification may be a result of the requirement to collect more features and/or the real-time of the model itself.

Discussion: Measuring execution time in a live environment
Though algorithm execution duration was measured above, due to batch processing used by the models, the number of processes being classified can be increased by an order of magnitude with a negligible impact on execution time. The data collection and process killing both have O(n) complexity; where n is the number of processes therefore it is expected that the number of processes to impact classification time. The RF with statistical filters has complexity O(nps) where p are the number of trees in the forest and s is the number alerts considered by the filter, efficient library implementations of matrix operations mean that the execution time does not scale linearly with n for the RF inference. Given this, a further experiment was carried out with the RF to measure in a live environment how long the data collection, model inference, and process killing takes as the number of processes increases. This was tested by executing more than 1000 processes in the virtual machine whilst the process killing RF runs. Some processes demand more computational resources than others, and some malware in our test set locked pages in memory [49] which prevented the model from having sufficient resources to collect data, leading to tens of seconds during which no data was captured and many processes were launched, with better software engineering practices the model may be more robust against this kind of malicious activity.
These differences in behaviour can cause the evaluation time to lag as demonstrated by the outlier points visible in Figure 7. The data shows a broadly linear positive correlation between the number of processes (being monitored or killed) and the time taken for the data collection and process killing; this confirms the hypothesis that more processes equates to slower processing time. The slowest Figure 7: Mean time to collect data, analyse data with Random Forest, and kill varying numbers of processes 30 total processing time was 0.81 seconds (seen with both 17 and 40 simultaneous processes running) but the mean processing time is just under 0.3 seconds with 65 simultaneous processes, fitting comfortably within the 1-second goal time. Additional code optimisation could greatly improve on these initial results which indicate that the processing, even using standard libraries and a high level programming language, can execute reasonably quickly.

Implications and Analysis
The experiments in this paper address a largely unexplored area of malware detection, by comparison with post-trace classification. Real-time processing and response has a number of benefits, outlined above and the results presented here give tentative indications of the advantages and challenges of such an approach.
The initial experiments (Section 6.1, demonstrate that a high-accuracy RNN (as used in [11]) does not maintain high-accuracy when used in real-time with an automated response to classify individual processes rather than full application traces, since a single false positive classification of sequential data cannot be outweighed by later correct predictions.
The next set of experiments (Section 6.1) showed that whilst the RNN achieves one of the highest classification accuracies of a set of algorithms tested, it is not one of the best in terms of computational resource consumption or latency. However, a clear best-algorithm was not evident either since the lowresource consuming algorithms (like decision tree) did not always achieve high accuracy. Furthermore, all of the supervised learning algorithms were clearly unsuited to process killing with the highest F1 score from any algorithm being 77.85 on the test set compared with 85.55 for process-level classification alone. This 85.55 F1 score is lower than is seen in many dynamic malware detection research publications that use full-application behavioural traces, indicating the challenges of classification at the process level, where malware and benignware may share functionality.
Attempting to improve detection accuracy, three approaches were tested: statistical filtering, reinforcement learning and a regression model estimating the utility (q-value) of killing a process. Statistical filters using rolling mean or alert thresholds were the only approach to improve on the supervised learning model F1 score. Reinforcement learning tended to kill processes too early and therefore not explore enough scenarios (and thus receive the requisite reinforcement) to allow benign processes to continue, this does not mean that future models could not improve upon this result. This may be supported by the success of the regression models in maintaining a high true-negative rate, given that these models ascribed the similar utility to killing processes as the reinforcement learning models.
The accuracy metrics tested thus far simply indicate whether a process was ever killed, but do not address whether damage was actually prevented by process killing. If damage was not prevented, there is little point to process killing and a database of alerts for analysis would be a better solution since the risk of killing benignware is eliminated. This is why the final set of experiments in Section 6.5 were conducted to test the detection models in real-time and see if damage could be prevented by looking at the number file corrupted by ransomware before and after infection. Here we found that is was possible to prevent 92% of files from being encrypted whilst maintaining a true negative rate of 82%. This result does not indicate that the system is ready for real-world deployment but that perhaps further model analysis perhaps including anomaly detection could raise the true negative rate to a usable point. This work also demonstrates the damage that certain malware can carry out in a short space of time and reinforces the need for further research in this area, since previous work has either focused solely on ransomware [24] or waited minutes to being classification [23], by which time it is too late.

Future Work
Real-time attack detection has wider applications than endpoint detection, as Alazab et al. [50] argue, Internet of Things networks in particular could benefit from real-time attack detection using heterogeneous data feed from different sensors combined using federated learning approaches.
However, some challenges remain to be solved; behavioural malware analysis research using machine learning regularly reports >95% classification accuracy. Though useful for analysts, behavioural detection should be deployed as part of endpoint defensive systems to leverage the full benefits of a detection model. Dynamic analysis is not typically used for endpoint protection, perhaps because it takes too long in data collection to deliver the quick verdicts required for good user experience. Real-time detection on the endpoint allows for observation of the full trace without the user having to wait. However, real-time detection also introduces the risk that malware will cause damage to the endpoint. This risk requires that processes detected as malicious are automatically killed as early as possible to avoid harm.
There are some key challenges to implementation, which have been outlined in this paper: • The need for signal separation drives the use of individual processes and only partial traces can be used.
• The significant drop in accuracy on the unseen test set, even without process killing demonstrates that additional features may be necessary to improve detection accuracy.
• With the introduction of process killing, the poor performance of the models on either benignware classification (RF min 2 alerts: TNR 81% with an 88% TPR on the test set) or on malware classification (GBDT regressor min 4 alerts: 56% TPR with a 94% TNR on the test set) means that considerable further work is needed before very early stage real-time detection can be considered for real-world use.
• Real-time detection using full execution traces of processes however, may be viable. This is useful to handle VM-aware malware which may only reveal its true behaviour in the target environment. Although the more complex approach using DQNs algorithms did not outperform the supervised models with some additional statistical thresholds, the regression models had better performance in correctly classifying benignware. Reinforcement learning could still be useful for real-time detection and automated cyber defence models, but the DQN in these experiments did not perform well.
• Despite the theoretical unsuitability of supervised learning models to stateaction problems, these experiments demonstrate how powerful supervised learning can be for classification problems, even if the problem is not quite the one that the model is attempting to solve.
• Future work may require a more comprehensive manual labelling effort at the process level and perhaps labelling sub-sections of processes as malicious or benign.
An additional consideration for real-time detection with automated actions is whether this introduces an additional denial-of-service vector using process injection for example to trigger process killing. This may also however indicate that an attacker is present and therefore aid the user.

Conclusions
This paper has built on previous work in real-time detection to address some of the key challenges: signal separation, detection with partial execution traces and computational resource consumption with a focus on preventing harm to the user, since real-time detection introduces this risk.
Behavioural malware detection using virtual machines is a well-established research field yielding high detection accuracy in recent literature [3,20,6,11]. However, as is shown here, fixed-time execution in a sandbox may not reveal malicious functionality. Real-time malware analysis addresses this issue but risks executing malware on the endpoint and requires detection to take place at the process level, which is more challenging as the definition of a malicious process can be unclear. These two reasons may account for the limited literature on real-time detection. Looking forward real-time detection may become more popular if static data manipulation and VM-evasion continue to be used and the costs of malicious execution continue to rise. Real-time detection does not need to be an alternative to these approaches, but could hold complementary value as part of a defence-in-depth endpoint security.
To the best of our knowledge previous real-time detection work has used up to 5 simultaneous applications, whereas users may use far more. This paper has demonstrated that up to 35 simultaneous applications (and nearly 100 simultaneous processes) can be constantly monitored, where previous work [23] had tested a maximum of 5. Moreover, these results demonstrated that data collection presented a greater limiting factor than machine learning algorithms which can easily process 1000 samples with negligible impact on performance. This result is not too surprising since batch processing allows algorithms to achieve O(1) complexity by comparison with O(n) for data collection.
Automatic actions are necessary in response to a detection if the goal is to prevent harm. Otherwise this is equivalent to letting the malware fully execute and simply monitor it's behaviour since human response times are unlikely to be quick enough for fast-acting malware. From a user perspective the question is not 'What percentage of malware was executed?' or 'Was the malware detected in 5 or 10 minutes?' but 'How much damage as been done?'. This paper found that using simple statistical filters on top of supervised learning models it was possible to prevent 92% of files from being corrupted by fast-acting ransomware thus reducing the requirements on the user or organisation to remediate the damage, since it was prevented in the first instance (the rest of the attack vector would remain a concern). This approach does not achieve the detection accuracies of state of the art offline behavioural analysis models but, as stated in the introduction, these models typically use the full post-execution trace of malicious behaviour. Delaying classification until post-execution negates the principal advantages of real-time detection. However, the proposed model presents an initial step towards a fully automated endpoint protection model which becomes increasingly necessary as adversaries become more and more motivated to evade offline automated detection tools.