Modified Decision Tree Technique for Ransomware Detection at Runtime through API Calls

Ransomware (RW) is a distinctive variety of malware that encrypts the files or locks the user’s system by keeping and taking their files hostage, which leads to huge financial losses to users. In this article, we propose a new model that extracts the novel features from the RWdataset and performs classification of the RW and benign files.*e proposedmodel can detect a large number of RW from various families at runtime and scan the network, registry activities, and file system throughout the execution. API-call series was reutilized to represent the behavior-based features of RW. *e technique extracts fourteen-feature vector at runtime and analyzes it by applying online machine learning algorithms to predict the RW. To validate the effectiveness and scalability, we test 78550 recent malign and benign RW and compare with the random forest and AdaBoost, and the testing accuracy is extended at 99.56%.


Introduction
Computers are now becoming a legal part of our daily life, and the world cannot imagine life without a computer. Internet and computer applications have facilitated our daily life. e development has also brought us several threats to the computer, i.e., malware [1]. Malware is a malicious code, which is composed of two words "Mal" mean malicious and "ware" mean software.
rough e-mail, this malicious software sent a link or file, and when the user clicks on the link or opens the file, their malware type viruses, ransomware (RW) and spyware, get executed [2]. Malicious software consists of codes developed by cyber attackers and designed to extensively damage the victim data. ere are numerous types of malware but the most common types are spyware, virus, and scareware. e spyware is designed to spy on the users' activities. It is a hidden application that is secretly executed in the background on the victim's computer. is type of malicious software collects information such as details of credit cards, passwords, and other sensitive information without the user's permission. A computer virus is a common type of malware which attaches itself to victims' other files. It gets downloaded or installed itself in the computer system. e computer virus spreads quickly in the computer system. It also damages the main functionality of the computer systems and corrupts or locks the victim's system and files [3]. e third type of malware is scareware, also acknowledged as RW, that comes with a high price. It is capable to lock or encrypt user data and restrict a user to get access to their data until the demanded money or ransom is paid.
RW attacked some of the largest organizations in the world. It is the main type of malware related to cybercriminals, and it is very common. e aim and objective of this malware are to collect money as a ransom. RW encrypts the files or locks the user's system by holding and taking user files' hostage that leads to financial gain [4]. In today's Internet market, RW is the most dangerous and significant security threat and is on the top of the list. e history of RW goes back to 1980 [5]. In the last few years, such kinds of attacks are in the headlines around the world. ey have resulted in increasing new families, e.g., Cryptowall 3.0 is one amongst the family of RW, which known as costly and effective RW family that had caused around $325 million damage to the industry. Sony RW attack is also very dangerous which got huge media headlines. North Korea was behind the attack, and US government confirmed it [6][7][8][9].
1.1. Types of Ransomware (RW). According to the current arrival and weekly arising stories of RW, it is difficult to identify the different strains, as each of them spread differently. ey generally follow similar strategies to gain the benefit of users' security weaknesses and hold data hostage [10]. ere are several forms of RW in which some of them are discussed here in detail.

Bad Rabbit RW.
Bad Rabbit is the type of scareware, which is on the top of the list. In Eastern Europe and Russia, RW infected different organizations. e RW spreads itself by showing itself as a fake adobe flash update on compromised websites. When this RW infects a system, the user is directed to the payment page and shows that you are infected or hacked and now you have to pay $285 [10].

Cerber RW.
It is the most dangerous and powerful RW because it also works even if you are not connected to the Internet, and even if your PC is unplugged, it still works. Cerber function is to encode the files of infected users, and then if you want to give access to your files back, you need to pay money. It attaches and sends the infected Microsoft Office document through e-mail to the victim's system. Accessing the attached file automatically encrypts the files with Rivest-Cipher (RC4) and Rivest-Shamir-Adleman (RSA) algorithms and updates or modifies them with Cerber extensions [11].

CryptoLocker RW.
Crypto RW is also a special type of malware. It works like a Trojan horse, which is also used to earn money. It encodes files on the specific system, and the users will be asked to pay to decrypt their files. rough spam emails, Ads, or fake sites or by malicious methods, these threats affect the user system. us, once the system is infected by Trojan, it stores the path of encoded files through several registered entries and runs when the system restarts and specific extensions are made in the computer which encrypts the records, and to find the decryption key, it creates additional files. To get the key, this dangerous family tries to convert the user to pay money. ey use different kinds of techniques for users to pay the money for ransom [12].

Cryptowall RW.
Ransom Cryptowall is a Trojan horse type virus that encodes files on the specific computer and asks the user to pay for file decryption. ese threats typically arrive on the affected PC through exploit kits, spam emails held through malware ads or compromised sites, or other malicious. Once the Trojan is entered into the compromised system, it makes several registry entries to store the path of the encoded files and run when the computer restarts. It encrypts the records with specific extensions on the system and creates additional files with instructions on how to find the decryption key. is danger family attempts to convince the user to pay money to get the key to unlocking their documents. It uses different techniques to convince the user to pay the money for ransom [13].
RW is a specialized form of malware that encrypts files and condenses them unreachable until the victim pays a ransom. It is an extremely serious problem, and it is quickly getting worse. e statistics gathered by the FBI's Internet Crime Complaint Center (IC3) for 2018 show Internetenabled theft, fraud, and exploitation remain pervasive and were responsible for a staggering $2.7 billion financial losses [14]. e FBI reports the IC3 received 351,936 complaints in 2018 and an average of more than 900 every day. ere is a dramatic increase in extortion payments with tens of thousands of ransomware victims paying several hundred dollars each to recover their encrypted files. In some instances, the ransom is larger, such as South Korean web hosting company Nayana, which paid 397.6 Bitcoin (about $1 million) in June 2017 and Hollywood Presbyterian Medical Center, which paid $17,000 in Bitcoin in February 2016 [15].
is emerging issue needs the attention of the research community to detect and prevent the families of RW that will protect users from huge losses. In this paper, we proposed a robust solution to detect RW at runtime by monitoring network, registry activities, and file systems. We use the API-call series to represent the behavior-based features of malware. e proposed methodology extracts the 14feature vector by using runtime analysis by applying online machine learning algorithms for the classification of malware samples in a distributed and scalable architecture.
is paper organized as follows: Section 2 has the literature about recent work on RW classification and detection. In Section 3, we present our proposed methodology in detail. Section 4 has the experiments, dataset used, time of the proposed approach, evaluation metrics, and experimental results. In Section 5, we conclude this paper and outline for future direction.

Literature Review
In this section, the existing research work done on the detection and classification of RW is analyzed. e summary of the literature on RW with findings is given in Table 1. e existing computational model for detection and classification of RW is summarized in Tables 1 and 2. Alhawi et al. [16] presented a machine learning-(ML-) based solution for the detection of RW.
e dataset was collected from VirusTotal, and both data are the malicious and benign and contain 264 records having 9 RW families and 3 types of benign. Wireshark is used to capture the data and features. T-Shark is used to extract the features. e 2 Scientific Programming experiment was carried out in WEKA version 3.8.1. e WEKA machine-learning tool splits a dataset for training and testing purposes. e training dataset contained 75,618 simples, and the test dataset contained 48,526 simples. e training and testing datasets are split as 70 percent and 30 percent, respectively. Six different machine learning algorithms were applied. Using dataset network traffic features, we got a true positive detection rate of 97.1 percent, and using a decision tree classifier, we achieved a zero false positive rate (FPR) and true positive rate (TPR) of 96.3 percent.
Rhode et al. [17] carried out a study for the detection of RW. To achieve high accuracy, the author presented a novel approach. e proposed algorithm detects RW files during the execution stage in the first 20 sec. e dataset was collected from VirusTotal and VirusShare. e dataset contains 23,145 benign and 2,286 malicious records. A preprocess was carried out to convert all alphabetic values into numerical range for presenting of RW. Recurrent Neural Networks (RNNs) are applied to predict RW. e accuracy in 5 sec is 94 percent and 10 sec is 96 percent. e minimum false negative rate (FNR) for a model was 4.5 percent and FPR was just 3 percent. e actual value of the model in 20 seconds is 93 percent. e experiment carried was out in Python version 2.7 using Keras to implement the RNN model.
Carlin et al. [20] developed a dynamical analysis with a new detecting cryptomining technique. e dataset consists of 490 samples and is collected from VirusShark. A total of 490 samples, 194 are benign and Cryptomining has 296 HTML files or malicious samples. e RF classifier is used and implemented in WEKA version 3.9. e data will have used 10-fold cross-validation. e best accuracy of RF is 99.05 percent. e FPR is 99.7 percent, and FNR is 98.6 percent.
Carlin et al. [21] emphasized the analysis of low-level opcode, both dynamic and static, to detect the malware on runtime dataset 1,000 labels samples to affect the traditional AV labels. e dataset was collected from VirusShare. e author selected the size modality and facility. 180,000 records are malware, and all records are named by message digest MD5 hash with no other metadata. Data will be preprocessing only 1,000 opcodes with a 1.0 percent margin. e dataset contains 764 benign and 18,827 malicious samples.
e counterbased classifier uses RF and implements it in WEKA version 3.8. e best accuracy of the RF is 98.4 percent.
Takeuchi et al. [24] introduced RW detecting using support vector machines (SVMs). e dataset consists of 588 samples, which have 312 benign and 276 RW, and was collected from VirusTotal. e authors design different sequence of API calls into the same vector symbols. e

Methodology
In this section, the new methodology is discussed. e main objective of the new methodology is the detection of the RW family at runtime. e dataset used in this paper is collected from a virus's total website [27]. VirusTotal is an online provision that examines the files and uniform resource locations (URLs) to help in the detection of worms, viruses, and other kinds of malicious gratified using website scanners and antivirus engines. e dataset is used to identify benign and malware from the data. e proposed methodological model has different phases as shown in Figure 1.
First, the selected dataset is processed. e second phase is used to extract useful features from the preprocessed dataset using API calls. In the third phase, the dataset is divided into testing and training subsets. Finally, for the classification purpose, three diverse machine-learning algorithms, i.e., modified decision tree, random forest, and AdaBoost, are used.

Data Sets.
e dataset is collected from the VirusTotal. It consists of 78550 samples; among them, there are 35369 malware and 43191 benign. e dataset has a total of 18 features, and we select 14 features that are most relevant for the classification of a file in malware or benign. For the accuracy and improvement of the result, the 10-fold cross-validation technique is applied to the data [27].

Feature Extraction.
In this step, we extract 14 features from the dataset. e detail of these features given in Table 3. e file names and MD5 hash features are dropped from the dataset. e last feature will be used as a class label, i.e., benign or malware.

Training and Testing.
After extracting all vector's features, we utilized the feature vectors with class labels to train the model. en, the trained classifiers can calculate the labels of new instances in the form of feature vectors. Later, the performance of the proposed model is calculated. In this research, we utilized three different machine-learning algorithms, namely, decision tree, random forest, and AdaBoost.

Classification.
During classification, the dataset is split into training and test datasets. is process has a key role in the field of RW detection and ML. e set training is used to train the model, and the test set is used to validate the model results.

Modified Decision
Tree. Algorithm 1 is used to split a huge collection of records into continuously smaller subsets of records by applying a sequence of simple decision rules. e algorithm 1 splits the feature space into subsets where each subset consists of a homogeneous group of samples [28]. e outcome is a tree with leaf nodes and decision nodes. e topmost decision node in a tree, which corresponds to the best predictor, is called the root node. Decision trees can handle both categorical variables and numerical data [29].
e decision tree uses the information gain theory to select the best partitioning attribute from the dataset. e info (X) is calculated using (1): e key advantage of the decision tree is its' easy implementation. Decision trees and the underlying principle that they work on are easy to interpret and understand as compared with other complex machine-learning algorithms.

Random Forest.
Algorithm 2 is a combination of different decision trees, each with the unique nodes, but utilizing diverse data that leads to different leaves Figure 2. It combines the decisions of multiple decision trees to find the best answer, which denotes the average decision trees [4]. Random forest is a flexible, easy to practice machinelearning algorithm that generally generates, even without hyperparameter tuning, an improved result. It can be used for both regression and classification problems [30].

AdaBoost.
AdaBoost stands for adaptive boosting and combines weak classifiers into a strong classifier. Adaptive boosting is the first practical learning technique for building  An RVA in the portable executable (PE) header, which has a value of zero, indicates the field has not used all tables, and structure fields must be united on their ordinary limits, with the possible exception of the debug information.

Major image version
It is the file version. is record is user-definable and not connected to the task of the application. Many benign programs have more varieties and a larger image version set. Malware distributes a 0 value.

4
MajorOSVersion (major operating system version) It is the major operating system required to run .exe files.

5
ExportRVA (export relative virtual address) RVA (relative virtual address) exports ordinals for table entry. e location is virtual to the commencement of the image base. e export address table holds the location of exported data, entry points, and absolutes. An ordinal value is used to index the export address table.
6 Export size Present the size of the export records. Only DLLs, not runtime applications, have export tables. So, the vote of this feature may be positive for clean files, which contain many DLLs and 0 for virus files. 7 IatRVA is means the relative-virtual address of the import-address table. e value of this feature is read chunks of 4096 bytes and cleanest files and 0 or a very large value for virus files. 8 Major linker version e major version linker produced the file to the PE header in the major linker version, and the resources size malware will be sometimes 0 in the section of PE header. Malware sometimes has 0 resources. 9 Minor linker version e minor version linker produced the file. 10 Number of sections e amount of virtual memory to standby for the initial thread's stack. 11 Size of stack reserve e amount of virtual memory to reserve for the initial thread's stack.

12
All characteristics It is a set of flags indicating under which circumstances a dynamic-link library (DLL) initialization function 13 Resource size It symbolizes the dimensions of the resource section. Some malware records may have no resources. Benign files may have higher resources.
14 Machine Defines the architecture type of the computer. e program can be run only on a system that monitors this type.

Input:
Training  6 Scientific Programming a strong classifier by the combination of weaker one [31]. A tree just has one node, and two leaves are called decision stump Figure 3. h (x) is a weak classifier. is is equivalent to saying that (h) is computed as a weighted majority vote of the weak hypothesis (h), where each hypothesis is assigned weight F (x). e weak classifier learns by considering one simple feature and h (x) is the most useful feature for the classification selection Figure 4.

Experiments and Results
In this phase, the experimental environment, experiments, and results are discussed. e datasets are statistically analyzed to understand the data. en, different classification techniques were applied to classify the malicious and benign files, and finally, the performance evaluation measures were used to assess the performance of the classifiers.  e dataset used in this study consists of 78550 samples, where 35369 samples are malware and 43191 samples are benign. RW is the type of malware, and benign is good ware. e dataset is nearly balanced; therefore, it does not need the balancing techniques.

Experimental Environment.
All the experiments for this study are conducted on the core i5 machine with 2.4 GHz CPU and 8 GB of memory. e decision tree, random forest, and AdaBoost were implemented in Python due to its simplicity and scalability.

Evaluation Matrices.
In this study, different evaluation measures are used to relate the performance of the classifiers. ese include accuracy, sensitivity, specificity, and f1. All these measures are grounded on the confusion matrix given in Table 4.
Accurateness is the utmost intuitive performance measurement. It is a relation of correctly predicted observation concluded over total observation. e accuracy of the model is calculated using (2). Sensitivity statistic (recall) is a proportion of correctly predicted positive observation and overall positive observations in the actual class, and it is calculated using (3). e negative class prediction power of the classifier is called specificity, which can be calculated using (4). Finally, the f1 measure is calculated using (5) which is the Harmonic mean of the sensitivity and specificity: f1 � 2 sensitivity * specificity sensitivity + specificity .
(1) Accuracy-Based Analysis. (2) Sensitivity-Based Analysis. e sensitivity-based comparison of 10-fold cross-validation is performed the best as shown in Table 6. e experiments show that the sensitivity of the decision tree is higher.
(3) Specificity-Based Analysis. Table 7 represents the specificity-based comparison of the different classifiers. e experiments show that the specificity of decision tree has a higher accuracy and the value is 99.62% because the feature of specificity is higher. Specificity (Precision) is a proportion of correctly classified positive observation over total predicted positive observation.
(4) f1 Measure Based Analysis. Table 8 represents the f-measure based comparison of performing. e experiments show that the f1 value of Decision Tree is higher accuracy value is 99.55%.

Performance Comparison with State-of-the-Art
Techniques. By comparing the performance using different classifiers used on the dataset, it is clear that the proposed technique availed a higher accuracy as matched to the already developed models. e results in Table 9 show that the modified decision tree has the highest accuracy of 99.56%. AdaBoost has the lowest accuracy of 98.37%. random Forest has an average accuracy of 99.38%.
It also clearly shows that the proposed technique availed a higher accuracy as matched to the already developed models. Table 10 presents the results of the contrast of the suggested algorithm with other multiple methods.

Conclusion and Future Direction
In this research, the RW detection at runtime scheme is developed which uses a preprocessed dataset that comprises benign and RW files. Benign is good ware, and RW is a special type of malware that keeps the data encrypted until a ransom is paid to the attacker. In the experiment, three different algorithms, namely, decision tree, random forest, and AdaBoost, are used to detect the RW and benign files. e modified decision tree, among the three algorithms, performed well in terms of accuracy, sensitivity, specificity, and f1-measure. Our experimental outcomes demonstrate that the presented malware classification's testing and training accuracy is reached at 99.56%. Researchers stated some facts about sheltered device from attack and established some parameters to save data from the attack in the future, because RW is Trojan-type attack and malware, and so anomaly-based IDS may be used in the future for detecting abnormal behaviors of the network. Data mining techniques are used for detecting the activity of attack.

Data Availability
e data used to support the findings of the study are available upon request.

Conflicts of Interest
e authors declare that there are no conflicts of interest associated with the publication of this article.