Using a Subtractive Center Behavioral Model to Detect Malware

In recent years, malware has evolved by using different obfuscation techniques; due to this evolution, the detection of malware has become problematic. Signature-based and traditional behavior-based malware detectors cannot effectively detect this new generation of malware.*is paper proposes a subtractive center behavior model (SCBM) to create a malware dataset that captures semantically related behaviors from sample programs. In the proposed model, system paths, where malware behaviors are performed, and malware behaviors themselves are taken into consideration. *is way malicious behavior patterns are differentiated from benign behavior patterns. Features that could not exceed the specified score are removed from the dataset. *e datasets created using the proposed model contain far fewer features than the datasets created by n-gram and other models that have been used in other studies. *e proposed model can handle both known and unknown malware, and the obtained detection rate and accuracy of the proposed model are higher than those of the known models. To show the effectiveness of the proposed model, 2 datasets with score and without score are created by using SCBM. In total, 6700 malware samples and 3000 benign samples are tested. *e results are compared with those derived from n-gram and models from other studies in the literature. *e test results show that, by combining the proposed model with an appropriate machine learning algorithm, the detection rate, false positive rate, and accuracy are measured as 99.9%, 0.2%, and 99.8%, respectively.


Introduction
Any software that performs malicious activities on victim machines is considered to be malware. Sophisticated malware uses packing and obfuscation techniques to make the analysis and detection processes more difficult [1]. Malware lies at the root of almost all cyber threats and attacks including global threats, advanced persistent threats (APTs), sensitive data theft, remote code execution, and distributed denial of service (DDoS) attacks. In recent years, the number, sophistication of malware attacks, and the economic damage caused by malware have been increasing exponentially. According to scientific and business reports, approximately 1 million malware files are created every day. According to cybersecurity ventures, cybercrime will cost the world economy approximately $6 trillion annually by 2021 [2]. According to the same report in 2019, ransomware malware costs around $11.5 billion globally [2].
Mobile malware is on the rise. According to the McAfee mobile threat report, there is a substantial increase in backdoors, fake applications, and banking Trojans for mobile devices [3]. e number of new mobile malware variants increased by 54% from 2016 to 2017 [4], and most types of unknown and mobile malware are evolved versions of known malware [5]. Moreover, malware attacks related to the healthcare industry, cloud computing, social media, Internet of ings, and cryptocurrencies are also on the rise [2,6].
It is almost impossible to propose a method or system that can detect every new generation of sophisticated malware. e 4 main methods used to detect malware are based on signature, behavior, heuristic, and model checking detection. Each method has advantages and disadvantages.
Signature-based malware detector examines the features that encapsulate the program's structure and uniquely identify the malware. is method detects known malware efficiently, but it cannot detect unknown malware. Behavior-basedmalware detector observes program behaviors using monitoring tools and determines whether the program is malware or benign. Although program codes change, the behavior of the program will remain relatively the same; thus, new malware can be detected with this method [7]. However, some malware does not run properly under the protected environment (e.g., virtual machine and sandbox environment), and thus, the malware sample may be incorrectly marked as benign.
In recent years, heuristic-based detection methods have been used frequently. ese methods are complex detection methods that apply both experience and different techniques such as rules and machine learning techniques [8]. However, even if the heuristic technique can detect various forms of known and unknown malware [7], it cannot detect new malware that is quite different from existing malware. In model checking-based detection, malware behaviors are manually extracted, and behavior groups are coded using linear temporal logic (LTL) to display a specific feature [9]. Although model checking-based detection can successfully detect some unknown malware that could not be detected with the previous 3 methods, it is insufficient for detecting all new malware.
In this paper, the subtractive center behavior model (SCBM), which captures semantically associated behaviors when creating a dataset, is proposed. In this model, in addition to malware behaviors, system paths where malware behaviors are executed are also considered. e proposed model makes the following contributions: (i) SCBM is proposed to create a malware dataset with fewer features than known models. (ii) Instead of directly using system calls as behaviors, system calls are mapped to relevant behaviors. (iii) Behaviors are divided into groups, and risk scores are calculated based on the system path and activepassive behaviors. (iv) Features are extracted from behaviors according to the type of resources and instances that have been used. is way malicious behavior patterns are segregated from benign behavior patterns.
(v) e proposed model can handle both known and unknown malware. (vi) e obtained detection rate and accuracy of the proposed model are higher than those of the known models.
e rest of this paper is organized as follows. Section 2 defines malware and describes trends in malware technologies. Related work is summarized in Section 3. SCBM is explained in Section 4, and the case study is presented in Section 5. e results and discussion are provided in Section 6. Finally, the limitations and future works are given in Section 7, and the conclusion is given in Section 8.

Definition of Malware and Trends in Malware Technologies
Any software that intentionally executes malicious payloads on victim machines is considered to be malware [7]. ere are different types of malware including viruses, worms, Trojan horses, rootkits, and ransomware. Common malware types and their primary characteristics can be seen in Table 1. e malware types and families are designed to affect the original victim machine in different ways (e.g., damaging the targeted system, allowing remote code execution, and stealing confidential data). Generally, hackers launch an attack by using malware, which exploits vulnerabilities in existing systems such as buffer overflow, injection, and sensitive data misconfiguration [10]. ese days, the classification of malware is becoming more complex because some malware instances can present the characteristics of multiple classes at the same time [11].
Viruses, which are considered to be first malware that appeared in the wild, were defined as self-replicating automata by John von Neumann in the 1950s. However, practically the first virus called "the Creeper" was created in 1971 by Bob omas [12,13]. In the early days, this software was written for simple purposes, but in time, it was replaced by a new generation of malware that targeted large companies and governments. Malware that runs in the kernel mode is more destructive and difficult to detect than traditional malware, and it can be defined as a new generation (next generation) of malware. e comparison between traditional and new generation malware can be seen in Table 2. e inability to implement the operating system control features in the kernel mode makes the detection of new generation malware difficult. is malware can easily bypass protection software that is running in the kernel mode such as antivirus software and firewalls. In addition, by using this software, targeted and persistent cyberattacks that have never been seen before can be launched, and more than one type of malware can be used during the attacks. Examples of traditional versus new generation malware can be seen in Figures 1 and 2.
M represents malware, and (P 1 , P 2 , P 3 , P 4 ) show the running processes that interact with the malware. First, M copies itself into different processes such as P 1 , P 2 , and P 3 . en, M deletes itself from the system to make itself invisible ( Figure 2). In early days, rootkits were using similar techniques to hide themselves from the system. However, in process of time, many other kinds of malware (in some cases, rootkits are combining with viruses, worms, and Trojan horses) have started to use similar techniques to hide themselves as well. With the help of the processes, it has recently copied (P 1 ⟶ P 2 ; P 1 ⟶ P 3 ; P 3 ⟶ P 4 ) and it connects to remote system and makes changes on the victim's operating system. Even if the actual malware containing the malicious code has deleted itself from the system, the new version of the malware remains in and affects the system because the actual malware injected itself into different processes such as existing system files, third-party software, and newly created processes, which make the malware almost impossible to detect. To determine the malicious software mentioned in Figure 2, M and the P 1 , P 2 , P 3 , and P 4 processes must be examined separately, and the relations among these processes should be determined.
In addition, the new generation malware uses the common obfuscation techniques such as encryption, oligomorphic, polymorphic, metamorphic, stealth, and packing methods to make the detection process more difficult.
is makes practically almost impossible to detect all malware with single detection approach. e well-known obfuscation techniques can be explained as follows: (1) Encryption: malware uses encryption to hide malicious code block in its entire code [9]. us, malware becomes invisible in the host. (2) Oligomorphic: a different key is used when encrypting and decrypting malware payload. Hence, it is more difficult to detect malware, which uses the oligomorphic method rather than encryption. (3) Polymorphic: malware uses a different key to encrypt and decrypt likewise the key used in the oligomorphic Collects victim's sensitive information and sends them to third parties Commonly used to access credit card information or to identify user habits Obfuscated malware Can be any type of malware Uses obfuscation techniques to make detection process more difficult  and encryption method. However, the encrypted payload portion contains several copies of the decoder. us, it is more difficult to detect polymorphic malware when compared to oligomorphic malware. (4) Metamorphic: metamorphic method does not use encryption. Instead, it uses dynamic code hiding which the opcode is changing on each iteration when the malicious process is executed [9]. It is very difficult to detect such malware because each new copy has a completely different signature. (5) Stealth: the stealth method also called code protection implements a number of countertechniques to prevent it from being analyzed correctly. For example, it can make changes on the system and keep it hidden from detection systems. (6) Packaging: packaging is an obfuscation technique to compress malware to prevent detection or hiding the actual code by using encryption. Due to this technique, malware can easily bypass firewall and antivirus software [7]. Packaged malware need to be unpacked before being analyzed.

Related Work
In recent years, there has been a rapid increase in the number of studies on malware analysis and detection. In the early years, signature-based detection was used widely. Over time, researchers have developed new techniques for detecting malware including detection techniques based on behavior, heuristics, and model checking. ere is huge demand for methods that effectively detect complex and unknown malware. us, we present related research from the literature and examine the pros and cons of each study. e summary of related works can be seen in Table 3. e similarities determined among features by using system calls were described in [14,20]. Wagener et al. [14] proposed a flexible and automated approach that considered system calls to be program behaviors. ey used an alignment technique to identify similarities and calculated the Hellinger distance to compute associated distances. e paper claimed that the classification process can be improved using a phylogenetic tree that represents the common functionalities of malware. ey also claimed that obfuscated malware variants that show similar behaviors can be detected. e limitations of paper can be summarized as follows: (1) Lack of knowledge is provided about the malware dataset. (2) Statistical evaluation of performance is not provided.
(3) Comparison of proposed method against other methods is not given. Besides, it is not clear how phylogenetic tree can improve the performance.
Shan and Wang proposed a behavior-based clustering method to classify malware [20]. Behaviors were generated using system calls, and features within a cluster were shown to be similar. According to paper, the proposed method can detect 71.1% of unknown malware samples without FPs, while the performance overhead is around 9.1%. e proposed method is complex, not scalable for large datasets, and there are some performance issues on servers. Eliminating these deficiencies will improve the model performance.
A graph-based detection schema was defined in [15,17,21]. Kolbitsch et al. [21] proposed a graph-based detection method in which system calls are converted into a behavior graph, where the nodes represent system calls and the edges indicate transitions among system calls, to show the data dependency. e program graph to be marked is extracted and compared with the existing graph to determine whether the given program is malware. Although the proposed model has performed well for the known malware, it has difficulties detecting unknown malware.
Park et al. proposed a graph method that specifies the common behaviors of malware and benign programs [15]. In this method, kernel objects are determined by system calls, and behaviors are determined according to these objects. According to the paper, the proposed method is scalable and can detect unknown malware with high detection rates (DRs) and false positive (FP) rates close to 0%. In addition, the proposed model is highly scalable regardless of new instances added and robust against system call attacks. However, the proposed method can observe only partial behavior of an executable. To explore more possible execution paths would improve the accuracy of this method.
Naval et al. [17] suggested a dynamic malware detection system that collects system calls and constructs a graph that finds semantically relevant paths among them. To find all semantically relevant paths in a graph is a NP-complete problem. us, to reduce the time complexity, the authors measured the most relevant paths, which specify malware behaviors that cannot be found in benign samples. e authors claim that the proposed approach outperforms its counterparts because, unlike similar approaches, the proposed approach can detect a high percentage of malware using system call injection attacks. Paper has some limitations such as performance overhead during path computation and vulnerable to call-injection attacks and cannot identify all semantically relevant paths efficiently. Eliminating these limitations may improve the performance.
Fukushima et al. proposed a behavior-based detection method that can detect unknown and encrypted malware on Windows OS [22]. e proposed framework not only checks for specific behaviors that malware performs but also checks normal behaviors that malware usually does not perform. e proposed scheme's malware DR was approximately 60% to 67% without any FP. e DR is very low; to increase the DR, more malicious behaviors can be identified, and to prove the effectiveness of new method, the test set will be extended.
Lanzi et al. [23] proposed a system-centric behavior model. According to the authors, the interaction of malware programs with system resources (directory, file, and registry) is different from that of benign programs. e behavioral sequences of the program to be marked are compared with the behavior sequences of the two groups (i.e., malware and benign). e paper claimed that the suggested system detects a significant fraction of malware with a few FP. e proposed method cannot detect all malicious activities such as malware which does not attempt to hide its presence or to gain control of the OS and which uses only computer network for transmission. To include network-related policies and rules for malware, which ignores to modify legitimate applications and the OS execution, can improve the performance.
Chandramohan et al. proposed BOFM (bounded feature space behavior modeling), which limits the number of features to detect malware [24]. First, system calls were transformed into high-level behaviors. en, features were created using the behaviors. Finally, the feature vector is created and machine learning algorithms are applied to the feature vector to determine whether the program is malware or benign.
is method ignored the frequency of system calls. Executing the same system call repeatedly can cause DoS attacks. Considering the frequency of system calls can improve DR and accuracy.
A hardware-enhanced architecture that uses a processor and an FPGA (field-programmable gate array) is proposed in [18]. e authors suggested using an FCM (frequencycentralized model) to extract the system calls and construct the features from the behaviors. Features obtained from the benign and malware samples are used to train the machine learning classifier to detect the malware. e paper claimed that the suggested system achieved a high classification accuracy, fast DR, low power consumption, and flexibility for easy functionality upgrades to adapt to new malware samples. However, malware can perform various behaviors, and there is no uniform policy to specify number of behaviors and features to be extracted before triggering the early prediction. Furthermore, the proposed method performance has only been compared with BOFM and n-gram which is not enough to determine the efficiency of the proposed model.
Ye et al. proposed associative classification postprocessing techniques for malware detection [25]. e proposed system greatly reduces the number of generated rules by using rule pruning, rule ranking, and rule selection.
us, the technique does not need to deal with a large database of rules, which accelerate the detection time and improve the accuracy rate. According to the paper, the proposed system outperformed popular antivirus software tools such as McAfee, VirusScan, and Norton Antivirus and data mining-based detection systems such as naive Bayes, support vector machine (SVM), and decision tree. To collect more API calls, which can provide more information about malware, and identify complex relationships among the API calls may improve the performance.
A supervised machine learning model is proposed in [26]. e model applied a kernel-based SVM that used weighting measure, which calculates the frequency of each library call to detect Mac OS X malware. e DR was 91% with an FP rate of 3.9%. Test results indicated that incrementing sample size increases the detection accuracy but decreases the FPR. Combining static and dynamic features, using other techniques such as fuzzy classification and deep learning can increase the performance. e method of grouping system calls using MapReduce and detecting malware according to this grouping is described by Liu et al. [27]. According to the authors, most of the studies performed so far were process-oriented, which determines a process as a malware only by its invoked system calls. However, most current malware is module-based, which consists of several processes, and it is transmitted to the system via driver or DLL [28]. In such cases, malware performs actions on the victim's machine by using more than one process instead of its own process. When only one process is analyzed, malware can be marked as benign. However, there are some limitations of the proposed method. e limitations of this method can be addressed as follows: (1) some malware does not require persistent behavior ASEP; (2) persistent malware behaviors can be completed without using system calls; and (3) the cost of data transmission has not been measured. Besides, the proposed method results were not compared with other studies in the literature. Eliminating abovementioned limitations can improve the method performance.
A detection system that combines static and dynamic features was proposed in [16].
is system has three properties: the frequencies (in bytes) of the method, the string information, and the system calls and their parameters. By combining these properties, the feature vector was constructed and classified using classification algorithms. e paper claimed that the detection of the proposed system is reasonable and increases the probability of detecting unknown malware compared to their first study. However, the probability of detecting unknown malware is still low and FPR is high. Using more distinctive features and train model with more malware may improve the method performance for unknown malware.
Recent works on malware behaviors are represented in [19,[29][30][31]. Lightweight behavioral malware detection for windows platforms is explained in [29]. It extracts features from prefetch files and discriminates malware from benign applications using these features. To show the effectiveness of the malware detector on the prefetch datasets, they used LR (logistic regression) and SVM (support vector machine) Table 3: Summary of related works on malware detection methods.

Paper
Feature representation Goal/success Year Wagener et al. [14] System calls, Hellinger distance, phylogenetic tree Identify new and different forms of malware 2008 Park et al. [15] Creating system call diagrams Identify different forms of malware 2013 Islam et al. [16] Printable strings, API method frequencies Identify malware with 97% accuracy 2013 Naval et al. [17] Diagram of system calls and relations Detect code insertion attacks 2015 Das et al. [18] System call frequencies, n-gram Identify new and different forms of malware 2016 Zhang et al. [19] API calls sequence to construct a behavior chain It achieved 98.64% accuracy with 2% FPR 2019 classifier. According to the authors, test results are promising especially TPR and FPR for practical malware detection. Choi et al. proposed metamorphic malicious code behavior detection using probabilistic inference methods [30]. It used FP-growth and Markov logic networks algorithm to detect metamorphic malware. FP-growth algorithm was used to find API patterns of malicious behaviors from among the various APIs. Markov logic networks algorithm was used to verify the proposed methodology based on inference rules. According to the test results, the proposed approach outperformed the Bayesian network by 8% higher category classification. Karbab and Debbabi proposed MalDy (mal die), a portable (plug and play) malware detection, and family threat attribution framework using supervised ML techniques [31]. It uses behavioral reports into a sequence of words, along with advanced natural language processing (NLP) and ML techniques to extract relevant security features. According to the test results, MalDy achieved 94% success on Win32 malware reports. A depth detection method on behavior chains (MALDC) is proposed in [19].
e MALDC monitors behavior points based on API calls and uses the calling sequence of those behavior points at runtime to construct behavior chains. en, it uses the depth detection method based on long short-term memory (LSTM) to detect malicious behaviors from the behavior chains. To verify the performance of the proposed model, 54.324 malware and 53.361 benign samples were collected from Windows systems and tested. MALDC achieved 98.64% accuracy with 2% FPR in the best case. e malware detection schema landscape is changing from computers to mobile devices, and cloud-, deep learning-, and mobile-based detection techniques are becoming popular. However, these detection schemas have some problems, too. For instance, deep learning-based detection approach is effective to detect new malware and reduces features space sharply [32], but it is not resistant to some evasion attacks. On the other hand, cloud-based detection approach increases DR, decreases FPs, and provides bigger malware databases and powerful computational resources [33]. However, the overhead between client and server and lack of real monitoring is a still challenging task in cloud environment. Mobile-and IoT-based detection approaches can use both static and dynamic features and improve detection rates on traditional and new generation of malware [34]. But, they have difficulties to detect complex malware and are not scalable for large bundle of apps.
In the literature review, the malware detection methods have been summarized. Current studies can be divided into 2 major groups: (1) Studies that apply certain rules directly to behaviors or features to group similar behaviors and extract the signature (no ML is required at this stage) (2) Studies that determine behaviors, extract features from behaviors, and apply classification by using ML and data mining algorithms In current studies, some new techniques and methods have been used widely for many years. ese techniques and methods are can be listed as follows: (i) Datamining and ML have been used widely for a decade, and cloud and deep learning have been used recently in malware detection (ii) e n-gram, n-tuple, bag, and graph models have been used to determine the features from behaviors (iii) Probability and statistical methods such as Hellinger distance, cosine coefficient, chi-square, and distance algorithms are used to specify similarities among features Current studies which are explained above have some limitations and can be addressed as follows: (i) Many detection methods produce high FPs and require complex and resource-intensive hardware (ii) Detection rate and accuracies are low (iii) Cannot effectively handle new and complex malware (iv) Focused on specific malware type, family, or specific OS (v) Prone to evasion techniques (vi) Have difficulties to handle all malicious behaviors (vii) Feature extraction methods are not effective, so the size of the features increases overtime As a result, the difficulties in defining behaviors and identifying the similarities and differences among the extracted properties have prevented the creation of an effective detection system. e use of new methods and approaches along with the use of ML and data mining algorithms in malware detection has begun to play a major role in making the extracted features meaningfully.
On the contrary, the SCBM has a high detection rate and accuracy with low FP. It can handle new and complex malware to a certain degree, and it is resistant to evasion techniques. Besides, the feature extraction method is effective and only specifies the features which can discriminate malware from benign. During the feature extraction process, the SCBM assigns numbers to each feature, which shows the importance of the feature in the dataset. us, the model does not need feature selection techniques before ML, and this makes SCBM faster and less resource-intensive.

Subtractive Center Behavior Model
is section describes the system architecture and explains the proposed model in detail.

Architecture of the Proposed Model.
e system architecture of the proposed malware detection model is summarized in Figure 3.
According to the proposed model, the program samples are first collected and analyzed by relevant dynamic tools. en, the behavior is determined according to the results of the analysis. After that, behaviors are grouped according to the determined rules, and features are extracted. Finally, the most important features are selected, and the system is trained. Based on the training data, each sample is marked as malware or benign.
During the detection process, the SCBM specifies malicious behavioral patterns which can be seen in malware, but not seen or rarely seen in benign samples. Scoring system is used to determine the behavioral patterns. For instance, even if malware (M) and benign (B) samples system calls are the same (in real examples, this is not the case) M � B � {a, b, c, d, e}, the behavior patterns will be different. M pattern (candidate) � {ab, ac, ce}, where ab score � 4, ac score � 1, ce score � 3, while B pattern (candidate) � {ab, ac, ce}, where ab score � 1, ac score � 1, and ce score � 0. In this case, M pattern � {ab, ce}, while B pattern � { }, and we can easily differentiate malware from benign.
To collect the execution trace of each sample, both a process monitor and explorer are used in this study, but other dynamic tools such as API monitor and different sandboxes can be used as well.
e proposed system is implemented using the Python scripting language, and classification is done on Weka. To prove the efficiency of the proposed model, different tools and programming language have been used. However, someone can use different tools and can get better results with proposed model. us, the implementation of proposed model does not put restriction on SCBM.

Proposed Model.
In this study, the SCBM creates a dataset. When the SCBM and the n-gram model are compared, the SCBM contains far fewer features and determines the related processes more clearly than n-gram. In the proposed model, system paths, where malware behaviors are performed, and the malware behaviors themselves are taken into consideration. Based on each malware behavior and related system path, a score is assigned. Features that do not exceed the specified score are removed from the dataset. For example, to run properly, each process accesses certain system files and performs similar actions and behaviors. ose behaviors and the resulting properties are not included in the dataset. erefore, the datasets created using the proposed model contain far fewer features than the datasets created by n-gram and the models used in other studies. e proposed SCBM model consists of following phases: (i) Phase 1: convert the actions into behaviors (ii) Phase 2: divide the behaviors into groups and calculate the risk scores (iii) Phase 3: group the behaviors according to the types of resources (iv) Phase 4: group the behaviors based on the same resources but different instances (v) Phase 5: extract the features from repeated behaviors (vi) Phase 6: extract the features from different data sources (vii) Phase 7: calculate the risk scores for each behavior based on active/passive behaviors e details of these phases are given below.

Security and Communication Networks
In Algorithm 1, d 1 , d 2 , and n represent the input action sequence, output behavior sequence, and input size, respectively. e algorithm takes d 1 as an input and generates d 2 . During this process, AE (active) and PE (passive) behaviors are identified, and sfPs (system file paths) such as self, system, and third party's software are determined. On this basis, ψ and µ, which represent action state and action type, are calculated by using AE, PE, and eST (action state type). Finally, system calls, which cannot define new behaviors, such as rcK: "RegCloseKey," cF: "CloseFile," tE: " read Exit," and pE: "Process Exit," are eliminated from the action list, and the rest of the actions are written to the d2 file.
An example system-call sequence and corresponding behaviors are given in Table 4.
e system calls that are produced by each sample are formulated as S = {a, b, c, d, . . ., n}, where S represents the system-call sequence and a, b, c, . . ., n represent each system call. Only s ⊂ S is taken into consideration when building behaviors. In this way, the behaviors that define the program are clarified, and the data to be analyzed are reduced significantly before feature extraction.

Phase 2: Divide the Behaviors into Groups and Calculate the Risk Scores.
e behaviors identified in the previous phase are divided into three groups: self-generated behaviors, behaviors on third-party software, and behaviors on system software. In this section, the risk score is calculated for each behavior and its path (Table 5). e risk score is numbered from 0 to 4, where 0 means that related behavior is normal and can be seen in both malware and benign samples and 4 means that the related behavior is risky, likely to be seen for malware and rarely seen in benign samples (Table 5). e score is assigned based on the behavior path performed by the program sample. SGB1 shows the first type of behaviors from self-generated behaviors, TPB1 shows the first type of third-party behaviors, and SB1 shows the first type of system behaviors. Higher score is given to system behaviors because more differentiating malicious behaviors are performed on system files. In addition, a score is assigned for active and passive behaviors, as explained in phase 7. A threshold value was used when excluding behaviors. For instance, feature x i ∈ feature set X consists of y 1 , y 2 ,. . ., y n behaviors. e risk score for feature path (rsP) is calculated for x i as follows: x i (rsP) � y 1 (rsP) + y 2 (rsP) + · · · + y n (rsP) n . (1) Let a be a specified threshold value, if x i (rsP) ≥ a, x i is in the feature set. Otherwise, x i is not in the feature set.
μ ⟵ self (10) elif (eST � � 'ss') (11) μ ⟵ system (12) elif (eST � � 'ts') (13) μ ⟵ thirdParty (14) else write.d 2 () (28) end if (29) end if (30) end if (31) end for ALGORITHM 1: Malware behavior creation algorithm. 8 Security (1) Phase 2.1: Self-generated behaviors (SGB). When an executed malware/benign sample performs behaviors on its own directory (SGB1), these behaviors are determined as the lowest dangerous behaviors and assigned a risk score of 0. In this case, because the program needs to retrieve some data from its own file to run properly, it generates normal behaviors that cannot be categorized as dangerous. However, when an executed malware/benign sample presents registry or network-related behaviors within some files (SGB2), this behavior group is considered to be slightly more dangerous and is assigned a risk score of 1. e behaviors marked with a risk score of 1 are likely to be included in the dataset according to the specified threshold. For instance, the behavior in which a file that creates another file and then copies its own file content to another file is more dangerous than the behavior that retrieves some data from its own file to run properly.
(2) Phase 2.2: ird-Party Behaviors (TPBs). Many programs require third-party software to run properly. For instance, in order to compile and run a program written in the Python language, the program will frequently perform behaviors for the file path (TPB1) where this language exists. Such behavior is considered harmless, and the behavior risk score is assigned as 0. However, behaviors related to directories and files that are not related to the performed sample (TPB2) are considered dangerous and the behavior risk score is assigned as 3.

(3) Phase 2.3: System behaviors (SBs).
Programs are needed to interact with the operating system to work properly. Typically, this interaction is provided by system DLLs, background processes, Windows services, etc. on the Windows operating system. Most of these interactions are considered normal, while some of them are classified as malicious. If a program contains interactions that are necessary for the program to work properly, these type of behaviors (SB1) are evaluated not dangerous and lowest level risk score is assigned as 0. If the program uses "GDI32.dll" and "shell32.dll" [35] which can be used for both in malicious and benign behaviors (SB2), the risk score assigned as 1. If the program uses "User32.dll" and "kernel32.dll," which can be used frequently by malware and also sometimes used by benign (SB3), the risk score is assigned as 2. However, if the program frequently calls "Wininet.dll," "Advapi32.dll," and directly calls "Ntdll.dll" instead of "kernel32.dll" or uses high-level methods that are likely to be categorized as dangerous such as "ReadProcessMemory" and "Adjust-TokenPrivileges" [35] (SB4), then a behavior risk score is assigned as 3.
In addition, if the program is attempting to interfere with system processes such as "svchost.exe" and "winlogon.exe" and to use these processes to access system databases that contain critical information, then these behaviors (SB5) are also considered malicious and behavior risk score is assigned as 4. Furthermore, if the same name as the system files in different system paths such as "svchost.exe," "winlogon.exe," and  "smss.exe" have been created or if the file is automatically initializing itself each time, the system is started (autostart locations such as "hklm\software\...\currentversion\run," "hklm\software\...\currentversion\runonce," "c:\users\...\ startmenu\...\startup"), then these behaviors (SB4) are also considered malicious and behavior risk score is assigned as 3.

Phase 3: Group the Behaviors according to Types of Resources.
Operating system resources are divided into groups such as file, registry, network, section, and thread; and the same types of resources are generally considered when determining property relationships. For instance, in Table 4, the behaviors of ReadFile (7, "sfile1.exe," malware.exe, 8) and WriteFile (5, "tfile2.exe," sfile1.exe, 9) are directly associated with each other. However, SetValue (10, "key1," tfile2.exe, 11) and ReadFile (

Phase 5: Extract the Features from Repeated
Behaviors. e successive behaviors on the same resource and sample are set to a single property. Behaviors that occur in different locations and names are set to the same feature as well, but the importance of the feature increases.

Phase 6: Extract the Features from Different Data
Resources. Behaviors that are on different resources but are indirectly determined as having a relationship also create a property. For example, although their behaviors take place in different resources, WriteFile (5, "tfile 2 .exe," sfile 1 .exe ⟶ 9) and SetValue (10, "key 1 ," tfile 2 .exe ⟶ 11) (Table 4) create a property between them.

Phase 7: Calculate the Risk Scores for Each Behavior
Based on Active/Passive Behaviors. Active behaviors are considered to be more dangerous than passive behaviors, and consequently, a higher level of danger is assigned. For example, while the danger level for ReadFile is set to 0, the danger level of WriteFile is set to 3. e feature creation algorithms are shown in Algorithms 2 and 3.
In Algorithms 2 and 3, the first algorithm contains abbreviations d 2 , d 3 , (rD, tY, aS, and sRY), and pRS, which define input file, output file, related file paths, and each file path risk score and the second algorithm contains abbreviations d 2 , d 4 , as, (O, O 1 , and O 2 ), π, rdF, and weF, which define input file, output file, action state, action values, operation value, "ReadFile," and "WriteFile." In Algorithms 2 and 3, the risk score is first calculated for each behavior, and the features from the related behaviors are constructed. For example, let B � {a, b, c, d} be a behavior sequence, where a and c are active behaviors while b and d are passive behaviors. In addition, behavior a is related to behaviors b and c, and behavior b is related to behavior d. In this case, features (F) and their risk scores (rS) are calculated as where the first score represents the active-passive risk score and the second score represents the path score. After the feature sequences have been generated, the frequency of each feature is calculated. e features that have a risk score above a certain threshold are considered during classification. In this case, the number of features decreases significantly, and classification algorithms produce better results without the use of feature selection algorithms.
Using the SCBM, Table 4 malware behaviors, Table 6 malware features, and Table 7 feature vector are generated. In Table 6, the Risk IDs column provides information about features. By looking at the Risk IDs column, the importance of each feature and risk score can be understood. In the Risk IDs, column I a represents property types such as self, third party, and system; b represents the level of property; and A and P represent active and passive, respectively. For example, in I 1 2, A can be evaluated as a related process trying to make changes on its files by using active behaviors, while in I 3 1, P can be evaluated as a related process trying to perform operations on system files by using passive behaviors. When the values for Table 7 are obtained by using  Table 6, a value of 0 is assigned for missing properties, 1 is assigned for one-time repeated properties, and x is assigned for x-time repeated properties. In addition, risk scores are assigned as a subfeature of the feature, considering behavioral groups and danger levels.
When comparing SCBM and the n-gram model, the test results showed that the number of created features decreases rapidly while the remaining features are more closely related one another. e dataset constructed by n-gram contained approximately 37-folds more features than the proposed model's dataset, which shows that machine learning algorithms likely perform better on dataset that is generated by the proposed model.

Case Study
is section describes the case study and experiments. Test cases were performed on different versions of Windows such as Windows 7 virtual machines, Windows 8 virtual machines, and Windows 10. For malware analysis, a process explorer and process monitor were used. To show the effectiveness of the proposed model, 2 datasets with score and without score by using SCBM have been created, and the results are compared with those of n-gram and other methods from the literature. A dataset with score is a modification of a dataset without score, which takes the features that can precisely represent each sample. In total, 6700 malware and 3000 benign samples have been analyzed.
is section consists of 5 parts: data collection, representation, differentiate malicious patterns, ML and detection, and model performance and evaluation.

Data Collection.
Malware samples were collected from a variety of sources such as Malware Benchmark [36], ViruSign [37], Malshare [38], Malware [39], KerelMode [40], and Tekdefense [41]. e malware was labeled using Virustotal [42], which uses approximately 70 antivirus scanners online and 10 antivirus scanners locally such as Avast, AVG, ClamAV, Kaspersky, McAfee, and Symantec. For this purpose, 6700 malware samples were randomly selected among 10,000 malware samples and analyzed. e dataset contains different malware types including viruses, Trojans, worms, backdoor, rootkit, ransomware, and packed malware ( Figure 4) and contains different malware families such as agent, rooter, generic, ransomlock, cryptolocker, sality, snoopy, win32, and CTB-Locker. Analyzed malware is created from year 2000 to 2019 and can be categorized as regular known malware, packed malware, complicated malware, and some zero-day malware. e dataset contains 3000 benign samples from several categories including system tools, games, office documents, sound, multimedia, and other third-party software. e malware signature was used for each scanner, and each malware was marked at the deepest level as possible. For example, a Trojan downloader and a virus downloader were marked as downloader, and key logger was marked as keylogger instead of spyware. Some of the malware could not be categorized; those malware files were marked as malware.
e majority of the malware tested were Trojan horses, viruses, adware, worms, downloader, and backdoor. Other types of malware tested were rootkit, ransomware, dropper, injector, spyware, and packed malware (Figure 4).

Differentiate Malicious Behavior Patterns from Benign.
During the detection process, the SCBM specifies malicious behavioral patterns, which can be seen frequently in malware but rarely seen in benign samples. To do that the algorithms in Section 4 have been used. To specify the malicious behavior patterns, following procedures are taken into consideration: (1) e behaviors and the system paths where sample program performed are identified (2) Scores are calculated for each behavior (3) Behavior that could not exceed the specified score is removed from the list (4) Behavior groups are determined according to the order of the selected behaviors (5) Classification is performed according to the frequency of selected behaviors By using these procedures, someone can easily separate malicious behavior patterns from benign even if malware and benign samples system calls are the same (in real examples, this is not the case). Example real features from our dataset and their frequencies are shown in Table 8. It can be clearly seen in Table 8 that someone can easily differentiate malware and benign samples by grouping to frequencies and level of frequencies. One way to do that is group to frequencies by numbering {0}, {1 to 20}, {21 to 100}, {101 to 200}, {201 to 300}, and {300+} and using decision tree for classification.

Machine Learning and Detection.
Machine learning (ML) algorithms have been used to discriminate malware from benign samples. Even though ML algorithms have been used in many different areas for a long time, they have not been used sufficiently in malware detection. us, in this study, the most appropriate algorithms were used including Bayesian network (BN), naive Bayes (NB), decision tree variant (C4.5-J48), logistic model trees (LMT), random forest (RF), k-nearest neighbor (KNN), multilayer perceptron (MLP), simple logistic regression (SLR), and sequential minimal optimization (SMO). It cannot be concluded that one algorithm is more efficient than the others because each algorithm has its own advantages and disadvantages. Each algorithm can perform better than other algorithms under certain distributions of data, numbers of features, and dependencies between properties.
NB does not return good results due to calculation on assumptions that are not very related to each other, and BN is not practically applicable for data sets with many features. On our dataset, performance of these two algorithms was lower    than other ML algorithms. However, some satisfying results have been measured in the literature. SVM and SMO work well in both linear separation and nonlinear boundary situations depending on the kernel used and performs well on highdimensional data, but the desired performance measurements could not gather on the data sets generated. However, the SVM and SMO perform better than NB and BN. KNN algorithm requires a lot of storage space, and MLP algorithm requires long calculation time during the learning phase. ese 2 deficiencies reduce the efficiency of these 2 algorithms. However, KNN performance was much higher than NB and BN performance. Although the fact that the SLR algorithm is inadequate to solve nonlinear problems and contains high bias decreases the efficiency of the algorithm, it has returned good results on the data sets created with the proposed model. On the contrary, decision trees produce scalable and highly accurate results, and they are the best performing classifiers according to test results on our dataset makes these classifiers more prominent than other classifiers. In the literature, except in some cases, they have returned satisfying results as well.

Model Performance and Evaluation.
To evaluate the performance of the ML algorithms, DR, FP rate, f-measure, and accuracy were used. ese values are calculated using the confusion matrix (Table 9). ese values are represented by the TP (the number of malicious software being marked as malicious), TN (the number of benign software being marked as normal), FP (the number of benign software being mistakenly marked as malicious), and FN (the number of malicious software accidentally being marked as benign). By using these values, DR, FPR, f-measure, and accuracy are calculated as To evaluate the model and ML performance, holdout, cross-validation, and bootstrap have been used widely. For small datasets, cross-validation is a preferable method because the model performs better on previously unknown data, while the holdout method is useful for large datasets because the system can be trained with enough instances.
In this study, both the holdout and cross-validation methods were used to evaluate performance. At the beginning, when the dataset was small, cross-validation returned better results. However, when the dataset had grown, the holdout method also generated favorable results.

Results and Discussion
e summarized test results can be seen in Tables 10-14 and Figures 5 and 6. e test results show the DR, FPR, and accuracy on n-gram and proposed models. e both holdout and cross-validation methods perform well on the proposed model.
us, when evaluating a model performance, the combination of 10-fold cross validation and percentage split (75% training and 25% testing) for holdout results are used. Similar results were obtained when parameters are changed. Table 10 shows the comparison of the classification algorithms on the SCBM and n-gram model that were used to build the dataset.
In Table 10, 400 malware and 300 benign portable executables are tested. In almost all cases, the proposed model achieved better results than 4-gram; similar results were obtained using 2-gram, 3-gram, and 6-gram. For instance, the SLR algorithm performance on 4-gram is measured as 94.6% for DR, 6.3% for FPR, and 94.5% for accuracy; versus SCBM performance is measured as 98.5% for DR, 4% for FPR, and 97.2% for accuracy. In the same way, J48 algorithm achieved 91.4% for DR, 9.1% for FPR, and 91% for accuracy when using 4-gram; and versus 99.5% for DR, 0.7% for FPR, and 99.4% for accuracy when using SCBM. Other classification algorithms achieved similar results on the n-gram and SCBM datasets, which shows that the proposed model's results are much better than those of the n-gram models. e n-gram uses consecutive system calls whether related or not from properties.
is causes malware features to grow significantly, which increases the training time and makes the detection processes challenging.
e test results with and without scores can be seen in Tables 11 and 12 when 1000 program samples have been analyzed. e both datasets without score and with score have been created by using the proposed model. However, the dataset with score contains far less features than dataset without score. us, after 1000 programs have been analyzed, we have only continued to analyze programs for dataset with score.
Decision tree classifiers (J48, LMT, and RF) give better results than other classifiers such as SMO, KNN, BN, and NB (Tables 11-13). For example, in J48, DR, FPR, and accuracy were measured as 99.1%, 1.2%, and 99.2%, respectively (Table 12). e test results also indicate that SLR performs better than SMO, KNN, BN, and NB. However, KNN is slightly better than SMO in terms of FPR and accuracy. SMO performs better than BN and NB. NB shows lower performance than other classifiers. us, NB is not an appropriate classifier for our dataset. MLP was too slow to classify malware and benign samples in both the n-gram dataset and the proposed method. us, it was not included in the test results.
e DRs and accuracies are increased when the number of analyzed programs are increased, while FPRs are decreased (Tables 12 and 13). is shows that the proposed       model successfully differentiates malicious from benign patterns. However, the n-gram was too slow when the analyzed programs increase. Hence, we stopped to analyze more programs to create dataset with n-gram. e test results also indicate that the proposed model with score-specified malware properties is better than the proposed model without score ( Figure 5). e average classification accuracy (cross-validation and holdout split by 75/25%) can be seen in Figure 5, which shows the accuracy of the classifiers on the dataset with and without scores. It can be clearly seen that, with the exception of the NB, all classifiers performed much better when the scoring system was used.
We have concluded that using the scoring schema for our dataset eliminated less important features for discriminating malware from benign samples. is is because the SCBM model with score also works as a feature selection algorithm and metric which produce better performance. Feature selection algorithms use dependency, accuracy, distance, and information measures such as information gain and gain ratio to select more important features from the dataset. e dataset with score outperformed the dataset without score, which uses feature selection algorithms and metrics. us, there is no need to use a feature selection algorithm for most of the classifiers before classification. Since decision tree classifiers use a feature selection algorithm by default (feature selection and tree pruning), the classification algorithm difference is low ( Figure 5). For example, J48 accuracy is 99.2% with score and 99% without score, LMT accuracy is 98% with score and 97.4% without score, and RF accuracy is 96% with score and 94% without score. However, SMO accuracy is 92% with score and 90% without score and KNN accuracy is 92.2% with score and 88.4% without score.
us, providing fewer but more meaningful features for classification produces better results. It can also be concluded that using the feature selection algorithm for the dataset without scoring for some classifiers may increase the detection and accuracy rates.
To evaluate the proposed model more accurately, different numbers of malware and benign samples were tested. Figure 6 shows the average accuracy rate and FPR when the number of analyzed programs increase. e classification accuracy increases when the number of analyzed programs increase while FPR decreases for all ML algorithms that have been used including J48, LMT, RF, SLR, SMO, KNN, BN, and NB. For example, when 200 programs were analyzed, the accuracy rate was 89%. is accuracy increases over time when more programs are analyzed, up to 94%, 95.3%, and 97% ( Figure 6). However, FPR decreased sharply when more programs were analyzed. FPR was 12% at the beginning, but overtime, it decreased to 9.7%, 5.9%, and 4.1%. Based on the test results, it can be concluded that the classifier results improve when more programs are analyzed.
To evaluate the efficiency of the proposed model, DR, FPR, and accuracies are also compared with different models from the literature (Table 14). e proposed model produces considerably better results than other models [16,43,45] when the same classifier is used for evaluation. For instance, when J48 is used as a classifier; the DR, FPR, and accuracies are measured as 99.9%, 0.2%, and 99.8%, respectively, for the proposed model, while 90.9%, 3.8%, and 93.6% for the model from [43] (Table 14). For other classifiers, the proposed model also performed better than other models. e worst result was obtained for NB (75.6% DR, 15.3% FPR, and 75.62% accuracy for the proposed model), while DR of 58.1% was obtained for the model in [43] and an FPR of 31% was obtained for the model in [44]. Even if our result was fairly low when using the NB classifier, it was still better than those of other works in the literature.
Furthermore, some important findings were found during analysis. ese findings should be considered when creating an effective detection system. e key findings of the analysis are listed as follows: (i) Most of the new generation malware uses existing processes or newly created processes for malicious purposes (ii) New generation malware tries to hide itself by creating similar systems and third-party software files (iii) Most malware creates malicious behaviors in temporary file paths (iv) Malware usually tries to become permanent in the system by locating itself within Windows automatic startup locations (v) Some malware displays the actual behaviors only when it runs with administrator-level authority (vi) Most malware creates random files (using meaningless file names) (vii) Most new generation malware injects itself into Windows system files ("svchost.exe," "winlogon.exe," and "conhost.exe") or copies itself into different file paths with the same or similar names (vii) Some malware tries to find and disable existing security software (firewall and antivirus program) as soon as it is performed

Limitations and Future Works
Even though SCBM is fast and efficient to detect malware, there are some limitations needed to be mentioned. e proposed model has been tested on uniformly distributed dataset, more zero-day malware need to be tested. e test cases for malware is performed on virtual machines which can represent limited behaviors of malware [46]. us, running malware on real machine can improve the performance. Besides, suggested schema only tested on our dataset, if raw data of other datasets will be gathered, in the future suggested schema will be tested on other datasets as well.
e suggested schema will be integrated with other technologies such as cloud, blockchain, and deep learning to build more powerful detection system [46].

Conclusion
e SCBM is presented. In the SCBM, malware behaviors and system paths, where malware behaviors are performed, are considered. Features that could not exceed the specified score are removed from the dataset.
is way malicious behavior patterns were differentiated from benign behavior patterns. erefore, datasets created using the proposed model contained far fewer features than datasets created by n-gram. To evaluate the performance, the proposed model was combined with an appropriate ML algorithm. e test results showed that the proposed model outperformed ngram and some models used in other studies. For the proposed model, DR, FPR, and accuracies were 99.9%, 0.2%, and 99.8%, respectively, which are higher than those of ngram and other methods. e test results also indicated that decision tree classifiers (J48, LMT, and RF) and SLR yield better results than classifiers such as SMO, KNN, BN, and NB. BN and NB show lower performance than other classifiers, which show that BN and NB are not appropriate classifiers. It can be concluded that the proposed method combined with appropriate ML algorithms has outperformed signature-based detection method, n-gram model, and other behavior-based detection methods. e proposed model has performed effectively for known and unknown malware.

Data Availability
e data used to support the findings of this study are available from the corresponding author upon request.