Metamorphic Malware and Obfuscation: A Survey of Techniques, Variants, and Generation Kits

Attribution


Introduction
Digital resources and infrastructure have become some of the most crucial concerns in the feld of cyber security.As we encourage a greater use of the Internet to delegate the tasks of everyday life, we expose ourselves and our information through potential exploitation by malicious actors.Te biggest culprit is malware, a portmanteau for malicious software.Malware takes on many forms, but put simply, the ultimate goal of malware is to carry out a series of actions for nefarious purposes.Whether the end goal is espionage, disrupting services, or exploiting systems for fnancial gain, the costs associated with inaction are increasing every year as new malware variants are deployed on unsuspecting enterprises and victims.Every year several antivirus (AV) vendors publish their annual white papers regarding the current state of malware worldwide.From a research standpoint, researchers are concerned about three aspects of malware behavior: the ability for malware to disguise its own structure to avoid detection; modifcation and/or utilization of the host operating system (OS) resources; and the communication malware aims to establish externally [1] to so-called command and control servers (CnC).Tese aspects of malware behavior can be summarized as follows: Obfuscation: Malware employs the use of various obfuscation techniques, such as packing and encryption, in order to avoid signature-based detection methods.Obfuscated malware also makes it cumbersome to disassemble and produce accurate control-fow graphs (CFG) when reverse engineering.Resources: Malware will utilize various resources of the host operating system in order to carry out its predefned objectives.Malware will call several Windows application programming interfaces (APIs), make changes to the registry, read and write to the fle system, as well as create and spawn new daughter processes and threads.
Network: Malware will attempt to communicate with an outside command and control (CnC) server in order to relay information.Communication may be used to serve a greater botnet network and relay personal confdential details obtained from surveillance of the target operating system (OS) or used in detecting the presence of a sandbox environment in antiemulation and stealth malware.
Te scope of malware worldwide is widespread and includes infections in both Macintosh and Windows OS, afecting businesses, governments, and individuals alike.A total of 20% of individuals have experienced a malware attack in one form or another, a 14% increase from 2018 [2].Estimates obtained for 2019 identifed 24 million Windows and 30 million Macintosh infections being recorded [3], with Kaspersky noting over 24 million unique malicious objects being detected in 2019 alone [4].While infections recorded span several diferent types of OS, approximately 94% of malware developed is, in fact, Windows targeted [5,6].Malware takes on many shapes and sizes and includes archetypes such as Trojans, adware, spyware, viruses, worms, ransomware, rootkits, exploits, cryptojackers, and keyloggers.Tese all carry out some form of invasion, damage, or disabling of systems for the direct or indirect beneft of the malicious actor.More recently, the availability of free and open source software distributions has posed signifcant risks, as so-called "script kiddies," which are users who have little to no experience in writing software themselves, have made use of these tools for nefarious purposes.Te readily available access to distributions such as Remnux and Kali Linux (Ofensive Security, New York City, NY) has made it even easier for users to deploy various forms of reconnaissance and penetration testing tools with out-of-the box software.As natural language processing (NLP) tools become more sophisticated, chatbots such as ChatGPT can act as personal advisers in red-teaming and blue-teaming drills, which can also subsequently be used by black hats for their own vulnerability campaigns.
Businesses are some of the most susceptible recipients to malware attacks, as they are potential victims to ransomware attacks for monetary gain and experience service downtown due to denial of service (DOS) attacks.For example, in late 2019, the average downtime for a ransomware attack was 16.2 days and the average ransom payment was 81,116 USD, almost doubling from 41,198 USD seen earlier in 2019 [7].Te average cost of a data breach to a business was estimated at 3.8 million USD [8], and the average cost of a DOS attack was placed at 2 million USD [9].Te prevalence of malware in the business environment is evident, with 95% of organizations recording a malicious infection [10] and 81% having been afected by such an infection [11].While total malware detections have seen a small increase of 1% year over year, the business sector has seen a 13% increase in 2019 [3,12].Te top 10 malware variants which target business infrastructure saw triple digit increases in their number of infections between 2018 and 2019 [3].Small businesses represent 43% of infected businesses reported, likely due to their inability to mitigate, fag, and respond to infections appropriately [7] and the fact that 37% of businesses spend less than 200,000 USD on Internet technology (IT) security and 78% do not have a formal incident response plan in place [13,14].Security experts encourage IT security personnel to adopt the 1-10-60 rule: threats are to be detected within the frst minute, threats are to be investigated in 10 minutes, and an appropriate action must be taken within the frst 60 minutes [15].Businesses are prime targets for malware due to the fnancial motivation, with 71% of all breaches being fnancially motivated and 25% being motivated by espionage [7,8].Furthermore, North America is one of the leading regions where corporate ransomware is a pressing concern, with 68% of businesses having experienced attacks in the last year [16].
AV vendors are particularly interested in the emergence of new forms of malware because these represent unique instances of malware that have never been seen before and they pose a signifcant threat to security infrastructure.A report by FireEye noted over 100,000 unique malware signatures are being reported each day by AV vendors [10].Zero-day attacks are of particular concern as they require AV vendors to develop signatures of these new malware instances, requiring signifcant domain-level knowledge and constant revision of their signature database.New and emerging threats are evident, with 60% of ransomware variants identifed in the last 6 months of 2016 being developed in the last year [17].Moreover, a small but mutable subset of malware variants, totaling only 50 malware families, were noted to make up 80% of all successful malware infections [10].Tis propensity for malware infections to originate from a small family of malware instances is due to the polymorphism built into their development.Polymorphism allows for malware to change their signature upon each iteration of its propagation, leading to previously unseen threats and new instances of zero-day attacks [18][19][20][21].As the stakes increase for both cybercriminals and businesses, so has the tools they develop to penetrate and mitigate threat vectors, respectively.Te call for cyber security expertise has never been at its highest, with 62% of organizations planning on investing more in cyber security in 2020 [22].Te prevalence of polymorphic malware and its variants has expanded how we approach the feld of cyber security for threat mitigation.Legacy methods, which classify new malware based on previously known signatures, are no longer efective in identifying polymorphic malware [23], lending credence to the development of a more adaptable, behavioral, and cognitive-based approach to how we detect malware [24].Te vast majority (93.6%) of malware observed today is polymorphic [25], and the necessary steps must be taken to ensure our instruction detection systems (IDS) and security information and event management (SIEM) systems are equipped to keep up with the ever-mutating nature of today's malware landscape.Tis review will cover several aspects of metamorphic malware: starting from the limitations of current signaturebased methods to the various obfuscation techniques employed by malware.Tis survey discusses the constantly 2 Security and Communication Networks evolving threat characteristics of metamorphic malware, which provides the basis for building more sophisticated heuristic and analytically tools based on potential features sets.In addition, a broad discussion of metamorphic engines and antiarmoring techniques discusses the challenges researchers face in isolating malware variants in a controlled environment.We hope to improve the current understanding of metamorphic malware research by making the following core contributions in this work: (i) Summarize the common obfuscation methods which, in turn, can be used to develop better heuristic techniques for feature engineering in machine learning pipelines.(ii) Present the inner workings of a metamorphic engine and polymorphism more generally.Understanding how a malicious payload can persist in memory without ever be written to disk will allow researchers to fnd indicators of compression or encryption when a candidate binary is presented.(iii) Outline the current metamorphic engines broadly available in the literature which can be used by researchers to obfuscate their own binaries to incorporate robustness into their own work.
Section 1.1 will cover basic signature techniques used by AV vendors including the most common scanning techniques and considerations for scanners.Section 2 builds on the limitations of these techniques by introducing malware obfuscation, which is the most commonly used routine used by metamorphic engines in its obfuscation stage.In Section 3, the idea of obfuscation is put into perspective with a deepdive into a metamorphic engine, which involves the ability of malware to unpack, obfuscate, compress, and encrypt its payload on the fy.Finally, Section 4 provides an overview of the most well-studied datasets used in malware research, with Section 4.2 covering popular metamorphic kits that can be used by researchers to create their own metamorphic binaries.
1.1.Signature Analysis and Creation.Signatures are used to help identify malicious code segments present, either existing as independent executables or attached to benign fles known as benignware.It is imperative that AV vendors constantly update their signature databases in order to crossreference known malicious binaries with fles suspected of being malicious.Acting as unique fngerprints for malware, signatures are plagued with several fundamental issues.First, signatures are incapable of identifying emerging malware variants.In an environment where approximately 60% of new ransomware are never-before-seen variants according to the most recent estimates [3], this creates a signifcant shortfall in detection rates for new variants.In addition, when the vast majority of malware is polymorphic [25], signatures are sometimes not generalized to catch obfuscated instances of previously fagged malware.
Te art of fle scanning is in of itself a laborious process, requiring trade-ofs between speed and specifcity.Incorporating longer signatures provides a more specifc identifcation of malware and malware families but is unable to catch the subtleties of minute changes [26].Short signatures provide better coverage but results in more false positives [27,28].AV vendors therefore must come up with a series of rules to both generalize their signatures and improve their scanning efciency.Some of the basic scanning strategies are shown in Table 1 and described in the following: String scanning is the de facto standard for any string match scanning.Te scanner is to look up the exact sequence of bytes in any ofset.Wildcards method allows for the use of wildcard variables.In the example shown in Table 1, the use of "??" acts as a placeholder for 2 bytes of any string, while %3 prompts the scanner to look for the subsequent byte sequence in any of the proceeding 3 byte positions.Tis is extremely efective for catching register swapping and instruction replacement obfuscations.Mismatch method incorporates the idea of partial match of any given byte sequence.In the example provided in Table 1, if the scanner allows for up to 1 mismatch, as long as 2 of the 3 byte sequences are found, the scanner alerts to a match.Generic method allows for the detection of malware families through the use of both wildcards and mismatch sequences.Tis method extracts the core malware artifacts of a malware family, thereby capturing any subtle alterations to the bytecode sequence that may arise in the future.For example, the Win95/ Regswap virus uses similar opcodes between generations.Trough a combination of wildcard string matching with mismatch, the entire Regswap generation can be fagged based on a few common signatures.
In addition to generating unique signatures as part of generating a greater signature database for malware, scanning fles requires a dedicated strategy, and in some cases, dedicated hardware.For example, while a signature may be located in any one of the portable executable (PE) sections, such as .idata, it may also be located in the PE fle header.In addition, if you wish to cross-reference thousands of malicious signatures with an incoming data stream using regex patterns, you would have to take advantage of intrapacket or interpacket scanning to process them efectively [29].AV vendors utilize cheaper operations, such as checking the fle length, before committing to the use of a more arduous task such as a checksum [28].In practice, a signature can act as a representation for a series of bytes, a whole fle, or certain sections.Te ways in which AV vendors carry out simple scanning on a binary is described in the following sections.
1.1.1.Top-and-Tail Scanning.Tis mode of scanning used to extract signatures from the top and bottom of fles.Tis is especially useful for viruses that append to the front or back of the targeted host program.Since the address of the main entry point of a program is in its header section, Security and Communication Networks manipulation of this address to point to the appending malicious binary is possible [30].As an example, the Polimer.512.A virus preappends itself at the front of the executable and shifts the original program content after itself.Alternatively, the Vienna virus is 1,881 bytes long and appends itself to the end of the host fle.
1.1.2.Entry Point Scanning.Tis mode of scanning is used to extract signatures from the sequence at program entry points.Malware routinely alters program entry points as to avoid detection through rerouting of the execution fow to a decryptor stub which decrypts the original binary [31].Te Zmorph virus follows such behavior, whereby the decryptor aims to rebuild the instructions line by line by pushing the result into the stack memory.Tis can lead to "black hole" scenarios where useless operations are compiled early on in the process fow to burden the reverse engineering analysis.
In addition, an assembly encoder or an altered JUMP statement can be confgured to run encoded information in a "code cave," as to not increase the fle size of the binaries.Tis would normally impact the binary fle header values, and any changes will alter relative/absolute ofsets, so the pointers need to be changed accordingly.As previously mentioned, the Polimer.512.A virus appends itself to the infected program and in doing so is exactly 512 bytes long.Tis would raise fags and be easy to identify possible infected fles due to the consistent fle size diferential.
Viruses such as the Win32/Simile is able to avoid changing the entry point of an infected fle by altering call instructions which reference ExitProcess() to point to the virus code.Tis has the efect of not changing the entry point of the infected fle.Other viruses such as W32/Bistro and W32/SMorph obfuscate their entry point [32].SMorph is able to use existing API calls in the infected fle to call to its own import address table containing references to API imports.
1.1.3.Integrity Checking.Tis mode of scanning can be an extremely powerful tool to detect manipulation of system fles which should never change [27].A checksum database can be used for reference when performing routine integrity checking of the system and fles to detect any alterations [33,34].Common checksums include MD4, MD5, and CRC32.Checksums are routinely used on byte values suspected areas of a virus body, thereby reducing the number of total checksums required.
Alternatively, certain types of infections, such as companion infections, may attempt to mimic the name of an infected fle and redirect the header section of an EXE which stores the address to the main entry point of a program to the start of the virus code [35].Te virus may also change the extension to COM as the Windows OS give a higher priority to COM over EXE extensions.In order to account for this, distributions such as McAfee's network security platform can assign a magic number to fle types and will fag fles whose extensions have been tampered with [36].

Obfuscation
Tis chapter will provide an overview of the common obfuscation techniques employed by malware.Examples of these techniques will be provided, along with some actual code snippets from popularized malware variants.Finally, a brief overview of encryption and compression is given, two very important techniques to familiarize yourself with.Tis chapter will focus on obfuscations made specifcally via changes to the opcodes and operands, which serve as the CPU instruction set which specifes the data that are processed and how it is done.Opcode examples will include both Intel and ATT syntax, with the former being readily apparent as the source operand is always on the right side of the instruction and the destination on the left (e.g., mov eax, 1).

Dead-Code Insertion.
Dead-code insertion, or sometimes referred to as garbage code insertion, is an obfuscation technique which inserts byte code sequences into a binary without afecting functionality [37][38][39][40].Tis obfuscation relies on the fact that instructions can be added to code which do not perform any meaningful function, or in other scenarios, can carry out an instruction and perform the operation in reverse [41,42].An example of this type of obfuscation is shown in Table 2 where a series of nop instructions are used to pad the instructions.Typically, deadcode insertion is used to carry out one of three functions: (1) Insertion of a pointless operation such as nop, mov eax, eax, add eax, 0, and eax, −1 or or eax, 0. In practice, these instructions do not change the content of CPU registers or memory as they are all semantically equivalent to nop; however, they may modify the status of the fag register in the CPU.Tese instructions also have diferent opcodes.(2) Insertion of operations with the purpose of burdening the reverse engineering process by altering values in registries and then reversing the instruction.An example would be incrementing a registry add eax, 1 and then reversing the instruction by decrementing sub eax, 1.Other examples would be push and pop and inc and sub.Tis Garbage code insertion is used successfully in the implementation of W95/Bistro, a later implementation of W32/Zperm, which utilizes a random block insertion engine which is placed directly after the virus entry point.Upon entering, this block of code millions of instructions is run, thereby overburdening the emulator before the virus instructions are even executed.Other popular examples of viruses utilizing garbage code insertion are W32/Evol and W32/Zmist.Zmist is notable for its use of the executable trash generator (ETG).W32/Evol in particular is able to utilize garbage code insertion to produce very diferent variants with diferent opcodes and string signatures, thereby evading signature scanning techniques as no sequence of bytes is similar between the two generations.An example of 3 variations of the same code is shown in Table 3.
Te use of garbage code insertion techniques is useful in avoiding AV scanning for two reasons.First, the garbage code inserted is unique to each virus generation, thereby sidestepping previously seen AV signatures [44].Secondly, garbage code from benignware can be inserted into malware to increase the false negative rate.In [45], the authors created binaries with approximately 30% of dead code along with 10% benign code and showed similar classifcation scores as benignware.In the work of [46], ranges of garbage code between 5 and 35% were used to determine their efectiveness at evading detection; with 10% being noted as being sufcient.In an earlier work [47], the authors combined various proportions of garbage code insertion with subroutine reordering to total 25 diferent combinations.Two diferent obfuscation engines, AVFUCKER and DSPLIT, also known as crypters, were used in [48] to produce obfuscated code with dead code insertion.Since there is a wide variety of permutations, from single nops to intermeshed garbage code blocks, upon which garbage code insertion can take form, string scanning is fairly inefective against this form of obfuscation.

Registry Reassignment.
Registry Reassignment, or sometimes referred to as Registry Renaming, is an obfuscation technique which swaps unused registers or memory variables with those currently used by the program [44].In its simplest form, as demonstrated in Figure 1, registry reassignment can replace the eax registry with ebx, with no change in functionality.
Te downside to using registry reassignment is that string scanning techniques, such as wildcard or half-byte techniques, can be used to detect any possible combination of registry used.Tis in efect will provide a constant string between generations of registry reassignment, rendering them easily fagged by scanners.Te virus W95/Regswap (hence the name) efectively made use of registry reassignment as demonstrated in Table 4.
In Table 5, the string signatures of version 1 and version 2 have a 60% similarity when it comes to their hexadecimal representation [49].With the help of regex expressions, the accuracy is greatly increased with variations of a similar instruction set [50].Along with garbage code insertion, these primary obfuscation techniques make it considerably harder to fag new variants of malware.

Instruction Substitution. Te instruction-substitution
technique introduces an additional layer of obfuscation on the existing techniques discussed.Te power of instruction-substitution comes from the fact that there is a seemingly endless diversity to the substitutions you can introduce to an existing instruction framework.Table 6 demonstrates an example of a 2-4 instruction substitution (2 instructions are replaced with 4 to perform the same function) [51].Another instruction substitution would be push eax; mov eax, ebx with push eax; push ebx; pop eax.Semantically, these are equivalent, but push, pop is in fact slower as it is quicker to direct registry write with mov.Tis exact substitution is utilized by the W95/Zmist virus, along with interchanging xor/sub and or/test instructions.

Security and Communication Networks
Instruction-substitution is utilized very efectively in several high-profle viruses such as Evol, MetaPHOR, Zperm, and Avron.Since instructions-substitutions produce diferent opcode representations, this renders opcode frequency and accompanying n-gram techniques efectively useless.Researchers have attempted to draw from the basic set of fundamental operations in order to track the malware's original intentions.In [52], a clue set was established for the Evol virus in which all rewritten instructions were based upon.Tis approach was found to be very efective at characterizing the metamorphic engine Evol uses.A similar approach was taken by [53] where the complex instructions the virus would create were transformed back into their simple representations using their similar semantics.In Table 7, two versions of the W95/Bistro virus are shown, using diferent instruction substitutions in each generation.Similar to registry reassignment, the generations contain similar string signatures, making them susceptible to wildcard and half-byte scanning techniques.While this manuscript is focused on obfuscators based on the Intel x86 instruction set, compile-time instruction set obfuscators can also create semantically similar rule sets for basic operations in other instruction sets [54,55].

Code Transposition.
Code transposition, or sometimes called instruction permutation, is an obfuscation technique which utilizes conditional or unconditional jmp statements to reorder single or blocks of instructions [18].Since jmp instructions can theoretically be used for every line of instruction, the total number of permutations m! is proportional to the number of lines rearranged m [44].Code transposition carries out a very similar function as subroutine reordering with the exception that there is a change in the process fow; therefore, they will be discussed together.Subroutine reordering, also known as block reordering, is an obfuscation technique that reorders the process fow by rearranging blocks of code that have independent subroutines [56].If a program were to be categorized into n number of subroutines, then n! permutations of subroutines are available for rearrangement [40,50,57,58].A simple program with 10 subroutines would therefore be able to produce over 3.6 million possible iterations.Subroutines require that the instructions' set are independent of one another, allowing them to be reordered without having an impact on functionality.In Table 8, an example of a set of instructions exhibiting multiple forms of obfuscation is shown.In the example code transposition, subroutine reordering, garbage code insertion, and instructionsubstitution are all used.Several jmp statements are employed to permute blocks of instructions which can be run independently from each other.Instruction-substitution is used to add more sophisticated instructions based on the simple instruction set add eax 5; mov ecx, eax.jnk insertions are used to add complexity to the existing code, as well as added following the jmp F1 statement where it is never actually executed.Tis jnk could include code from benignware that would normally fail to compile if it were embedded within the existing obfuscated framework but may confuse scanning techniques nonetheless.Table 8 also displays another form of obfuscation called subroutine outlining [32].Tis obfuscation explicitly turns instruction blocks into subroutines and uses the call instruction to perform an unconditional jump to the location indicated by the label operand.Subroutine inlining would carry out the reverse: where subroutines would be unraveled and placed in order to preserve the process fow.Unlike simple jmp instructions, call preserves the locations to return to when the subroutine is completed.Tis sophisticated form of obfuscation is used by the W95/Zperm and W32/Ghost viruses, with the former employing the use of the real permutation engine to perform subroutine reordering.Zperm divides the code into frames which are independent subroutines, which are then repositioned randomly and connected using branch instructions to preserve process fow.When Zperm initializes, it allocates a bufer sized at 64 Kb flled with zeros and then flls it with obfuscated code and randomly positioned jmp statements [43].Tis means that a constant body is never generated between generations and is never present in memory.Similar to Table 8, garbage code is inserted between frames to fool string detection similar to the Zmist virus.W95/ Zmist also inserts jmp instructions after every instruction, making it the perfect shield to heuristic detection.In [39], 30% subroutine reordering was used to sidestep a developed similarity metric that compared benignware to malware based on the similarity of their transpositions.From a security analysis standpoint, it is extremely difcult to know when the virus begins when it is embedded within existing code and is encrypted.Partial emulation is one avenue whereby code can be reconstructed and then used to completely decrypt the virus.But when and how to decrypt during emulation is still a laborious process in of itself.

Encryption, Compression, and Metamorphism
Metamorphism, and more generally obfuscation techniques, makes up the backbone for most new and emerging malicious threats we see today.As the signature-based scanning techniques improved for AV vendors, so did the levels of obfuscation employed by malicious actors to thwart said techniques [49,59].Along with obfuscation came various forms of armoring, stealth-behavior and antiemulation tactics, which made the job of a security researcher that much more burdensome.
To understand how mutation came to be, it is worth mentioning the earliest forms of obfuscation and how they came into existence.Viruses make use of entry point obscuration (EPO) in order to avoid any consistency in the execution order of the virus code in relation to the infected fle.As shown in Figure 2, the fle header would point to an address that would execute virus code, which would then point back to the host fle so that the virus execution would do so unknowingly.
Te CASCADE virus in 1986 became one of the frst known viruses to implement encryption, thereby requiring a separate decryption routine to carry out decryption and push the instructions into memory for execution.Since the form of encryption would become apparent as the virus propagated, the decryptor routine itself would have to be mutated, leading to the establishment of the frst series of oligomorphic viruses.Security and Communication Networks techniques such as wildcard and mismatch, a greater swath of possible infections could be characterized by a few unique signatures.Furthermore, since virus code would either append or preappend onto an existing fle, top-and-tail scanning was an efective tool for extracting signatures from certain select sections of a fle.Emulators could also be utilized to uncover the decryption routine used in the encryption, meaning that the decryption routine itself had to be altered in some form or another.Emulators wait as the virus is decrypted one instruction at a time and as it rebuilds itself by pushing the stack into memory.Once control is sent to the stack memory, the emulator monitors the stack, and the code can be dumped.Oligomorphic malware were the start of a new breed of malware which would involve obfuscation of the routine itself, meaning viruses were unique among their generation.Te frst oligomorphic virus was the whale DOS virus frst identifed in 1990.In Figure 3(a), an obfuscated, encrypted decryption routine is used to carry out decryption of the virus body and to avoid detection.
However, a major limitation to oligomorphism is that the loop of possible decryptors is fnite.For example, the W95/Memorial virus had exactly 96 diferent decryptors to choose from.Once an oligomorphic generator is exhausted, the entirety of its possible generational variance is also exhausted and understood.Te natural extension to this problem is to introduce obfuscation into the decryptor routine itself, leading to an infnite number of possible decryption routines [60].Tis led to the frst generation of polymorphic viruses such as 1260, and popularized generators such as Phalcon/Skism mass-produced code generator (PS-MPC) and virus creation lab (VCL), which are still used to this day.

Polymorphism.
Polymorphic malware was seen as a complete package: complete with a compiler that could decrypt and obfuscate then recompile everything back together.Te unencrypted virus body would create a new mutated decryptor using a random encryption algorithm and then allow the decryptor to encrypt itself before linking both sections back together.However, the core problem of emulation remains: the virus code section would be decrypted into memory and be able to be detected and fagged by security researchers.It was also the case that prior generations of obfuscators sufered from several limitations [61]: (1) Constant size of virus code between generations (Polimer.512.A or Vienna viruses) (2) Appending or preappending to the infected host fle meant signature scanning could target these sections exclusively (3) Similar virus code segments between generations mean the virus is subject to entropy analysis In order to build on some of these defciencies, the introduction of the metamorphic engine came to be.

Metamorphism. Te introduction of metamorphic
viruses introduced the idea for the frst time that no two generations of viruses can have similar signatures, as no constant body is present like with polymorphic malware [43].In Figure 3(b), an example of a metamorphic virus is shown.Unlike polymorphism, the virus code is obfuscated, meaning that the entirety of the virus is present in an obfuscated state.Tis introduces the fundamental issue since "metamorphics are body-polymorphics" [62] and as a result have no constant body and they reinforce the notion that anomaly-based detection is NP-complete [63,64].Te frst metamorphic viruses were W95/ Regswap in 1998 [65] followed by W32/Ghost identifed in 2000 [66].W32/Ghost contained 10 submodules, so over 3.6 million possible variations were possible with subroutine reordering.
In light of the graphic shown in Figure 3(b), the separation between the decryptor and the virus body is no longer possible and the level of obfuscation means that encryption is no longer needed.Furthermore, as is typically the case, the decryption routine is scattered in the benign code.Te executed code in the virus body mutates entirely along with the decryptor, and it does not need to unpack to create a new constant virus body like polymorphics [50].One of the most utilized and efective metamorphic generators is W32/ NGVCK created in 2001.Metamorphic viruses have a sophisticated mutation engine that contains many subprocesses.Tese will be discussed in the following section.

Metamorphic Engine.
A metamorphic engine is responsible for the obfuscation and reconstruction of the binary so that the fle can remain operational.In Figure 4, an illustration of a complete metamorphic engine is shown.Some of the key components of the metamorphic engine are described as follows [65,67]: Disassembler is responsible for turning the source code into assembly instructions.Tis creates an intermediate form that is independent of the CPU architecture for future adoption with diferent OS and CPU architectures [43].Within the disassembler a code analyzer provides info for a code transformer module that gathers information related to control fow, subroutines, variables, and registers.Shrinker eliminates much of the garbage code produced from previous generations and mainly eliminates garbage and other nonsequential code that is produced from obfuscation.Tis step also carries out code shrinking, a form of code-substitution that will turn previous 1 to 2 or 1 to 3 instruction substitutions back to their semantically similar primitive equivalents [68].Permutor carries out much of the obfuscation using permutations of subroutines, many times in a randomized fashion.Insertion of jmp instructions is also common to divert control fow.

Security and Communication Networks
Expander performs instruction-substitution to convert instructions into another equivalent instruction set.In addition, registries are reassigned and variables are reselected according to the fxed probabilities using substitution tables [65,69].Garbage and other donothing codes are added, and functions are inlined/ outlined [70,71] Both the permutor and expander steps are quite sophisticated in the metamorphic W32/Etap and W32/Zmist viruses [60].
Assembler restructures the control fow and converts the assembly code back into machine binary code where it can become operational again.
Virus code contains the core instruction set that will execute on all new generations of the virus.It also contains the instructions that coordinate with the mutation engine and other components.
Te mutation engine does not have to operate at the assembly and the source code level but can also operate at an intermediate representation (IR) bytecode level [70].In [72,73], morphing techniques are seen as deterministic automata, whereby transitions following formal grammar are made to symbols and new mutations are produced.In [69], a template is used which illustrates how simple representations of formal grammar can produce several possible mutations.Te depiction shown in Figure 4 includes all the core components with the exception of a decryption routine.A metamorphic engine with the addition of a decryption routine is shown in Figure 1 and follows a sequence of steps to decrypt, obfuscate, and link everything back together.Te steps are as follows in order: (1) First, the decryption routine decrypts the virus body and executes an instance of it.(2) Te decryption routine then decrypts the mutation engine and executes it.(3) Te shrinker component of the mutation engine goes to work to deobfuscate the virus body.(4) Obfuscation takes place by introducing a new and unique decryption routine using the various techniques discussed in Section 2. (5) Te virus body is then obfuscated by the mutation engine to produce a unique generation using the various techniques discussed in Section 2. Te virus body is then encrypted using a unique algorithm, a static key or a host specifed temporary key.More is given on this in the following section.(6) Finally, the mutation engine is encrypted.
Once all three components are reobfuscated to seemingly new binaries, with the mutation engine and virus body decrypted, the virus relinks its components back up and can execute on a new host by decrypting its payload through it newly obfuscated decryption routine.
Te authors in [57] provide a detailed summary of the production and considerations for creating a metamorphic generator, as well as in [74] for creating a metamorphic worm.One of the more sophisticated metamorphic viruses is W32/Simile, also known as MetaPHOR or Etap.Te author, "Mental Driller," referred to the expansion, contraction, and permutation of instructions as the "Accordion Model" [61,67] based on the changing form that garbage code takes when it becomes obfuscated.Te Simile virus was also unique, and in that, 90% of the virus code was dedicated to the metamorphic engine itself, with the decryptor being placed at the end of the code section and the virus body being partitioned elsewhere [43,52].
3.5.Encryption.While encryption was briefy touched upon at the beginning of Section 3, obfuscation engines make use of a variety of encryption techniques to avoid detection [49].Te earliest form of encryption was carried out by the CASCADE virus on DOS [40] and did so using a simple xor (see Figure 5).
Te cascade virus, frst identifed in the early 1900s, was shown to increase the fle size of infected fles by 1701 and 1704 bytes and mainly comprised its encryption loop and main body.Te virus uses a technique called "cascading" to conceal its presence.When the infected fles are executed, the virus code is executed frst, causing the virus to infect more fles and directories.Tis creates a cascading efect, making it difcult for antivirus programs to detect and remove the virus [75].Te decryption routine in Figure 5 is fairly simple: the stack pointer, sp, acts as the key and the si register is used to keep track of which position of the virus body to point to.As the decryption process is carried out, both the si and sp counter increment and decrement by one, respectively, until sp returns to 0; otherwise, it will jump using jnz.For example, applying a simple xor operation to each byte using an 8-bit value as the encryption key will produce the encrypted text.Te string 2D03 002E when xor'd with the key 0xFF will produce D2FC FFD1.Doing so in reverse with the same key will produce the original text, thereby performing encryption and decryption with only one key.
Conventional decryption relies on the virus' own decryptor loop to decrypt the virus body.It did not take long for malicious actors to rely on multiple decryptors instead of one, such as the DOS/whale virus in 1990, which utilized dozens of diferent decryptors and chose one randomly each infection.It may also be the case that rather than the encryption being performed serially, decryption can be performed in a random fashion, as is the case for W32/ MetaPHOR which does so seemingly randomly, with each instruction only being decrypted once.In malware deployments, the use of a crypter is typically used, which carries out encryption for antianalysis and obfuscation purposes.A crypter contains a stub which carries out the decryption and does so while generating a new payload and key with each new generation [48,76].All of this occurs in memory, and nothing is written to disk.Decryption can take place in the stack, but then the key to it is not writable, as opposed to allocating to memory which is easily fagged by emulations that are monitoring memory.On Intel x86 platforms, 24 bytes or more of modifed memory is characteristic of a decryption routine [28].Once the stub passes control to the virus body after decryption, a new encryption key is created and all executables and .textsections are encrypted with the new key.Depending on the fle type, a TEA cipher can be used for EXE and RC4 for DLLs as is the case for HackedTeam's core-packer [77].Te key is then stored in the decryptor stub or elsewhere.
Basic encryption can be performed as mentioned previously with a single decryptor key (see Figure 6), using 1 to 1 byte to byte mapping, with zero operand using inc or neg, or reversible instructions such as add or xor.Alternatively, sliding key encryption makes use of the starting key which updates as it progresses and may even utilize the characters most recently encrypted (see Figure 6(b)) or based on an algorithm, as shown in Figure 6(c).Flow encryption determines a key stream in advance equal to the size of the encrypted text and then encrypts the body instruction by instruction.Key generation can also be varied amongst decryptor routines, where a key(s) can be located in the decryptor stub itself, hidden among the virus body, generated uniquely from the host system, or alternatively, randomly generated and not stored at all.
Te sources for the encryption key can vary but can either be hardcoded in one form or another or obtained through the host.In the case of variable key generation, the decryptor can develop the encryption key based on its own function calls.Alternatively, environmental key generation does not involve any descriptors from the viral payload or stub itself, but rather, retrieves them from the infected host.One example of environmental key generation is the use of a trusted platform module (TPM) chip, which is a hardware component built into many modern computers and devices [78].Te TPM can generate unique encryption keys that are tied to specifc physical attributes of the device, such as the device's BIOS, frmware, or other hardware components.Tis makes it much more difcult for an attacker to access the key and decrypt the protected data even if they are able to physically access the device.In the case of the RDA.Fighter virus family, the virus checks the BIOS address at FFFF : 000E0, and if it returns advanced technology (AT), as in AT-class computer, the time stamp is retrieved from the CMOS bufer; otherwise, it is retrieved from the system clock.Te timestamp is then used to create a 16-bit number that is used to decrypt the next code section using a mirror table lookup as a mask.In addition to time, the current date, timer tick, host flename, and even the hard disk serial number can act as sources for developing the encryption key.As a form of armoring, the key can be stored on a distant web server, and outside of a typical host environment, such as in virtualization or emulation, the virus can disable itself and fail to run.
Decryptions and decryption loops are not limited to a single loop, or to a single key.For example, the RDA.Fighter virus family utilizes 16 layers of decryption and does so in a backward fashion, making it a laborious process to automate the disassembling process [28].Multiple layers of encryption are also utilized by the W32/Harrier and Bradley viruses [79].To avoid all form of local or external storage of the key, a random decryption algorithm (RDA) can be used to brute force the key.Te key can be any generated word value, and the decoding method will check the checksum following the decoding procedure to identify when it has successfully found the key.In the RDA.Fighter family, RDA is used as secondary form of encryption on top of environmental key generation.

Compression.
Compression represents an additional level of obfuscation on top of a possible decryption routine and other forms of obfuscation.A packer is defned as a utility which enacts some form of compression to the executable either to reduce fles size to avoid entropy analysis or introduce a layer of obfuscation to the PE header.It has been estimated that 80% of all malware uses some form of packer [80], as well as 90% of all worms [43].Two of the most popular packers are Ultimate Packer for eXecutables (UPX (https://upx.github.io/))and ASPACK (https://www.aspack.com/).In addition to signifcant compression ratios and great performance, these packers work for a variety of executable formats with no memory overhead due to in-place decompression.Packers are ultimately tasked with compressing executables with decompressed code and a compressed payload.Packers compress the code to avoid reverse engineering and bypass frewalls.Malware makes use of packers by initially converting an Image Section (see Figure 7(a)) into a Packed Section and Unpacking Section (see Figure 7(b)).Te Unpacking Section is then set to be the initial point of entry once the fle is executed.Upon execution, the packed section is decompressed to become the Unpacked Section (Figure 7(c)) and is executed on virtual memory [81].One of the more devious uses of packers in malware analysis is that the original PE header is hidden as the visible import functions are those utilized by the packer itself.Since packers such as UPX, ASProtect, PECompact, and Temida are widely used for nonnefarious purposes as well, there is no sure indication that the fle is malicious based on the import functions [82][83][84].
One of the more comprehensive tools for the detection of malicious packers is the use of entropy analysis [1].In the work of [85], 28 diferent packers were used to classify a control fow graph as an image representation through the use of a convolutional neural network (CNN).Te work of [86] used CNNs for a similar purpose, but was used to categorized 9300 malware variants into 25 malware families simply based on the malware binary.Tese techniques have the advantage of allowing the neural network to learn which PE sections are important in identifying maliciousness; and in doing so it uses an advanced form of entropy analysis which can identify malware family usage of packers, encryption and garbage code obfuscation [86].When compression is coupled with encryption, as is the case with so-called Protectors, the resulting binary has high entropy levels, making it susceptible to classifcation.In [58], a fle segmentation method that utilized entropy with wavelet analysis was used to classify metamorphic malware based on edit distance between fle segments.Tis motivation was derived from the earlier work of [87] that established that the homogeneity of each malware's binary section is characteristic of the complexity of its data order.
Along with this insight, polymorphic malwares are able to be identifed using these techniques, albeit with a high rate of false positives [87].
In Figure 8, a historic summary is provided, which is complete with major milestones in obfuscation and new malware deployments.

Metamorphic Datasets, Generation Kits, and Armoring
While metamorphic malware has grown in sophistication, so has the tools we have as available as researchers to thwart their actions.One of such tools and resources is the use of publicly available datasets, such as DARPA99, a popularized dataset released to improve intrusion detection systems.Datasets encourage the development of classifcation tools by leaving the details for collecting representative samples in a controlled environment and at scale to others.Secondly, datasets also provide a baseline in which to compare competing algorithms, usually with the aim of increasing true positive rates and decreasing false positives.One of the downsides is that these datasets are typically outdated and are not representative of new and emerging threats.If researchers make raw malicious binaries available, as is the case with the SOREL dataset [88], they cannot do the same for benign binaries due to issues with copyright.One workaround used in SOREL is to dump the entire metadata of the binary and use that metadata dump to create features for a model to learn from.Tis section will touch on some of the more useful malware datasets used historically and then transition into covering some aspects of malware generation kits and antiarmoring behavior.the DARPA dataset [89], with a reduced size and a total of 24 attack types and an additional 14 existing solely in the test dataset [90,91].Based on the observations of [91], KDD99 was the most widely used dataset in IDS research between the years 2010 and 2015.Several issues arose with the use of KDD99, namely, the time-to-live (TTL) values for benign and malicious packets were diferent [92,93], and the data rates were not characteristic of real-world networks [94].Many of these issues were exemplifed in the critique carried out by [93], leading to a need to provide much needed modifcations to the existing dataset.In addition, since the size of the KDD99 datasets was large for many trainable models and the dataset contained duplicates of attacks such as DOS, the dataset was further reduced to become its most recent version, NSL-KDD [93].
Another dataset containing network trafc is the UNSW-NB 15.Te dataset was created by the IXIA PerfectStorm tool at the Cyber Range Lab at the Australian Center for Cyber Security [95].A TCP Dump tool is used to capture 100 GB of raw trafc, with a total of 49 features generated using a set of tools and algorithms.Other lesser known network datasets include CAIDA [96] and ISCX 2012 [97] for network intrusion detection and CICIDS2017 [98].Te CICIDS2017 dataset is unique, and in that, the authors included behavior for Windows (XP, 7, 8, and 10), macOS, iOS as well as Linux operating systems, encompassing attacks from Botnets, DoS, DDos, Brute Force FTP, Brute Force SSH, Heartbleed, Web Attack, and Infltration [98].For a thorough summary of network-based datasets, the authors refer to the review carried out by [99].
Several datasets have been used to represent the content of the malware binary, versus relying on network activity.One of the more utilized datasets is the Microsoft Malware Classifcation Challenge dataset, which becomes popularized in a Kaggle competition back in 2015.Te raw data of a virus' binary are represented in hexadecimal, with a compilation of metadata retrieved using the IDA disassembler tool.Binary representations of malware binary have also become popularized as a dataset in image analysis, with the Malimg dataset [100] having the greatest impact in recent years [101][102][103][104][105][106][107][108][109].Other alternatives include the Malicia dataset [110] which contains 11,668 malicious binaries from 54 families retrieved from 500 drive-by downloads over 11 months.However, the project was ultimately discontinued in 2016.Te Malsign dataset [111] contains 142,000 signed malware and potential unwanted products (PUP) binaries obtained from 2012 to 2015 for the Windows platform [112].
Mobile and internet-of-things (IoT) security plays a unique but important role in malware security, as these devices make up a larger proportion than ever in how we connect with others and exchange information.Te Drebin dataset [113,114] is one of the most used datasets in mobile security, with 5500+ malware being included in the dataset belonging to 20 families, collected from 2010 to 2012.Te android adware and general malware dataset (AAGM) [115,116] includes network activity of 1900 adware, general malware, and benignware running on android smartphones.Te IoTID20 [117] is a more recent dataset used to simulate network attack retrieved from two smart home devices.Te dataset consists of 42 pcap fles encompassing simulated attacks produced from Nmap and from the Mirai botnet [118,119].
Several datasets include features extracted directly from PE fles, and this includes the ClaMP and EMBER dataset.ClaMP [120] includes features from the DOS header, fle header and optional header of PE fles.Te integrated dataset includes 68 features:28 features are from the raw dataset, 26 features are Boolean (fle and optional header), and 14 are derived features.A second version of the dataset exists which consists of 56 features.Finally, the largest dataset by far is the Ember dataset [121] with a total of 1.1 million binary fles.

Security and Communication Networks
Te authors in [122] include additional tools to extract features from the PE fles to further encourage the use of the dataset to train benchmark problems.Te Ember dataset was the larges of such datasets until the introduction of the SOREL dataset in 2020, which expanded from 1.1 million binaries to 20 million binaries, including 10 million disarmed malware samples ready for feature extraction [88].Te Australian Defense Force Academy (ADFA) is the author of two datasets: the Linux dataset (LD) [123,124] and Windows dataset (WD) [125].Both datasets provide a comprehensive simulation of a HIDS based on the collection of system calls; however, a signifcant downside exists for the ADFA-WD as it was collected solely on Windows XP, which limits the applicability to future generations of Windows OS [125].
Insider threats are considered one of the more emerging sources of security vulnerabilities for government and frms.CERT identifed that 15-24% of frms experience an insider incident perpetrated by a business partner [126].It has also been noted that a quarter of cyber security risks are due to   Security and Communication Networks insider threats, meaning that current or close business partners are considered as much of a threat as ransomware from a security standpoint [17].Tat is why, a dataset such as the CERT insider threat V.2 dataset is so important in our understanding and tracing of threats that exist in network topologies [127].Te dataset includes several synthetic threat scenarios, accompanied with information related to HTTP records, employee info, and log on/of times, among other indicators.A summary of the datasets discussed along with some information on their makeup is shown in Table 9.
Virus repositories are also a source for millions of malicious binaries and source code for malware research.Te Zoo (https://github.com/ytisf/theZoo)from [263] contains hundreds of malicious binaries that are updated on a regular basis as new threats emerge and as virus source code becomes available [264].VirusTotal (https://www.virustotal.com/gui/home/)contains one of the most comprehensive repositories used in the industry today.Malicious binaries can be uploaded or searched via MD5 hash to provide a detailed summary of the threat and other metadata.VirusTotal also comes equipped with a public and private API that allows threats to be uploaded while returning a detailed report, along with which AV vendors have already developed a signature for the given binary.Virushare (https://virusshare.com/) is a searchable sample database, boasting 34 million + malware samples for use for analysts, researchers, and the security community [265].Other less popularized repositories for sharing malware for research purposes include Malshare, VirusBay, and Das Malwerk.

Metamorphic Generation Kits.
Virus generation kits facilitate the creation of a bulk of the newly generated virus signatures we see every day.Tese kits perform some, if not all, types of obfuscation outlined in Section 2 to evade signature-based techniques and are a signifcant problem for AV vendors and researchers alike.In addition, some kits even provide functionality whereby users can customize the level of obfuscation and encryption to introduce variation into the malware generation and are even able to enact antiemulation and armoring behavior.Some generation kits have been easily fagged by AV vendors since their generated code would contain similar code between generations; therefore, only a few signatures developed could fag the entire generation, rendering the generation kit obsolete.Depending on the generation kit, COM and EXE viruses can be produced directly, while other kits generate the virus assembly code.For example, Borland TurboAssembler TASM 5.0 can assemble an ASM fle into an object fle and then TLINK takes the object fles and libraries and links them together to produce virus executables.As demonstrated in Figure 9, disassemblers such as IDA Pro can be used to produce the ASM fles [266].Te ASM fles can then be used to extract opcodes and other features sets for use in malware classifcation [267].Tis section will discuss several popular generation kits used in research, with a brief description on some of the obfuscation techniques used by each generator.
Te phalcon-SKISM mass produced code generator (PS-MPC) was developed in 1992 and includes over 25 options for diferent types of encryption and payload types, as well as having options to be memory resident.Te generator employs its own decryption routine but lacks options for stealth techniques.PS-MPC generates fles that reside in memory long enough to infect all COM and EXE fles.Te advantage of PS-MPC at the time of creation was the ability to carry out code generation in batches due to the generator operating as a code-morphing engine as it is script-driven [43].While all PS-MPC-generated codes today are readily fagged by AV vendors, the generator is still used today for research on metamorphic malware [31,51,[268][269][270][271].Te massproduced code generation kit (MPCGEN) was frst developed in 1993 and was used to create CFG fles which were then passed to PS-MPC followed by TASM to produce 32-bit executables.Te name "mass-produced" comes from the fact that the process of generating, compiling, and assembling can be carried out for 500 fles in as little as 25 minutes.Similarly, MPCGEN is used to produce a high quality and quantity of metamorphic variants for research purposes [51,56,[271][272][273][274][275].
Virus creation lab for Windows 32 (VCL32) was created in 1992 but was revamped in 2003.Created by a virus writer named Nowhere Man, a member of a group called NuKE, this generator can produce the assembly source code of viruses.Tis means the assembly code needs to be compiled and linked afterwards before they are active.Te versatility of VCL32 comes from being able to customize activation conditions based on date, time of day, number of infected fles, computer country code, version of DOS, or the amount of RAM available.VLC32 supports COM fle infections, generating companion viruses, as well as various encryption and infection strategies.As a complete package with a GUI and drop-down menus, the most recent version VCL32 released in 2004 is commonly used in research [31,50,51,56,268,272,274,279,280].
Te next generation virus generation kit (NGVCK) is one of the more popular virus construction kits available.Developed in 2001 with the most recent version released in 2003, NGVCK has been widely adopted for use in developing 32-bit PE-EXE polymorphic malware, especially in a research environment [31,39,47 [124] Security and Communication Networks techniques for obfuscation.In [51], NGVCK was compared to other popular generation kits, including G2, MPCGEN, and VCL32, and was noted to produce the highest rates of obfuscation compared to other kits.A similarity metric was used to compare assembly programs, and no similarity was found to have G2 and MPCGEN, up to 2.4% was found with VCL32, and normal fles had similarities between 0.98% and 1.2%.In [271], only a 10% similarity was found between NGVCK when run over multiple iterations, meaning that the kit produces a large amount of variability between uses.An example of two virus variations produced by the NGVCK generation kit is shown in Table 10.Obfuscation produces two semantically similar variants using garbage code insertion, instruction substitution, and subroutine reordering as techniques.
A more recent polymorphic engine was introduced in [69] as the virus and metamorphic worm (MWOR) generation kit.Te efectiveness of the generation kit was exemplifed in [270] for being able to fool common statistical analysis.Te kit has also found more recent interest in research as it is able to control for the proportion of garbage code and subroutine reordering possible [270,271,273,282,283,286]. Tis is extremely efective because inserting a certain amount of garbage code from benign fles has demonstrated an improved ability to thwart AV scanners [39].Tis chapter does not provide an exhaustive list of generation kits, and on the contrary, these kits represent a small subset of available kits widely distributed.Websites such as VxHeavens were one of such sources until the website was taken down in March 2012 by Ukrainian police.Repositories containing over 200+ generation kits once hosted on VxHeavens can be found circulating online to this day.Included in these kits as discussed is antiarmoring and antiemulation capabilities.Some of these will be discussed in the next section.

Anti-Emulation, Stealth, and Code Protection.
Antiemulation is an all-encompassing term that includes all the various armoring, stealth, and/or code protection techniques that are used to thwart or burden the process of reverse engineering of a malware sample.According to Symantec, approximately 28% of malware are VMware [12].One of the shortcomings of virtual machines and other honeypot deployments is that the environment they are deployed in is static, with several confgurations set to default.It is for this reason that antiemulation malware can check the environment for indicators of virtualization and fail to execute or burden the reverse engineering analysis with cumbersome instructions.Tis section will cover some of the actions taken by antiemulation malware to exploit their virtual environment and prevent security experts from understanding the full breadth of their behavior.Antiemulation checks fall into four categories: human interaction, confguration-specifc, environment-specifc, and VMware specifc checks [289,290].

Human Interaction.
Checks to see if actions routinely carried out by a user are being performed.Tis includes mouse movements, use of the clipboard, and opening and closing windows.Te Cuckoo Sandbox, for example, has a setting which provides this sort of functionality for each malware submission.Trojan Upclicker is a virus variant that monitors user input in the form of a left click in order to identify sandbox environments.It does this by using the SetWindowsHookEx() and GetLastInputInfor() API to determine the rate of user input over time.Tis would identify the presence of sandbox environments as automated analysis does not require the use of an auxiliary keyboard and mouse [291].

Confguration-Specifc.
Uses time periods or other confguration to execute at a later time and date only if certain conditions are met.Te Duqu virus, which was frst identifed in 2011, included a series of antistealth techniques in the form of delays as a precautionary measure [292].Code injection only occurs after approximately 10-15 minutes, and the lifespan of Duqu is set by an unknown communication module that removes its hooks, deletes its kernel driver, and removes its registry key once the timer has elapsed [292,293].Te Kelihos botnet and Nap Trojan both make use of the SleepEx() and NtDelayExecution() for extended sleep calls, with the Kelihos botnet having afected 41,000 users before being identifed and taken down.Hastati has a hardcoded check which is executed only at 2 pm on March 20, 2013.Otherwise, it does not execute if GetLocalTime() returns a time less than that, indicating the presence of a virtualized environment [294].

Environment-Specifc.
It looks at the settings and parameters of the host operating system and hardware and decides whether to execute based on those fndings [295].Virtual machines incorporate virtual hardware which tends to have consistent confgurations between VM deployments.Hardware such as network adapters, USB controllers, and audio adapters are all virtualized, meaning that MAC addresses, USB controller types, and SCSI device types are all telling signs of virtualization.Te Scoopy Doo tool developed by Tobias Klein uses Windows Script Host to read registry keys located in HKEY_LOCAL_MACHINE∖HARDWARE∖DEVICEM-AP∖Scsi∖ and HKEY_LOCAL_MACHINE∖SYSTEM ∖ControlSet001∖Control∖Class associated with SCSI and can also lookup keys that are associated with IO and ports for strings containing "VMware."In another application, malware can utilize the internal processor tick counter via the ReaD Time Stamp Counter (RDTSC) instruction.Based on a random bit value that is returned, the decryptor contained within the malware will decode and execute the virus body; otherwise, it will bypass and exit.[5,290].It also performs a check of the physical hard drive serial number and checks if it is set to a default value of 00 which is typically in virtual machines.In the work of [296], the authors looked at antiemulator behavior in android malware and noted volume identifers, network interfaces, and invoking the GPU were all techniques used to obfuscate Dalvik virtual machines.Other evasion techniques, such as exception process timing, IMEI checking, and checking the variability in sensors have all been traced to emulation evasion in android malware [297][298][299][300][301][302].
Alongside the specifc checks mentioned above, general antidebugging makes it difcult for researchers to extract signatures or strings to develop systems to protect against them.An example is the Bistro virus which inserts garbage code insertion and dummy loops before the decryptor stub.As a result, before the malware has even unpacked millions of instructions and burdens the emulator, and Bistro fails to run.During analysis, many malware variants are memoryresident, thereby requiring careful monitoring of viral payload to load itself into memory before it can be dumped and analyzed [61].In the past, malware authors have been one step ahead in their eforts to thwart monitoring memory dumps or memory snapshotting.An example is the Zmorph virus which has its decryptor rebuilding its instructions line by line by pushing the result into stack memory.One of the earlier adopters of this sort of technique was the DOS/ DarkParanoid which contained 10 diferent encryption functions which it used to encrypt its previously run instructions while only allowing its current instruction to be decrypted at any point in time.Without a conventional decryption loop, it is a true polymorphic memory-resident variant.Te use of other so-called "stealth viruses" employed reconnaissance of the OS by waiting until AV products check-summed programs to check for changes.When a fle was read, as opposed to executed as is the case with user input, it took that as indication of check-summing by the AV and removed itself from the target executable.Finally, once it waited until the fle was closed, it then reinfects the fle [303].Using this process, it can follow the AV and infect every fle on disk.A thorough summary of antidisassembly, antidebugging, and antiemulation techniques can be found in [43].For a summary of android application hardening used by malware authors and developers, we refer the readers to the work of [304].

Approaches to Feature Analysis
Malware features are typically categorized into two types: static and dynamic.Static features incorporate all the unique compositional information of the executable, irrespective of the contextual information of the target system [305][306][307][308].Tat is to say, the static features of an executable would be the same regardless of what machine the malware is deployed on.Static features typically include the portable executable (PE) structure, assembly code instructions [5], list of DLLs, n-grams, and byte sequences.PE structure features would include information related to PE sections, resources, application programming interface (API) calls, as well as which dynamic link libraries (DLL) are imported/exported.Most modern antivirus (AV) products employ the use of a signature database which contains known signatures of the static features of malware.Alternatively, dynamic features include API and DLL call graphs, information gathered from the fle system, registry, as well as process and thread activity and the consumption of kernel resources.Dynamic analysis can also include temporal snapshots of process execution, memory, network, and system call logs [309].Dynamic analysis is OS-specifc because depending on the system resources, account privileges, and other environmental variables, the malware will behave diferently and have a diferent signature as a result.
Te ability for malware to mutate has also presented a problem for researchers, which render many of the legacy static approaches to malware research obsolete.As a result, dynamic analysis has been presented as the de facto standard in classifcation approaches as it is impervious to routine obfuscation and packing carried out my mutating malware.Nowadays, dynamic analysis represents some 51% of the analysis methods in the body of literature examined [306], with a unique combination of feature sets and model architectures being used to perform classifcation.It has been noted that malware classifcation is not a trivial problem, with some presenting it as an NP-complete problem [63] to identify a bounded-length mutating virus or a polymorphic variant of one [310].Characterizing malware is the fundamental issue of concern, and researchers and practitioners are constantly refning their methods to stay ahead of the curve.Figure 10 provides an illustration of the feature pipeline used for most malware classifcation approaches.Both static and dynamic features form the bedrock in the characterization of malicious behavior.Any number of these features can be combined to form a hybridized model for feature analysis, which is unofcially the third form of characterization.
Many of these methods are covered in the comprehensive review of [308,309], but this work will simply provide a narrow overview of malware detection approaches as it concerns API calls.While API calls are just of one of the many forms of static and dynamic behavior, it is one of the most consequential and information rich sources of discrimination.But frst, an introduction to the source of APIs, fles known as dynamically linked libraries, is required and will be the topic of the next section.

Dynamically Linked Libraries.
Dynamically linked libraries, or DLLs, are libraries of code that are written by vendors such as Microsoft as well as third parties to coordinate and manage resources on the Windows OS.DLLs are fundamentally libraries of code that contain one or more functions, indicated in their Export Address Table (EAT), which identifes and whose functions are available for export to other processes.DLLs are structurally equivalent to executables, with the exception being that their main function is called DllMain, and they cannot be executed without the use of helper functions RUNDLL.exe or RUNDLL32.exe,for 64-bit and 32-bit, respectively.DLLs are useful because they allow multiple processes to share the same library of code loaded into memory, thereby reducing the time required to recompile each process and the amount of memory overhead if the same code segments had to be loaded in memory multiple times.Because each process does not need to include static code of its functions, it keeps fle sizes smaller overall when it can connect to an already running copy of the library of functions.It also has the advantage of allowing the OS vendor to update a catalogue of core DLL libraries which can work with subsequent versions of the OS.
When a DLL is requested to be loaded by an EXE, it does so through by checking some default directories frst.Tere is a known registry key in KnownDLLs that tells Windows that the well-known DLLs should be found in the System32 path; otherwise, it searches in the .exedirectory, the current working directory, the %SystemRoot% directory, the 16-bit system path, and then the directories in your environment PATH.DLL order hijacking is the process by which malicious actors inject their own malicious DLLs somewhere in this load order so that their payload is loaded instead of a legitimate DLL.For example, ntshrui.dll is loaded by explorer.exe,but it is not a known DLL and therefore can be susceptible to load-order hijacking.DLLs that are fully protected can recursively load other DLLs that are not protected, which forces the next executable to follow the default search order and be prone to hijacking.Te tool Dependency Walker (https://www.dependencywalker.com/)can be used to see the dependency tree between loaded DLLs on the OS.Legacy malware would change the Import Address Table (IAT) to point to a new address in memory for the DLL it needs.Changing pointers to new malicious address locations with malicious payloads has since been rectifed on newer versions of Windows as it becomes apparent if all the address locations for functions are in higher memory space 0x7C86 and a single function is loaded into 0x3420 then most likely that IAT entry has been changed with a hook by a rootkit.Alternatively, malware can just modify the DLL inline, requiring no changes in pointers just the code, leading to a vulnerability commonly known as DLL proxying which is much harder to detect but can be alerted to using integrity checking.
Potentially vulnerable DLLs can be observed if using tools such as SysInternals' Process Monitor (Procmon (https://docs.microsoft.com/en-us/sysinternals/downloads/procmon)).In Procmon, if a DLL is not found and it is not core to the functionality of the process, it will return an entry NAME NOT FOUND.Using an out-of-the-box option like Metasploit's (https://www.metasploit.com/)msfvenom will produce a DLL than can be put in place of the missing DLL, thereby running the malicious payload and executing a successful DLL hijacking.Other tools such as the SANS (https:// www.sans.org/blog/detecting-dll-hijacking-on-windows/)tool can be used to search for DLLs that appear multiple times, are unsinged, and are in unusual folders.More common in research, the Dependency Walker tool (https://www.dependencywalker.com/)makes it easy to view the mapping of imported DLLs and to even view a hierarchical view of all dependencies between modules by looking at the IAT.Te authors in citewang 2008 separated DLL usage according to implicit dependency, delay-load dependency, and forward dependency, which are all responsible for the static loading of DLLs in 3 tiers of hierarchy.Tier 1 starts from those used by the main program, followed by Tier 2 which have DLLs invoked by other DLLs that are not in the main executable, with Tier 3 being the entire statically loaded tree.Te authors created a one-hot encoded vector if the particular DLL existed in the program and used that feature mapping for classifcation.In [311], a similar approach was taken which relied on the DLL dependency tree but incorporated encoding tree string dependencies.Te authors looked at all the tiers of DLLs which loaded and created a depth-frst representation where the original executable is the root node and all nodes from root to leaf are assigned a unique integer value.Tey then used CMTreeMiner which extracts closed frequent subtrees that exist in a particular executable, and one-hot encoded a feature vector if a particular subtree exists in the executable.Looking at depths of subtrees from 3 to 6, accuracies as high as 98%+ were obtained following random forest and naive Bayes classifers.Te work of [312] did not go in as depth as [310], but the authors looked at the number of API calls by a DLL in addition to the list of DLLs used and the API calls Security and Communication Networks made.In any case, while DLLs do provide a good proxy of malicious intent, it is in fact the API calls that are made that are the real discriminator.For this reason, researchers turn their focus towards API calls and their usage among malware variants.

Windows Application Programming Interface.
Windows API calls are interfaces provided by DLLs to access low-level resources [313].API calls come in two favors: user-level and kernel-level APIs.User level APIs operate at Ring 3 and provide the average user just enough privileges to access system resources to perform typical workloads.Te actual hardware on the other hand runs in the kernel mode, which makes use of the kernel level APIs that are not directly available to users for the sake of security and stability of the OS.From the stability perspective, a user-level crash results in an error message, while a kernel-level crash results in the OS crashing.From the security side, malware could reside in the kernel and operate at a layer that is indistinguishable to the user or any Ring 3 defenses.Nowadays, it is much more unlikely to see malware residing in the kernel, as the Windows OS has made it more difcult to run code in the kernel and make use of rootkits.Ultimately, to make use of the kernel, all userland code uses Kernel32.dllas a gateway to communicate with Ntdll.dllwhich, in turn, communicates with the kernel.
Te fascination with API calls comes down to the fact that API calls provides a higher resolution of analysis of the operation of any given process.It is the case that API functions and system calls are related to the services provided by the OS [309,314,315].As the API is responsible for all system resource management, it is a particularly discriminating feature for malware classifcation as it provides the basic functionality for everything from networking to saving fles to disk.Te usage of APIs and patterns in usage can be very telling.Similar to the overarching view of static and dynamic analysis of behavior, APIs are approached from a static and dynamic perspective as well.In dynamic analysis, the run-time behavior is monitored, and ideally, all code segments are traced to reveal the behavior of the malware.Tis circumvents the obfuscation techniques of encryption, packing, and polymorphism [316].Static analysis on the other hand can be fooled by adding fake API calls [317] or API calls typical of benign event activity [318].It is also the case, as mentioned in Section 5.1, that the imported functions of a DLL may or may not ever be called, which can be used as a distraction from the real nefarious purpose of the malware.
Features such as the API call function names, parameters, and the return values of an executable can be extracted from the APIs [319].Monitoring the API calls is an approach to detecting the malicious behavior of software; however, there is no clear distinction between malicious APIs and benign APIs as all native APIs are a helpful utility given the right context.Te next section will outline some of the nefarious usages of APIs by malware authors and how they balance stealthiness with functionality.

Malicious Windows Application Programming
Interface Usage.Broadly speaking, API usage can be categorized into 7 categories based on the functionality they provide to a process [314,320].Researchers have also made use of similar categories to classify malicious intent [184].Some of the malicious functionality APIs can provide to executables and include the following: File: create a fle in sensitive folders; delete or hide fles; fle directory traversal Process: inject DLL into a running system process; create mutex to prevent execution Memory: free up or occupy memory; minimize memory usage Registry: add or delete system service.Autorun, hide, and protect Network: open and listen on a port, communicate over e-mail service, communicate with CnC server Windows Service: terminate windows update, frewall, setup Telnet or SSH Others: hooking keyboard, hiding window, scan for existing vulnerabilities and confguration Code injection usually begins with the usage of thirdpart DLLs or injecting code into a Windows DLL.Malware makes use of Ntdll.exeindirectly to make use of kernel APIs, so checking the stack trace of event activity is important [321].Malware authors have to balance gaining increased functionality at the cost of rising suspicion, so a careful deliberation of which APIs to use is always in mind [322].Native Windows API calls that begin with NTtQuery are popular for malware, as they include functions such as NTtQuerySystemInformation and NTtQueryInformationProcess which provide much more information about the host system.More invasively, early rootkits would make changes to the System Service Descriptor Table (SSDT) which contains addresses to the kernel functions, which would instead be changed to malicious driver functions.If, for example, a typical address of a kernel function is set to 804d7000 for ntoskrnl.exe,then one can look at addresses which are not familiar and contained within the address space typical for  (IDT).Te IDT takes care of exception handling, so rerouting the response to interrupts to malicious code would be highly disruptive.As a precaution to prevent making changes to native Microsoft DLLs and APIs, Windows Vista was the frst Windows version to introduce digitally signed drivers.Some of the example use-cases and APIs used by malware are the following: (a) File: if software wishes to make use of the fle register, it can do so using CreateFile, ReadFile, and Write-File.Malware can make use of CreateFileMapping or MapViewOfFile which loads the fle into RAM, avoiding writing to disk all-together.Some malware types, like Ransomware, perform high volume fle and encryption operations to carry out its function [323].(b) Process: it is typical for malware to use Open-Mutex to check if a mutex exists for a running malware executable.Malware can make use of DLL injection or direct injection.Code can be injected into a running process using VirtualAlloxEx and WriteProcessMemory.When the code is injected into an executable such as Explorer.exe,the same privileges hold for the executable it is injected into.Asynchronous procedure call (APC) is a process by which malicious code is attached to the APC queue of a process' thread.WaitForSingleObjectEx is the most common call, with QueueUserAPC being used for queues running on a thread.It can be run from the kernel using KeInitializeApc and KeInsertQueueApc.APC remains a known vulnerability on the MITRE ATTCK knowledge base [324].(c) Registry: when it comes to making use of the Windows registry, malware can gain persistence so that it can load whenever Windows restarts [316,325] Hooking uses an API such as SetWindowsHookEx to notify about a key press, while polling is conducted using GetAsyncKeyState and GetFore-groundWindow to poll key states during any time period.
Researchers have looked beyond individual API calls and have investigated API call distribution [327].A summary of some of these classes of API usage used by researchers is shown in Table 11.Te issues arise in that, and it requires signifcant domain expertise to create and update a database of API calls for particular malware variants or families.It is also the case that there is signifcant overlap between malicious and benign API usage, thereby making it difcult to alert malware without alerting false positives.Te work of [328] developed a similarity metric to trace the similarity between malware variants and Stuxnet based on groups of API calls.It comes to reason that groups of API calls in succession, or the distribution of API calls, can provide further insight into malicious behavior [334].For this, we investigate some of these research methods in the following section.

Classifcation of Windows Application Programming
Interfaces.Te investigation of API calls in the context of feature extraction is sometimes referred to as API call sequence or API call traces.In either defnition we are concerned with the patterns that arise in the sequence of API calls used one after another.Early adopters of this form of investigation used Hofmeyr API call sequences, whereby behavior profles were established between two sequences of API calls based on Hamming distance [335].Originally, UNIX system calls were traced, and the investigators were motivated by the immune system in their attempt to draw an analogy between sequences of system calls and chains of amino acids in the human body.API call sequences have been leveraged in several applications involving malware detection [160,184,316,[336][337][338][339], as well as in tracing the API call traces during event activity [316,[340][341][342][343]. Overall,  [315].

Application Programming Interface
Frequency.One of the more primitive approaches to API analysis is API frequency analysis.It stands to reason that if malware and benignware make use of similar API libraries, then malware must make use of certain libraries or "malicious" APIs more frequently than others.In [319], considering API frequency alone was efective in achieving 97% accuracy in a multicategorical classifcation problem involving metamorphic malware variants.One takeaway was that incorporating sequential information did improve accuracy of the models, so frequency analysis is certainly a useful preliminary step in behavioral analysis.Te work of [344] developed an endto-end malware detector based on the frequency of occurrence of opcode and API calls.Teir detector coined OPEM, demonstrating an increased area under the curve (AUC) and lower FPs with static calls and a hybrid approach.Unfortunately, the authors did not account for obfuscated malware which tend to be packed and have polymorphic engines which obfuscates the opcode.Teir hybrid approach, which included API execution trace, did outperform all other feature sets used in their work [344].Certain works, like that of [245], decided to use a frequency of a subset of 794 API calls extracted from 500 thousand malware samples.Te authors then fused this feature set with other static techniques such as entropy and features extracted from the PE fle such as the total number of assembly instructions in the .dataand .rsrcsection.Te drawback to these approaches is that taking the most frequent API calls leaves out information of potential edges cases; it is also a fact that frequent API calls by malware are still routine events carried out by benignware, such as reserving memory, creating a fle, etc. Te work of [345] approached the problem in a similar fashion, where they eliminated API calls with low frequency.Again, doing so removes important edge-cases and is used typically to reduce the size of the feature vector space to improve training times.Tese aforementioned works all made use of ML techniques to classify their malicious behavior.Other works make use of statistical similarity metrics to diferentiate malicious versus benign by using one or more metrics of comparison.For example, in [304], the authors made use of information gain to select the features based on the sequence of opcodes from android applications.
Based on some key obfuscation techniques discussed thus far, including control fow obfuscation, string encryption, in addition to advanced techniques such as class encryption and refection, the authors found several ML approaches were efective in detecting obfuscated samples.
In [346], the cosine similarity was proposed to compare API call frequency between two vectors to represent the similarity in vector space of a known signature to a new malware sample.Te expression for cosine similarity is shown in equation (1).Te motivation for using cosine similarity is that the measure computes the similarity between two vectors while excluding their magnitude.Tis has the efect of ignoring the impact of magnitude if one vector were to use an API much more frequently than the other, as the θ angle in equation ( 1) is indiferent to their magnitude.
Te extended Jaccard measure is another similarity metric than is useful in measuring the degree of overlap in two sets [346].As an extension to Jaccard for use in continuous or count attributes, it is efective in demonstrating the similarity, or the ratio of set intersection, between two sets in the context of set theory.Te equation for this relationship is shown in equation (2).Te numerator can be seen as expressing the set intersection, while the denominator can be seen as the union which acts as a form of normalization.

J(x, y)
Te cosine similarity was used efectively to create a similarity matrix between the rarest 20-30% raw security events and events of the training set [160].Tis approach was used to signifcantly reduce their dimensionality of their set by focusing their eforts on the similarities between a baseline set of unusual events and their dataset more broadly.In [347], similarity metrics were computed for API sequences that appear frequently, and both assembly instructions and API calls were considered in their work.API calls were noted to be faster in having a smaller signature; however, the authors noted that the API approach is bad for network applications such as PuTTY and encrypted fles which show few or do not show any API calls.Teir work did rely on unpacked executables as it was limited only to static analysis.In [346], an API call frequency similarity measure was used followed by a chi-square test to test the representation based on a distribution from a known signature.Families of APIs of known metamorphic mutation engines were categorized and compared to one another and to the same mutation engine using both the cosine similarity and the extended Jaccard measure.An interesting fnding was that comparing a similarity metric between variants from the same mutation engine provided a measure of the degree of obfuscation, which was shown to be the largest for the next generation virus creation kit (NGVCK), a well-known mutation engine.Te work of [275] completed similar work, whereby a proximity index table was setup to compare the similarities between mutation engine families.Due to the sheer number of possible API calls, feature dimensionality reduction was carried out on the original 1000 or so APIs according to frequency.Te authors noted that common APIs were used between mass code generator (MPCGEN) and NGVCKgenerated viruses.An approach that included data mining was taken in [320], whereby the calling frequencies of the raw features are calculated to select a subset of features, and then principal component analysis (PCA) is used for dimensionality reduction of the selected features.In total, 24,662 API function calls, 792 DLL features, along with PE header info, were considered in their feature set while considering only the top 30 DLLs according to frequency [320].To address the issue with high-dimensional data, the authors in [336] developed a string-based malware detection system that focused on the top 3,000 interpretable strings that included API names using a max-relevance algorithm.Teir feature parser extracted strings from 9,838 executables and classifed them as Backdoors, spyware, Trojans, and worms, in addition to benignware.While these techniques have been proven useful in many controlled scenarios, frequency-based analysis is still prone to malware which can obfuscate themselves to avoid heuristic detection.For this reason, sequence analysis is used.

Application Programming Interface Sequences.
Te investigation of API sequences has become the de facto standard for many behavioral approaches as the information contained within sequences is too powerful to rely on the API frequency alone.It has also led to the adoption of natural language approaches which will be discussed in Section 5.4.Te work of [316] provided an example of the fow of information surrounding a process that can act as a template for how to carry out sequence analysis of APIs.Te three fow paths are as follows: (1) Te API call GetModuleFileName takes a NULL character as its frst argument which returns the malware fle path (1.1). the path can be passed to CopyFile to open the executable and run its processes (1.2). or, if desired, a process can call CopyFile on itself with the share permission shared to NULL, thereby preventing applications from opening and scanning the fle Tis example serves to demonstrate that two very different uses of CopyFile can indicate malicious behavior, and only once the whole context is understood can a detection system alert it.An application that performed this successfully was in [337] where 2,727 unique APIs were categorized into 26 groups based on functionality such as hooking, fle and directories, registry modifcation, and others.Based on the sequence of the APIs, critical patterns were uncovered which were essential for core functionality such as screen capturing and DLL injection.Results demonstrated F1 scores as high as 0.999 with a focus on the longest common subsequence between existing malicious signatures and those of unknown variants.A similar approach was taken in [1] where 11 hand-crafted signatures of dynamic and static behaviors were created based on malicious operations spanning registry operations to device operation to kernel operations.Tese signatures were converted into semantic blocks based on the largest common subsequences between dynamic and static APIs.Te work of [348] created a formulation that includes API sequences as part of a temporal domain, and pointers passed to API calls as spatial information.Te motivation being similar to [316] in that an API call such as LocalAlloc takes in uBytes as an argument that is statistically lower for malicious fles than benign fles during allocation of the heap.Capturing this information in the spatial domain, while modeling the sequences of APIs in the temporal domain were efective in classifying 516 executables with accuracies as high as 0.966.Rather than focusing on API sequences as it pertains to general malicious behavior, researchers have explored common API sequence usage among malware variants and types.In [330], fve classes of malware including Worm, Trojan-Downloader, Trojan-Spy, Trojan-Dropper, and Backdoor were associated based on the presence of 26 API categories and sequences.534 malware variants were hooked and then categorized based on the presence of these API sequences, which were characteristically diferent for different malware types that aim to pursue diferent objectives through their API usage.In [349], the authors considered 9 behaviors based on sequences of 2-4 APIs in succession, while [315] looked at combinations of 3 APIs (such as CreateFile, WriteFile, and CloseHandle).Te work of [350] obtained a 99.7% detection rate using several API calls sets, which included sequences of diferent lengths.
When it comes to determining appropriate sets of API calls for classifcation, researchers have pursued approaches in the data mining space to optimize for a set of association patterns towards a particular objective [351] and in this case, optimizing an objective that a sample belongs to a malicious or benign sample.Several papers have been published in this area, in particular those published out of the Xiamen University [352][353][354] focused on malware classifcation.Ultimately, regardless of the particular mining algorithm used, the idea is to fnd a set of API calls that support the objective of classifying malware from benignware.In [353], this was performed using a frequency pattern growth algorithm [355].Te goal is to create a frequency pattern tree which encodes sequence in a tree-like structure similar to a Hufman coding where parents of a node are encoded as longer extensions of the child sequences.So, for a given API call API_i, it would exist as a leaf node, while its parent nodes would contain sequences that contain API_i such as (API_i, API_j) or (API_i, API_k).Tis is performed recursively up the tree, and frequencies are stored as satellite information at each node, and this is how rules are generated.A new sample is then matched against the rules according to the descending order of the rules' confdence and support [356].Te motivation is to maximize the likelihood that rules exist which can discriminate one objective from the other.Tis procedure was further described in [352] and used successfully to generate rules which parse 29,850 Windows PE fles, half of which were malicious.In the approach of [356], the authors compared frequency mining approaches to ML approaches including SVM, decision trees, and naïve Bayes and noted a 2-9% improvement in classifcation accuracy.Because these approaches did extract the APIs from the PE fles, this static approach is not efective for packed malware or APIs which are imported by the executable but never used.In a later paper by Ye and Yu [143], rule pruning was used for duplicate rules and only elected to use the top 100 API calls as no further improvement was shown beyond 100.While using a linear SVM, Aassociate classifer and novel hierarchical associative classifer, 26 thousand malicious samples were parsed and a precision value as high as 96% was achieved but with a low recall value of 34%.A thorough examination of the state of data mining approaches as it pertains to cyber security are covered in [357].While handcrafting sequence signatures can be time-consuming and require knowledge of specifc patterns in API usage, the alternative is to consider all possible subsequences of a given length and consider the usage patterns of all sequences simultaneously.While data mining does provide a compact representation to do this, more innovative works allow models to discern these rules on their own when coupled to ML approaches.For this purpose, n-gram representation is used.

Application Programming
Interface n-Grams.One of the earliest forms of sequence analysis in the malware domain was carried out in [358].It was also the frst successful application of n-grams, which involves translating a sequence of L APIs into subsequences n long and doing so for every possible subsequence that exists in the original API sequence.Tis has the efect of incorporating information about the sequences of APIs with little preprocessing required.For any given API sequence, a sequence of length L would have L − n + 1ngrams, where n is the length of the subsequences and assuming a stride length of one.So, for an API sequence 10 APIs long, we would have (10 − 5) − 1, subsequences for n � 5. Te number of possible n-gram combinations would be |C| 5 , which represents all the unique combinations of fve APIs in sequence that are possible in the set of APIs C. Te authors in [358] looked at short byte string n-grams of the PC boot sector which was 512 bytes long.Tey utilized an ML approach that removed the sigmoid activation and stored the weights as 5/6-bit integers.Te technique became part of the IBM AV package and was successfully deployed to millions of machines.
Te versatility of n-grams means that one can look at smaller n to generate shorter signatures which are noisy but more generalizable or use larger n to create more specifc signatures which lead to lower false positives (FP) but at a cost of lower true positives (TP).Te application of ngrams is known to have low FP rates with increasing sequence length L; however, the space complexity of n-gram sequences is exponential in the length of the sequences O(|C| 5 ) [71].Te work of [359] focused their attention of the PE header and body and carried out static analysis using the top 500 most common 4-grams [360], representing DLL names.Results demonstrated that the header-only features are as relevant as body information and that separately, they both have a use-case [359].Similarly, in [361], a 4-gram representation was used to model API sequences.Te authors developed average confdence values of benign and malicious activity and used the average confdence of malware as a threshold.Tis simple thresholding obtained 90% accuracy; however, the work provided no indication of FP rates to support their fndings.Te work of [342] went one step further and carried out n-gram modeling of API call sequences based on the fle system, network, and registry activity.Tis work was unique in that, and it separated API events based on the fle system, network, and registry, to provide a further analysis of how these event categories fare in acting as discriminators.In all, the authors looked at over 17,900 malicious executables and obtained 92.5% test accuracy.Finally, [345] resorted to 3-and 4-gram representations but focused on the dynamic API usage after process execution.Tis resulted in 94% accuracy, but when coupled with static feature sets based on frequency, it improved the accuracy beyond 97%.Te shortfall of n-grams is that sequences exceeding that of 4 or 5 are impractical to model due to the number of permutations of API calls, which significantly hinders the ability for models to attend to diferent behaviors.For this reason, we can pursue graph-based approaches in an attempt to consider diferent behaviors simultaneously.

Graph-Based Approaches.
Graph-based approaches to malware detection have a long history.Te earliest application of graph-based includes the use of control fow graphs (CFG) to evaluate unique control fow sequences of a program.A CFG is created as a directed graph where the nodes represent individual or blocks of program instructions and the edges represent the control fow between statements [310].Within each CFG, we have a subgraph that is isomorphic to the whole graph.Trying to map a subgraph from one sample to another is part of the set of problems which includes the subgraph isomorphism problem which is NPcomplete [362].In Figure 11, we can see an illustration for the control fow from the Trojan.Emotet virus.Tis instruction segment belongs to the set of instructions that are responsible for spawning a child process which depends on the initial call to CreateEvent at the top of Figure 11.When examining such a control fow, the question becomes which Security and Communication Networks segment(s) of instructions are responsible for malicious behavior.While this segment was carefully selected to show the behavior of Emotet, extracting similar segments from the entire malicious execution is cumbersome, especially when they include diversions and dead-ends.Extracting such segments as signatures and generalizing these signatures to fag future malware samples is the goal of CFG-based malware classifcation.
Most applications of CFGs look at extracting some subset of the fow of sequences to compare to other samples to establish a baseline for malicious control fow.One approach used by [363] looked at jmp, jcc, call, ret, inst, and ret opcode instructions and built the CFG based on only these instructions, thereby creating a reduced graph and leaving placeholders for the rest.Based on these, the authors created unique signatures for malware detection.In [364], the authors looked at the system call functions, which included call, jump, and conditional jump expressions in the x86 Intel instruction set.In [365], the authors looked at the most frequent subgraphs and simply excluded the rest.Te sample set used by [366] included 25,145 functions which were 5 nodes (simple instructions) large and 15,439 unique functions which were 5 nodes long.Setting the threshold at 5 ensures that only atypical calls and procedures are included.One of the issues associated with CFGs is that the control fow is either (a) similar among all executables, regardless of malicious activity (also known as boilerplate code) or (b) is sometimes appended with benign code segments that are never executed but can confuse string-based scanning techniques [366].Tis was considered by [367] in their CFG reconstruction based on system call logs extracted using Procmon.Teir approach did not look at functions that were not loaded by the dynamic linker in order to remove boilerplate code.However, this is a double-edged sword as malware does not only rely on its Import Address Table (IAT) to fetch the APIs it needs, it can load those statically as well.An alternative approach used in [368] looked at contrast subgraphing [369], which is the opposite of graph isomorphism since it looks for the smallest subgraph of G 1 that does not belong in G 2 .Tis approach lends itself well to looking for characteristically signifcant diferences between malware and benignware, rather than developing signatures that look for similarities among classes.Alternatively, one can consider creating signatures as coopcode graphs that belong to malware families and therefore create high-level signatures that can be used to classify malware families based on the coopcode graph similarity [319].While opcodes have been investigated extensively, Windows API usage has been shown to perform well at detecting polymorphic variants, [143,160,364] but the large size of potential subgraphs remains a limitation to graph-based approaches.Going more in depth, [370] examined not just the API functions used but also their function input arguments among fle system, registry, socket, and process operations.Tis provides additional insight into the calling process, such as through bytes written to when using WriteFile or destination key when setting a registry value using RegSetValue.Te work of [289] looked at the opcode similarity to detect polymorphic variants.Te authors developed a weighted directed graph where the edges were probabilities that one opcode followed the next.Tey then computed scores between metamorphic viruses and between viruses and benign fles and developed a threshold score for maliciousness.Tis approach performed well since metamorphic viruses are created with a selected few metamorphic engines; therefore, the signatures developed are in fact tracing obfuscation used by a given mutation engine [364,371].Another factor to consider when using CFGs is how to establish a comparison between CFGs from malicious and nonmalicious control-fows.Te authors in [362] examined the detection of metamorphic code based on a crosscomparison of the control fow graphs of known malware.Te authors normalized the code to remove dead or unreachable code, removed common subexpressions, removed dead paths, and analyzed indirect control fow transitions to remove longer chains of control fow and avoid misdirections.Te authors recorded a 96.5% true positive rate while producing almost no false positives.Te Jaccard similarity matrix was used in [367] between system call subsequences.Te cosine similarity is another approach used [372], but all similarity metrics sufer from drawbacks because they are all subject to the selection of subgraph as discussed earlier.Even with reliable subgraphs that perform well on a particular set of malware, the work of [373] demonstrated that 23 algorithmic graph features including betweenness centrality, closeness, degree centrality, density, and number of edges and nodes can be used in adversarial analysis and result in a 100% misclassifcation rate.Teir approached target IoT malware, but android malware, is also an ongoing feld of study [374][375][376].With all the shortcomings that come with the graph-isomorphism problem, newer advances in this feld remove the need for graphs alltogether and convert the entire graph into feature vectors [373,377].Once features are vectorized, this opens up the door for other machine learning models to act as discriminators for the classifcation step.

Natural Language Processing
Approaches.Te use of natural language processing (NLP) approaches applied to API call sequences was a natural extension to developing models that can predict malicious behavior.Malicious behavior is not simply a product of individual API usage or frequency of APIs, but it is rather a consideration of the pattern in the API usage over time.Similar to how word usage and context can provide an indication of whether or not an email is spam or not, the context of API called in succession can tell you something about malicious intent.Tis has the efect of being able to attend to diferent behaviors simultaneously and allows the model to learn what malicious behaviors exist on its own.Many popularized vectorization techniques used in NLP applications have also been migrated for the purpose of malware research.Two of these techniques were displayed in the work of [378] which used a bag-of-words (BoW) model and term frequency-inverse document frequency (tf-idf ).Te background specifcs of these techniques will be discussed in the next section.Teir work created fxed lengthened vectors from behavioral reports produced in virtual machines and automated the feature extraction step.Finally, an ensemble of ML techniques, such as random forest, k-nearest neighbors (k-NN), support vector machine (SVM), and XGBoost, were used, with majority voting summarizing the end predictions over the models.An application that did involve APIs was carried in [1] who looked at both dynamic and static behaviors and hand-crafted groups of signatures based on operation.Te authors created 11 diferent types of malicious operations, spanning from registry operations to device I/O to kernel operations.APIs were converted to semantic blocks which looked at the largest common subsequences between dynamic and static behavior.Following the sequencing, tf-idf was used to vectorize the contribution of each API, with a focus on rarely used APIs that drive malicious behavior.In [160], tf-idf was used to convert the sequence of a unique event name to a representation for a machine learning mode to learn which included both 1-dimensional convolutional neural network (CNN) and long short-term memory (LSTM) architectures.A similar line of work was used in [379] where a LSTM was used to model sequential API usage of 20 thousand malware samples run on a Windows 7 machine using the Cuckoo sandbox.Te authors only considered 342 API calls but limited their investigation to those that were used at least 10 times among all samples in the training set.When coupled with tf-idf, this has the efect of focusing more on rarely used APIs, and by limiting the minimum threshold to 10, there are enough training examples for the model to learn the importance of those features.In a more recent work in [380], graph neural networks (GNN) were used to identify dynamic malware execution in a sandbox using the techniques developed in [315] and used in [381].Windows APIs were vectorized with n-gram and td-idf, with malware execution being performed in sandbox snapshots with different benignware excecutions to simulate diferent potential host environments.Te use of GNNs allowed the model to learn patterns in API usage by combining learned patterns from neighboring nodes that represent difernet hierachies in process execution.Tis has the efect of not only learning the API usage of a single process, but that of all the processes that are daughter or parent processes of any given running process -thereby magnifying the discriminatory power of the model in identifying malicious behavior.
In addition to the form of vectorization, modern NLP models allow the model itself to learn the importance of each word (or API) relative to the context of the surrounding words.For this purpose, word embeddings were developed which can learn the semantic relationship between words and map that relationship to vector space [382].Tis has the efect of allowing models that are closely related to have similar cosine-similarity scores.A modest application by [383] used 300-dimensional word embeddings followed by a similarity matrix to cluster malware and benignware using k-means.Tis way, the cluster index was a dense representation of malware and benignware.A more end-to-end approach was used in [381] whereby API stack traces were modeled as an NLP problem.Embedding dimensions of size 50 to 200 were used to map the API stack trace that included APIs that communicated all the way to the kernel.With the use of a transformer architecture which learns latent representation of the sequences, F1 scores as high as 96.2% were obtained when considering registry APIs.Te authors in [384] looked at developing a semantic transition matrix to segregate API calls which have similar contexts into clusters.Tis was conducted by capturing the relationship between API calls that represent malware and benignware using Security and Communication Networks Word2Vec [382], a word embedding technique which has more powerful encoding ability than vanilla word embedding approaches.More powerful encoders translate to better ability to learn context, which was evident in their FP rate of only 1%.A similar use of Word2Vec was followed by an LSTM in [385] to analyze opcodes and API function names.In total, 1369 API function names and opcodes were used, of which 958 were API calls.
Several works have made use of the Windows PE malware API sequence dataset [379], a dataset of over API call sequence extracted from 7017 malicious binaries from 8 malware classes including Adware, Backdoors, Downloaders Droppers, Apyware, Trojans, Viruses, and Worms.For this dataset, [386] achieved poor results with a 0.38 F1 score when using a 32-dimensional embedding to represent the API sequences followed by a 2-layer LSTM.Teir approach used 342 API calls and discarded those that were used less than 10 times.Similar poor results were obtained in [387] which reported F1 scores ranging from 0.33 to 0.72 for the 8 malware types based on a similar LSTM approach.Te work of [388] went one step further and compared an LSTM approach to that of a transformer and fnally to a bidirectional encoder representation from transformers (BERT).BERT relies on learning latent representations from both directional contexts from before and after sequences, meaning that it does a better job encoding context of the API sequence.In [388], they also used the Windows PE malware dataset and found similar issues classifying the 8 classes with a weighted F1 score of 0.51 on their best performing BERT model.One approach that did fnd success using BERT was that of [389] who implemented fastText [390], a text vectorizing technique based on n-gram.While removing redundant API calls, such as NtDelayExecution, accuracies as high as 96.76% using BERT were obtained.

Conclusions
Tis paper provides a systematic review of commonly used obfuscation techniques used by malware variants and mutation engine kits.Tis survey of the literature touched upon several key indicators of obfuscation employed by malware, which serves to better understand the nature of the reverseengineering process.Our work makes four core contributions.
We noted the scope of malware and obfuscation worldwide and presented some of the key red-fags noted by antivirus (AV) vendors and researchers.Te numbers suggest an aggressive increase in the number of threats and the monetary cost associated with breaches, system intrusions, and downtime.In addition, we discussed some of the string scanning techniques that are still very much in use by AV vendors to this day.
We provided an examination of the popular obfuscation techniques used to translate the opcode sequences of malware into semantic equivalent but diferent instructions.Tese techniques have been integrated into popular mutation engines for over a decade now and render much of the reverse-engineering and legacy signature-based techniques obsolete if used efectively.Tis presents a fundamental problem for researchers and practitioners, but it has led to the feld of dynamic analysis which examines the run-time behavior of malicious executables.We also touched upon the structure of metamorphic mutation engines, along with encryption and compression, two very important behaviors that serve as key indicators of maliciousness for a given binary.
We provided a review of popularized malware datasets that are commonly used in malware research.Tese datasets span applications in mobile malware, intrusion detection, networking, and binaries.We also touched upon some antiemulation and antiarmoring tactics in use by malware to protect from examination under virtualized environments.
Finally, some common approaches to feature analysis are introduced which discusses the various ways Windows APIs are categorized and vectorized to identify malicious binaries, especially in the context of identifying obfuscated malware variants.Security and Communication Networks

Figure 1 :
Figure 1: Graphical illustration for the decryption, obfuscation, and encryption carried out by a metamorphic mutation engine.

3. 1 .
Oligomorphism.Oligomorphism began as a reaction to the signature-based scanning techniques widely utilized for fagging possible virus infections.With the help of scanning

Figure 2 :
Figure 2: Illustration of an appending virus that latches onto the end of a benign fle.

Figure 3 :Figure 4 :
Figure 3: An illustration showing the variation in positioning and level of obfuscation found in (a) oligomorphic and (b) metamorphic malware.

Figure 5 :
Figure 5: Simple xor decryptor which decrypts byte by byte using an increment counter and a jump not zero (jnz) loop.

4. 1 .Figure 6 :
Figure 6: Illustration of diferent encryption archetypes, where (a) key is reused for each encrypted block; (b) encrypted block is used as nonce for next encrypted block; and (c) stream cipher is used to encrypt each block.

Figure 7 :
Figure7: Overview of the main steps in a packer.Adopted from[81].

Figure 8 :
Figure 8: Timeline of major malware variants, techniques, and mutation engines.

Figure 10 :
Figure 10: Summary of the feature pipeline for the classifcation of malware.

Figure 11 :
Figure 11: A CFG representation of the disassembled instructions for Trojan.Emotet produced in Ghidra.

Table 2 :
An example of dead code insertion using nop.

Table 4 :
An example of simple registry reassignment.

Table 6 :
A simple example of instruction-substitution.

Table 8 :
An example of code reordering and code transposition in combination with other obfuscation techniques.

Table 9 :
, 50, 51, 56, 58, 67,  268-275, 277-288].Options for encryption include rotate without carry ROR/ROL, Twos complement negation NEG, Ones complement Negation NOT, logical exclusive or XOR, and addition/subtraction ADD/SUB.NGVCK can carry out dead code insertion, subroutine reordering, code substitution, and registry renaming, and all are very efective Security and Communication Networks Summary of the more prevalent malware datasets publicly available for use by researchers.
It uses checks that add the ability for malware to look for specifc indicators of virtualization based on the VMware software used by the host.One of the best examples is in the use of VMWare workstation's WinXP Guest virtual hardware which includes a running VMtools service and 300 references to VMtools in the registry.Another interesting adoption of VMware behavior is Pushdo.Pushdo uses PspCreateProcessNotify() to deregister sandbox routines

Table 10 :
[286]tions in code obfuscation used by the next generation virus generation kit.Adapted from[286].
d) Networking: certain network API usage can be indicative of malicious intent as networking APIs provide diferent levels of fexible.For example, the APIs in Wininet.dll will use higher level APIs for HTTP and HTTPS communications.Malware might use the raw Winsock libraries located in ws2_32.
[326]ons when Secure Boot is enabled.WinLogon Notify launches during log on, sleep, or when the lock screen is open.Adding a malicious DLL to the ServiceDll parameter in the registry allows a malicious service to start its malicious service DLL into a loaded svchost.exe[326].(

Table 11 :
Summary of malicious API usage by behavior type.