A Vulnerability Detection System Based on Fusion of Assembly Code and Source Code

Software vulnerabilities are one of the important reasons for network intrusion. It is vital to detect and fix vulnerabilities in a timely manner. Existing vulnerability detection methods usually rely on single code models, which may miss some vulnerabilities. ,is paper implements a vulnerability detection system by combining source code and assembly code models. First, code slices are extracted from the source code and assembly code. Second, these slices are aligned by the proposed code alignment algorithm. ,ird, aligned code slices are converted into vector and input into a hyper fusion-based deep learning model. Experiments are carried out to verify the system. ,e results show that the system presents a stable and convergent detection performance.


Introduction
Software vulnerability detection is crucial to cybersecurity defenses. In recent years, the number of software vulnerabilities reported has grown rapidly with the development of the software industry. According to Common Vulnerabilities and Exposures (CVE) [1], 6447 security vulnerabilities were published in 2016, and this number has increased to 9000 in the first half of 2020. ese vulnerabilities have led to many cybersecurity incidents. An effective tool to handle these software vulnerabilities is vulnerability detection. For this reason, many scholars and businesses have devoted much of their energies to the research of vulnerability detection.
ere are a wide variety of vulnerability detection methods for discovering vulnerabilities in software. ey can be broadly divided into dynamic analysis and static analysis methods. e former identify vulnerable behaviors in the process of analyzing and executing software [2,3]. Despite a low false alarm rate, this type of methods may miss some vulnerabilities due to the difficulty of analyzing the entire program's behavior [4]. e latter detect software vulnerabilities by analyzing a code without executing it. ey have high convergence and, thus, may be more popular than the former. Some static analysis-based detectors explore code similarity to detect the vulnerabilities caused by code cloning [5][6][7]. However, these approaches are hard to generalize to other types of vulnerabilities. In contrast, pattern-based, especially machine learning-based, detectors can use a large number of software codes to learn the patterns of universal vulnerabilities [8,9]. Recently, deep learning models have been introduced to extract vulnerability features from program fragments [10][11][12] and achieved considerable accuracy.
e method based on code metrics [13] selects appropriate metrics to predict vulnerabilities. is method can complete the detection quickly, but with a low accuracy. e graph-based deep learning vulnerability detection method in [14] extracts the graph of the code to predict vulnerabilities.
is method has a high detection effectiveness; however, the detection speed is too slow for large-scale code detection. In order to balance the detection speed and accuracy, we use a similar token sequence-based deep learning model as those in [10][11][12]15]. ese methods extract the token sequences related to the vulnerability information. Some of them extract token sequences from source codes [10,11], while some use assembly codes. Experimentally we find that, due to the limitations of employing a single model, these schemes do not perform very stably when facing some vulnerabilities, thus reducing their performances. In view of this, we combine the source code model and assembly code model to enhance the detection ability.
Multimodal machine learning has the ability of processing and understanding multisource modal information. Multimodal fusion is one of the earliest concerns for multimodal machine learning. It processes data from different modalities and obtains more characteristics by extracting multimodal data. Multimodal fusion usually achieves better results than only using a single mode. In audio-visual speech recognition (AVSR), the scheme in [16] fuses visual and audio cues to improve the performance relative to audio-only recognition. e scheme in [17] uses video features fusion with spectral and prosodic speech information for multimodal laughter detection. Compared to using only video features, performance can be improved by up to 3.7%. In multimodal emotion recognition, the scheme in [18] considers video only, audio only, and audio visual for the purpose of emotion recognition. e scheme in [19] uses the early fusion of information from facial expressions, body movements, gestures, and speech to recognize emotions. ese methods successfully improved the recognition rate of emotions on the original basis. As a result, we believe that the multimodal fusion of code information could also be usable for vulnerability detection.
In this paper, a vulnerability detection system is proposed by fusing the assembly code and source code models (code and data are available: https://github.com/onstar99/ VulnerabilitySystem). Given the source code of a software program, it is compiled to obtain the corresponded assembly code. en, code slices are extracted from both the source code and the assembly code. After that, a code alignment algorithm is suggested to connect each source code statement with its corresponded assembly code statements. en, the aligned codes are combined to form a fused slice. ese new slices, together with the original source code slices and assembly code slices, are fed into the hyper fusion-based deep learning model for vulnerability detection. e main contributions of this paper include the following: (1) We improve the performance of the vulnerability detection by fusing the models of assembly codes and source codes (2) We suggest a simple but effective alignment method between source codes and assembly codes to quickly align the data slices (3) We collect a vulnerability dataset composed of source and assembly codes, which can be used to train and verify the proposed multimodal-based vulnerability detection method  [23] scans the character stream of a code and marks its important information according to the lexical rules of a programming language. en, the character stream of the code is converted into a token sequence representation. Finally, through vectorization processing such as one-hot and word2vec [24], vectors can be obtained and used as the input of a machine learning model.  [20] extracted abstract syntax tree (AST), control flow graph (CFG), and data flow graph (DFG) from the source code. en, they are used to generate the code composite graph (CCG), which can be combined with a bidirectional graph neural network for vulnerability detection.

Vulnerability
Source codes may have more intuitive information than assembly codes or executable codes. is information can better help detect vulnerabilities. For example, source codes can provide more syntactic and semantic information, which makes it easier to know how the data and inputs drive the paths of execution [25].

Vulnerability Detection by Assembly Code.
Assembly codes are program files obtained by compiling the source codes. Practically, assembly codes might be easier to obtain in comparison with source code files. Grieco et al. [12] developed and implemented VDiscover. is system extracts different feature sets including static features of an assembly code and dynamic features during execution to solve largescale vulnerability discovery. Xu et al. [21] presented a neural network-based cross-platform method to detect the similarity between assembly codes, which is an aid to detect vulnerabilities. Liu et al. [22] employed the attention mechanism on top of a bidirectional long short-term memory to detect assembly code vulnerabilities. Tian et al. [15] extracted code slices from assembly codes in their vulnerability detection system. A summary of the above studies is presented in Table 1, where we have highlighted the key differences of different works.
It has been found that vulnerability detection on the assembly code level can better solve cross-architecture problems [26]. Moreover, assembly codes are sometimes more sensitive to the semantic errors of a program [27]. is means that the assembly code can be used as a powerful supplement to improve the detector's performance.

System Overview.
e proposed method analyzes the source code and assembly code of software to obtain more comprehensive information. First, we give the definitions of the source code slice and assembly code slice.
Definition 2 (assembly code slice). An assembly code slice. D i is a snippet of a semantically related multiline assembly code, denoted as Figure 1 shows an example of the source code and corresponded assembly code, where assembly code D is compiled from source code S by using the GCC compiler.
e workflow of our system is depicted in Figure 2. It has two phases: the training phase and the testing phase. In the training phase, the system first extracts source code slices from training codes and adds labels to them. Second, the system compiles training codes to assembly codes, from which assembly code slices are extracted and labeled. ird, source code slices and assembly code slices are aligned for further multimodal fusion. Fourth, the system converts source code slices and assembly code slices to vectors. Finally, the system trains a deep learning model by using these vectors. In the testing phase, the system tests the trained model using targeted codes and evaluates its performance.

Multimodal Network Structure.
e designed system fuses assembly codes and source codes to predict the outcome. We use the multimodal hybrid fusion strategy [28] to conduct its network, as shown in Figure 3. It first combines assembly code slices and source code slices via early fusion. en, these combined slices, together with the source code slices and assembly code slices, are fed into three individual networks. At last, the results of the three networks are used to give the final decision through late fusion.

Code Representation.
We use code slices to represent different codes. Given a snippet of the source code, Joern [29] is first used to analyze it and generate a control flow graph (CFG), which represents the real-time execution of a process in the form of a graph. After that, we get the program dependency graph (PDG) from the source code, which is a graphical representation of the control dependency and data dependency among program statements. Finally, CFG and PDG are traversed forward and backward to extract all affected statements. ese statements are combined to form a code slice, providing data for vulnerability detection. Figure 4 shows a running example of how we generate a code slice from CFG and PDG of a source code. All the affected statements of each function are captured through CFG and combined into a source code slice based on PDG.
Besides the assembly and source code slices, we further construct fused code slices by applying early fusion to them. We combine each source code slice S i and the corresponded assembly code slice D i to derive a new slice G i � (S i , D i ). Moreover, redundant information in G i caused by data fusion is removed, yielding a new set of code slices as part of the network input.

Network Model.
e three types of slices are input into three individual networks having the same structure. Each of them consists of five layers: input layer, bidirectional LSTM (BLSTM) layer, dense layer, softmax layer, and output layer, as shown in Figure 5. e input layer is responsible for inputting vector data. e BLSTM layer extracts vulnerability characteristics from vulnerability samples. It contains 300 LSTM units in a bidirectional form (altogether 600 LSTM units). e dense layer reduces the dimensionality of the vector. We set up two dense layers to obtain better detection results. To prevent overfitting, we apply dropout with the value of 0.5 after the dense layers. e softmax layer represents and formats the classification results. At last, the output layer gives a decision. In our experiment, the loss Step2 Step4 Step5 Step6 Output Step3 Step1 Step2 Step4 Step5 Step6 Step3 Testing programs Deep learning model with optimal parameter combination   function used is categorical cross-entropy, and the optimizer used is Adamax [30]. During the training phase, we set the batch size with 64 and the epoch number with 30, which gives a good performance. More settings are detailed in Table 2.

Late Fusion
Model. e decisions of the three networks may be different. As a result, at the end of our system, these decisions are passed through a voting layer to get the final result. Majority voting is employed in this layer. It is known that a late fusion model usually gets better results than those from a single model. It is expected to avoid erroneous decisions in individual models.

Training Phase.
e training phase is highlighted in Figure 2. It has 5 steps: Step 1. Source Code Processing.
(1) Extract source code slices: extract source code slices from a training code according to some vulnerability syntax characteristics. (2) Add labels: give each source code slice a label as the indicator of vulnerabilities ("1" means the presence of vulnerabilities, and "0" means the absence of vulnerability). e method on how to determine the vulnerability of a code slice is detailed in Section 3.6.  Step 2. Program Compiling (1) Generate an assembly code from the training code using the GCC compiler.
Step 3. Assembly Code Processing (1) Extract assembly code slices: use the code slice alignment algorithm detailed in Section 3.5 to analyze the assembly code, and extract assembly code slices corresponding to source code slices. (2) Add labels: similar to the processing of source code slices, each assembly code slice is assigned a label indicating whether it contains vulnerabilities ("1" means yes and "0" means no).
Step 4. Convert Source Code Slices and Assembly Code Slices into Vectors (1) As shown in Figure 6, the system uses word2vec to convert source code slices, assembly code slices, and hybrid code slices into vector forms, which will be input into the deep learning model.
Step 5. Model Training (1) Use the data generated from Steps 1-4 to train the network presented in Section 3.2. e parameters and design of the model have been detailed in Section 3.2. Figure 2, the testing phase has 5 steps.

Testing Phase. As highlighted in
Steps 1-4. Similar to Steps 1-4 in the training phase, except that the label adding is in no need.
(1) Input the data generated from Step 1 to Step 4 into the trained model, and the system will give a detection result. Result "1" means that vulnerabilities exist in this code slice, and "0" means there is no vulnerability.

Code Alignment.
In this section, we provide the algorithm employed in data alignment, align data. e top level of the algorithm is listed in Algorithm 1. e whole algorithm can be summarized into three stages, namely, (1) pseudocode generation, (2) collection of candidate sets of matched assembly codes, and (3) best match finding. At stage (1), we use IDA Pro [31] to generate pseudocodes, pe_code, from the assembly code (i.e., Line 3 in Algorithm 1), where each statement in pe_code corresponds to several statements d iu , . . . , d iw in assembly_code. At stage (2), we search for the candidate set of assembly code statements that match s ij (i.e., Lines 4-9 in Algorithm 1). For each p i , if its statement type (e.g., loop statement and assignment statement) is as same as s ij ., the corresponded d iu , . . . , d iw can be considered as candidate matches of s ij . At stage (3), the Hungarian algorithm [32] is used to get a slice (d i1 , . . . , d ij ) from the candidate set D i ′ . It is considered as a potential match for S i and combined into D i . en, we use string and integer constants, function and library calls, and function declaration information to compute the similarity between D i and S i . If this similarity is bigger than a threshold, a satisfactory match is achieved. Otherwise, repeat this stage until a satisfactory match is found (i.e., Lines 11-16 in Algorithm 1).

Slice Labeling.
A simple tool is developed here to give each code slice a label indicating the presence of vulnerabilities. All the vulnerability functions have been defined and given by the NIST Software Assurance Reference Dataset (SARD) [33]. is paper manually populates these vulnerability functions according to their vulnerability codes provided by SARD. e procedure of the tool is described as follows: (1) Obtain the file names, vulnerability locations, vulnerability types, and other information from the vulnerability file. (1) If a code slice has vulnerability functions or call vulnerability functions, it will be labeled with "1." (2) Otherwise, it will be labeled with "0."

Experimental Results
Our experiments are conducted on a computer with an NVIDIA GeForce GTX 2080 Ti GPU and an Intel Xeon E5 − 2678 v3 CPU operating at 2.50 GHz. e employed program compiler is GCC 9.3.0.

Evaluation Metrics.
In this paper, we use five standard indicators suggested in [34] to measure the performance of the vulnerability detection system. We denote TP as the number of vulnerable samples that are detected as vulnerable (i.e., true-positives), FP as the number of samples that are not vulnerable but are detected as vulnerable (i.e., falsepositives), TN as the number of samples that are not vulnerable and are not detected as vulnerable (i.e., true-negatives), and FN as the number of vulnerable samples that are not detected as vulnerable (i.e., false-negatives). en, the five indicators, namely, accuracy (A), false positive rate (FPR), false negative rate (FNR), precision (P), and F1measure (F1), are defined as (1)

Input Preparing.
In this paper, we collect a large number of source codes from the NIST Software Assurance Reference Dataset (SARD) [33]. en, these codes are compiled with GCC on Windows X64 to obtain the corresponded assembly codes. We randomly select 80% of the source codes and corresponded assembly codes as the training dataset, and the remaining 20% are used for the testing.

Extracting Code Slices and Adding Labels
(1) Extracting Code Slices. Source code slices are extracted first. We obtain CFG and PDG by parsing the source code.
After that, all the statements affected by the library/API function calls are extracted based on CFG and PDG. ese statements form source code slices. en, we use Algorithm 1 to collect assembly code slices corresponding to the source code. In total, 42938 source code slices and assembly code slices are extracted.
(2) Add Labels. We obtain file names, vulnerability locations, vulnerability types, and other information by analyzing program documents in the SARD project. en, they are used to build a vulnerability function database, where all the code slices are labeled by the tool developed in Section 3.6. In total, we label 16478 code slices with "1," and 26460 code slices with "0." To ensure effective training of the detection system, 16478 code slices are randomly selected from the 26460 code slices labeled with "0." ey are then combined with all the code slices labeled with "1" to form a dataset in each training/testing round, of which 13182 code slices labeled with "0" and 13182 code slices labeled with "1" are randomly selected to train the model, and the rest are used for testing.

Convert Datasets into Vectors.
e neural network only accepts vectors, thus we need to convert the data into the vector form, which is carried out by word2vec. Since the effectiveness of our system depends largely on the quality of the word embeddings produced, we first use the constructed dataset as a corpus to train word2vec. It can give a trained word2vec model more suitable for our task. After that, code slices are divided into a series of tokens. Each token is converted through the trained word2vec to obtain a fixedlength vector that can be recognized by the network. In this experiment, the length of each code slice is 50 and the dimension of the word embedding is 100.

Network Training.
In this experiment, we use Keras with TensorFlow to implement the system. e length of the input layer is fixed to be 200, and the number of nodes of each BLSTM is 300. Two fully connected layers with LeakyReLU are employed as its activation functions. e number of nodes is 300, and the dropout is 0.5. e model is optimized by Adamax [30] with a learning rate of 0.002.

Result.
We compare the proposed system with the approaches presented in [35] (denoted as VulDeePecker) and [15] (denoted as BVDetector). VulDeePecke detects vulnerabilities at the source code level, while BVDetector detects vulnerabilities at the assembly code level. e training accuracy and training loss curves for these systems are compared in Figure 7. It can be observed that all the compared systems converge fast. e average accuracy of compared systems is close to 90%, while the accuracy of our early fusion system can reach 97%. Table 3 lists the number of marked code slices by different schemes. It can be observed that some vulnerabilities can be detected by the model based on source codes, but not by the model based on assembly codes. In contrast, some vulnerabilities can only be detected by the model based on assembly codes. Take the code fragment shown in Figure 8 as an example, which is a vulnerability related to out-of-bounds array access. e system, which uses source code slices, may incorrectly detect the 0  30  29  28  27  26  25  24  23  22  21  20  19  18  17  16  15  14  13  12  11  10  9  8  7  6  5  4  3 0  30  29  28  27  26  25  24  23  22  21  20  19  18  17  16  15  14  13  12  11  10  9  8  7  6  5  4  3  vulnerability as normal pointer arithmetic because it is difficult to detect out-of-bounds array access in a calculation. By contrast, the system using assembly code slices can easily detect this vulnerability via memory addresses. As a result, fusing source codes and assembly codes in our model can accurately detect a wider range of vulnerabilities.
e comparison results on the test set are shown in Tables 4 and 5, where a representative static vulnerability mining tool on the market, Flawfinder [36], is also included for comparison. It indicates that our system achieves better results in all aspects compared to other systems. Compared with VulDeePecke and BVDetector, our hybrid fusion-based system can raise the F1 scores by 10%. e score of our   system is also well above the one obtained by Flawfinder. is confirms the effectiveness of the proposed vulnerability detection system.

Conclusion
is paper proposes a system for detecting vulnerabilities in software through deep learning. It combines the vulnerability features at the source code level and assembly code level to improve the ability of vulnerability detection. e system extracts code slices from both the source code and the assembly code of a program. en, a code alignment algorithm based on string, integer constants, function and library calls, etc., is suggested to align these code slices. e hyper fusion-based deep learning model considers early fusion and late fusion. Early fusion combines aligned source code slices and assembly code slices to generate a new dataset, while late fusion combines the decisions of the deep learning models on the source dataset, assembly dataset, and fused dataset. We implement a prototype and perform systematic experiments. e experimental results show the effectiveness of the proposed system. In future research, we can extend this method to cross languages and cross platforms. Moreover, we could consider additional data besides source codes and assembly codes.

Conflicts of Interest
e authors declare that they have no conflicts of interest.