An Automatic Source Code Vulnerability Detection Approach Based on KELM

Traditional vulnerability detection mostly ran on rules or source code similarity with manually defined vulnerability features. In fact, these vulnerability rules or features are difficult to be defined accurately, which usually cost much expert labor and perform weakly in practical applications. To mitigate this issue, researchers introduced neural networks to automatically extract features to improve the intelligence of vulnerability detection. Bidirectional Long Short-term Memory (Bi-LSTM) network has proved a success for software vulnerability detection. However, due to complex context information processing and iterative training mechanism, training cost is heavy for Bi-LSTM. To effectively improve the training efficiency, we proposed to use Extreme Learning Machine (ELM). -e training process of ELM is noniterative, so the network training can converge quickly. As ELM usually shows weak precision performance because of its simple network structure, we introduce the kernel method. In the preprocessing of this framework, we introduce doc2vec for vector representation and multilevel symbolization for program symbolization. Experimental results show that doc2vec vector representation brings faster training and better generalizing performance than word2vec. ELM converges much quickly than Bi-LSTM, and the kernel method can effectively improve the precision of ELM while ensuring training efficiency.


Introduction
As software becomes more and more complicated, software vulnerabilities caused by design flaws and implementation errors become an inevitable problem in engineering [1]. According to statistics released by the Common Vulnerabilities and Exposures (CVE) [2] and National Vulnerability Database (NVD) [3], the number of software vulnerabilities has increased from 1600 to nearly 100000 since 1999 [4]. Software systems containing these vulnerabilities will face serious security risks.
On the one hand, existing vulnerability detection techniques are mostly driven by rules [5][6][7][8][9][10] and code similarity metrics [11,12]. Vulnerability detection rules are usually defined by experienced experts. e performance of these methods is limited by the experience of experts. Generally, the features of software vulnerabilities are very difficult to be described accurately, which leads to the corresponding detection rules which are also difficult to be defined accurately and completely. ese problems inspired researchers to propose automatic vulnerability detection (source code level). Neural networks show great potential [13][14][15][16][17]. Neural networks can automatically extract complex features from input data, avoiding the problems of high cost, instability, and incompleteness of manually constructing features and empirically defining rules. VulDeePecker [16] utilized Bi-LSTM [18] for software vulnerability detection. Zhen Li et al. [17] discussed the performance of different neural networks on vulnerability detection separately, namely, MLP, CNN, LSTM, and Bi-LSTM. All of the above neural networks train the detection model with an iterative training mechanism, which usually costs a lot of time. To solve this problem, we introduce ELM [19], which trains the detection model with a noniterative training mechanism. In order to improve precision performance, we then introduce the kernel method.
On the other hand, there are two most classical data preprocessing methods in neural network-based automatic vulnerability detection, namely, vector representation and program symbolization. e most common vector representation method is word2vec [20], which can vectorize the software codes into form of vector (variable length) as the input of neural network. However, word2vec usually requires additional work to further preprocess the output vector (e.g., padding zeros). e final vectors usually with large dimension can heavily affect the training efficiency of detection model. Moreover, word2vec may also lose important semantic information of the source codes, which can affect the precision of detection model. As for program symbolization, the normal way is to symbolize the variables and user-defined functions in the source code at the same time [16,17], which can be seen as a single symbolization level of 2.
is idea ignores to consider the influence of multiple symbolization levels on performance of vulnerability detection model.
To alleviate the above problems, we propose a multilevel symbolization method for symbolic representation and introduce doc2vec [21] for vector representation. In detail, we first obtain symbolic representations of the source codes related to vulnerabilities through three symbolizations. Using three levels of symbolization can significantly reduce the noise introduced by irrelevant information of vulnerable codes.
en, we use doc2vec to automatically transform symbolic representation of source codes to corresponding vector representation. Compared to word2vec used in [16], we found that doc2vec is more suitable for modeling vector representation because it can not only transform source codes with arbitrary length into a fixed-length feature representation but also grasp the semantic information of source codes better. ese advantages are helpful to improve the precision and training efficiency of vulnerability detection model. e rest of this paper is organized as follows. Section 2 discusses the work related to automatic detection of software vulnerability. Section 3 describes the details of the proposed automatic software vulnerability detection method. Section 4 gives the details of experimental environment and parameter configuration, experimental results, and corresponding analysis. e conclusions and future works are presented in Section 5.

Vulnerability Detection Techniques.
Existing classical vulnerability detection techniques range from making use of manually defined features [5][6][7][8][9][10] to code similarity metrics [11,12]. However, there are several primary flaws among them. First, the effort for defining vulnerability features is error-prone and manual labor consuming. Second, the features can hardly be integral and usually contain only partial information about the vulnerabilities, which may lead to high false-positive and false-negative rates [16]. Moreover, the application of the code similarity method is limited to vulnerabilities caused by code clones.
Vulnerability detection with traditional machine learning techniques such as Decision Tree [22] and Support Vector Machine (SVM) [23] mainly extracts vulnerability features from preclassified vulnerabilities. However, vulnerability detection patterns based on this type of feature are usually available for specific vulnerabilities. In the paper by Boris Chernis [24], both simple text features (e.g., character count, character diversity, and maximum nesting depth) and complex text features (e.g., character n-grams, word n-grams, and suffix trees) are extracted from the source codes and analyzed by using the naive Bayes classifier. Experimental results show that simple features performed unexpectedly better by comparing with the complex features.
Neural networks can learn complex vulnerability features automatically. Zhen Li [16] presented a vulnerability detection system VulDeePecker based on deep learning, which initiates the study of using deep learning for vulnerability detection. VulDeePecker collects the samples by first extracting code gadgets from the buggy programs and then transforming them into the vector representations using word2vec. e detection model is designed based on Bi-LSTM. Siqi Ma [13] proposed a tool called VuRLE for automatic detection and repair of vulnerabilities. VuRLE uses the context patterns to detect vulnerabilities and customizes the corresponding edit patterns to repair them. Jacob A. Harer [14] implemented various machine learning models for detecting bugs that can lead to security vulnerabilities in C/C++ code. Specifically, they used features derived from the build process and the source code. Rebecca L. Russell [15] developed a vulnerability detection tool based on deep feature representation learning that can directly interpret the parsed source codes. e source codes are firstly transformed into tokens and then embedded as vectors for both CNNs and Recurrent Neural Networks (RNNs). Zhen Li [25] proposed a systematic framework by using deep learning to detect vulnerabilities that combined syntax-based, semantics-based, and vector representations (SySeVR). SySeVR can accommodate syntax and semantic information pertinent to vulnerabilities. e source codes are successively represented by syntax-based, semanticsbased, and vector representations. Zhen Li [17] performed a quantitative evaluation of the impacts of different factors (e.g., data dependency and control dependency) on the effectiveness of neural network-based vulnerability detection techniques. Zhen Li [26] presented VulDeeLocator, a deep learning-based fine-grained vulnerability detector. It leverages intermediate code to capture semantic information that cannot be conveyed by source code-based representations and presents a new idea of granularity refinement. Xin Li [27] proposed an automated and intelligent vulnerability detection method in source code based on the minimum intermediate representation learning. e sample in the form of source code is first transformed into a minimum intermediate representation; then, it is transformed into a real value vector through pretraining on an extended corpus. e vector is fed to 2 Security and Communication Networks three concatenated convolutional neural networks to obtain high-level features of vulnerability.

Preprocessing
Method. e commonly used preprocessing methods for automatic source code vulnerability detection are program symbolization and vector representation. Zhen Li [16,25] first maps variable names to symbolic names (e.g., "V1" and "V2") in a one-to-one fashion, then maps function names to symbolic names (e.g., "F1" and "F2") in a one-to-one fashion, and finally uses word2vec to perform vector representation. Gustavo Grieco [28] uses word2vec to preprocess the dynamic features of source codes since it was successfully used in a variety of text mining applications. Savchenko [29] proposed a system for vulnerability detection based on deep learning approach, which performs the following steps: source code preprocessing, AST creation, code gadget extraction, and code gadget vectorization using word2vec.

Kernel Method.
e kernel method is often used to solve the linear indivisibility problems. Qin-Qin Tao [30] proposed a locality-sensitive support vector machine using kernel combination (LS-KC-SVM) algorithm, which solved the large appearance variations due to some real-world factors on face detection. Liang [31] proposed an SVM-based method combining with the deep quasilinear kernel (DQLK) learning for large-scale image classification. It could train SVM on a large-scale dataset with less memory space and less training time. Zhang [32] developed a least-squares (LS) SVM-based identification scheme, where the system parameters were estimated in a reproducing kernel Hilbert space. It can effectively solve the issue that LS results in low accuracy in ill-conditioned scenarios. Lu Li [33] proposed the AdaBoost-WCKELM made of ELM, AdaBoost, and composite kernel method, which derived a good improvement in HSI classification accuracy. Figure 1 is an overview of the proposed automatic source code vulnerability detection system using enhanced ELM on the source code level. Starting with the dataset in form of code gadget, it then obtains symbolic representation of each code gadget using multilevel symbolization. Next, it transforms the symbolic representations into vector representations with a low-dimension using doc2vec. Finally, it applies enhanced ELM neural networks to train the detection model. As for testing, code gadget is firstly preprocessed successively through multilevel symbolization and doc2vec, and then the vector representations of them are input to detection model to get the detection results. In the subsequent sections, we give the details of the main components of this system.

Symbolic Representation.
A code gadget is composed of several program statements (i.e., lines of code), which are semantically related to each other in terms of data dependency or control dependency [16]. It can be further transformed into a form of symbolic representation using symbolization. e symbolic representation is then collected as a corpus for training the vector representation tool, such as doc2vec.
e benefit of symbolic representation is that it can result in higher training effectiveness by further reducing the length of code gadget. In symbolization, vulnerability features of each code gadget such as local variables, user-defined functions, and data types are transformed into short and fixed-length symbolic presentations, where the same features are mapped to the same symbolic presentation. In this work, we deploy three symbolization types that are shown as follows: e symbol N mentioned above in symbolization is a number which represents the index of the first occurrence of the feature while noting that multiple functions may be mapped to the same symbolic name when they appear in different code gadgets. Moreover, all the symbolization types will reserve keywords of C/C++ language.
We build a multilevel symbolization mechanism according to the priority of symbolization shown in Table 1. Level 2 includes two symbolization groups, namely, F + V and F + T. is is because symbolizations V and T may have different effects on SNR of vulnerability information in different datasets.
We take Sample 0 as an example to show how the symbolization works, where the symbolization group F + V is chosen from level 2. From Figure 2, we can observe that there are 2 user-defined functions, 5 variables, and 2 data types in Sample 0.

Security and Communication Networks
As a result, through three levels of symbolization, Sample 0 is gradually simplified to a generalized symbolic representation, which can effectively characterize different manifestations of the same vulnerability.

Vector Representation.
Since the neural network can only accept vector as input, the symbolic representation of source code needs to be further converted to the vector representation. Currently, the most popular vectorization methods are word2vec [34] and doc2vec [35].
Compared with the one-hot representation, a high-dimensional and sparse representation method, word2vec, outputs a low-dimensional and dense vector representation, which is conducive to improving training efficiency and precision of the model, making it widely used for vulnerability detection recently [14,16,17]. However, there is a drawback of word2vec; that is, it ignores the influence of word order that relates to information of a sentence or a document.
doc2vec was proposed in [35], and the authors proposed the unsupervised algorithm called Paragraph Vector that can learn fixed-length feature representation from texts with arbitrary length, ranging from a sentence to a document. Moreover, the Paragraph Vector can memorize the topic of the paragraph, which makes it be able to better extract global features than word2vec.
Given the fact that word2vec converts word to vector representation in a one-to-one fashion, thus, the length of the converted vector varies with the length of the input text. To satisfy the neural network requirement of input with a fixed length, the vector generated by word2vec needs to be further processed to obtain the corresponding fixed-length form. Different from word2vec, doc2vec can directly output fixedlength vectors from input texts with arbitrary length. Furthermore, doc2vec can also grasp more semantic information from the context of input text than word2vec. In summary, doc2vec shows great potential in source code vector representation.

Neural Network
Model. ELM is a special type of feedforward neural network with the noniterative training mechanism, which was proposed by Huang et al. in the 1990s [19]. Unlike traditional neural networks, which use gradient descent techniques to iteratively fine-tune all the parameters of the model, ELM randomly assigns values to some parameters according to certain rules and keeps these parameters frozen throughout the training process, while other parameters are calculated by the least square method. In other words, the training mechanism of ELM is noniterative, which can bring it much faster training speed than conventional neural networks on some tasks with relatively large data scale. Here, we take ELM with a single hidden layer network   structure as an example to introduce its training mechanism. e network structure of ELM is shown in Figure 3. Figure 3, d, L, and m refer to the number of the input layer neurons, the hidden layer neurons, and the output layer neurons, respectively. ω is the input weights connecting the input layer to the hidden layer, b is the thresholds of the hidden layer neurons, and β is the output weights connecting the hidden layer to the output layer. ω and b are generated randomly from the range (−1, 1) and (0, 1) under a uniform distribution. ey are kept frozen throughout the training process of the model.

ELM. In
Given a training data set . . , N, the ELM model can be represented as where T is the expected output matrix and H is the hidden layer output matrix. N, which is the output vector of the hidden layer with respect to the input x i . g(·) is the activation function of the ELM. And ω j · x i denotes the inner product of the input weights and the features of the ith training sample. e output weights β can be obtained by where H + refers to the Moore − Penrose generalized inverse of H, L refers to neuron number of hidden layers, I refers to an N identity matrix, and λ refers to a regularization factor with a value between [0,1]. e ELM output function is e optimization objective of the ELM model can be expressed as where f(x i ) and t i refer to the predictive label and the real label of the ith sample, respectively.

KELM.
Kernel method is an effective way to solve the nonlinear problems by mapping the data to high-dimensional space so that the nonlinear problem can be transformed into a linear problem. With the combination of kernel method, there are two benefits compared with conventional ELM. For one thing, it solves the problem that the number of hidden layer nodes in conventional ELM depends on manual setting, which shows better stability [36]. For another thing, the kernel function maps the data to the high-dimensional space, and the distribution of the data in the transformed space is very smooth. In fact, the smooth new data make the classification problem easier, so the model can show better effectiveness. Radial Basis Function (RBF) is the preferred kernel function in our experiments because it has only one hyperparameter which simplifies the model configuration and training cost. RBF kernel function can be expressed as where x and y represent the samples, c represents the unique hyperparameter of Gaussian kernel function, and ‖x − y‖ denotes the norm of vectors. e kernel matrix for ELM can be defined as [37] Ω ELM � H T H, And we can revise equation (2) when N ≥ L as and then, the ELM output function (3) can be as follows: From equation (8), we can find that ELM combined with the kernel method can avoid the problem that the number of hidden layer nodes in conventional ELM depends on manual setting.

Experiment and Evaluation
e goal of our work is to construct an automatic software vulnerability detection model with both superior precision and efficiency. To be specific, we investigate the following questions in experiments:

Dataset.
In our experiments, we include the following three datasets from [16]. Each sample is a piece of source code with known vulnerabilities. Table 2 shows the number of samples (i.e., source code files) in each dataset. Each dataset is partitioned into two parts with a proportion of 80% and 20%, where the larger part is for training and the other part is for testing. Each sample in the dataset is in the form of code gadget with a ground truth label.

Evaluation Metrics.
In our experiment, we used the indexes mentioned in [38] to evaluate the effectiveness of vulnerability detection model, that is, False Positive Rate (FPR), True Positive Rate (TPR), Precision (P), and F1measure (F1). e value range of these four indicators is [0, 1]. For FPR, the closer their values are to 0, the better the performance of the model is; for other indicators, the closer their values are to 1, the better the performance of the model is.
e quality of vector representation can be evaluated by Cosine Similarity (cosine) between vectors in the vector space, which can be calculated by the following formula. e range of cosine value is [−1, 1]. e closer the value is to 1 or −1, the more similar the two vectors are.
where A and B refer to vectors. Given the fact that Cosine Similarity only considers the angle between vectors, so that it can avoid too large output deviation due to different dimension of input vectors. is is the main reason why we choose Cosine Similarity as the evaluation metric of vector representation.

Parameters Setting for Neural Networks.
In our experiments, we used two types of neural networks for the vulnerability detection model, namely, Bi-LSTM and ELM. For both, there is only one hidden layer in the network structure. We build the following five configurations. We do not list the configuration of AdaBoost KELM because it is predictable that the calculation of KELM with weight and iteration mechanism is very complex and the efficiency will be greatly reduced.
(i) word2vec with Bi-LSTM (w + B), which was used by We have implemented the CPU versions of Bi-LSTM and ELM, and all the models were trained in the PC environment with CPU. For Bi-LSTM, the batch size, the dropout rate, the number of epochs, and the number of the hidden layer neurons were set to 64, 0.5, 2, and 60, respectively, and the optimizer chosen was Root Mean Square Prop (RMSProp). For ELM, the number of the hidden layer neurons was set to 5000 and the activation function used sigmoidal function. e input weights and the hidden biases of ELM were generated randomly from (−1, 1) and (0, 1), respectively, under a uniform distribution. e details of the parameters' configuration of ELM are given as follows.
To determine which activation function is the best choice for the ELM-based detection model, we implement an experiment to discuss the effectiveness of ELM with five activation functions, respectively. e number of neurons is set to 250 and the dataset is HY-ALL. From the results in Figure 4, we can find that ELM with sigmoidal function outperforms the other activation functions on precision and F1.
In terms of neuron configuration of ELM, we have done several experiments to analyze the effect of a different number of neurons on the precision of ELM as shown in Table 3. Generally speaking, when the number of neurons ranges from 250 to 12000, the precision of ELM gradually increases with the number of neurons increasing, but when the number of neurons is more than or equal to 15000, the 6 Security and Communication Networks precision of ELM begins to decline slowly. In particular, when the number is between 250 and 5000, the precision improvement is more obvious, while the number increases from 5000 to 12000, the precision improvement is slight, nearly 0.3%, and the training time increased by 4 times. Considering the cost-effectiveness of precision improvement and time consumption, we set the number of neurons as 5000. Kernel function plays a very important role in KELM, which largely determines its precision performance. We collect three commonly used kernel functions to make a comparison experiment. e comparison of the results after fine-tuning is shown in Figure 5. It is clear from the result that RBF shows the best overall performance than the other two kernel functions.
us, the subsequent KELM-related experiments set the RBF as the kernel function.

Results for Q1.
Regarding the impacts of different neural network models on the performances of vulnerability detection, we evaluate the precision and efficiency of the above five configurations on all datasets. Table 4 shows the effect of different neural network models on vulnerability detection precision, while Table 5 gives the efficiency of different neural network models on vulnerability detection.
In the experiments, all three datasets are preprocessed with the symbolization group F + V.
According to the results in Table 4, we analyze them from two aspects: precision comparison of conventional Bi-LSTM and ELM and enhanced effect of conventional ELM using kernel function and AdaBoost method.
Compared with ELM, Bi-LSTM is slightly inferior in RM-ALL, a small-scale dataset, but superior in BE-ALL and HY-ALL, the large-scale datasets. is may be due to the fact that the deep learning model is more suitable for large dataset scenarios. Besides, Bi-LSTM shows lower FPR than ELM on all three datasets, which can be explained by the fact that Bi-LSTM can express the long-term dependency information in the input, while ELM is based on forwarding neural network; it is slightly inferior to Bi-LSTM in the context processing.
For Ada-E, it outperforms conventional ELM on RM-ALL and HY-ALL, which shows the advantage of the ensemble learning, for example, combination enhancement. However, it shows similar P and lower TPR than ELM on RM-ALL, which may be due to the overfitting effect of ensemble learning for high-precision base classifiers. It can be seen that if the base classifier is with very high precision, the final classifier generated by AdaBoost does not always show the higher precision but may be worse if the basic classifier shows high enough precision. For KE, it shows the lowest FPR and the highest P on the three datasets compared with the other five configurations, which benefits from its effective way to solve nonlinear problems through highdimensional mapping. Besides, it also results in the lowest TPR, but this is acceptable; it is due to the fact that the high false-positive rate is the primary problem of vulnerability detection tools in practical application.
From Table 5, we can find that the configuration w + B performs the longest time for training and detection on HY-ALL, while configuration d + B costs less than 1/30 of configuration w + B. It is because the configuration w + B in [16] outputs vectors with a longer dimension of 2500, which  Security and Communication Networks 7 results in a higher computation complexity for Bi-LSTM. Moreover, compared with configuration d + B, configuration d + E further reduces the time cost of the training and detection to a few minutes. is can be explained by the fact that the noniterative training mechanism of ELM reduces the computation of parameters. Ada-E improves the precision of ELM by adding the iteration mechanism and introducing the weight mechanism to ELM, but these operations increase the computational complexity. erefore, the training and detection time of ELM will be multiplied accordingly. KE shows a lower efficiency than conventional ELM because it maps the input and output to a higher dimension for calculation which will result in a larger computational complexity than the former. us, we can conclude that configuration with conventional Bi-LSTM achieves a higher precision, while the configuration with conventional ELM is more effective. Using AdaBoost and kernel function can effectively further improve the precision of conventional ELM in vulnerability detection. In particular, the kernel function achieves a very good precision improvement effect while maintaining higher efficiency than conventional Bi-LSTM.

Results for Q2.
To answer the second question, we evaluate the effectiveness of the two vector representation methods, namely, doc2vec and word2vec. We implement experiments with four samples shown in Figure 6. Sample 2 and Sample 4 are labeled as "vulnerable," while Sample 1 and Sample 3 are not. We collect these four samples from dataset   We evaluate the effectiveness of vector representations by using the similarity measure cosine. e output vector dimension of word2vec is set to 2500, where the output vector dimension of one word is set to 50, and the number of words to represent a paragraph is set to 50. e output vector dimension of doc2vec is set to 250. e reason of making different output vector dimension settings of word2vec and doc2vec is due to the fact that if both dimensions are set to be the same (e.g., 250), then word2vec outputs vector dimension of one word will be 5, or the number of words to represent a paragraph will be 5, which may have a great influence on the effectiveness of vector representation. As a result, the comparison of the effectiveness of word2vec and doc2vec is carried out under the condition that they both use a proper dimension of output vector representation.
From the perspective of vulnerability detection, in terms of the fact that two similar samples are given different labels, it is better to make the similarity between the two vectors after vectorization be small as far as possible, which is conducive to the training of vulnerability detection model using neural network. From Table 6, we find that, for Sample 1 and Sample 2, word2vec outputs vectors with a higher cosine value than doc2vec, while for Sample 3 and Sample 4, it outputs a lower cosine value than doc2vec. Generally, compared with word2vec, doc2vec can output nearly similar or better vector representation with smaller dimension. It can be explained by the fact that, as noted in [16], in order to obtain a fixed length of vector representation, vectors generated by word2vec should be padded with zeros, which may cause the loss of semantic information of the samples. Moreover, it is obvious that a neural network model with low-dimension input vectors can result in good efficiency. We can also observe that, for the same sample in different dataset, doc2vec can output more similar cosine results than word2vec; the biggest output cosine deviation of doc2vec is 0.008, while word2vec results in a value of 0.032. It shows that doc2vec can perform well on large datasets. is conclusion also can be justified by the results in Table 4, where the configurations with doc2vec show better results than the ones with word2vec on HY-ALL.

Results for Q3.
To answer the third question, we take the configuration d + B and d + KE as baselines to discuss whether symbolization can further improve the precision of the neural network model. We implement experiments with all the three datasets. And for each dataset, we apply symbolization level from 1 to 3 for preprocessing the datasets. Table 7 summarizes results of how differently symbolization levels affect the precision of Bi-LSTM. From the perspective of different datasets, symbolization levels have a bigger impact on the precision of Bi-LSTM vulnerability detection model with smaller datasets, which shows a maximum deviation of precision at 3.1% in BE-ALL and 2.6% in RM-ALL. However, with the largest dataset HY-ALL, the maximum precision deviation is 0.9%. is may be because the scale of datasets can affect generalization performance of detection model, while the impact of symbolization is gradually reduced according to the scale becoming smaller. From the perspective of symbolization levels, configuration d + B with the symbolization level 1 shows a better and more stable performance than other symbolization levels, while the symbolization level 2 results in an unstable performance, and the symbolization level 3 shows the worst performance. e main reason is that a high level of symbolization may lose some key vulnerability information in the source codes. Moreover, it should be mentioned that symbolization groups of F + T outperform than symbolization groups of F + V with all datasets; it may be due to the fact that there are many codes related to data type in the source codes; symbolizing them can better capture the vulnerability information. Table 8 summarizes results of how differently symbolization levels affect the precision of KELM. From the perspective of different datasets, symbolization levels have a big data = ( char * )malloc(100 * sizeof( char)); goodG2BSource(data); void goodG2B Source(char * &data) memset(data, "A" , 50 -1); data[50 -1] = "\0" ; char dest[50] = ""; strcpy(dest, data); (a) char * data; data = (char * )malloc(100 * sizeof(char)); if(5 == 5) memset(data, "A" , 100 -1); data[100 -1] = "\0"; char dest[50] = ""; strcpy(dest, data); impact on the precision of the KELM-based vulnerability detection model with dataset HY-ALL, which shows a maximum deviation of precision at 3.2%. However, with smaller datasets BE-ALL and RM-ALL, the maximum deviation precision is 1.5% and 0.8%, respectively.
is is because the semantic changes of samples generated by different symbolization are smaller in small datasets and larger in large datasets. erefore, it will cause a large deviation of precision performance. From the perspective of symbolization levels, configuration d + B with the symbolization group of F + T shows the best and most stable performance than other symbolization levels, while the symbolization level 1 results in a better performance than symbolization level 3 with all the datasets. e former phenomenon may be due to the fact that KELM is more suitable for extracting the vulnerability information with dataset preprocessed by symbolization of F + T, and the latter one can be explained by the reason mentioned above. Furthermore, to verify the training efficiency of the proposed multilevel symbol representation, we also give the comparative analysis of time complexity as shown in Table 9. From Table 9, we can observe two phenomena as follows: one is that, for the same dataset, there is a linear downward trend of training time as the symbolization level increases from 1 to 3; the other is that the training time increases correspondingly as the size of dataset increasing. Meanwhile, compared with the symbolization level 1, symbolization level 3 improves training efficiency by about 20% on all three datasets. is can indicate that multilevel symbolization can slightly improve the efficiency of preprocessing, which is not worth mentioning when it is used to improve the precision performance of neural networks.

Conclusions
We have made the first effort to use ELM to solve the training efficiency issue of the vulnerability detection model.    Moreover, we then introduce the kernel method to improve the precision of ELM. Experimental results show that ELM with the kernel method is an effective combination of both efficiency and precision. Particularly, for the data preprocessing issue, we find that vector representation using doc2vec performs well on large datasets, and an appropriate symbolization level can effectively improve the precision of vulnerability detection. ese experimental conclusions will provide researchers and engineers with guidelines when choosing neural networks and data preprocessing methods for vulnerability detection. ere are several limitations of this paper, which are expected to be researched in the future. First, from more than one kind of single-layer feedforward neural network that could be used for vulnerability detection, we only used ELM in this work. Second, not limited to the kernel method, we expect to explore other methods to improve the precision of ELM subsequently. ird, the datasets used in our experiment are provided by a single source, and more datasets from different sources can be expanded to verify the effectiveness of our proposed approach.

Data Availability
Previously reported vulnerability data were used to support this study and are available at https://github.com/CGCLcodes/VulDeePecker. ese prior studies (and datasets) are cited at relevant places within the text as references [16].

Conflicts of Interest
e authors declare that they have no conflicts of interest.