In order to improve software reliability, software defect prediction is applied to the process of software maintenance to identify potential bugs. Traditional methods of software defect prediction mainly focus on designing static code metrics, which are input into machine learning classifiers to predict defect probabilities of the code. However, the characteristics of these artificial metrics do not contain the syntactic structures and semantic information of programs. Such information is more significant than manual metrics and can provide a more accurate predictive model. In this paper, we propose a framework called defect prediction via attention-based recurrent neural network (DP-ARNN). More specifically, DP-ARNN first parses abstract syntax trees (ASTs) of programs and extracts them as vectors. Then it encodes vectors which are used as inputs of DP-ARNN by dictionary mapping and word embedding. After that, it can automatically learn syntactic and semantic features. Furthermore, it employs the attention mechanism to further generate significant features for accurate defect prediction. To validate our method, we choose seven open-source Java projects in Apache, using F1-measure and area under the curve (AUC) as evaluation criteria. The experimental results show that, in average, DP-ARNN improves the F1-measure by 14% and AUC by 7% compared with the state-of-the-art methods, respectively.
With the continuous expansion of modern software, software reliability has become a key concern. The complex source code of software tends to cause software defects which may lead to software failure. In order to help developers and testers locate software defects in time, software defect prediction has become one of the research directions in the field of data mining of software engineering [
Software defect prediction [
Traditional defect prediction methods mainly consist of two stages: extracting software metrics from historical repositories and constructing a machine learning model for classification. Previous research focuses on designing discriminative artificial metrics to achieve higher model accuracy. These manual metrics are mainly divided into Halstead features [
However, sometimes static code attributes cannot distinguish whether the code has defects because a clean code snippet and a buggy one may have the same values of static code attributes, which makes classifiers hard to differentiate. Since the syntactic and semantic information between them is different, features which contain such structural and semantic information should improve the performance of defect prediction. Programs have their own particular syntactic structures and rich semantic information hidden in ASTs [
To take full advantage of intrinsic syntaxes and semantics of programs, this paper proposes a framework called software defect prediction via attention-based recurrent neural network (DP-ARNN), which can capture syntactic and semantic features of programs and use them to improve defect prediction. Specifically, we build recurrent neural network (RNN) [ We propose an RNN-based defect prediction framework to learn valuable features which contain syntactic and semantic information of the source code. The empirical studies show that, in average, these deep learning-based features outperform traditional features by 14% on F1-measure and 10% on AUC. We apply dictionary mapping and word embedding to convert programs’ ASTs into high-dimensional digital vectors as the inputs of RNN to learn code context information. We leverage attention mechanism to further generate significant features from the outputs of RNN, leading to better performance of defect prediction. The experimental results show that, compared with RNN, attention-based RNN has an average increase of 3% on F1-measure and 1% on AUC.
The rest of this paper is organized as follows. Section
Software defect prediction is a significant research field in software engineering [
To solve the problem of lack of information in the historical repositories of the same project, more and more papers have studied cross-project software defect prediction. In this field, because of the different domains of training samples and test samples, we need to apply transfer learning techniques. By using the transfer component analysis (TCA+) [
Our proposed DP-ARNN differs from the aforementioned traditional defect prediction methods. Instead of using the static code attributes, we leverage the deep learning technique (i.e., RNN) to automatically generate features from the source code, which can capture syntactic and semantic information of programs, and implement the attention mechanism to generate significant features which can improve the performance of defect prediction.
Datasets of traditional defect prediction are extracted from artificially designed metrics which may be redundant or not be highly correlated with class labels. These all can affect the prediction performance of the model. Besides, manual metrics cannot make full use of code context information to mine the syntactic structure and semantic information of programs.
The syntactic and semantic information of programs can be represented in two ways. One is ASTs and the other is control flow graphs (CFGs) [
The aforementioned deep learning-based methods consider all the hidden features to be equally significant, and they cannot identify discriminative features that contain key syntaxes and semantics. This may lead to inaccurate defect prediction. Hence, in our proposed method, we employ the attention mechanism to capture these key features and give them higher weights. Besides, we choose ASTs of programs as the representation of programs rather than CFGs, because ASTs can better depict the structure of the source code and reserve more information of source code.
In this section, we elaborate our proposed DP-ARNN, a framework which automatically learns syntactic and semantic features from the source code and generates key features from them for precise software defect prediction. Figure
The overall framework of our proposed DP-ARNN. (a) Parsing source code. (b) Mapping string vectors into integer vectors. (c) Generating features via RNN with attention mechanism. (d) Performing defect prediction.
As is shown in Figure
In order to represent the source code in each file as a vector, we first need to find the appropriate granularity as the vector representation of the source code. We can extract characters, words, or ASTs from the source code as tokens. According to the former research [
In our experiments, we apply an open-source Python package named javalang which is available at
The selected nodes of ASTs.
Representative nodes
for Traversing If Adding node into end Adding end return
ASTs can effectively store structural and semantic information of the program module. For example, code A in Figure
Source code of two example files. (a) The clean code A. (b) The defective code B.
ASTs of two example files. (a) The clean code A. (b) The defective code B.
Creating a list
Adding Adding
return
Software defect prediction data are usually class imbalanced. Defective instances usually account for a small part of all the instances. If you put them directly into the model for training, the prediction results will be biased towards the majority class (i.e., clean instances). According to the research [
In order to learn the context information of the source code and generate the key features, we construct a Bi-LSTM network, a variant of standard RNN, with attention mechanism [
The network architecture of DP-ARNN.
Simple digital integers cannot reflect the content information carried by an AST node. Therefore, we adopt word embedding technique to map each positive integer vector into a high-dimensional real vector with fixed size, which can be defined as follows:
Standard RNN splits sequence data into vectors with fixed length. Each element in a vector denotes a certain moment. For a certain moment
Contextual information of the source code is significant to detect potential bugs. Each program has its own syntaxes and semantics which are context sensitive. Therefore, the occurrence of a defective code segment is usually relevant to either previous or subsequent code, or even to both of them. In most cases, because of the complexity of the source code, it is hard to exactly locate which line of code actually results in the vulnerability. Hence, in order to efficiently capture the defective programming patterns, we implement Bi-LSTM to make full use of both forward and backward information.
From the output of the Bi-LSTM network, we can get the hidden features of all time nodes in a sequence. Contributions of these nodes to the representation of the sequence meaning are not the same. In order to enhance the effect of critical nodes, we embed an attention layer after the Bi-LSTM Layer. By applying the attention mechanism, critical nodes which are significant to the meaning of the sequence are aggregated together to form a sequence vector. Figure
The process of attention mechanism.
That is, we first input the node annotation
In the training phase, we construct two fully connected layers and an output layer. The first fully connected layer normalizes sequence features through a tanh function, and the second fully connected layer with a linear function further extracts features. At last, in the output layer, we put them through a sigmoid function as a logistic regression classifier to compute the defect probability of the program module.
In this section, we design experiments to verify the effectiveness of DP-ARNN. Four research questions (RQs) are need to be answered as follows:
RQ1: do the deep learning methods improve the performance of defect prediction compared to traditional methods based on static code metrics?
RQ2: compared with features generated by the classical unsupervised learning methods, do features learned by the deep learning methods better represent syntaxes and semantics of programs?
RQ3: does DP-ARNN outperform the basic deep learning methods, including CNN and RNN?
RQ4: how is the prediction performance of DP-ARNN under different parameter settings?
In our experiments, we choose Keras (2.2.4) and Tensorflow (1.11.0) to build attention-based Bi-LSTM network. The implementation of other benchmark methods is mainly based on scikit-learn (0.19.2) and Python 3.6. The experimental operating environment is a server running Ubuntu 16.04 with a 3.60 GHz Intel i7 CPU and RAM of 8 GB.
We collect seven open-source Java projects in Apache, each of which contains two versions (i.e., preversion and postversion). Datasets that contain static code metrics and defect annotations of source files in each project are from metrics repository, which is a public available repository specializing in software defect prediction research datasets. Specifically, each source file has 20 traditional artificial features, which are carefully extracted by Jureczko and Madeyski, the contributors of CK features for object-oriented programs [
Description of the 20 static code metrics.
Metric Name | Symbol | Description |
---|---|---|
Weighted methods per class | WMC | The number of methods in the class |
Depth of inheritance tree | DIT | The position of the class in the inheritance tree |
Number of children | NOC | The number of immediate descendants of the class |
Coupling between object classes | CBO | The value increases when the methods of one class access services of another |
Response for a class | RFC | Number of methods invoked in response to a message to the object |
Lack of cohesion in methods | LCOM | Number of pairs of methods that cannot share a reference to an instance variable |
Lack of cohesion in methods, different from LCOM | LCOM3 | If |
Number of public methods | NPM | The number of all the methods in a class that are declared as public |
Data access metric | DAM | Ratio of the number of private (protected) attributes to the total number of attributes |
Measure of aggregation | MOA | The number of data declarations (class fields) whose types are user-defined classes |
Measure of function abstraction | MFA | Number of methods inherited by a class plus number of methods accessible by member methods of the class |
Cohesion among methods of class | CAM | Summation of the number of different types of method parameters in every method divided by the multiplication of the number of different method parameter types in whole class and number of methods |
Inheritance coupling | IC | The number of parent classes to which a given class is coupled |
Coupling between methods | CBM | Total number of new/redefined methods to which all the inherited methods are coupled |
Average method complexity | AMC | The number of JAVA byte codes |
Afferent couplings | Ca | How many other classes use the specific class |
Efferent couplings | Ce | How many other classes are used by the specific class |
Maximum McCabe | Max (CC) | Maximum McCabe’s cyclomatic complexity values of methods in the same class |
Average McCabe | Avg (CC) | Average McCabe’s cyclomatic complexity values of methods in the same class |
Lines of code | LOC | Measures the volume of the code |
Java project dataset information.
Project | Versions (pre, post) | Avg files | Avg defect rate (%) |
---|---|---|---|
Camel | 1.4, 1.6 | 918 | 18.1 |
Lucene | 2.0, 2.2 | 221 | 53.7 |
Poi | 2.5, 3.0 | 413 | 64.0 |
Xerces | 1.2, 1.3 | 447 | 15.7 |
Jedit | 4.0, 4.1 | 309 | 25.0 |
Xalan | 2.5, 2.6 | 844 | 47.3 |
Synapse | 1.1, 1.2 | 239 | 30.3 |
We evaluate the performance of our model as F1-measure and AUC. F1-measure is used to measure the stability of DP-ARNN, and AUC is used to assess the discrimination ability of it.
F1-measure is the harmonic mean of the precision and the recall. We define equations (
Normally, Precision and Recall cannot be optimal at the same time. For example, if we predict all the program files to be defective, Recall will reach 100%, but Precision will be very low. Therefore, we make a trade-off between Precision and Recall as F1-measure (i.e., the harmonic mean of the two metrics). The range of it is
AUC (i.e., area under ROC curve) is based on the area under the ROC (i.e., receiver operating characteristic) curve to evaluate the distinguishing ability of the prediction model. When evaluating the model classifier, the ROC curve first sets different thresholds for classification. The abscissa of the ROC curve is the value of false positive rate (fpr) and the ordinate is the value of true positive rate (tpr). Each classification threshold generates a coordinate (fpr, tpr), and ROC is the curve formed by these coordinate points. AUC is the area under ROC curve. The value of it ranges from 0 to 1, the higher the better. In addition, AUC is appropriate for evaluating class-imbalanced datasets.
Besides, we employ the Friedman test [
We select the following five baseline methods to compare with our proposed DP-ARNN. RF: random forest (RF) [ RBM + RF: random forest method with hidden features learned by restricted Boltzmann machine (RBM) [ DBN + RF: random forest method with hidden features generated by deep belief network (DBN) [ CNN: a deep learning method based on text sequence convolution, which feeds hidden features learned by CNN to the final classifier. RNN: a bidirectional recurrent neural network based on LSTM units to generate syntactic and semantic features for defect prediction
We take the same method to generate the inputs of CNN and RNN, which we have mentioned in Section
The Friedman test is performed on F1-measure among all the methods, whose result is shown in Table
Friedman test among all the 6 methods.
|
| |
---|---|---|
Baseline | 5 | 0.05 |
Test result | 5 |
|
RF | RBM + RF | DBN + RF | CNN | RNN | |
---|---|---|---|---|---|
DP-ARNN | 0.104 | 0.001 | 0.008 | 0.900 | 0.766 |
F1-measure comparison of different models.
Project | DP-ARNN | RF | RBM + RF | DBN + RF | CNN | RNN |
---|---|---|---|---|---|---|
Camel |
|
0.396 | 0.310 | 0.330 | 0.473 | 0.506 |
Lucene |
|
0.604 | 0.600 | 0.623 | 0.711 | 0.672 |
Poi |
|
0.669 | 0.639 | 0.652 | 0.734 | 0.722 |
Xerces |
|
0.185 | 0.128 | 0.167 | 0.243 | 0.262 |
Jedit | 0.560 | 0.550 | 0.468 | 0.500 |
|
0.595 |
Xalan |
|
0.638 | 0.628 | 0.623 | 0.639 | 0.606 |
Synapse | 0.477 | 0.414 | 0.303 | 0.360 | 0.424 |
|
W/T/L | 7/0/0 | 7/0/0 | 7/0/0 | 6/0/1 | 5/0/2 | |
Average |
|
0.494 | 0.439 | 0.465 | 0.546 | 0.550 |
AUC comparison of different models.
Project | DP-ARNN | RF | RBM + RF | DBM + RF | CNN | RNN |
---|---|---|---|---|---|---|
Camel |
|
0.677 | 0.674 | 0.654 | 0.732 | 0.766 |
Lucene | 0.680 | 0.641 | 0.679 | 0.682 | 0.688 |
|
Poi |
|
0.636 | 0.657 | 0.668 | 0.745 | 0.764 |
Xerces |
|
0.576 | 0.579 | 0.560 | 0.671 | 0.730 |
Jedit | 0.820 | 0.797 | 0.797 | 0.794 | 0.841 |
|
Xalan | 0.674 | 0.674 |
|
|
0.674 | 0.654 |
Synapse | 0.645 |
|
0.646 | 0.657 | 0.632 | 0.648 |
W/T/L | 5/1/1 | 5/0/2 | 4/0/3 | 4/1/2 | 4/0/3 | |
Average |
|
0.669 | 0.673 | 0.670 | 0.712 | 0.728 |
We first compare three deep learning methods (i.e., CNN, RNN, and DP-ARNN) with two traditional machine learning methods (i.e., RF and RBM + RF). RF is a traditional features-based method with static code metrics, and RBM + RF is a method which first builds a shallow network including two layers (i.e., a visible layer and a hidden layer) to generate hidden features and then feeds them into RF for classification. This comparison is to verify the superiority of deep learning methods in the field of software defect prediction. We conduct the experiments on these projects listed in Table
Table
Table
Based on the analysis above, we come to a conclusion that deep learning methods are superior to traditional machine learning methods for software defect prediction.
To further demonstrate that features generated by deep learning methods are generally better than typical unsupervised feature extraction methods, we construct an RBM model and a DBN model to extract features from ASTs of programs and feed them into RF for classification. The difference between RBM and DBN is that the former is a two-layer shallow neural network, and the latter is a network that consists of multiple RBMs.
By comprehensively comparing the average F1-measure of RBM + RF and DBN + RF on the seven projects, we can see that the average F1-measure of DBN + RF is higher than the values of RBM + RF, indicating that the information of ASTs of programs can be deeper mined. From the perspective of W/T/L, compared with DBN + RF, DP-ARNN and CNN win 7 times on F1-measure, and RNN also wins 6 times, validating the stability of models based on deep learning methods. As for AUC, the average values of DP-ARNN, CNN, and RNN are all higher than the value of DBN + RF, which means the comprehensive discrimination ability based on deep learning methods outperforms unsupervised learning methods. These results validate the superiority of features extracted from deep learning methods, especially our proposed DP-ARNN method.
In this section, we compare the performance of our proposed DP-ARNN method with other deep learning methods, including CNN and RNN. We construct a convolutional neural network and a recurrent neural network as our deep learning baseline methods. We implement one-dimensional convolution on elements in each encoded vector in CNN. For RNN, we adopt LSTM as the basic unit and then construct a Bi-LSTM network without attention mechanism.
From the perspective of W/T/L, compared with CNN and RNN, our proposed DP-ARNN wins 6 times and 5 times, respectively, on F1-measure. This indicates that, in terms of the stability of software defect prediction, DP-ARNN has better performance than CNN and RNN. As for AUC, Figure
The ROC curves of (a) CNN, (b) RNN, and (c) DP-ARNN, respectively.
These results exactly answer our RQ3 that, compared with the typical convolutional neural network and recurrent neural network, our proposed DP-ARNN method can better learn the key syntactic and semantic features of programs with the help of attention mechanism and perform the best.
In this section, we discuss how we tune the key parameters in DP-ARNN to achieve the best performance of software defect prediction. We only select part of projects to tune the parameters, considering the cost of training time. We first choose the 90th percentile of AST vector length in the projects as the length of each AST vector. Then we select suitable dimensionality of embedding vectors, and we need to make a trade-off between model precision and training cost. Empirically, the range of it is from 20 to 150. After that, we set the batch size as 32 heuristically and the appropriate epoch is determined by the method of early stopping. That is, the training is stopped when the error of the current model on validation set is worse than the previous one, and we use the parameters in the previous result as the final parameters of the model. More importantly, there are three crucial parameters in our proposed DP-ARNN, including the number of the Bi-LSTM units per layer, the number of the 1st hidden layer nodes, and the number of the 2nd hidden layer nodes. We use F1-measure as the evaluation index. Finally, we calculate the average F1-measure of the projects under different parameter values, choosing the values that the average curve under different parameters reaches the peak.
In our experiments, we select
F1-measure of DP-ARNN under different parameter settings. Different numbers of (a) LSTM units, (b) 1st hidden layer nodes, and (c) 2nd hidden layer nodes.
Tuned parameters for DP-ARNN.
Parameter | Description (value) |
---|---|
Embedding_dim | The dimensionality of embedding vectors (30) |
Vector_length | The length of each AST vector (2000) |
Bi-LSTM units | The number of the Bi-LSTM units per layer (40) |
1st hidden layer nodes | The number of 1st hidden layer nodes (16) |
2nd hidden layer nodes | The number of 2nd hidden layer nodes (24) |
Batch_size | The number of training samples that propagated through DP-ARNN at a time (32) |
Epoch | One forward/backward pass of all the training samples (20) |
Monitor | The evaluation criteria on the validation set (val_acc) |
Loss function | The loss function to minimize (binary_crossentropy) |
Optimizer | The loss function solver (RMSprop) |
Activation | Types of activation used in fully connected layers (tanh, linear, and sigmoid) |
As the scale and complexity of modern software continue to increase, software reliability has become an important indicator of software quality. To enhance software reliability, in this paper, we propose a deep learning-based method called DP-ARNN (defect prediction via attention-based recurrent neural network), as an aid to software testing and code review, to predict potential code defects in software. Specifically, DP-ARNN leverages RNN to automatically generate syntactic and semantic features from source code. Furthermore, we employ the attention mechanism to capture crucial features, which can further improve our defect prediction performance. Our experiments on seven open-source projects indicate that, in average, DP-ARNN improves the state-of-the-art baseline methods by 14% on F1-measure and 7% on AUC in software defect prediction.
To further evaluate the generality of DP-ARNN in the fields of defect prediction, in the future, we will conduct experiments on more projects, including personal projects and company projects. Meanwhile, we will implement our method to other programming languages such as Python, Javascript, and C++ to verify the effectiveness of it. Moreover, we will try to embed static code attributes into DP-ARNN, and then test whether the performance of defect prediction can be improved.
There are two different datasets including source code and static code metrics of the seven open-sourced Java projects. The source code of these projects from Apache is available at
There are no conflicts of interest regarding the publication of this paper.
This work was partially supported by the NSF of China under Grant nos. 61772200 and 61702334, Shanghai Pujiang Talent Program under grants no. 17PJ1401900, Shanghai Municipal Natural Science Foundation under Grant nos. 17ZR1406900 and 17ZR1429700, Educational Research Fund of ECUST under Grant no. ZH1726108, and the Collaborative Innovation Foundation of Shanghai Institute of Technology under Grant no. XTCX2016-20.