Prediction of Defective Software Modules Using Class Imbalance Learning

. Software defect predictors are useful to maintain the high quality of software products effectively. The early prediction of defective software modules can help the software developers to allocate the available resources to deliver high quality software products. The objective of software defect prediction system is to find as many defective software modules as possible without affecting the overall performance. The learning process of a software defect predictor is difficult due to the imbalanced distribution of software modules between defective and nondefective classes. Misclassification cost of defective software modules generally incurs much higher cost than the misclassification of nondefective one. Therefore, on considering the misclassification cost issue, we have developed a software defect prediction system using Weighted Least Squares Twin Support Vector Machine (WLSTSVM). This system assigns higher misclassification cost to the data samples of defective classes and lower cost to the data samples of nondefective classes. The experiments on eight software defect prediction datasets have proved the validity of the proposed defect prediction system. The significance of the results has been tested via statistical analysis performed by using nonparametric Wilcoxon signed rank test.


Introduction
Software Development Life Cycle (SDLC) consists of five phases: Analysis, Design, Implementation, Test, and Maintenance phases.These phases should be operated effectively in order to deliver bug-free and high quality software product to the end users.Developing a defect-free software product is a very challenging task due to the occurrence of unknown bugs or unforeseen deficiencies even if all the guidelines of software project development are followed carefully.Early prediction of defective software modules helps the software project manager to effectively utilize the resources such as people, time, and budget to develop high quality software [1][2][3][4].Identifying defective software modules is a major issue of concern in the software industry which facilitates further software evolution and maintenance.Software project managers, quality managers, and software developers monitor, detect, and correct software defects in all phases of software development life cycle in order to deliver quality software on time and within budget.The quality of a software product is highly correlated with the absence or presence of the defects [5,6].A software defect is an error or deficiency in a software process which occurs due to incorrect programming logic, miscommunication of requirements, lack of coding experience, poor software testing skill, and so forth.Defective software modules generate wrong output and lead to a poor quality software product which further increases the development and maintenance cost and is responsible for customer dissatisfaction [1,2].In last two decades researchers have focused on software defect prediction problem by applying several statistical and machine learning techniques.The software defect data suffers from the class imbalance problem due to the skewed distribution of defective and nondefective software modules [7][8][9][10][11].Mostly machine learning algorithms consider equal distribution of data samples in each class and assume the misclassification cost of each class is equally important.However, the misclassification cost of data samples of minority class is higher than that of the data samples of majority class in most cases [12].In case of the software defect prediction, predicting the defective software module as nondefective one can increase the cost of maintenance and for the opposite case in which nondefective module is considered as defective can involve unnecessary testing activities.But the latter is generally more acceptable than the former.Hence, the objective of this research work is to consider the different misclassification cost of each class for the effective prediction of defective software modules.
Software defect prediction problem requires a binary classifier as it is a two-class classification problem.In recent years, many nonparallel hyperplane Support Vector Machine (SVM) classifiers have been proposed by the researchers for binary classification [13][14][15].For example, Mangasarian and Wild proposed a Generalized Eigenvalue Proximal Support Vector Machine (GEPSVM), which is the first nonparallel hyperplane classifier and it aims to find a pair of nonparallel hyperplane in such a way that each hyperplane is nearest to one of the two classes and as far as possible from the other class [16].GEPSVM shows excellent performance with several benchmark datasets especially with the "Cross-Planes" dataset.Later, by utilizing the concept of traditional SVM and GEPSVM, Jayadeva et al. proposed a nonparallel hyperplane based novel binary classifier, named as TWSVM [13].TWSVM has shown better performance as compared to Support Vector Machine (SVM) and other classifiers not only in terms of predictive accuracy but also in terms of time [13,14].For equally distributed classes, the training process of TWSVM is four times faster than that of SVM as it solves two smaller Quadratic Programming Problems (QPPs) instead of a complex QPP as in SVM.TWSVM seeks two nonparallel hyperplanes one for each class in such a way that each hyperplane remains within the close affinity of its corresponding class while being as far as possible from the other class.Although TWSVM classifier is faster than that of conventional SVM, yet it involves solving of two QPPs which is a complex process.Hence, Arun Kumar and Gopal proposed a binary classifier referred to as Least Squares Twin Support Vector Machine (LSTSVM) which solves two linear equations rather than two QPPs as in TWSVM [17].It is the least square variant of Twin Support Vector Machine (TWSVM).LSTSVM has shown its effectiveness over TWSVM in terms of better generalization ability and lesser computational time.Therefore, this research work has adopted LSTSVM classifier for the defect prediction in software modules.This study takes the misclassification cost issue into account and proposes a Weighted Least Squares Twin Support Vector Machine classifier to develop a software defect prediction system that considers misclassification cost for each class.Experiments on eight software defect prediction datasets taken from PROMISE repository demonstrate the superiority of our proposed system over existing approaches, including Support Vector Machine (SVM), Cost-Sensitive Neural Network (CBNN), weighted Naive Bayes (NB), Random Forests (RF), Logistic Regression (LR), -Nearest Neighbor (-NN), Bayesian Belief Network (BBN), C4.5 Decision Tree, and Least Squares Twin Support Vector Machine (LSTSVM).The effectiveness of the proposed software defect prediction system has also been analyzed by using nonparametric Wilcoxon signed rank hypothesis tests.The statistical inferences are made from the observed difference in the geometric mean.
The paper is organized into five sections.Section 2 summarizes the related work in the field of software defect prediction and class imbalance learning.Section 3 discusses the proposed software defect prediction approach.Results of experiment are presented and discussed in Section 4 and conclusion is drawn in Section 5.

Related Work
2.1.Class Imbalance Learning.In imbalanced data distribution, one class contains large number of data samples (majority class) as compared to the other class (minority class).Traditional classification algorithms assume balanced distribution of data samples among classes.The degree of imbalance varies from one problem domain to another and the correct class prediction of data samples in an unusual class becomes more important than the contrary case.In the software defect prediction problem the cases of defective software modules are less as compared to nondefective software modules.For such type of problem, software developers take more interest in the correct identification of defective software modules.The failure to identify defective software modules can degrade the software quality.Therefore, a software defect predictor could be beneficial if it correctly recognizes the defective software modules.
Class imbalanced learning is the process of learning from the imbalanced datasets [18].The challenge of imbalanced data learning is that the unusual class cannot draw equal attention to the learning algorithm as compared to the majority class.For imbalanced dataset, the learning algorithm generates specific or missing classification rules for the unusual class [18][19][20].These rules cannot be generalized well for the unseen data and thus are not appropriate for the future prediction.
Various solutions have been recommended by the researchers to handle class imbalance problem-data level, algorithmic level, and cost-sensitive solutions.In data level solutions, the training data is manipulated to rebalance the distribution of data among classes for the purpose of rectifying the effect of class imbalance by using different resampling techniques such as random oversampling, random undersampling, SMOTE, informed undersampling, and cluster based sampling [20][21][22][23][24][25][26][27].Data level solutions are more versatile in nature as they are independent of the learning algorithms.In algorithmic level solutions, the learning algorithms modify their training mechanism with the objective to achieve better accuracy on the minority class.One-class learning approaches such as REMED and RIPPER are used to predict the data samples of minority class [28].Ensemble learning approaches have been used by the researchers for imbalance data handling.In this approach, a set of classifiers are used for learning and their outputs are combined in order to predict the class of new data samples.Boosting, Random Forest, AdaBoost.NC, SMOTEBoost, and so forth are examples of ensemble learning approaches [29].Cost-sensitive learning methods consider different misclassification cost for different classes in such a way that the data samples of minority class get importance.Cost-Sensitive Decision Tree, Cost-Sensitive Neural Network, and Cost-Sensitive Boosting methods such as Adacost are some approaches which are proposed by the researchers to handle the class imbalance learning problem [30][31][32][33].Cost functions have also been combined with Support Vector Machine and Bayesian classifiers.
2.2.Software Defect Prediction.Researchers are taking great interest in software defect prediction problem using statistical and machine learning algorithms such as Neural Network, Support Vector Machine, Naive Bayes, Random Forest, Case Based Reasoning, Logistic Regression, and Association Rule Mining [34][35][36][37][38][39][40].K. O. Elish and M. O. Elish investigated the capability of Support Vector Machine in predicting defective software modules and analyzed its performance against some statistical and machine learning approaches on four NASA datasets [37].Czibula et al. developed a system to identify the defective software modules using relational association rule mining which is an extension of association rules [38].Association rules are used to determine the different types of relations between metrics for defect prediction.Challagulla et al. have evaluated the performance of various machine learning approaches and statistical models on four software defect prediction datasets taken from NASA repository for predicting software quality [41].From experiments, it was analyzed that the combination of 1-rule classification and instance based learning incorporation with consistency based subset evaluation approach achieved the highest defect predictive accuracy as compared to the other methods.Guo et al. proposed Random Forests, which is an extension of Decision Tree, for identifying the defective software modules [39].They have performed experiment on five case studies based on NASA datasets and compared the performance of their proposed methodology with statistical and machine learning approaches of WEKA and See5 machine learning packages.They concluded that the Random Forest algorithm has produced higher defect prediction rate as compared to the other approaches.Moeyersoms et al. used Data Mining approaches such as Random Forest, Support Vector Regression, C4.5, and Regression Tree [42].They have applied ALPA rule extraction technique to improve the rule sets in terms of accuracy, fidelity, and recall.Okutan and Yıldız developed a software defect prediction model by using Bayesian Network [43].This model determines the probabilistic influential relationships of software metrics with defect-prone software modules.Bayesian Network is one of the most widely used approaches to analyze the effect of object-oriented metrics on the number of defects [43][44][45][46][47][48].Pai and Dugan performed experiment on KC1 project taken from NASA repository using Bayesian Network [47].Fenton et al. used Bayesian Network to predict the defect, quality, and risk of software system [48].They have analyzed the influence of information variables such as test effectiveness and defect present on target variable defects detected.Catal and Diri have investigated the effect of dataset size, metrics, and feature selection on the prediction of defective software modules [49].They have conducted experiments on five datasets and analyzed that the Random Forest (RF) algorithm obtained better performance on large datasets while Naive Bayes performed better on small datasets as compared to RF. Again they have used Artificial Immune System (AIS) algorithm to analyze the effect of metrics set.Artificial Immune Recognition Systems (AIRS2Parallel) perform better with the method level metrics while Immunos2 algorithm shows better results with class-level metrics.They have found that the algorithm is more important component of software defect prediction than the metrics suite.Apart from these basic classification approaches, several optimization approaches such as Genetic Algorithm, Particle Swarm Optimization (PSO), and Ant Colony Optimization (ACO) have also been applied to the software defect prediction problem [50][51][52].
The imbalance distribution of defective and nondefective software modules leads to the poor performance of machine learning approaches.To balance the distribution of data samples between classes, various solutions such as oversampling and undersampling methods have been applied by the researchers.Arar and Ayan proposed a Cost-Sensitive Neural Network based defect prediction system with the objective to handle class imbalance problem [53].Artificial Bee Colony algorithm was used to find the optimal weights.They have investigated the performance of their proposed approach on five publically available datasets taken from NASA repository.Zheng considered different misclassification costs and developed a software defect prediction model by using Cost-Sensitive Boosting Neural Network [8].Khoshgoftaar and Gao also studied the impact of data sampling and feature selection on software defect prediction datasets [10,54].They used wrapper based feature selection approach to select relevant features and random undersampling to reduce the negative impact of imbalanced data on the performance of software defect prediction model.Wang and Yao investigated the impact of imbalanced data on the software defect prediction learning models [7].They have performed experiments on ten publically available datasets taken from PROMISE repository with different types of class imbalance learning approaches such as resampling, ensemble approach, and threshold moving.From the experiment, it was found that the AdaBoost.NC has shown better performance as compared to the other approaches.Jing et al. employed dictionary learning approach and proposed a costsensitive discriminative dictionary learning (CDDL) based software defect prediction model.They have analyzed the performance of their proposed model on ten NASA datasets [55].
Apart from these researches, various studies have been done on predicting the software defect using Data Mining techniques.Researchers have also analyzed the impact of metrics on identifying defect-prone software modules.They have focused on the selection of relevant metrics which are useful for defect prediction [52,[56][57][58][59][60][61][62].From the literature, we have analyzed that the Data Mining plays crucial role in predicting software defect.The datasets which are used for defect prediction are highly imbalanced in nature as the number of defective software modules is usually less than the nondefective software modules.Therefore, this research work focuses on the imbalance nature of software defect prediction dataset in order to get effective results.

Weighted Least Squares Twin Support Vector Machine
Only The following conclusions can be drawn from the abovementioned formula: (1) Cost lies within 0 to 1 range, that is,   ∈ (0, 1) so that the classifier could be trained with convergence.
(3) Lower misclassification cost is assigned to the majority class while minority class receives higher misclassification cost.
Linear and nonlinear WLSTSVM classifier is formulated as follows.
3.1.Linear WLSTSVM.Least Squares Twin Support Vector Machine (LSTSVM), proposed by Arun Kumar and Gopal, is a binary classifier which classifies the data samples of two classes by generating hyperplane for each class [17].
The hyperplanes are constructed in such a way that the data samples of each class lie in the close proximity of its corresponding hyperplane while maintaining clear separation from other hyperplanes.For each new data sample, its distance is calculated from each hyperplane and the data sample is assigned into the class which lies closer to it.Weighted Least Squares Twin Support Vector Machine is obtained by adding weight or misclassification cost to the formulation of LSTSVM according to (1).Linear WLSTSVM solves the following two objective functions: to determine the following two nonparallel hyperplanes: Here, Here,  > 0 is a Lagrangian multiplier.Following Karush-Kuhn-Tucker (KKT) necessary and sufficient optimality conditions are determined by differentiating ( 5) with respect to  1 ,  1 , , and : Equations ( 6) and ( 7) lead to Let . With these notations, (10) can be rewritten as The solution of the above equation requires the inverse   1  1 .However, sometimes it is not possible to determine the inverse of it due to ill-conditioned matrix.To avoid this situation, a regularization term  may be added to the   1  1 .
Here,  > 0 and  is an identity matrix of suitable dimension.Equation ( 11) can be rewritten as Lagrangian multiplier is determined by ( 8), (9), and (11) as In the same way, Lagrangian function of ( 3) is obtained as Here,  > 0 is a Lagrangian multiplier.The hyperplane parameters corresponding to class 2 are obtained by solving the above equation ( 14) as Hyperplane parameters are obtained using (11) Algorithm 1.
(1) Define weight matrix for each class (defective or nondefective) using ( 1).(2) Obtain matrices where matrices  1 and  2 comprise the software modules of defective and nondefective classes or vice versa.(3) Select the penalty parameters on validation basis.(4) Determine hyperplane parameters using (12) and (15) which are further used to determine the hyperplane for each class.(5) For new software module its class (either it is defective or not) is determined by using decision function as mentioned by (17).

Nonlinear WLSTSVM.
Nonlinear WLSTSVM is obtained by using kernel trick.Kernel function maps the data samples into higher-dimensional feature space in order to make easier separation.WLSTSVM classifier generates the following kernel surfaces in that space instead of hyperplanes: Here, "" is an appropriately chosen kernel function and  = [ Let . The kernel generated surface parameters are obtained as These parameters generate kernel surfaces and the class is assigned to new data sample depending on its distance from the kernel surface.
The algorithm of nonlinear WLSTSVM classifier is similar to that of linear WLSTSVM classifier except that there is a need to choose a kernel function.Kernel function transforms the data samples into higher-dimensional feature space and then kernel generated surface parameters are calculated using (20) and (22).The class is assigned to new data samples using (24).

Dataset Description and Performance Measurements.
In this study, we have performed the experiment on eight benchmark datasets taken from PROMISE repository [63].These datasets are NASA MDP software projects which were developed in C/C++ language for spacecraft instrumentation, satellite flight control, scientific data processing, and storage management of ground data.The detailed description of each dataset is given in Table 1.
The imbalance ratio represents the ratio of majority class (number of nondefective software modules) with minority class (number of defective software modules).It is clear that the software defect prediction dataset is imbalanced in nature as the number of defective software modules is less as compared to the number of nondefective software modules.
A brief description of twenty-one common basic software metrics selected from forty metrics of eight defect prediction datasets such as lines of code, cyclomatic complexity, volume, difficulty, number of operators, and operands is also provided in Table 2.More detailed description of other metrics or information about the NASA datasets can be obtained from [63].
Performance evaluation model of proposed software defect prediction system is shown in Figure 1.
True Prediction (True Positive (TP) or True Negative (TN)) refers to the number of software modules which are correctly predicted as nondefective or defective software modules.While the False Prediction (False Positive (FP) or False Negative (FN)) indicates the number of software modules which are incorrectly recognized as defective or nondefective software modules.The performance of proposed software defect prediction model is evaluated by using geometric mean.Geometric mean (-Mean) is a performance evaluation metric proposed by Kubat and Matwin for binary class classification problem [64].It is usually used to evaluate the performance of a classifier in imbalanced We have also compared the performance of the proposed software defect predictor using precision and -measure which are defined as

Parameters Selection.
The proposed WLSTSVM classifier used for software prediction has two penalty parameters  1 and  2 .In this research work, we have analyzed the influence of penalty parameters on the performance of proposed system which are used in problem ( 13) and ( 16).The performance of a classifier gets affected by the selections of these parameters.This study has used Grid Search approach for the optimal parameters selection.The penalty parameters are selected from the following range:  1 ,  2 ∈ {10 −4 , . . ., 10 2 }.
Figure 2 shows the influence of these parameters on the -Mean of proposed software defect prediction system for KC1, KC2, CM1, and PC4 datasets.From the figure, it is clear that the proposed system shows better performance on high value of  1 and  2 parameters ( 1 = 1,  2 = 1, 77.78%) for KC2 dataset.For CM1 dataset, the proposed system achieves the highest value of geometric mean on high value of  1 and low value of  2 parameter ( 1 = 1,  2 = 0.1, 71.98%).For KC1 dataset, WLSTSVM gains the highest geometric mean with high value of  1 and  2 parameters ( 1 = 1,  2 = 10, 71.73%).
On the other hand, for KC1 dataset, the proposed defect predictor obtains better geometric mean with low value of  1 and  2 parameters ( 1 = 0.0001,  2 = 0.01, 75.05%).It is observed that the impact of these parameters on the -Mean is different for each dataset and the proper choice of these parameters can improve the performance of the software defect prediction system to a great extent.Therefore, there is a need of proper combination of these parameters for other datasets also so that the software defect prediction system could achieve better predictive performance.The experiment is performed by using 10-fold cross validation method in which each dataset is randomly divided into ten equal sized subsets.Each time nine subsets are used as the training dataset for the learning and remaining one subset is used as the testing data for the evaluation of defect prediction system.This process is then repeated ten times so that each of the ten subsets is used exactly once as the training and testing data.The final performance of the defect prediction system is estimated by averaging of the results of 10-fold.Tables 3-7 show the performance comparison in terms of sensitivity, specificity, precision, -measure, and geometric mean (-Mean) metrics of our proposed approach with other existing approaches on 8 software defect prediction datasets.
The results include the mean of sensitivity, specificity, precision, -measure, and geometric mean of the 10-fold.In Tables 3-7, we have mentioned the best performance of each approach.Bold figures indicate better predictive performance of a classifier for each dataset.From Table 3, it is clear that the proposed WLSTSVM based software defect predictor obtains highest sensitivity for CM1, PC1, PC4, MC2, and KC2 datasets.WLSTSVM gains the highest precision value Wilcoxon signed rank test performs pairwise comparison of two approaches used for software defect prediction and analyzes the differences between their performances on each dataset [65][66][67].The rank is assigned to the differences according to their absolute values from the smallest to the largest and average ranks which are given in the case of ties.Wilcoxon signed rank stores the sum of rank in  + and  − where  + stores the sum of ranks for the datasets on which WLSTSVM classifier has shown better performance over other classifiers and  − stores the sum of ranks for the opposite.It determines whether a hypothesis of software defect predictors comparison could be rejected at a specified significance level .The  value is also computed for each comparison which shows the lowest level of significance of a hypothesis that results in a rejection.In this manner, it can be determined whether two software defect predictors are

Conclusion
Class imbalance problem often occurs in software engineering and other real world applications which deteriorates the performance of machine learning approaches as they consider the equal distribution of data samples among classes and assume that the misclassification cost of each class is equally important.It is essential to incorporate the misclassification costs into the software defect prediction models as the misclassification of defective software modules incurs much higher cost than the misclassification of nondefective one.So, in this study, we have developed a software defect prediction system by using Weighted Least Squares Twin Support Vector Machine (WLSTSVM).In this approach misclassification cost is assigned to the software modules of each class in order to compensate the negative effect of the imbalanced data on the performance of software defect prediction.The performance of proposed WLSTSVM classifier is compared with nine algorithms on eight software defect prediction datasets.Experimental results demonstrate the effectiveness of our approach for the software defect prediction task.This study also performs statistical analysis of the performance of each classifier by using Wilcoxon signed rank test.The test shows that the differences between WLSTSVM and the compared approaches are statistically significant.Parameters selection is an important issue which needs to be addressed in future as they affect the prediction results to a certain extent.Selection of relevant features is another issue of concern which should be performed to improve the performance of software defect prediction system.

Figure 2 :
Figure 2: The influence of penalty parameters on -Mean of WLSTSVM.
few researches have considered the misclassification cost of defective and nondefective software modules.This research work has used Weighted Least Squares Twin Support Vector Machine (WLSTSVM) to develop the effective software defect prediction model in which different misclassification cost or weight is assigned to each class according to its sample distribution.Let the training dataset contain "" data samples {( 1 ,  1 ), ( 2 ,  2 ), . . ., (  ,   )}, where   ∈   ,  = 1, 2, . . ., , denotes feature vector and   ∈ {1, 2} represents corresponding class label.Suppose the size of class 1 and class 2 is  1 and  2 correspondingly, where  =  1 +  2 .Let matrices  1 ∈   1 × and  2 ∈   2 × consist of data samples of class 1 and class 2, respectively.The appropriate selection of cost is an important issue of consideration.The weight or misclassification cost is determined for each class according to the following formula: (1)d  2 are two normal vectors to the hyperplanes and  1 and  2 are bias terms. 1 and  2 represent nonnegative penalty parameters. 1 ∈   1 and  2 ∈   2 are the vectors of 1's and  ∈   2 ,  ∈   1 are slack variables.1 ∈   2 × 2 and  2 ∈   1 × 1 represent the diagonal matrix containing misclassification cost for the data samples of class 2 and class 1, respectively, according to(1).The first term of the objective function as indicated in (2) measures the squared sum distances of the data samples of class 1.The minimization of it keeps the hyperplane in the close proximity with class 1.The second term of the objective function minimizes the misclassification error due to the data samples of class 2. Thus, in this way the hyperplane is kept near the data samples of class 1 and as far as possible from the data samples of class 2. The Lagrangian function corresponding to (2) is given by 1  2 ]  .Nonlinear WLSTSVM classifier is constructed as min ( 1 ,  1 , ) 1 2       ( 1 ,   )  1 +  1  1      ( ( 1 ,   )  2 +  1  2 ) +  =  1 .

Table 2 :
Details of software metrics.
data distribution scenario.It measures the balanced performance of a software defect prediction approach.-Mean is

Table 3 :
Comparison on the basis of sensitivity values on eight datasets.

Table 4 :
Comparison on the basis of specificity values on eight datasets.

Table 5 :
Comparison on the basis of precision values on eight datasets.

Table 6 :
Comparison on the basis of -measure values on eight datasets.

Table 7 :
Comparison on the basis of geometric mean values on eight datasets.

Table 8 :
Result of Wilcoxon signed rank test.different or the same.If different, it also determines how different they are.In order to conduct Wilcoxon signed rank test, we have performed pairwise comparison of software defect predictors in which the performance of WLSTSVM is compared with every other approach.Ranks and  value are calculated for each case.The statistical inferences are made from the observed difference in the geometric mean as it evaluates the balanced performance of a classifier in imbalance learning scenario.The results obtained from Wilcoxon signed rank test is shown in Table8.It is observed from the table that the  value is less than 0.05 in all the cases; that is, the proposed software defect predictor outperforms all of them with high degree of confidence in each case. significantly