This paper mainly deals with how kernel method can be used for software defect prediction, since the class imbalance can greatly reduce the performance of defect prediction. In this paper, two classifiers, namely, the asymmetric kernel partial least squares classifier (AKPLSC) and asymmetric kernel principal component analysis classifier (AKPCAC), are proposed for solving the class imbalance problem. This is achieved by applying kernel function to the asymmetric partial least squares classifier and asymmetric principal component analysis classifier, respectively. The kernel function used for the two classifiers is Gaussian function. Experiments conducted on NASA and SOFTLAB data sets using F-measure, Friedman’s test, and Tukey’s test confirm the validity of our methods.
1. Introduction
Software defect prediction is an essential part of software quality analysis and has been extensively studied in the domain of software-reliability engineering [1–5]. However, as pointed out by Menzies et al. [2] and Seiffert et al. [4], the performance of defect predictors can be greatly degraded by class imbalance problem of the real-world data sets. Here the “class imbalanced” means that the majority of defects in a software system are located in a small percentage of the program modules. Current approaches to solve the class imbalance problem can be roughly categorized into two ways: in a data-level way or algorithm-level way, as reported in [4]. The literature [4] shows that the algorithm-level method AdaBoost almost always outperforms even the best data-level methods in software defect prediction. AdaBoost is a typical adaptive algorithm which has received great attention since Freund and Schapire’s proposal [6]. Adaboost attempts to reduce the bias generated by majority class data, by updating the weights of instances dynamically according to the errors in previous learning. Some other studies improved dimension reduction methods for the class imbalanced problem by means of partial least squares (PLS) [7], linear discriminant analysis (LDA) [8], and principle component analysis (PCA) [9, 10]. Although PLS was not inherently designed for problems of classification and discrimination, it is widely used in many areas that need class proclaimation. The authors of [7] reported that rarely will PLS be followed by an actual discriminant analysis on the scores and rarely is the classification rule given a formal interpretation. Still this method often produces nice separation. Based on the previous work, recently, Qu et al. investigated the effect of PLS in unbalanced pattern classification. It is reported that, beyond dimension reduction, PLS is proved to be superior to generate favorable features for classification. Thereafter, they proposed an asymmetric partial least squares (APLS) classifier to deal with the class imbalance problem. They illustrated that APLS outperforms other algorithms because it can extract favorable features for unbalanced classification. As for the PCA, it is an effective linear transformation, which maps high-dimensional data to a lower dimensional space. Based on the PCA, the authors of [11] proposed kernel principal component analysis (KPCA) which can perform nonlinear mapping Φ(x) to transform an input vector to a higher dimensional feature space, where kernel function Φ(x) is introduced to reduce computation for mapping the data nonlinearly into a feature space. Then linear PCA is used in this feature space.
While both APLS and KPCA are of great value, they have their own disadvantages. For example, the APLS classifier is a bilinear classifier, in which the dimension is mapped to a bilinear subspace, which is, to some degree, obscure and not easy to implement. The KPCA regression model does not consider the correlation between principal components and the class attribution. PCA dimension reduction is affected inevitably by asymmetric distribution. In this paper, we propose two kernel-based learning methods to solve the class imbalance problem, called asymmetric kernel partial least squares classifier (AKPLSC) and asymmetric kernel principal component analysis classifier (AKPCAC), respectively. The former is able to nonlinearly extract the favorable features and retrieve the loss caused by class imbalance problem, while the latter is more adaptive to imbalance data sets.
It is not out of place to explain the relationship between this paper and our previous papers [12, 13]. The AKPLSC and AKPCAC were firstly proposed in [12, 13], respectively. However, recently, we found some errors when we proceeded to our work. And due to these errors, the AKPCAC and AKPLSC proposed in [12, 13] show superiority only in part of the data sets. We carefully rectified the source code and then tested the AKPCAC and AKPLSC again on the whole data sets by means of statistical tools, such as Friedman’s test and Tukey’s test. The outcomes show that our classifiers indeed outperform the others, namely, APLSC, KPCAC, AdaBoost, and SMOTE. We carefully examine the theory and experimental results and then form this paper in more detail.
2. State of the Art
In software defect prediction, L={(x1,y1),(x2,y2),…,(xℓ,yℓ)}⊂X×Y denotes the labeled example set with size ℓ and U={xℓ+1,xℓ+2,…,xℓ+u}⊂X denotes the unlabeled example set with size u. For labeled examples, Y={+1,-1}, the defective modules are labeled “+1” and the nondefective modules are labeled “−1”. Software defect data sets are highly imbalanced; that is, the examples of the minority class (defective modules) are heavily underrepresented in comparison to the examples of majority class (nondefective modules). Thereby, lots of algorithms are proposed to cope with this problem, as will be seen below.
2.1. Software Defect Predictor Related to Partial Least Squares
Linear partial least squares (PLS) [7] is an effective linear transformation, which performs the regression on the subset of extracted latent variables. Kernel PLS [14] first performs nonlinear mapping, Φ:{xi}i=1n∈ℝN→Φ(x)∈𝔽, to project an input vector to a higher dimensional feature space, in which the linear PLS is used.
Given the center M, the radius of the class region r, and the parameter of overlapping η, the relationship of the two classes can be expressed as M+1-M-1=η(r+1-r-1). The parameter η indicates the level of overlapping between the region of the two classes (the smaller the value of η is, the higher the overlapping will be).
APLSC can be expressed as Y^=sign(∑i=1kmiti-b), which is derived from the regression model of the linear PLS, y^=∑i=1kmiti, where k is the number of the latent variables, ti is the ith score vector of testing data, mi indicates the direction of ith score, and the bias b is equal to m1(M+1-r+1η).
APLSC suffers from the high overlapping, especially when the data sets are nonlinear separable [15]. A suggestion of solving such overlapping problem is by using a kernel method. Kernel PLS [14] corresponds to solving the eigenvalue equation as follows:
(1)ΦΦTΨΨTτ=λτ,
where Φ and Ψ denote the matrix of mapped X-space data Φ(x) and the matrix of mapped Y-space data Ψ(y) in the feature space 𝔽, respectively. The nonlinear feature selection methods can reduce the overlapping level of the two classes, but the class imbalance problem makes them fail to distinguish the minority class [15]. In order to retrieve the loss caused by class imbalance problem, we want to get the bias b^ of the kernel PLS classification, KPLSC [14].
Different from the APLSC, the kernel PLS regression is y^=∑i=1ℓαiκ(xi,x), where ℓ is the size of labeled example set, κ(xi,x) is a kernel function, and αi is dual regression coefficient. Consequently, we may combine the APLSC and kernel PLS so that we get the asymmetric kernel PLS, as will be seen in Section 3.1.
2.2. Kernel Principal Component Analysis Classifier for Software Defect Prediction
Principal Component Analysis (PCA) [10] is an effective linear transformation, which maps high-dimensional data to a lower dimensional space. Kernel principal component analysis (KPCA) [11] first performs nonlinear mapping Φ(x) to transform an input vector to a higher dimensional feature space. And then linear PCA is used in this feature space.
For both of the algorithms demonstrated in [10, 11], the input data are centralized in the original space and the transformed high-dimensional space; that is, ∑i=1ℓxi=0 and ∑i=1ℓΦ(xi)=0, where ℓ is the number of the labeled data and xi is the ith instance of the data set. In the proceeding of PCA, the correlation matrix C=(1/ℓ)∑i=1ℓxixi′ should be diagonalized, while, in KPCA, the correlation matrix CΦ=(1/ℓ)∑i=1ℓΦ(xi)Φ(xi)′ should be diagonalized. It is equal to solving the eigenvalue problem λV=CΦV, where λ is an eigenvalue and V is a matrix of eigenvectors in KPCA. It can also be written as nλα=Kα, where K=nCΦ is the kernel matrix.
The kernel principal component regression algorithm has been proposed by Rosipal et al. [11]. The standard regression model in the transformed feature space can be written as
(2)f(x)=∑k=1pwkβ(x)k+b,
where p is the number of components, wk is the kth primal regression coefficient, and b is the regression bias. Consider β(x)k=VkΦ(x), where Vk is the kth eigenvector of V. V and Λ are the eigenvectors and eigenvalues of the correlation matrix, respectively.
2.3. Data Set
There are many data sets for machine learning test, such as the UCI [16] and the PROMISE data repository [17] (since the contributors maintain these data sets continuously, the metrics listed in Table 1 may vary at different times). What we are using in this paper are the latest ones updated in June 2012. They are different from the data set that we used in our previous papers [12, 13]), which is a data collection from real-world software engineering projects. The choice that which data set should be used depends on the area of the machine learning where it will be applied. In this paper, the experimental data sets come from NASA and SOFTLAB, which can be obtained from PROMISE [17], as shown in Tables 1 and 2. These software modules are developed in different languages, at different sites by different teams, as shown in Table 1. The SOFTLAB data sets (ar3, ar4, and ar5) are drawn from three controller systems for a washing machine, a dishwasher, and a refrigerator, respectively. They are all written in C. The rests are from NASA projects. They are all written in C++, except for kc3, which is written in JAVA. All the metrics are computed according to [17].
Data sets.
Project
Modules
Attributes
Size (loc)
%Defective
Description
ar3
63
30
5,624
11.11
Embedded controller
ar4
107
30
9,196
18.69
Embedded controller
ar5
36
30
2,732
22.22
Embedded controller
cm1
327
38
14,763
12.84
Spacecraft instrument
kc1
2109
21
42,965
15.46
Storage management
kc2
521
22
19,259
20.54
Storage management
kc3
194
40
7,749
18.56
Storage management
mw1
253
38
8341
10.67
A zero gravity experiment
pc1
705
38
25,924
8.65
Flight software
Metrics used in our experiment.
Type
Number
Metric
Loc
5
Halstead's count of blank lines; McCabe's line count of code; Halstead's line count; Halstead's count of lines of comments; line count of code and comment
Unique operators; unique operands; total operators; total operands; total operators and operands; volume; program length; difficulty; intelligence; effort; volume on minimal implementation; time estimator
BranchCount
1
BranchCount
Others
18
Global data complexity; cyclomatic density; decision count; decision density; global data density; essential density; design density; loc executable; parameter count; percent comments; normalized cyclomatic complexity; modified condition count multiple condition count; node count; maintenance severity; condition count; global data complexity; call pairs; edge count
3. Design the Asymmetric Classifiers Based on Kernel Method3.1. The Asymmetric Kernel Partial Least Squares Classifier (AKPLSC)
As we illustrated in Section 2.1, APLSC can be expressed as Y^=sign(∑i=1kmiti-b) and the kernel PLS regression is y^=∑i=1ℓαiκ(xi,x); thus the AKPLSC can be well characterized as
(3)Y^=sign(∑i=1ℓαiκ(xi,x)-b^),
where αi is dual regression coefficient, which can be obtained from kernel PLS, as shown in Algorithm 1 and b^ is the bias of the classifier.
<bold>Algorithm 1: </bold>AKPLSC.
Input: Labeled and unlabeled data sets, L and U; number of components, k.
Output: Asymmetric Kernel Partial Least Squares Classifier, H;
Method:
(1) Kij=κ(xi,xj),i,j=1,…,ℓ,xi,xj∈L;
(2) K1=K,Y^=Y, where K is the kernel matrix, Y is the label vector.
(3) for j=1,…,k do
(4) βj=βj/∥βj∥, where βj is a projection direction.
(5) repeat
(6) βj=Y^Y^′Kjβj
(7) βj=βj/∥βj∥
(8) untill convergence
(9) τj=Kjβj, where τj is the score
(10) cj=Y^′τj/∥τj∥2, where cj is the direction of the score
(11) Y^=Y^-τjcj′, where Y^ is the deflation of Y
(12) Kj+1=(I-τjτj′/∥τj∥2)Kj(I-τjτj′/∥τj∥2)
(13) end for
(14) B=[β1,…,βk],T=[τ1,…,τk]
(15) α=B(T′KB)-1T′Y, where α is the vector of dual regression coefficients
(16) Calculate b^ according to (4);
(17) H(x)=sign(∑i=1ℓαiκ(xi,x)-b^),x∈U;
(18) return H;
End Algorithm AKPLSC.
Since kernel PLS put most of the information on the first dimension, the bias in the AKPLSC can be computed similarly as [15]
(4)b^=c1*(M+1-r+1η)=c1*M+1r-1+M-1r+1r-1+r+1,
where c1 indicates the direction of the first score τ1 and the centers (M+1, M-1) and radiuses (r+1, r-1) are computed based on τ1, which can be obtained from (1). Then we move the origin to the center of mass by employing data centering, as reported in [14]:
(5)K=K-1ℓJJ′K-1ℓKJJ′+1ℓ2(J′KJ)JJ′,
where J is the a vector with all elements that are equal to 1. After data centering, the AKPLSC can be described as shown in Algorithm 1.
3.2. The Asymmetric Kernel Principal Component Analysis Classifier (AKPCAC)
The KPCA regression model does not consider the correlation between principal components and the class attribution. PCA dimension reduction is inevitably affected by asymmetric distribution [15]. We analyze the effect of class imbalance on KPCA. Considering the class imbalance problem, we propose an asymmetric kernel principal component analysis classifier (AKPCAC), which retrieves the loss caused by this effect.
Suppose that SbΦ=∑i=12ni(ui¯-u¯)(ui¯-u¯)′ denotes the between-class scatter matrix and SwΦ=∑i=12∑j=1ni(Φ(xji)-ui¯)(Φ(xji)-ui¯)′ denotes the within-class scatter matrix, where ui¯=(1/ni)∑j=1niΦ(xji) is class-conditional mean vector, u¯ is mean vector of total instances, Φ(xji) is the jth instances in the ith class, and ni is the number of instances of the ith class. The total noncentralized scatter matrix in the form of kernel matrix is
(6)K=∑i=12∑j=1ℓ(Φ(xji)-u¯)(Φ(xji)-u¯)′=∑i=12∑j=1ℓ(Φ(xji)-ui¯+ui¯-u¯)(Φ(xji)-ui¯+ui¯-u¯)′=SwΦ+SbΦ+∑i=12∑j=1ℓ(Φ(xji)-ui¯)(ui¯-u¯)′+∑i=12∑j=1ℓ(ui¯-u¯)(Φ(xji)-ui¯)′.
The third term of (6) can be rewritten as
(7)∑i=12∑j=1ℓ(Φ(xji)-ui¯)(ui¯-u¯)′=∑i=12(∑j=1ℓ(Φ(xji)-ui¯))(ui¯-u¯)′=∑i=12(∑j=1niΦ(xji)-niui¯)(ui¯-u¯)′.
Note that niui¯=∑j=1niΦ(xji). Then the third term and fourth term of (6) are equal to zero. Thus, we have the relation K=SbΦ+SwΦ=SbΦ+PΣPΦ+NΣNΦ, where P is the number of positive instances, N is the number of negative instances, ΣPΦ is the positive covariance matrices, and ΣNΦ is the negative covariance matrix. Since class distribution has a great impact on SwΦ, the class imbalance also impacts the diagonalization problem of KPCA.
In order to combat the class imbalance problem, we propose the AKPCAC, based on kernel method. It considers the correlation between principal components and the class distribution. The imbalance ratio can be denoted as ∑i=1ℓI(yi,+1)/∑i=1ℓI(yi,-1)=P/N, which is the probabilities of the positive instances to the negative instances of training data, where I(·) is an indicator function: I(x,y)=1 if x=y, zero otherwise. We assume that future test examples are drawn from the same distribution, so the imbalance ratio of the training data is the same as that of the test data. Then, we have
(8)∑i=1ℓ(y^i-b^)I(yi,+1)∑i=1ℓ(y^i-b^)I(yi,-1)=∑i=1ℓI(yi,+1)∑i=1ℓI(yi,-1)=PN,
where b^ is the bias of the classifier and y^i is the regression result of xi. y^i can be computed by regression model equation (2). Note that the regression is conducted on the p principal components. Solving this one variable equation, we get
(9)b^=N(∑i=1ℓ(y^iI(yi,+1)))-P(∑i=1ℓ(y^iI(yi,-1)))N2-P2.
Based on principal components, (9) describes the detail deviation of the classifier. This deviation may be caused by class imbalance, noise, or other unintended factors. In order to retrieve the harmful effect, we compensate this deviation. By transforming the regression model (2), the classifier model can be written as
(10)H(x)=sign(∑k=1pwkβ(x)k+b^)=sign(∑k=1pwk∑i=1ℓαikκ(xi,x)+b^)=sign(∑i=1ℓciκ(xi,x)+b^),
where {ci=∑k=1pwkαik}, i=1,2,…,ℓ.
AKPCAC is summarized in Algorithm 2. Since the AKPCAC was firstly studied for reducing the effect of class imbalance for classification, it inherently has the advantage of kernel method, which can conduct quite general dimensional feature space mappings. In this paper, again, we have illustrated how the unreliable dimensions based on KPCA can be removed; thereafter, the imbalance problem based on the PCA has also been solved.
<bold>Algorithm 2: </bold>AKPCAC.
Input: The set of Labeled samples, L={(x1,y1),(x2,y2),…,(xℓ,yℓ)};
The set of unlabeled samples, U={xℓ+1,xℓ+2,…,xℓ+u};
Output: Kernel Principal Component Classifier, H;
Method:
(1) Kij=κ(xi,xj),i,j=1,…,ℓ;
(2) K=K-(1/ℓ)JJ′K-(1/ℓ)KJJ′+(1/ℓ2)(J′KJ)JJ′, where J is a vector with all elements equal to 1.
(3) [V,Λ]=eig(K);
(4) α=∑j=1p(1/λj)(Vj′Ys)Vj; %Ys={y1,y2,…,yℓ} is the label vector
(5) Calculate b^,H(x) according to (9), (10);
(6) return H;
End Algorithm APPCC.
4. Experimental Result
The experiments are conducted under the data set from NASA and SOFTLAB. The Gaussian kernel function K(x,y)=exp(-∥x-y∥2) is adopted for the performance investigation for both AKPLSC and AKPCAC. The efficiency is evaluated by F-measure and Friedman’s test, as will be explained presently.
4.1. Validation Method Using F-Measure
F-measure method is widely used for assessing a test’s accuracy. It considers both the precision P and the recall R to compute the score. P is defined as the number of correct results divided by the number of all returned results. R is the number of correct results divided by the number of results that should have been returned. For the clarity of this paper, we give a short explanation of the F-measure as below. Obviously, there are four possible outcomes of a predictor:
TP: true positives are modules classified correctly as defective modules;
FP: false positives refer to nondefective modules incorrectly labeled as defective;
TN: true negatives correspond to correctly classified nondefective modules;
FN: false negatives are defective modules incorrectly classified as nondefective.
Thereby, the precision is defined as P=TP/(TP+FP) and the recall is R=TP/(TP+FN).
The general formula of the F-measure is
(11)Fβ=(1+β2)PRβ2P+R,
where β is a positive real number. According to the definition of P and R, (11) can be rewritten as
(12)Fβ=(1+β2)TP(1+β2)TP+β2FN+FP.
Generally, there are 3 commonly used F-measures: F1 (which is a balance of P and R), F0.5 (which puts more emphasis on P than R), and F2 (which weights R higher than P). In this paper, F1=2PR/(P+R) is used to evaluate the efficiency of different classifiers. The F-measure can be interpreted as a weighted average of the precision and recall. It reaches its best value at 1 and worst score at 0.
We compare the F-measure values of different predictors including AKPLSC, AKPCAC, APLSC [15], KPCAC [11], AdaBoost [4], and SMOTE [18]. The results are listed in Table 3. For each data set, we perform a 10×5-fold cross validation.
Statistical F-measure (mean ± std) values of six classifiers on all data sets. The value of the winner method of each row is emphasized in bold.
APLSC
KPCAC
AdaBoost
SMOTE
AKPCAC
AKPLSC
ar3
0.498 (0.062)
0.394 (0.115)
0.357 (0.026)
0.415 (0.003)
0.569 (0.052)
0.500 (0.009)
ar4
0.421 (0.023)
0.422 (0.058)
0.412 (0.013)
0.457 (0.000)
0.474 (0.024)
0.421 (0.001)
ar5
0.526 (0.058)
0.562 (0.055)
0.435 (0.012)
0.555 (0.001)
0.625 (0.025)
0.553 (0.022)
cm1
0.254 (0.033)
0.069 (0.031)
0.234 (0.007)
0.237 (0.000)
0.089 (0.031)
0.235 (0.002)
kc1
0.408 (0.005)
0.309 (0.006)
0.325 (0.005)
0.428 (0.000)
0.462 (0.009)
0.433 (0.235)
kc2
0.431 (0.021)
0.433 (0.014)
0.534 (0.002)
0.520 (0.000)
0.523 (0.011)
0.480 (0.036)
kc3
0.382 (0.036)
0.231 (0.025)
0.386 (0.002)
0.340 (0.000)
0.234 (0.045)
0.412 (0.000)
mw1
0.285 (0.011)
0.276 (0.080)
0.224 (0.002)
0.290 (0.000)
0.371 (0.037)
0.268 (0.025)
pc1
0.215 (0.019)
0.212 (0.025)
0.354 (0.001)
0.387 (0.000)
0.254 (0.015)
0.398 (0.001)
From the table we may see clearly that the AKPLSC and the AKPCAC are superior than the other 4 classifiers, which validates our contributions of this paper.
4.2. Validation Method Using Friedman’s Test and Tukey’s Test
The Friedman test is a nonparametric statistical test developed by Friedman [19, 20]. It is used to detect differences in algorithms/classifiers across multiple test attempts. The procedure involves ranking each block (or row) together and then considering the values of ranks by columns. In this section, we present a multiple AUC value comparison among the six classifiers using Friedman’s test.
At first, we make two hypotheses:
H0: the six classifiers have equal classification probability;
H1: at least two of them have different probability distribution.
In order to determine which hypothesis should be rejected, we compute the statistic:
(13)Fr=12bk(k+1)∑i=1kRi2-3b(k+1),
where b is the number of blocks (or rows), k is the number of classifiers, and Ri is the summation of ranks of each column. The range of rejection for null hypothesis is Fr>χα2. In our experiment, the degree of freedom is k-1=5 and we set α=0.05; thus Fr=18.9683>11.0705, which implies that H0 should be rejected.
Friedman’s test just tells us that at least two of the classifiers have different performance, but it does not give any implication which one performs best. In this case, a post hoc test should be proceeded. Actually, there are many post hoc tests such as LSD (Fisher’s least significant difference), SNK (Student-Newman-Keuls), Bonferroni-Dunn’s test, Tukey’s test, and Nemenyi’s test, which is very similar to the Tukey test for ANOVA. In this paper, the Tukey test [21] is applied.
Tukey’s test is a single-step multiple comparison procedure and statistical test. It is used in conjunction with an ANOVA to find means that are significantly different from each other. It compares all possible pairs of means and is based on a studentized range distribution.
Tukey’s test involves two basic assumptions:
the observations being tested are independent;
there is equal within-group variance across the groups associated with each mean in the test.
Obviously, our case satisfies the two requirements.
The steps of the Tukey multiple comparison with equal sample size can be summarized in Algorithm 3.
Input: α, p, v, s, MSE, nt and samples. The meaning of these parameters is: α is an error rate.
p is the number of means, s=MSE, v is the degree of freedom related with MSE. nt is the
number of observations of each sample.
Output: The minimum significant difference ω and a deduction;
Method:
(1) Choose a proper error rate α;
(2) Calculate the statistic ω according to
ω=qα(p,v)snt,
where qα(p,v) is the critical value of Studentized range statistic, which can be found from any
statistics textbooks.
(3) Compute and rank all the p means
(4) Draw a deduction based on the ranks in the confidence level (1-α).
End Algorithm Tukey Multiple Comparison.
In this paper, we set α=0.05. Since we compare 6 classifiers over 9 data sets, then n=54, p=6, v=n-p=48, and nt=9. qα(p,v)≈4.2, which can be found from the studentized range statistic table. Now the only problem to find the value of ω is to determine s and MSE. This can be calculated accordingly as
(14)SY=∑i=1nyi,MY=SYn,YS=∑i=1nyi2,CM=(∑i=1nyi)2n=SY2n,SS=∑i=1nyi2-CM=YS-CM,SST=∑i=1pTi2ni-CM,SSE=SS-SST,MSE=SSEn-p,
where yi is the corresponding AUC value in Table 4 and Ti is the AUC summation of each column. Now we have the results: MSE=0.0051, s=0.0713, and ω=0.0998. The means comparison is listed in Table 5. From this table we can see clearly the following.
The difference T6-T1=0.1058 is greater than the critical value ω, which hints that the AKPCAC is significantly better than the APLSC.
But compared to the rest, except the APLSC, the two newly proposed methods have no significant difference.
Nevertheless, the AKPCAC and AKPLSC have the largest and second largest means, which implies that both indeed outperform the rest, although insignificantly.
Compared to the AKPLSC, the AKPCAC is slightly more powerful, which supports our claim that the AKPCAC is more adaptive to dimensional feature space mappings over imbalanced data sets.
The deduction is made in the confidence level (1-0.05).
Comparison of AUC between six classifiers. The ranks in the parentheses are used in computation of the Friedman test.
APLSC
KPCAC
AdaBoost
SMOTE
AKPCAC
AKPLSC
ar3
0.626 (3)
0.588 (5)
0.580 (6)
0.590 (4)
0.699 (1)
0.682 (2)
ar4
0.563 (5)
0.600 (3)
0.555 (6)
0.610 (2)
0.671 (1)
0.579 (4)
ar5
0.626 (5)
0.710 (2)
0.614 (6)
0.651 (3)
0.722 (1)
0.640 (4)
cm1
0.611 (4)
0.724 (1)
0.589 (5)
0.550 (6)
0.681 (2)
0.650 (3)
kc1
0.682 (4)
0.592 (6)
0.627 (5)
0.700 (3)
0.800 (1)
0.768 (2)
kc2
0.591 (6)
0.601 (5)
0.796 (1)
0.635 (3)
0.732 (2)
0.610 (4)
kc3
0.598 (5)
0.569 (6)
0.698 (2)
0.612 (4)
0.658 (3)
0.713 (1)
mw1
0.587 (5)
0.602 (4)
0.534 (6)
0.654 (2)
0.725 (1)
0.639 (3)
pc1
0.692 (6)
0.718 (5)
0.769 (3)
0.753 (4)
0.841 (2)
0.882 (1)
Sum of ranks
T1=43
T2=37
T3=40
T4=31
T5=14
T6=24
Tukey’s test result for the six classifiers.
APLSC
KPCAC
SMOTE
AdaBoost
AKPLSC
AKPCAC
Rank
1
2
3
4
5
6
Mean: Ti
0.6196
0.6338
0.6394
0.6402
0.6848
0.7254
Difference: Ti+1-Ti
—
0.0142
0.0057
0.0008
0.0446
0.0407
5. Conclusion
In this paper, we introduce kernel-based asymmetric learning for software defect prediction. To eliminate the negative effect of class imbalance problem, we propose two algorithms called the asymmetric kernel partial least squares classifier and the asymmetric kernel principal component analysis classifier. The former one is derived from the regression model of linear PLS, while the latter is derived from kernel PCA method. The AKPLSC can extract feature information in a nonlinear way and retrieve the loss caused by class imbalance. The AKPCAC is more adaptive to dimensional feature space mappings over imbalanced data sets and has a better performance. F-measure, Friedman’s test, and a post hoc test using Tukey’s method are used to verify the performance of our algorithms. Experimental results on NASA and SOFTLAB data sets validate their effectiveness.
Conflict of Interests
The authors declare that there is no conflict of interests regarding the publication of this paper.
Acknowledgments
The work of this paper was supported by the National Natural Science Foundation of China (Grant no. 61300093) and the Fundamental Research Funds for the Central Universities in China (Grant no. ZYGX2013J071). The authors are extremely grateful to the anonymous referees of the initial version of this paper, for their valuable comments. The present version incorporates all the changes requested. Their comments, thus, significantly improved the quality of this present paper.
KhoshgoftaarT. M.AllenE. B.DengJ.Using regression trees to classify fault-prone software modulesMenziesT.GreenwaldJ.FrankA.Data mining static code attributes to learn defect predictorsMaY.LuoG.ZengX.ChenA.Transfer learning for cross-company software defect predictionSeiffertC.KhoshgoftaarT. M.Van HulseJ.Improving software-quality predictions with data sampling and boostingGuoL.MaY.CukicB.SinghH.Robust prediction of fault-proneness by random forestsProceedings of the 15th International Symposium on Software Reliability Engineering (ISSRE '04)November 20044174282-s2.0-16244370106FreundY.SchapireR.Experiments with a new boosting algorithmProceedings of the 13th International Conference on Machine Learning1996148156BarkerM.RayensW.Partial least squares for discriminationXueJ.-H.TitteringtonD. M.Do unbalanced data have a negative effect on LDA?JiangX.Asymmetric principal component and discriminant analyses for pattern classificationJolliffeI. T.RosipalR.GirolamiM.TrejoL. J.CichockiA.Kernel PCA for feature extraction and de-noising in nonlinear regressionMaY.LuoG.ChenH.Kernel based asymmetric learning for software defect predictionMaY.LuoG.ChenH.Kernel based asymmetric learning for software defect predictionRosipalR.TrejoL. J.MatthewsB.Kernel PLS-SVC for linear and nonlinear classificationProceedings of the 20th International Conference on Machine Learning (ICML '03)August 20036406472-s2.0-1942516826QuH.-N.LiG.-Z.XuW.-S.An asymmetric classifier based on partial least squaresBacheK.LichmanM.UCI Machine Learning RepositoryUniversity of California, School of Information and Computer Science, Irvine, Calif, USA, 2013 http://archive.ics.uci.edu/ml/MenziesT.CaglayanB.KocaguneliE.KrallJ.PetersF.TurhanB.The PROMISE Repository of empirical software engineering dataWest Virginia University, Department of Computer Science, 2012, http://promisedata.googlecode.com/ChawlaN. V.BowyerK. W.HallL. O.KegelmeyerW. P.SMOTE: synthetic minority over-sampling techniqueFriedmanM.The use of ranks to avoid the assumption of normality implicit in the analysis of varianceFriedmanM.A comparison of alternative tests of significance for the problem of m rankingsWilliamM.TerryS.