Cross-Project Defect Prediction Based on Two-Phase Feature Importance Amplification

As the typical application of computational intelligence in software engineering, cross-project defect prediction (CPDP) uses labeled data from other projects (source projects) for building models to predict the defects in the current projects (target projects), helping testers quickly locate the defective modules. But class imbalance and different data distribution among projects make CPDP a challenging topic. To address the above two problems, we propose a two-phase feature importance amplification (TFIA) CPDP model in this paper which can solve these two problems from domain adaptation phase and classification phase. In the domain adaptation phase, the differences in data distribution among projects are reduced by filtering both source and target projects, and the correlation-based feature selection with greedy best-first search amplifies the importance of features with strong feature-class correlation. In the classification phase, Random Forest works as the classifier to further amplify the importance of highly correlated features and establish a model which is sensitive to highly correlated features. We conducted both ablation experiments and comparison experiments on the widely used AEEEM database. Experimental results show that TFIA can yield significant improvement on CPDP. And the performance of TFIA CPDP model in all experiments is stable and efficient, which lays a solid foundation for its further application in practical engineering.


Introduction
Software defects may cause unexpected disasters in the application scenarios, threatening the security of software and even people's lives [1]. During the whole life cycle of software, defects that are found in the later period will cost more than those found in the earlier period to be repaired. Testers should find out the defects in time, helping developers fix them as soon as possible [2]. But focusing on all modules of the software will cost a large amount of time and manpower, which goes against our original intention and makes preidentification of potentially defective modules an urgent issue [3,4].
Software defect prediction (SDP) is a method that can assist testers in quickly identifying potentially defective modules [5] so as to reduce the time spent by testers on troubleshooting and save testing costs as well [4]. e typical SDP usually uses historical data of software to build prediction models for defect pattern recognition in new releases of the modules from the projects [3,6,7]. As the brand-new projects lack historical data, especially the historical data with defective or clean labels, SDP is unable to work well on them. To solve this problem, researchers propose a method called cross-project software defect prediction (CPDP). CPDP builds prediction models with historical data from other projects (source projects) and predicts the current projects (target projects) [8,9].
Although CPDP focuses on the problem of lack of historical data, there are still two major problems to be solved. Due to a variety of factors such as project functionality and developers' habits, data distribution varies greatly from project to project. Differences in data distribution between projects are evident even when the same metrics are used to evaluate projects [10], which makes CPDP models built on source projects ineffective on the target project [11]. erefore, how to reduce the differences in data distribution between source and target projects is important to the performance of CPDP models [12,13]. On the other hand, the number of the modules with defects is often much smaller than that of the modules without defects, which is called class imbalance problem [14]. e class imbalance problem can also affect the performance of CPDP models, because they may have a preference for the majority when classifying [15,16].
To address the above problems, in this paper we propose a two-phase feature importance amplification (TFIA) CPDP model. Specifically, TFIA divides CPDP into two phases. In the domain adaptation phase, TFIA reduces differences in data distribution among source and target projects. After adding a certain proportion of samples from target projects to the source projects, the correlation-based feature selection method with a greedy best-first search strategy is proposed to amplify the importance of features with high feature-class correlation. e subset made up of the final selected items is used to filter the source and target projects for the purpose of removing redundant features as well as interfering items and reducing the dimensionality of the data. In the classification phase, resampling is conducted on the source project's data to make the numbers of clean and defective samples balanced. And an integrated learning method, Random Forest, is adopted to build the prediction model. e reason for adopting Random Forest is that integrated learning methods have shown their advantages in solving the class imbalanced problem [17]. In our case, the selected Random Forest uses a random sampling method; thus the trained model has small variance and high generalization ability, which can improve and effectively solve the class imbalance problem. Random Forest evaluates the relationship between features and classes during the training process [18], resulting in a ranking of feature-class correlation [19], further amplifying the importance of features with strong feature-class correlation for defect recognition.
In summary, our contributions lie in twofold.
(1) We propose a two-phase feature importance amplification CPDP model (TFIA). TFIA reduces the effect of data distribution and class imbalance problems in domain adaptation phase and classification phase. (2) We conducted experiments on the AEEEM database to verify the validity of the method. In addition, we performed a number of ablation experiments to carefully analyse and dissect the detailed components of our method. And we also conducted comparison experiments to compare TFIA with other methods.

Related Work
CPDP usually includes two phases, called domain adaptation and classification. Domain adaptation is mainly addressed by data processing methods, such as data migration and feature selection. Classification is mainly handled by machine learning classifiers [20]. e researches on domain adaptation address the difference in data distribution between the source and target projects, while the researches on classification methods improve the prediction performance from the perspective of algorithms. And also some researchers consider hybrid methods.

Domain Adaptation Phase.
Turhan et al. [21] proposed Burak filter based on k-means clustering method. Burak filter calculated the Euclidean distance between all samples in the source project datasets. And for each sample in the target project datasets, they selected the k samples with the smallest Euclidean distance as training samples. Finally, the Naive Bayes method was used to predict the processed dataset. Burak filter improves the performance of CPDP model by filtering the samples of the source project datasets based on the target project datasets. Peters et al. [22] proposed Peters filter based on the source project datasets. For the samples in the source project datasets, the sample which was the closest to it in the target project datasets was selected by comparing the Euclidean distance. en they labeled that sample. e sample of source project datasets with the closest distance to the labeled sample was selected as the training data in the prediction process. Pan et al. [23] proposed the TCA, which mapped the features of the source and target project to the latent space that makes them most similar, reducing the effect of differences in data distribution. Finally, Logistic Regression was used for prediction. He et al. [24] simplified the training set by TDSelector method and then classified it by Logistic Regression. Sun et al. [25] proposed a near-some source project selection by collaborative filtering (CFPS) method to filter source items, which has good results using SMO and Random Forest as classifiers. Alsawalqah et al. [26] proposed a SMOTE-ensemble method to optimize for class imbalance problems at the data level and algorithm level.

Hybrid Method.
Yuan et al. [27] used TrAdaBoost to determine weights for samples based on Burak filter and used weighted support vector machines to build the model to improve the CPDP model. Chao et al. [28] proposed a two-phase CPDP method called TPTL, using a source project estimator to select source projects with similar data distribution as the target project and using two improved TCA + to construct models for prediction. Cong [29] proposed a DA-KTSVMO method using kernel twin support vector machines to improve the data distribution and using a quantum particle swarm optimization algorithm to optimize the method for prediction. Zhang et al. investigated seven composite algorithms; they believed that composite algorithms can improve the performance of CPDP models [30]. e method proposed by Chen et al. combines the data gravitation method and TrAdaBoost to reduce the effects of class imbalance in the source project [31]. Xu et al. proposed a multisource TrAdaBoost Algorithm. e proposed method uses semisupervised high-density-based clustering and a small amount of labeled target item data to obtain a large amount of labeled source item data. When using TrAda-Boost for integration, only the base classifier trained from the source item data most relevant to the target item is selected. In the process of training, this method can ensure that the knowledge transferred is most relevant to the target project but ignores the influence of other source project data on the target project defect prediction model [32].

Methodology
In this section, we first introduce the framework of our proposed method TFIA. en we show the details of important steps in our proposed method. Figure 1 is the flowchart of TFIA. TFIA adds a certain proportion of labeled samples from the target project datasets to the source project datasets, uses a correlation-based feature selection method, and searches for a subset of features using a heuristic greedy best-first search strategy. en TFIA filters the feature of source and target projects, samples the source projects datasets with resampling, uses Random Forest on the source project datasets for training, and finally predicts the target project datasets.

Framework of TFIA.
TFIA reduces the difference in data distribution between source and target projects from the feature perspective and sample perspective, respectively. TFIA enhances the importance of features with strong class-feature correlation in model training. A resample approach is used to deal with class imbalance problems and reduce the impact generated by class imbalance problem. TFIA uses Random Forest as a classifier to train models with small variance and strong generalization ability to improve the accuracy of CPDP model for identifying defective modules.
In TFIA, we use Relief to evaluate the correlation of feature-feature and feature-class in TFIA. Relief [33] is a feature weighting algorithm that is sensitive to feature interactions. Relief of feature f can be formulated as (1) p is the probability, C is the class variable, C � defective, clean { }, f v is the certain value of the sample's feature f, and Gini' is another attribute quality measure that can be calculated by ) . (2) We use Relief to calculate correlation of feature-feature and feature-class in Section 3.2 and Section 3.3.

Domain Adaptation Phase.
In order to amplify the role of features with strong feature-class correlation in the classifier and to deemphasize features with strong featurefeature correlation, we decide to select the features. In the feature selection phase, we use the filtering method called correlation-based feature selection (CFS) [34]. In this paper, the heuristic search strategy used by CFS is greedy best-first search [35]. CFS evaluates the value of a subset of features by considering the individual predictive ability of each feature and the degree of redundancy between them [36].
As shown in Figure 2, CFS first computes the featureclass and feature-feature correlation matrices from the source project dataset and then searches the feature subset space using greedy best-first search.
To prevent the best-first search from exploring the entire feature subset search space, we follow the setting of [34] to impose a termination criterion. e search will terminate if five consecutive fully expanded subsets show no improvement over the current best subset. CFS filters features by a feature subset evaluation function, intending to find a subset of features that meet the conditions of low feature-feature correlation and strong feature-class correlation, thus sifting out redundant features [37]. e feature subset evaluation function is valued as the merit which can be calculated by M S is the value of the merit of feature subset with k selected features. r fc is the average feature-class correlation and r ff is the average feature-feature correlation. r is Relief in Section 3.1. e algorithm is described as follows. Computational Intelligence and Neuroscience

Classification Phase.
Random Forest is an integrated learning pattern recognition method [19,38]. It has been demonstrated that Random Forest has good performance in CPDP [27,39,40] due to its high tolerance to outliers and noise. Random Forest is also less prone to fitting characteristics [18]. Figure 3 shows the process of Random Forest in this paper. Random Forest uses a bootstrap method to randomly scrape multiple samples from the original samples, modeling a Decision Tree for each sample set. en we give a comprehensive conclusion from the results of all decision trees. e voting decision process of Random Forest is shown by H(x) denotes the combined classification model, h i denotes the single subject Decision Tree, C is the set of class labels, and I(·) is the indicative function.
ere is a put-back from the source project dataset D processed in the domain adaptation phase to obtain a randomly selected subset D T i as the training set. And the sample size of the training dataset is the same as the original data.
is sampling method of Random Forest ensures the variability of the training set. For the sake of easy explanation, we assume that there are n samples in dataset D. e probability of each sample being picked is 1/n. Repeat it for n times, so the probability of each sample in D not being picked is (1 − 1/n) n . As n tends to infinity, lim n⟶∞ (1 − 1/n) n � 0.368. We can assume that 37% of the samples in D will not appear, which guarantees the variability of the training set. λ features are randomly selected from the source project dataset to construct a Decision Tree, and each node is based on (2).
In a single Decision Tree, the Gini' metric is calculated for each attribute and a variable with the minimum Gini ′ metric is selected to split the current node. e Decision Tree is constructed by recursion until the stop criterion is reached. e algorithm is described as follows.

Experimental Design
For our method TFIA, in this section, we raise four research questions and set up experiments to discuss and analyse each question.

Research
Question. RQ1: Does the feature-level filtering approach proposed in this paper have any impact on the performance of the model? In TFIA, we use filtering to reduce the data distribution difference in source and target projects datasets by CFS. erefore we intend to analyse the role of filtering methods in the whole CPDP model. RQ2 : In the classification phase, does the choice of classifier affect the overall performance?
In TFIA, feature-class correlation is also amplified in the classification phase as we use Random Forest as the classifier. Classifiers behave diversely on different types of data, so we want to investigate how other classifiers would perform on our model, such as the linear classifiers Logistic Regression and Support Vector Classification, the Bayesian formulabased classifier Naive Bayes, and the tree-based classifier Decision Tree.  In the domain adaptation phase of TFIA, a certain percentage of samples of the labeled target items are required. We want to analyse how different proportions of samples of labeled target items in the project affect the overall prediction performance.
RQ4 : Compared with the classical method and the latest research, does the method proposed in this paper improve CPDP model's performance?
Various methods have been proposed by researchers for the CPDP problem, such as the classical methods TCA [23], Peters filtering method [22], Burak filtering method [21], and the newer method ALTRA [27]. We seek to analyse how the performance obtained by TFIA differs from these methods.

Dataset.
In this paper, we use the AEEEM database, a widely used dataset in the field of software defect prediction research. e AEEEM database was collected and compiled by D'Ambros et al. [41]. Table 1 shows the details of the AEEEM database, including the projects names, the projects types, numbers of modules, numbers of defective modules, and the ratio of defective modules.

Experimental Environment.
All of our codes are written in Python 3.7. e GPU used for the experiments is NVIDIA TITAN V and the CPU is Intel I9-9920X. Classifications are realized by WEKA [42] with default parameters.

Performance Measure.
e confusion matrix is used to store the correct and incorrect decisions made by the Recall is the ratio of correctly predicted defective modules to all truly defective modules, calculated as shown in F1-measure is often used to evaluate the performance of a classification model. F1-measure can be calculated by 4.6. Area under the ROC Curve. Area under the ROC curve (AUC) is used to evaluate the degree of discrimination obtained by the model. e value of AUC ranges in [0, 1]. AUC for random prediction is 0.5. e advantage of AUC is that AUC is insensitive to decision thresholds such as precision and recall. e higher the AUC, the better the prediction.

Matthews Correlation Coefficient.
Matthews correlation coefficient (MCC) is used in machine learning as a measure of binary (2-category) quality of classification, which is introduced from biochemistry by Brian W. Matthews in 1975 [43].
MCC takes into account true and false, positives and negatives, and is generally regarded as a balanced measure that can be used even if the classes are of very different sizes [44]. MCC is essentially a correlation coefficient between the observed and predicted binary classification. It returns a value between -1 and +1. MCC � +1 indicates a perfect prediction, MCC � 0 indicates that the model is not better than a random prediction, and MCC � -1 indicates a complete inconsistency between prediction and observation.
e formula for the MCC is shown by

Results and Discussion
In this section, we present the experimental results and give the answers to questions from Section 3.

Answer for RQ1.
To answer RQ1, we designed two groups of experiments on the AEEEM database, where we formed a total of 20 source-target project pairs from the five projects, including EQ, JDT, ML, LC, and PDE. e first group of experiments performed no filtering on these 20 source-target project pairs. And the second group of experiments were filtered by CFS based on greedy best-first search strategy proposed in this paper. Both sets of experiments used Random Forest as the classifier. Both groups of experiments randomly added 40% of the target project data to the source project dataset and source project datasets were dealt with through resample. Table 2 shows the results of the evaluation metrics obtained on 20 source-target project pairs without (Model 1) and with (Model 2) the filtering method proposed in this paper.  Computational Intelligence and Neuroscience From Table 2, it can be observed that the overall effect of the prediction models trained from the data processed by the filtering method is significantly improved (the data in bold), with an average F1-measure improvement of about 156.54%, AUC improvement of about 45.74%, and MCC improvement of about 173.46%. Numerically, in these 20 sourcetarget project pairs, the models with filtering all work better than those without filtering. Figure 4 is the boxplot of F1-measure, AUC, and MCC of model with filtering (Model 1) and model without filtering (Model 2) based on data from Table 2. As can be seen from Figure 4, the prediction models trained from the data processed by the filtering have higher numerical intervals in the overall distribution compared to the data without filtering. In terms of box size and endline length, the overall performance of the model after processing with the filtering also becomes less volatile and the performance is more stable. erefore, we can conclude that the filtering method proposed in this paper is effective in improving the performance of the CPDP model on these datasets.   Computational Intelligence and Neuroscience

Answer for RQ2.
To answer RQ2, 20 source-target projects from the AEEEM database were trained and validated using Naive Bayes (NB), Logistic Regression (LR), Decision Tree (DT), Support Vector Classification (SVC), and Random Forest (RF) as classifiers. Respectively, all the datasets were processed by the filtering method proposed in this paper. Every set of experiments randomly added 40% sample from the target project datasets to the source project datasets and the source project datasets were dealt with through resample. Table 3 shows the F1-measure, AUC, and MCC of models with different classifications, including Naive Bayes (NB), Logistic Regression (LR), Decision Tree (DT), Support Vector Classification (SVC), and Random Forest (RF). e data in bold show the classification with the best performance in each set of experiments. e value of F1-measure shows that, on the data processed by our filtering method in this paper, RF has the best results, Decision Tree is the second best, and LR, NB, and SVC are more similar and differ from RF and DT. e situation is the same in AUC and MCC. is phenomenon could be caused by the idea that, in the first phase, our filtering method amplifies the proportion of features that have a strong correlation with the class. So it has better performance on DT and Decision Tree-based integrated method RF. e other three classifiers are relatively insensitive to the dataset of amplified features, so the results are worse than DT and RF. Figure 5 is the boxplots of F1-measure, AUC, and MCC of models with different classifications, including Naive Bayes (NB), Logistic Regression (LR), Decision Tree (DT), Support Vector Classification (SVC), and Random Forest (RF). From Figure 5, it can be seen that the prediction models constructed with Random Forest and Decision Tree as classifiers achieve higher performance metrics and less overall volatility. is indicates that the choice of classifier has an impact on the performance of the prediction model. On the dataset processed by the filtering method in this paper, classifiers that are more sensitive to feature importance will have better results.

Answer for RQ3.
To analyse the effect of the proportion of samples from target projects datasets added to the source projects datasets on the prediction results, we added different proportions of labeled target item samples to the source projects datasets in steps of 10% from 0% and trained with TFIA. Every source project dataset had been dealt with through resample. Table 4 shows the average F1-measure, AUC, and MCC values of models added samples from target projects to the source projects at different proportions on 20 source-target project pairs. en we smoothed the values to plot the graphs as Figure 6, from which we can see that the transformation of the three measures starts to become very small at around 70%, indicating that the model performance reaches its best at around 70%. But for the consideration of the actual prediction environment, it is not feasible to manually label the target projects samples at 70%, so we only consider the growth rate of indicators in this paper. It can be seen that the AUC starts to level off around 40%, and the improvement rates of F1-measure and MCC start to become smaller, so we believe that adding 40% of the labeled samples of the target projects to the source projects will have better results in these 20 source-target projects pairs.   MCC  LR  DT  NB  SVC  RF  LR  DT  NB  SVC  RF  LR  DT  NB  SVC  RF  EQ-  To answer RQ4, we used TCA, Burak filter, Peters filter, and ALTRA on 20 source-target project pairs in the AEEEM database for prediction and compared with our proposed method, respectively. To ensure the consistency of the experimental conditions, the source items in the dataset were randomly added to 40% samples of the target project datasets, and the source project datasets were dealt with through resample. Table 5 shows the F1-measure, AUC, and MCC of ALTRA (Model 1), TCA (Model 2), Peters filter (Model 3), Burak filter (Model 4), and TFIA (Model 5).
As can be seen from the data in Table 5, compared with other methods, the proposed method in this paper shows significant improvements in F1-measure, AUC, and MCC. Taken as an example, the value of F1-measure is improved by about 126.62% over TCA, 98.64% over Burak filter, 86.46% over Peters filter, and by 31.12% over ALTRA.  Figure 7, we can see that the method in this paper has smaller boxes and shorter endline lengths relative to the other methods, indicating that TFIA performs more consistently on the datasets. Although ALTRA performs better than TCA, Burak filter, and Peters filter on the average F1-measure, the fluctuation of ALTRA is larger. erefore, we can conclude that the method proposed in this paper possesses good stability and prediction effect on the AEEEM database compared to these methods.

reats to Validity.
In this paper, the threats to validity are mainly divided into internal and external validity. e internal validity comes from the setting of classifier parameters. In this paper, we set the parameters defaulted by the references as well as the tool of WEKA, which may lead to some differences in the classification results. To mitigate this difference, we use the same default parameters provided by WEKA for the classification phase of our proposed method.
e external validity mainly comes from the dataset used. We use five project datasets from the publicly available AEEEM database and combine 20 source-target project pairs for our experiments. If other data sources are selected, especially those from real-world engineering datasets, different experimental results may be obtained.   Computational Intelligence and Neuroscience e attempt on more scenarios will be included in our future research plan.

Conclusion and Future Work
To tackle the problems of class imbalance and different data distribution among projects in cross-project defect prediction (CcrossPDP) which is the typical application of pattern recognition, we propose a model based on two-phase feature importance amplification (TFIA) in this paper. In the domain adaptation phase, TFIA reduces the differences in data distribution between the source and target projects by adding a certain percentage of samples from the target project to the source project. Meanwhile, correlation-based feature selection (CFS) with a greedy best-first search strategy for feature selection is utilized to obtain a feature subset with a weak feature-feature correlation and a strong feature-class correlation. e source and target project datasets filtered by this feature subset are input to the following classification phase. As the tree-based classifiers are sensitive to features with high feature-class correlation, Random Forest, the integrated method based on Decision Tree for pattern recognition, functions in the classification phase. Multiple decision trees can amplify the importance of features and help each other to improve the performance of the whole prediction model.
We conducted four experiments to validate TFIA on 20 project pairs from the widely adopted AEEEM dataset. e first experiment demonstrates that the domain adaptation approach we designed is effective and has a significant improvement on the overall model performance. e second experiment demonstrates that Random Forest outperforms other classifiers on the dataset processed by our domain adaptation approach. e third experiment analyses the influence of the proportion of samples added from the target projects on the model. e experiment proves that adding 40% of the target item data makes great improvement on the model performance considering the practical application scenarios. In the fourth experiment, we compare TFIA with four other classical research methods, and the experimental result proves that TFIA provides better and more stable performance.
In this paper, our method has been proved to have an improvement effect on CPDP, but there are still some parts that need to be further studied. We will follow two aspects in our future research. e first is to validate the effectiveness of TFIA on other datasets and make it more scalable and robust since our experiments use the subprojects from the same root project. Secondly, since the parameters used in this paper are defaulted, which have a large impact on the prediction model, we will seek to find the effect of adjusting parameters on the model performance. We will also look for the most suitable parameters and the methods that can automatically adjust the parameters according to the real application scenarios.

Data Availability
A publicly available dataset is used for this study (AEEEM).