Tsbagging: A Novel Cross-Project Software Defect Prediction Algorithm Based on Semisupervised Clustering

Software defect prediction (SDP) is an important technology which is widely applied to improve software quality and reduce development costs. It is dicult to train the SDP model when software to be test only has limited historical data. Cross-project defect prediction (CPDP) has been proposed to solve this problem by using source project data to train the defect prediction model. Most of CPDP methods build defect prediction models based on the similarity of feature space or data distance between dierent projects. However, when the target project has a small amount of label data, these methods usually do not consider this part of data information. ­erefore, when the distribution between source project and target project is quite dierent, these methods are dicult to achieve good prediction performance. To solve this problem, this paper proposes a CPDP method based on a semisupervised clustering (namely, Tsbagging). Tsbagging has two stages; in the rst stage, we cluster to the source project data based on the limited labeled data in the target project and assign dierent weights to these source project data according to the clustering results. In the second stage, we use bagging method to train the prediction model based on the weight assigned in the rst stage. ­e experimental results show that the performance achieved by Tsbagging is better than other existing SDP methods.


Introduction
SDP is a very important technology in the software testing stage. It can quickly predict defects and provide guidance for allocating test resources and manpower in the early stage of software development [1,2]. At present, most SDP methods use machine learning technology to build defect prediction models [3][4][5]. For example, Lessmann et al. [6] and Shepperd et al. [7] used traditional machine learning algorithms, such as decision tree, naive Bayes (NB), neural network, and support vector machine (SVM) to SDP, and achieved good results. Elish et al. [8] compared 8 machine learning methods on the NASA dataset, and the results showed that SVM was better than other algorithms. In addition, researchers also improved the traditional machine learning methods based on the characteristics of SDP. For example, Wang et al. [9] proposed a Compressed C4.5 Models (CCM) based on C4.5 decision tree by using Spearman rank correlation coe cient. Ji et al. [10] applied the Kolmogorov-Smirnov test on the datasets from the promise database and found that feature space of those datasets was not normally distributed. erefore, they improved the NB and proposed an SDP method based on kernel density estimation. Moreover, Ji et al. [11] proposed a weighted naive Bayes method (WNB). WNB used the information di usion model instead of the default probability density function of normal distribution to calculate the probability density of each feature. is method also achieved good performance. Li et al. [12] proposed CoForest and ACoForest. CoForest is a sampling method based on semisupervised learning. On the basis of random forest classi cation, this method can nd optimal sampling example by random sampling. ACoForest further enhanced the prediction performance of CoForest base on active learning. Studies demonstrated that these two methods showed better performance than other traditional machine learning methods. Bejjanki et al. [13] proposed a new class imbalance reduction algorithm (CIR) for the class imbalance problem in defect prediction dataset. is method made the defect data and nondefect data in unbalanced dataset symmetrical by considering the distribution characteristics of the unbalanced dataset. Experiments showed that CIR has better defect prediction performance than traditional methods.
Most of the above research studies have focused on within project defect prediction (WPDP). However, sometimes, few local training data can be available in the target software in which defects must be detected, because past local defective modules are expensive to be labeled for the module development in an unfamiliar domain [4,14]. To solve this problem, the CPDP method used the source project data training model to predict defects in the target project. Zimmermann et al. [15] constructed 622 cross-project task combinations from 12 projects to evaluate CPDP model performance. e experimental results showed that traditional defect prediction methods have difficulty to achieving good CPDP performance. It is because that these methods assumed that the training and test data have similar distributions, whereas data from different projects have different distributions. Ma et al. [16] assigned different weights to the data in the source project by calculating the similarity between those data and the target project and then trained a weighted Bayes classifier (namely, TNB) according to these weights. Nam et al. [17] proposed the TCA+, TCA + used the transfer component analysis technology to find the potential feature space of different project data, when the potential space is determined, and it mapped the source project and target project to the space to eliminate the difference of feature distribution between different projects. Turhan et al. [18] proposed a CPDP method based on nearest neighbors (namely NN). NN constructed a training dataset by selecting nearest neighbors with target project data in source project and use the training dataset to train the defect prediction model. Kawata et al. [19] proposed a CPDP method based on DBSCAN [20], they use DBSCAN to find subclusters and then select subclusters which have at least one record of the target project data to build and train the dataset. Ryu et al. [21] proposed a CPDP method (VCB + SVM). Firstly, VCB + SVM calculates the similarity weight of each instance in the source project based on the dataset of the target project and then constructs the defect prediction model based on the support vector machine and boosting method. Chen et al. [22] proposed a CPDP method collective transfer defect prediction (CTDP) based on multisource transfer learning. Firstly, CTDP extended the source project dataset by using different normalization and TCA. en, they built several base classifiers on the extended source project dataset. According to the contribution of each classifier to the target project, CTDP uses the particle swarm optimization algorithm to adaptively weight these base classifiers to construct an ensemble classifier. Finally, CTDP uses the ensemble classifier to predict defect. He et al. [23] proposed a CPDP method on multisource transfer learning (FSS + bagging). FSS + bagging has three steps. Firstly, FSS + bagging calculated the similarity between the candidate source project and target project and selected the first k most similar source project. Next, for each source project which be selected, FSS + bagging removed some unstable features to reduce the data distribution difference between source project and target project. Finally, FSS + bagging used bagging to get the final defect prediction results. To solve the class imbalance in CPDP, Limsettho et al. [24] proposed a new oversampling method CDE-SMOTE, which used CDE to estimate the class distribution of the target project and uses SMOTE to modify the class distribution of training data until it is similar to the distribution of target project. Although the above methods have achieved good performance in CPDP, most of them build CPDP models based on the similarity of feature space or data distance between source project and target project. However, when the target project has limited labeled data, those methods often do not take into account this part of label information. Turhan et al. [25] found that mixing a limit of target project data and source project data which are selected by the NN method to train classifier can significantly improve the performance of CPDP. However, this method also does not consider the limited label information when selecting the source project data. erefore, when the data distribution between the target project and the source project was very different (especially, when concept shift occurs), those method are difficult to achieve good performance. Chen et al. [26] propose a two-stage transfer learning method, double transfer boosting (DTB), to solve this problem. In the first stage, DTB uses the data gravitation method proposed by Ma et al. [16] to assign weights to the source project data; in the second stage, DTB uses Tradaboost [27] to build classifier based on the weight assigned in the first stage. However, DTB does not consider the impact of concept shift when assigning weights, which affects the final defect prediction performance. is paper proposes a CPDP method based on semisupervised clustering (namely, Tsbagging). In Tsbagging, we firstly proposed a semisupervised clustering algorithm (namely, TSCluster) to clustering the source project data based on limited target project data. Next, we trained several base classifiers based on the clustering results and integrated these classifiers by bagging to predict defect. We compare the performance of Tsbagging with other SDP methods on 42 defect prediction datasets. e experimental results show that Tsbagging has better performance than other methods.

Nearest Neighbor Filter.
Since the CPDP method can use source project data to help train the defect prediction model when target project only has limited software historical data, it has always been the research focus in the field of SDP. However, an important problem in CPDP is how to select data similar to target project from source project. e NN (nearest neighbor filter) is a commonly used technology for selecting data from source project [18,25]. e specific steps are as follows: firstly, for each data in the target project, NN select K-nearest neighbors of it from the source project based on the Euclidean distance. If there are n data in the target project, the number of finally selected nearest neighbors is k × n. en, NN removes duplicate data in the selected nearest neighbor data to build the final train dataset. Finally, NN uses the final train dataset to train the SDP model. Moreover, Turhan found that the NN + WP (WP means within project) which is trained by the limited target data and source data selected by NN can achieve better 2 Scientific Programming performance than NN. However, both NN and NN + WP only select data in the source project based on the similarity between the data distances of di erent projects. When the data distribution between target project and source project is very di erent, especially in the case of concept shift, it is di cult for NN to accurately select the data similar to the target project from the source project.

Concept Shift.
Concept shift is also called concept drift [28]. According to Moreno-Torres's theory [28], the concept shift is de ned as: P S (x) = P T (x), P S (y | x) and P T (y | x) are di erent. Where P S (y | x) and P T (y | x) represent the posteriori probabilities in the source project and the target project, respectively. P S (x) and P T (x) represent the marginal distribution of source and target projects, respectively. Moreno-Torres considers that nonstationary environments can lead to concept shift, and the relationship between the input and class variables has changes. Kanayake at al. [29] nds the concept shift can occur in SDP and in uence the defect prediction quality. In CPDP, di erent development environments or application elds can lead to the concept shift. When concept shift occurs, due to P S (y | x) and P T (y | x) are di erent, even if the distance between source project data and target project data is close, their label maybe very di erent. erefore, when the concept shift occurs, the NN method is di cult to accurately select the appropriate data from source project. As shown in Figure 1, all data are distributed in four areas A, B, C, and D, in which circular points represent nondefect data, triangular points represent defect data, red points represent target project data, and blue points represent source project data. As a result of concept shift, it is found that the target project's label is di erent to its nearest neighbors in areas A and D. If NN is used to select data from the source project, the data with large di erence from the distribution of the target project will be selected.

Semisupervised Clustering.
Semisupervised clustering can use additional information to achieve better clustering performance than unsupervised clustering. At present, there are mainly two types of semisupervised clustering methods. One is based on connected and unconnected constraints. e connected constraints means the data belong to the same cluster, and the unconnected constraints means the data do not belong to the same cluster. For example, Wagsta et al. [30] proposed the constrained k-means algorithm. is algorithm assigns appropriate clusters to each data by checking whether the data violate the given connected and unconnected constraint pairs in the clustering process of k-means. e other type of the semisupervised clustering method is based on the given cluster label information. is method helps clustering by taking the given cluster label as a constraint seed. For example, Bsau et al. [31] proposed the constrained seed k-means method, and this method initializes the cluster center of k-means according to the given cluster label information and did not change the cluster membership relationship of seed samples in the clustering. In this paper, we use a limit of target project data as constraint seeds to cluster the source project data selected by the NN method and exclude the data with large di erences with the target project data for achieving better prediction performance.

Tsbagging
In this section, we propose a CPDP method, namely Tsbagging. Tsbagging mainly consists of clustering stage and ensemble stage. In the clustering stage, we propose a semisupervised method (TSCluster) to clustering the data of the source project and assign di erent weights to these data according to the clustering results. In the ensemble stage, based on the weights given in the clustering stage, we use bagging [32] to train several classi ers and integrate them to predict the defect. In the following sections, we describe Tsbagging in details.

Problem Description.
In this paper, we de ne three sets: D S (x S i , . . . , y S i )|i 1, 2, . . . , Sn , D WP (x WC j , y WC j )|j 1, 2, . . . , Wn}, and D Test (x Test t )|t 1, 2, . . . , Tn , where D S represents the source project dataset,D WP represents the labeled dataset in the target project (WP means within project), and D Test represents the dataset to be tested in the target project. Note that the data from D test and D WP belong to the same distribution, while D S belongs to di erent distribution from D test and D WC . x S i , x WC j , and x Test t are mdimensional vectors. Each dimension in the vector represents a measured value of a metric to a program module. y S i and y WC j ∈ {−1, 1}, −1 represents that a module is a nondefect proneness module, 1 means that a module is a defect proneness module, Sn, Wn, and Tn represent the data amount of D S , D WP , and D Test , respectively, and Wn ≪ Tn.

Overall Flow of Tsbagging.
e overall process of Tsbagging is shown in Figure 2. Tsbagging consists of clustering stage and ensemble stage. In the clustering stage, rstly, we use the NN lter to select the candidate dataset (namely, FCCData) from D S , and then we propose a twostep clustering method. In the rst clustering, we use the k-means to cluster the data from di erent class in D WP , respectively, so that in the nal clustering result, the data in each cluster belong to the same class, and this class label is regarded as the cluster label. For example, in Figure 2, the result of the rst cluster contains six clusters. A red circle indicates that all the data in the cluster are defective data, and a green circle indicates that all the data in the cluster are nondefective data. In the second clustering, we use the semisupervised method based on the given clustering label information to cluster FCCData. In this clustering process, we take the clusters gathered by the rst clustering as constraint seed, and we do not change the cluster membership of data from D WP in this cluster. For example, in Figure 2, there are still six clusters in the second result. After the second clustering, we assign weights to data in each cluster based on the clustering hypothesis to build the nal training dataset (namely, CCTrainData). e clustering hypothesis is that if all data in the dataset obey the same distribution, data from same cluster are more likely to have the same label. Conversely, if the label of data from the same cluster is di erent, they do not obey the same distribution. So, the idea of assign weight as follows: in each cluster, if the label of a data from D S is the same as the cluster label, we believe that the data obey the clustering hypothesis; that is, the data may obey the data distribution of the target project, so we give the data a higher weight. If the label of the data D S is di erent from the cluster label, we believe that the data violate the clustering hypothesis. It means that the data may not obey the data distribution of the target project, so the data are given a lower weight. When all the data from D S are given weights, we use the bagging method to train the nal classi er from CCTrainData to predict D test . We will describe the algorithm in detail in the following section.

Clustering Stage.
e clustering stage has two steps. In the rst step, we use the NN method to select the candidate dataset (FCCData). In the second step, we use TSCluster proposed to clustering FCCData and assgin weights to the data in FCCData according to the clustering results. e ow of TSCluster is shown in algorithm 1.
TSCluster has two clustering processes. In the rst clustering, we cluster to the defect data and nondefect data in D WP , respectively, and construct the initial cluster seed of the second clustering processes according to the clustering results. Speci cally, we rst select defect dataset and nondefect dataset from D WP , repectively. Next, we use k-means method to clustering these datasets to build a defect cluster set and nondefect cluster set (where defect cluster set has pk clusters and nondefect cluster set has nk clusters). en, this two cluster sets are merged to built the initial cluster seed (namely, InitClusterSet) for semisupervised clustering. e speci c process is shown in algorithm 2.
In the second clustering, based on the idea that is similar to constrained k-means [31], we use the InitClusterSet constructed by the rst clustering as a constraint seed to clustering the FCCData dataset. Speci cally, in each iteration of the clustering process, we use InitClusterSet to initialize the new cluster set and then calculate the distance between FCCData's data and each cluster center; then, we assign data to the nearest cluster. It is worth noting that in this process, we do not change the cluster membership of the data from D WP . When all FCCData's data are assigned, we update the centers of all clusters; if all cluster centers are not updated, the clustering is terminated and the nal cluster set ( nal-ClusterSet) is obtained.
After two clustering, we assign di erent weights to the data in each cluster. e speci c steps are as follows: rstly, for the the jth cluster in final cluster set (finalClusterSet[j]), we label a cluster label (namely, Clusterlabel), and set Clusterlabel is y init , where (x init y init ) ∈ InitClusterSet[j]; then, according to the clustering assumption (i.e., data in the same cluster are more likely to have the same label), we assign higher weight to data which label is the same as Clusterlabel and assign lower weight to data which label is different from Clusterlabel. e specific calculation formula is as follows: (1) where (x i , y i ) ∈ finalClusterSet[j] ∩ FCCData, u j is the cluster center of finalClusterSet[j], and dis(.) is the Euclidean distance between x i and u j , ϵ is an amplification factor, and in this paper, ϵ = 2. In equation (1), the weight of data with the same label as Clusterlabel is higher than the weight of data with different label to Clusterlabel. For data (x i y i ), when y i = Clusterlabel, it is closer to the cluster center, it is more likely obey the data distribution of target project; therefore, the weight of data is higher. On the other hand, when y i ≠ Clusterlabel, it is closer to the cluster center, the data are less likely to obey the data distribution of target project; therefore, the weight of data is lower. After all data have been assigned weight, we add these data to a final training dataset (namely, CCTrainData), and then in the ensemble stage, we use CCTrainData to train the classifier.

Ensemble Stage.
In the ensemble stage, we use the bagging method to train several classifiers (in this paper, we use NB as a classifier) and integrate these classifiers to predict the target data. Firstly, we sample data from CCTrainData to built several datasets according to the weight assigned in the clustering stage. It should be noted Clusterlabel then X i .weight � w ij ; else X i .weight � 1/ w ij ; CCTrainData.add((x i y i )); return CCTrainData; ALGORITHM 1: TSCluster. Scientific Programming that a data with a higher weight mean a higher probability of it be sampled. Each dataset has CClen data. CClen is the amount of data in CCTrainData. en, we combine these datasets with D WP to train the base classifiers, and add these base classifiers to a classifier set (Classifierpool). Finally, when all classifiers are trained, we use the plurality voting to ensemble all base classifiers in the Classifierpool to predict the data in the D Test . e specific ensemble method is as follows: where C j is the predicted label, and its value is C 0 = 1 or C 1 = 1 Classifiersize is the amount of all base classifiers in the Classifierpool. When the classifier classifies x as c j , h i (x) is the prediction function of the ith classifier, h i (x) ∈ {−1, 1}, I(x)is an indicator when x is true, and I (x) = 1, otherwise I(x) = 0. e overall process of Tsbagging is shown in algorithm 3.

Experiment
In this section, we describe three experiments we conducted to verify the performance of Tsbagging.

Datasets.
In this experiment, we use seven projects in the public database promise [33] to build cross-project defect prediction datasets. e detailed description of these datasets is shown in Table 1. Each dataset in these datasets represents the test results of a program module, which contains 20 attributes and one label. e detailed description of these attributes is shown in Table 2 [34].
e label values are −1 and 1.1 represents defective module, and −1 represents nondefective module. Similar to reference [35][36][37], in each experiment, we select two different datasets from the seven datasets, and one of which is used as the target project and the other as the source project to build the CPDP experimental dataset (for example, we use poi-2.0 as the target project and synapse-1.2 as the source project); therefore, we have created 42 experimental datasets.

Performance Measures.
In this paper, we use F1measure, MCC, g-measure, and balance [36,[38][39][40] 6 Scientific Programming Recall is a measure of completeness, describing probabilities of true defective modules in comparison with the total number of defective modules: Precision is a measure of exactness, which defines the probabilities of the presence of modules that are truly defective from the number of modules predicted to be defective: PF shows the proportion of all the modules without defects predicted to be defective:   F1-measure is the harmonic average of recall and precision. An algorithm with a higher F1-measure value (namely, F1 value) implies a better performance as Matthews correlation coefficient (MCC) can take into account the measurement of TP, FP, TN, and TN in a more comprehensive way. Its range of values is between [−1, 1], where 1 denotes a perfect prediction, and −1 indicates complete disagreement between actual and predicted values. e more the values are means the better the performance of the classifier will be: . (7) G-measure [14] is the harmonic average of recall and (1-PF). An algorithm with a higher g-measure value implies a better performance as e measure Balance is introduced by calculating the Euclidean distance from the real (recall, PF) point to (1, 0), since the point (recall = 1, PF = 0) is the optimal point where all defects are detected without missing:

Experimental Design.
In this section, we design experiments to verify the performance of Tsbaging in the following four questions:

RQ1.
Can Tsbagging effectively overcome the impact of different data distributions between different projects, so as to obtain better prediction performance compared with traditional defect prediction methods? To research this question, we compare Tsbagging with other traditional SDP methods and calculate their F1measure, MCC, g-measure, and balance values. It is worth noting that the traditional SDP methods do not take into account the impact of data distribution differences between source project and target project when training prediction models. If the performance of Tsbagging is better than other traditional SDP methods, Tsbagging can effectively overcome the impact of different data distributions between different projects. Specifically, we compare Tsbagging with Naive Bayesian (NB) [18], NB + WP, and adaboost [41], where adaboost and NB only use D S to train prediction models, and Tsbagging, NB + WP use D S and D WP data to train prediction models, where the amount of D WP is 10%, 15%, and 25% of the target project data, respectively.

RQ2. Can Tsbagging achieve better prediction performance than other CPDP methods?
In order to research this question, we compare Tsbagging with other CPDP methods and calculate their F1-measure, MCC, g-measure, and balance values. Specifically, we compare Tsbagging with NN, DBSCAN filter [19], VCB + SVM [21], NN + WP, and DTB, in which NN, DBSCAN filter, and VCB + SVM only use D S data to train prediction models , Tsbagging, DTB [26], and NN + WP [25] use D S and D WP data to train prediction models, where the amount of D WP is also 10%, 15%, and 25% of the target project data, respectively. It is worth noting that NN, DBSCAN filter, VCB + SVM, and NN + WP do not take into account the impact of concept shift, and DTB also do not take into account the impact of concept shift when it assigns weights. If the performance of Tsbagging is better than other CPDP methods, Tsbagging can effectively overcome the impact of concept shift.

RQ3.
Are the experimental results of RQ1 and RQ2 statistically significant? Similar to the literature [39,40], this paper uses the hypothesis test to statistically analyze the experimental results of RQ1 and RQ2, so as to verify whether the conclusions are statistically significant. Specifically, we use the Wilcoxon rank sum test to judge whether the experimental results of Tsbagging have significant differences compad with other methods. In this section, experiments are carried out on 42 datasets and 10 times are repeated on each dataset. erefore, this paper will make statistical analysis on these 420 data points. e confidence level used by the Wilcoxon rank sum test is 0.05. Moreover, we use the box plot to display the distribution of those 420 points in more detail. e box plot has five numerical aspects: minimum, lower quartile, median, upper quartile, and maximum, and it is often used in SDP experiments [42][43][44].

RQ4.
How do we determine the number of clusters (i.e., nk and pk) for Tsbagging?
In order to study this problem, we will measure the prediction performance of Tsbagging under different nk and pk , then we find out the reasonable value range of nk and pk. Specifically, we calculate F1-measure, MCC, g-measure, and balance of Tsbagging under nk = pk = 3, 5, 7, and 10, respectively. Tsbgging also uses 10%, 15%, and 25% of the target project data, respectively.
It should be noted that in each experiment, we randomly take 10%, 15%, and 25% of the data from target data as D WP and the remaining data as D Test . We repeat 10 times experiments on each dataset and take the average value of these 10 times experiments as the final result. In addition, as suggested by Turhan et al. [18], we replace all numeric values with a "log-filter", i.e., N with ln(N). is spreads out skewed curves more evenly across the space from the minimum to maximum values. is "spreading" can significantly improve the effectiveness of data mining, as the distribution of log- 8 Scientific Programming filtered feature values fits better to the normal distribution assumption [47]. In this experiment, all methods are implemented by the famous machine learning framework Weka3.9 [45] and the running environment of experiment is Java 11 and Windows 10. e parameter settings of each method are shown in Table 4.

Experimental Results
In this section, we answer the four research questions proposed.

Results for RQ1.
In order to answer RQ1, we calculated the prediction performance of Tsbagging and other SDP methods on different D WP . e experimental results are shown in Tables 5-7. From the experimental results, it can be seen that NB and Adaboost only use the source project data to train the prediction model, so they are difficult to achieve good prediction performance. NB + WP uses a limit of target project data during training. erefore, the performance of NB + WP is better than NB and Adaboost. However, NB + WP still does not take into account the impact of distribution differences between different projects, so the performance of NB + WP is still lower than that of Tsbagging. Moreover, on the 10% target project dataset, the largest increases of F1, MCC, g-measure, and balance are 18.5%, 52.7%, 24.5%, and 13.5%, respectively. On the 15% target project dataset, the largest increases of F1, MCC, g-measure, and balance are 21.6%, 59.6%, 25.6%, and 14.1%, respectively. On the 25% target project dataset, the largest increases of F1, MCC, g-measure, and balance are 24%, 71.4%, 26.7%, and 15.2%, respectively. In summary, the Tsbagging can use the TScluster method to find out the data which obey the distribution of the target project, so as to effectively overcome the impact caused by the difference of data distribution between projects and achieve better prediction performance.

Results for RQ2.
In order to answer RQ2, we compare Tsbagging with other CPDP methods and calculate their F1, MCC, g-measure, and balance on different datasets. e experimental results are shown in Tables 8-10. e experimental results show that due to NN, DBSCAN filter, and VCB + SVM only use D S to train the prediction model, when the distribution of source project and target project is quite different, especially it occurs concept shift, NN, DBSCAN filter, and VCB + SVM are difficult to achieve good prediction performance, even DBSCAN filter and VCB + SVM have negative transfer [46](that is, the CPDP method's prediction performance is lower than the traditional SDP methods). DTB and NN + WP use a limit of target project data to training the prediction model, so their performance are better than NN, DBSCAN filter, and VCB + SVM. However, NN + WP and DTB also do not consider the impact of concept shift when selecting D S data and assigning weights, so their F1, MCC, g-measure, and balance are lower than Tsbagging. Moreover, on the 10% target project dataset, the largest increases of F1, MCC, g-measure, and balance are 23.2%, 60.6%, 26.4%, and 18.6%, respectively. On the 15% target project dataset, the largest increases of F1, MCC, g-measure, and balance are 26.6%, 74.1%, 36.1%, and 19%, respectively. On the 25% target project dataset, the largest increases of F1, MCC, g-measure, and balance are 28.6%, 84.4%, 38%, and 20.3%, respectively. In summary, Tsbagging can overcome the impact of concept shift using semisupervised clustering to achieve better performance than other CPDP methods.  Figures 3-14, we can see that the median and lower quartile of MCC, g-measure, and balance are higher than all of other methods, and the median and lower quartile of F1 are higher than most of the other methods (the median values of DTB and Tsbagging are roughly the same level, but the lower quartile of Tsbagging is significantly higher than that of DTB). erefore, it can be seen that the prediction performance of Tsbagging is better than other defect prediction methods. In conclusion, the experimental results of RQ1 and RQ2 are statistically significant.

Results for RQ4.
In order to answer RQ4, we calculate F1, MCC, g-measure, and balance values of Tsbagging under different nk and pk, where the value range of nk and pk is 3, 5, 7, and 10. e experimental results are shown in Table 14 and Figure 15-17.
e Y-axis and X-axis of Figures 15-17, represent a performance index and its corresponding value, respectively. From the experimental results, it can be seen that the prediction performance of Tsbagging does not change significantly in different nk and pk, and they are better than other methods. erefore, it is reasonable to take 3, 5, 7, and 10 for nk and pk at the same time for Tsbagging.

Internal Validity.
We compared our method with other SDP methods.To avoid the potential faults as much as possible during the implementation process of the Scientific Programming experiment, we implement these models based on the Java machine learning library Weka.

External Validity.
External validity refers to the degree to generalize the research results to other situations. e most commonly used promise dataset in the cross-project defect prediction research is selected as the experimental data to ensure that the experimental results have certain representative significance. At the same time, the prediction performance is evaluated in terms of four well-known performance measures (i.e., F1-measure, g-measure, balance, and MCC) to ensure the generality of the experiment.

Statistical Validity.
In this paper, we use the Wilcoxon rank sum test to statistically analyze the experimental results. e Wilcoxon rank sum test is a nonparametric test. It has no

Algorithm
Parameter Adaboost e number of iterations is 20, and the basic classifier is NB. NN e number of K-nearest neighbors is 10, and NB is used as the basic classifier. DBSCAN filter e number of minimum records was set to 10, and the distance is set to 10 VCB + SVM e number of iterations is 5 Tsbagging ϵ � 2. e number of clustering iterations is 100, pk � 5, nk � 5, classifiersize � 20, and the base classifier is NB DTB e number of iterations is 20, and the base classifier is NB                                  requirements for the distribution of dataset, and it is more convenient to test whether there are differences between the two groups of distribution. e results of the Wilcoxon rank sum test show that the p value on all datasets is less than 0.05. erefore, it can be said that our experiment is statistically significant. In addition, we use the box plot to analyze the experimental results. e box plot can show the overall distribution of the prediction results, and the results also show that Tsbagging achieves better defect prediction performance than others.

Conclusion and Future Work
is paper proposes a CPDP method Tsbagging based on semisupervised clustering. Tsbagging combines the semisupervised clustering method with the transfer learning method to overcome shortcomings of traditional SDP methods in CPDP. e experimental results (the detailed description of each experimental result is shown in  show that it has some advantages in performance compared with other defect prediction methods of Tsbagging. In the future work, we will try to integrate the information in multiple source projects for knowledge transfer based on the multisource transfer learning framework, so as to help the classifier achieve better performance in CPDP.