Combining Imbalance Learning Strategy and Multiclassifier Estimator for Bug Report Classification

Since a large number of bug reports are submitted to the bug repository every day, efficiently assigning bug reports to the correct developer is a considerable challenge. Because of the large differences between the different components of different projects, the current bug classificationmainly relies on the components of the bug report to dispatch bug reports to the designated developer or developer community. Unfortunately, the component information of the bug report is filled in by default according to the bug submitter and the result is often incorrect. ,us, an automatic technology that can identify high-impact bug reports can help developers to be aware of them early, rectify them quickly, and minimize the damages they cause. In this paper, we propose a method based on the combination of imbalanced learning strategies such as random undersampling (RUS), random oversampling (ROS), synthetic minority oversampling technique (SMOTE), and AdaCost algorithms with multiclass classification methods, OVO andOVA, to solve bug reports component classification problem.We investigate the effectiveness of different combinations, i.e., variants, each of which includes a specific imbalance learning strategy and a specific classification algorithm. We mainly perform an analytical study on five open bug repositories (Eclipse, Mozilla, GCC, OpenOffice, and NetBeans). ,e results show that different variants have different performance for bug reports component identification and the best performance variants are combined with the imbalanced learning strategy RUS and the OVA method based on the SVM classifier.


Introduction
ere are many studies on bug report predictions that have been performed to help reduce software quality issues [1][2][3][4][5]. Software quality requires a great deal of effort in the testing and debugging process. However, in many cases, the developer's resources and time are limited, so many bugs accumulate and are not fixed in the bug repository.
Anvik et al. reported their personal communication with a Mozilla triager who explains, "everyday, almost 300 bugs appear that need triaging, which is far too much for only the Mozilla programmers to handle" [6]. erefore, it is especially important to find an effective method for improving the efficiency of bug allocation and resolution. Many developer-recommended methods have been proposed to solve bug classification problems by recommending suitable developers for bug reports to improve the efficiency of bug fixes. Xie et al. proposed a developer-recommended method based on a topic model, using historical bug-solving records to build topic models, simulating developers' interest and expertise in bug-solving activities, and providing a helpful developer recommendation list for new bug reports [7]. On this basis, Xia et al. proposed a bug-based analysis between reports and a developer-based analysis for recommending a suitable developer list for new bug reports by calculating the relevance score [8]. Many researchers have proposed bug prediction techniques to prioritize software testing and debugging that can identify flawed components for developers and conduct considerable research on defect prediction. ese techniques predict the allocation model based on features such as code lines, code complexity, and the number of modified files [9][10][11]. Yang et al. proposed using deep learning techniques to predict changes in bug reports, extracting a set of expression features from initial variation features through deep confidence network algorithms, and constructing machine learning classifiers based on these expression features [12].
However, bug classification still has many problems and faces many challenges. Large-scale and low-quality bug data in bug repositories can prevent the usage of automatic bug classification techniques. Since software bug data are free-form text data, it is necessary to generate well-processed bug data to facilitate the application [13][14][15]. Tamrawi et al. proposed a caching model based on fuzzy sets and knowledge based on the professional repair of developers. e fuzzy set is defined as the relevant technical terms related to the bug report activities that developers have previously participated in. Developers are ranked by calculating the score between technical terms and bug reports [16]. Alenezi et al. proposed an efficient bug classification method based on the term selection method and a naive Bayesian classifier to construct a prediction model. is method improves the efficiency of bug classification by reducing the dimensions of the terms [17]. Some researchers have proposed automatic bug-dispatching techniques, such as support-vector machine (SVM) [18], k-nearest neighbor algorithm (KNN) [19], and naive Bayes multinomial (NBM) [20], to ensure that bug reports are assigned to the appropriate developers to improve the accuracy of bug allocation. Xia et al. used different combinations of imbalanced learning strategies and text classification algorithms to identify high-impact bug reports. e problem of class imbalance in bug reports is solved through imbalancing the processing of data.
Because of the large differences in component categories in different open source projects in the bug repository, it is clear from the analysis that managers rely on the component categories of bug reports to dispatch bug reports to a specific developer or developer group. e multiclass classification method OVO does not require retraining all classifiers when adding samples and only needs to retrain the classifiers associated with the added samples. e multiclass classification method OVA only needs to train the same number of classifiers, and the training time is relatively fast. In this paper, we propose an automatic bug reports component classification method that combines the imbalanced learning strategies such as random undersampling (RUS), random oversampling (ROS), synthetic minority oversampling technique (SMOTE), and AdaCost algorithms with the multiclass classification methods, OVO and OVA, to solve the automatic bug reports component classification problem. According to the mechanism of bug report classification, each bug report component is assigned to a specific developer or several developers. We recommend the appropriate developers to implement bug report classification through bug component classification. Since different classification algorithms have different sensitivities to different categories, we explore the effectiveness of different variants to find the optimal variant [21], i.e., each variant contains an imbalanced learning strategy (ILS) and a classification method. e remainder of this paper is organized as follows. e background knowledge and motivations are discussed in Section 2. e design of our approach is discussed in Section 3.
e experimental design and results are presented in Section 4, and the conclusions are discussed in Section 5.

Background Knowledge.
Since the number of daily bugs is large and properly assigned and the human triager has difficulty grasping all the knowledge about bugs [22][23][24][25], it is time consuming and error prone for humans to manually classify bugs. Existing work uses text-based classification methods to assist in bug classification, for example, [26][27][28][29]. Existing work uses a text-based classification approach to assist in preventing misclassification in recommending the correct developer. In such an approach, the summary and description of the bug report are extracted as textual content and the developer who can fix the bug is marked as a label for classification. en, the appropriate developer is predicted for the new bug report. Since the number of bug reports submitted to the bug repository is very large, during the bug classification process, developers resolve as many bug reports with a high degree of impact and severity as possible. Severity has become a key factor in determining the priority for bug fixes. A number of prediction methods for bug reporting severity labels have been proposed.
Xuan et al. used the modified REP algorithm and Knearest neighbor algorithm to predict the severity of bug reports and fixer recommendations [30]. is method uses a topic model to find the topic to which each bug belongs, introduces topics to enhance similarity function REP, and uses a K-nearest neighbor algorithm to search historical bug reports similar to the new bug report. Based on features extracted from the nearest neighbor of the new bug severity prediction, fixer recommendations are realized. Zhang et al. proposed a new automated method, SEVERIS, which helps test engineers assign severity levels to bug reports [31]. Menzies and Marcus compared the text classification algorithms such as naive Bayes, naive Bayes multinomial, K-nearest neighbor, and support-vector machine to determine which particular algorithm is most suitable for bug reporting severity level prediction. e results show that the naive Bayes multinomial algorithm has the highest classification accuracy [32]. Lamkanfi et al. proposed a new method of utilizing information retrieval by analyzing the severity label assigned by historical bug reports. Based on the document similarity function of BM25, the severity label is predicted for the new bug report [33]. Bug reports have serious imbalances. Tian et al.
proposed a new sampling technique, CR-SMOTE, to enhance the classification of bug reports with real imbalanced severity distributions [34]. is method uses the RSMOTE sampling method combined with the ELM algorithm to achieve bug severity prediction. In subsequent research work, Chen proposed a fuzzy integral fusion multi-RSMOTE method to solve the problem of data distribution imbalance for the randomness problem of RSMOTE [35,36]. In order to address the imbalance of the dataset, Guo et al. proposed an enhanced oversampling approach called CR-SMOTE to enhance the classification of bug reports with a realistically imbalanced severity distribution [37]. Pan et al. proposed an approach to empirically investigate the static and evolving topological properties enclosed in the weighted software networks by using weighted k-core decomposition [1]. In this study, Pan et al. explored the structural properties of the multilayer software network at the class level by progressively merging layers together, where each coupling type such as inheritance, implements, and method call defines a specific layer [38]. Jiang et al. proposed the ROSF method combining both information retrieval and supervised learning and recommend top-k code snippets for a given free-form query based on two stages [39,40]. On this basis, Pan et al. proposed a novel approach to identify key class candidates in object-oriented software [41][42][43]. Chai et al. proposed an approach to cluster mashup services and determine the cluster number based on a genetic algorithm [44,45]. To reduce the time developers spent analyzing bug reports, Jiang et al. used crowdsourced data to infer and summarized the valid attributes of bug reports [14]. To improve the quality of detection bug reports, Chen et al. proposed a new framework called the test report augmentation framework (TRAF) to help developers better understand and fix bug reports [14,[46][47][48].

Motivations.
We find that actual data always contain noise and redundancy [49][50][51]. Noise data can mislead data analysis techniques, while redundant data can increase the cost of data processing [52]. In the bug repository, all bug reports are filled by developers in natural language. As the scale grows, low-quality bugs are accumulated in the bug repository. Such large-scale and low-quality bug data may undermine the effectiveness of bug fixes [53,54]. Figure 1 shows the distribution of components in the top 10,000 bug reports for the five datasets. Because of the large number of component categories, it is difficult to achieve accurate bug report classification based on current classification methods. Table 1 shows that each developer performs an average of 2-3 components with a single component allocation. In Table 1, the second column represents an average of how each component can be solved by several developers. e third column represents how many components each developer can solve on average. erefore, the rational allocation of components directly affects the classification efficiency of bug reports. e current component information is filled in by the bug reporter and is the default option and cannot be used directly. erefore, focus should be on the allocation of the bug reports component.
Taking the OpenOffice dataset as an example, 27 components in the dataset are assigned to 1 to 5 developers, 3 components are assigned to 6 to 10 developers, 7 components are assigned to 11 to 20 developers, 6 components are assigned to 21-30 developers, and 10 components are assigned to more than 30 developers. Each component is assigned to approximately 20 developers. On average, each developer handles three components. erefore, we can more accurately assign a more appropriate developer to the bug report by identifying the bug report component.

Methodology
In this section, we present the overall model for bug component allocation problems and detail the algorithms in the overall framework.

3.1.
Overview. Inspired by the motivation in Section 2, we propose a new method by combining an imbalance learning strategy (ILS) with a multiclass classification method for bug   [34,41]. (2) Imbalanced processing of data: because of the serious category imbalance in the dataset, we use four imbalanced learning strategies, RUS, ROS, SMOTE, and AdaCost, to process the data and obtain a balanced dataset to make the classification results more accurate. (3) Multiclass classification of data: we use the multiclass classification methods, OVO and OVA, for the balanced dataset to classify the bug component and solve the bug component classification problem. Since different classification algorithms are sensitive to different categories, we analyze the effectiveness of different variants on the classification of bug reports components, that is, an imbalanced learning strategy combined with a specific classification algorithm. Figure 2 shows the overall framework of our model.

Imbalanced Learning Strategy.
Currently, the strategy for solving the problem of class imbalance of data is mainly divided into two directions. One direction starts with the data training set and reduces the class imbalance of the dataset by changing the sample distribution of the training set. e other direction starts with the learning algorithm according to the algorithm when solving the problem of class imbalance and modifies the algorithm to improve its efficiency. Many data sampling techniques have been introduced in the literature [30,41,55]. In our study, we choose RUS, ROS, and SMOTE methods based on changing datasets,and the AdaCost algorithm based on cost sensitivity.

Random Undersampling Method.
e random undersampling (RUS) method directly undersamples most of the samples in the training set, that is, it removes some samples in the majority class so that the number of positive and negative examples is close and then learns.
at is, some samples are randomly selected from the majority class S maj to form a sample set E, and then the sample set E is removed from S maj to obtain a new dataset

Random Oversampling Method.
e random oversampling (ROS) method directly oversamples a small number of samples in the training set, that is, it increases some minority samples so that the number of positive and negative examples is close and then learns.
at is, some samples are randomly selected from the minority class S min , and then the sample set E is generated by copying the selected samples, and they are added to S min and the original dataset is expanded to obtain a new minority class set S new � S min + E.

Synthetic Minority Oversampling Technique.
e basic idea of the SMOTE method is to randomly select a sample x ∧ i from its nearest neighbors for each minority class sample x i (x ∧ i is a sample of minority classes) and then randomly select a point on the line between x i and x ∧ i as a newly synthesized sample of minority classes. e specific method for synthesizing new minority samples by the SMOTE method is as follows: (i) For each sample x i in minority classes, the distance from x i to all the samples in the sample set S min of minority classes is calculated according to the Euclidean distance, and its k-nearest neighbors are obtained. (ii) A sampling ratio is set according to the sample imbalance ratio to determine the sampling magnification N. For each minority sample x i , several samples are selected randomly from its k-nearest neighbors, assuming that , a new sample is constructed with x i according to the following formula:

AdaCost Algorithm.
e AdaCost algorithm learns a classifier by iteration and updates the weight of the sample according to the performance of the current classifier. e weight update strategy greatly increases the weight of the costly misclassification sample, and the weight of the correct classification sample is appropriately reduced so that the weight reduction is relatively small. e overall idea is that the cost of the high sample weight is greatly reduced. e sample weights are updated according to the following formula [56]: β + � − 0.5C i + 0.5, β + indicates the value of β in the case where the sample is correctly classified. β − � 0.5C i + 0.5, β − indicates the value of β in the case where the sample is misclassified.

Multiclass Classification Method
Assuming there are m categories, the method creates a binomial classifier for the two categories and obtains k � m * (m − 1)/2 classifiers. When classifying new data, the k classifiers are sequentially used for classification. Each classification is equivalent to one vote, and each of the classification results is equivalent to which class is voted for. After classifying using all k classifiers, it is equivalent to k-th voting and the class with the most votes is selected as the final classification result. e following is a description of the structure of the algorithm.
In line 1, y is initialized to null. Lines 2-4 indicate that a classifier is designed between any two samples and classifiers need to be designed. Lines 5-7 indicate that the classification results are obtained. Lines 8-13 represent the voting strategy, and if the sample belongs to the class, one is added. Line 14 indicates that the class with the most votes lasting is the class of the unknown sample. Line 17 indicates that the class score with the most votes is returned.

OVA Method.
Assuming there are n categories, the method establishes n binomial classifiers; each classifier classifies one of the categories and the remaining categories. When making predictions, we use the n binomial classifiers to classify the data and obtain the probability that the data belong to the current class, and one of the categories with the highest probability is selected as the final prediction result. e following is a description of the structure of the algorithm.
In line 1, y is initialized to null. Lines 2-4 indicate that there are n groups of classifications, i.e., n classifiers. Lines 5-11 indicate that each group of classification results h (i) θ (x) is obtained, and the value with the highest probability is selected as the prediction result. Line 12 indicates that the maximum value of the classification value is returned. erefore, we need to reduce the datasets; we read the first 1,000 rows of data in five open source projects, and Table 3 shows the word frequency in the five datasets and the total number of bug reports for the original dataset. To more accurately identify the category of bug reports, we removed the number of bug report columns with word frequencies less than 5 and bug reports with a category of less than 50. Table 4 shows the categories of the processed datasets and the number of columns after word frequency reduction.

Accuracy.
Accuracy is the number of correctly classified samples divided by the total number of samples. Generally, the higher the correct rate, the better the classifier. We formally define the accuracy as follows:

Precision.
Precision is the ratio that is actually divided into positive examples. We formally define the precision as follows:  Create a binomial classifier with samples of class i and class j, and for all samples, obtain classifiers (5) Train the classifier to obtain the classification result.
(2) For all y from 1 to n do (3) Choosing one class and lumping all the others into a single second class, obtain n classifiers. (4) Repeat the previous step. (5) Train n classifiers h (i) θ (x) for each class i to predict the probability that y � i.

F-Measure. F-measure is also known as F-score.
When the precision (P) and recall (R) indicators sometimes appear contradictory, you need to use the indicator to consider both of them. F-measure is the weighted harmonic average of precision and recall.

Experimental Results
In this section, the experimental results are discussed in relation to the specific research questions.  Tables 5-9, it can be seen that the OVA multiclass classification method based on the SVM classifier improves the effect most obviously for the five datasets. e experimental results of the OVA method based on the SVM classifier improved by 0.5928, 0.6015, 0.5928, and 0.5787, respectively, for the Mozilla dataset. e experimental results of the OVA method based on the SVM classifier improved by 0.6539, 0.6685, 0.6539, and 0.6153, respectively, for the GCC dataset. e experimental results of the OVA method based on the SVM classifier improved by 0.7892, 0.8033, 0.7892, and 0.7781, respectively, for the Eclipse dataset. e experimental results of the OVA method based on the SVM classifier improved by 0.5205, 0.5321, 0.5205, and 0.4964, respectively, for the OpenOffice dataset. e experimental results of the OVA method based on the SVM classifier improved by 0.6160, 0.6260, 0.6160, and 0.6051, respectively, for the NetBeans dataset. erefore, for the Mozilla, GCC, NetBeans, OpenOffice, and Eclipse datasets, the OVA method based on the SVM classifier has greater efficiency in solving bug reports component classification problems.

RQ2: What Is the Impact of Imbalanced Learning Strategies on the Multiclass Classification OVO Method in Solving
Bug Reports Component Allocation Problems? Specifically, the question explores whether an imbalanced learning strategy has an impact on the OVO classification method. To answer this question, we use the imbalanced learning strategies RUS, ROS, SMOTE, and AdaCost algorithms to process the Mozilla, GCC, Eclipse, OpenOffice, and Net-Beans datasets and then use the multiclass classification method OVO based on SVM, KNN, and NBM classifiers to     increasing by 0.5392, 0.6477, 0.5392, and 0.5340, respectively. From Table 11, for the GCC dataset, the combination of the ROS and OVO method based on the SVM classifier had the largest improvement, increasing by 0.9016, 0.9035, 0.9016, and 0.9009, respectively. A combination of the ROS and OVO method based on the NBM classifier also greatly improved compared with other combinations, increasing by 0.8601, 0.8589, 0.8601, and 0.8589, respectively. e combination of the RUS and OVO method based on the KNN classifier has the lowest improvement, 0.1623, 0.3543, 0.1623, and 0.0903, respectively. e combination of the AdaCost and OVO method based on the SVM classifier had higher improvement compared to the combination of the SMOTE and OVO method, which increased by 0.7167, 0.7127, 0.7167, and 0.6820, respectively. e combination of the SMOTE and OVO method based on the NBM classifier was more efficient than the combination of the SMOTE and OVO method based on the SVM classifier, with elevations of 0.6271, 0.6244, 0.6271, and 0.6220, respectively. From Table 12, for the Eclipse dataset, the combination of the ROS and OVO method based on the SVM classifier had the greatest improvement, increasing by 0.9288, 0.9415, 0.9288, and 0.9316, respectively. A combination of the ROS and OVO method based on the NBM classifier also greatly improved compared with other combinations, increasing by 0.9170, 0.9181, 0.9170, and 0.9154, respectively. e combination of the RUS and OVO method based on the KNN classifier had the lowest improvement, 0.1713, 0.4548, 0.1713, and 0.1713, respectively. e combination of the SMOTE and OVO method based on the NBM classifier had higher improvement compared with the combination of the SMOTE, AdaCost, and OVO method based on the SVM classifier, which increased by 0.7144, 0.7288, 0.7144, and 0.7147, respectively. From Table 13, for the OpenOffice dataset, the combination of the ROS and OVO method based on the SVM classifier had the greatest improvement, increasing by 0.9204, 0.9346, 0.9204, and 0.9225, respectively. A combination of the ROS and OVO method based on the NBM classifier also greatly improved compared with other combinations, increasing by 0.8489, 0.8496, 0.8489, and 0.8437, respectively. e combination of the RUS and OVO method based on the KNN classifier had the lowest improvement, 0.1052, 0.1896, 0.1052, and 0.0958,

RQ3: How Much Improvement Does the Classification of Bug Reports Component Have in Combination with the Imbalanced Learning Strategies and the OVA Method?
To answer this question, we use the imbalanced learning strategies RUSUS, ROS, SMOTE, and AdaCost algorithms to process the Mozilla, GCC, Eclipse, OpenOffice, and NetBeans datasets and then use the multiclass classification method OVA based on SVM, KNN, and NBM classifiers to solve the problem of bug reports component classification. We combine ILS with OVA method based on SVM, KNN, and NBM classifiers and an imbalanced learning strategy with a classification method for the five datasets. en, we use accuracy, precision, recall, and F1 as evaluation criteria and we build the classifier with reference to the combination of RQ2. Tables 15-19 show the results of our experiments using ILS combined with the OVA method for the Mozilla, GCC, NetBeans, Open-Office, and Eclipse datasets.    respectively. e combination of the SMOTE and OVA method based on the SVM classifier had higher improvement compared with the combination of the SMOTE and OVA method based on the NBM classifier, and the combination of the AdaCost and OVA methods based on the SVM classifier increased by 0.4414, 0.4764, 0.4414, and 0.4305, respectively. From Table 19, for the NetBeans dataset, the combination of the ROS and OVA method based on the SVM classifier had the greatest improvement, increasing by 0.9197, 0.9178, 0.9197, and 0.9172, respectively. A combination of the ROS and OVA method based on the NBM classifier also greatly improved compared with other combinations, increasing by 0.7381, 0.7463, 0.7381, and 0.7350, respectively. e combination of the RUS and OVA method based on the KNN classifier had the lowest improvement, 0.1476, 0.1810, 0.1476, and 0.1170, respectively. e combination of the AdaCost and OVA method based on the SVM classifier had the highest accuracy value, the precision rate value, and the recall rate value compared with the combination of the SMOTE and OVA method based on the SVM and NBM classifiers which increased by 0.5346, 0.5869, and 0.5346, respectively. erefore, the combination of the imbalanced leaning strategy RUS and the OVA method based on SVM and NBM classifiers had higher efficiency in solving the bug reports component classification problem compared with other combinations for the Mozilla, GCC, NetBeans, OpenOffice, and Eclipse datasets.

Conclusion
In this article, we propose a new method by combining imbalanced learning technologies with multiclass classification methods to implement bug reports component classification problems. We use four imbalanced processing strategies, RUS, ROS, SMOTE, and AdaCost, to process the data and obtain a balanced dataset. en, we use the multiclass classification methods, OVO and OVA, based on NBM, KNN, and SVM classifiers for the balanced dataset to classify the bug reports component and solve the bug reports component classification problem. We explored the optimal performance of bug reports component classifications by different combinations of imbalanced learning strategies and classification algorithms. We can better solve the problem of bug reports classification by using the bug component classification to determine the appropriate developer for the bug report. In our work, we could not only reduce the word dimension of the original training set that improves the quality of the training set but also improve the classification performance for bug reports.

Data Availability
e data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest
e authors declare that there are no conflicts of interest regarding the publication of this paper.