Investigating Tree Family Machine Learning Techniques for a Predictive System to Unveil Software Defects

,


Introduction
Software engineering (SE) is a discipline that is worrisome with all qualities of software development from the beginning period of software specification over to keeping up to the software maintenance after it has gone into practice [1]. In the field of SE, software defect prediction (SDP) is one of the most significant and dynamic research zones that assumes a significant job in the software quality assurance (SQA) [2,3]. An SD is a flaw or insufficiency in a software system that roots the development of a spontaneous result. e rising convolutions as well as dependencies of software systems have increased the difficulty in delivering software with minimal effort, high caliber, and maintainability as well which increases the chances of introducing software defects (SDs) [4]. Generally, SDs are found in the testing stage of the Software Development Life Cycle (SDLC) [5]. An SD can moreover be the situation when the finalized software product does not meet the client's desire or client prerequisite [6] which causes the diminution of the software product quality and increases the development cost.
SDP is a momentous commotion to assure the substances of a software system that leads to adequate development cost and recover the quality by identifying defectprone instances before testing [4]. It moreover embraces categorizing software components in new varieties of a software system, which constructs the testing progression supplementary by concentrating on testing and evaluating the components classified as defective [7]. Defects adversely affect software reliability and quality [8].
SDP in the primary period of SDLC is measured as the utmost thought-provoking aspect of SQA [9]. In SE, bug fixing and testing are very costly which also requires a massive amount of resources. Forecasting software defects in software development have been observed by numerous studies in the last decades. Amid all these studies, machine learning (ML) techniques are considered as the best approach toward SDPs [7,10,11].
Keeping the above issue related to SDP, various researchers evaluated and built SDP models utilizing diverse classification techniques. Still, it is quite challenging to sort any broad-spectrum preparation to inaugurate the usability of these techniques. Comprehensively, it was originated that despite some uniqueness in the studies, no specific SDP procedure conveys a higher precision to different methods slantingly on different datasets. Most of the researchers have utilized different evaluation measures to achieve a higher accuracy, but to the best of our knowledge, no one has worked on reducing error rate which is also an important factor for any prediction model [12,13].
A question may be raised for the reason of selecting TF-ML techniques. e motive for the selection of TF-ML techniques is well-thought-out to be one of the finest and ordinarily used supervised learning methods. Tree-based techniques empower predictive models with ease of interpretation, stability, and high accuracy [14]. Disparate linear models such as TF-ML techniques map nonlinear relationships pretty well. ey are flexible at resolving several kinds of problems at hand (regression or classification). ese techniques also work for both categorical and continuous input and output variables [15,16]. TF is one of the wildest ways to categorize the utmost momentous variables and relation between two or more variables. TFs can produce new features that have improved power to forecast object variable. It involves fewer data cleaning contrasted to several other modeling techniques. It is not prejudiced via outliers and missing values to a rational amount [17][18][19].
Hereinafter, Section 2 presents the literature survey, Section 3 comprises the methodology and techniques, while experimental outcomes have conversed in Section 4, and Section 6 covers the inclusive conclusion.

Literature Survey
is section delivers an ephemeral study about existing techniques in the field of SDP. Several researchers have employed ML techniques for SDP at the initial phase of software development. Several particular studies have conversed here. Czibula et al. [11] presented a model grounded on Relational Association Discovery (RAD) for SDP. ey applied all the investigations on the NASA dataset including MC2, KC1, KC3, JM1, MW1, PC3, PC4, CM1, PC1, and PC2. To assess the model by comparing it to other models using accuracy, probability of detection (PD), specificity, precision, and area under cover (ROC) assessment measure, the acquired outcomes presented that RAD performed well rather than other employed techniques.
Li et al. [20] recommended a framework for SDP named Defect Prediction through Convolutional Neural Network (DP-CNN).
ey evaluated DP-CNN on seven different open source projects that are camel, jEdit, Lucene, xalam, Xerces, synapse, and poi in terms of FM in defect predictions. Overall outcomes illustrate that, on average, DP-CNN enhanced the state-of-the-art method by 12%. Jacob and Raju [21] introduced a hybrid feature selection (HFS) method for SDP. ey also performed their analysis on NASA datasets including PC1, PC2, PC3, PC4, CM1, JM1, KC3, and MW1. e outcomes of HFS are benchmarked with Naïve Bayes (NB), neural networks (NNs), RF, Random Tree (RT), and J48. Benchmarking is carried out using sensitivity, specificity, accuracy, and Matthew's correlation coefficient (MCC). e analyzed outcome shows that HFS outperforms by improving classification accuracy from 82% to 98%.
Bashir et al. [22] presented a joined framework to improve the SDP model using data sampling (DS), ranker feature selection (FS) techniques, and iterative partition filter 2 Complexity (IPF) to conquest class imbalance, high dimensionality, and  noise correspondingly. Seven ML techniques including NB,  RF, KNN, MLP, SVM, J48, and decision stump are employed  on CM1, JM1, KC2, MC1, PC1, and PC5 datasets for evaluations. e outcomes are carried out utilizing receiver operating characteristic (ROC) performance evaluation.
Overall experimental outcomes of the proposed model outperformed other models. In another study [7], the author projected a new approach for SDP utilizing hybridized gradual relational association and artificial neural network (HyGRAR) to classify the defective and nondefective objects. Experiments were achieved based on ten different open-source datasets that are JEdit 4.0, JEdit 4.2, JEdit 4.3, Anr 1.7, Tomcat 6.0, AR1, AR3, AR4, AR5, and AR6. For module evaluation, accuracy, sensitivity, specificity, and precision measures were utilized. e author concluded that HyGRAR achieved better outcomes compared to most of the foregoing projected approaches.
Alsaeedi and Khan [8] performed the comparison of supervised learning techniques including Bagging, SVM, Decision Tree (DT), and RF and ensemble classifiers on different NASA datasets that are KC2, KC3, PC1, PC3, PC4, PC5, JM1, CM1, MC1, and MC2. e basic learning and ensemble classifiers are evaluated using GM, specificity, F-score, recall, precision, and accuracy. e experimental results showed that RF, AdaBoost with RF, and DS with bagging outperformed other employed techniques.
A comparative exploration of several ML techniques for SDP is performed [9] on twelve NASA datasets that are PC1, PC2, PC3, PC4, PC5, MC1, MC2, JM1, CM1, MW1, KC1, and KC3, while the classification techniques include One Rule (OneR), NB, MLP, DT, RBF, kStar (K * ), SVM, KNN, PART, and RF. e performance of each technique is evaluated via MCC, ROC area, recall, precision, FM, and accuracy. It is imitated from the outcomes that neither the accuracy and nor the ROC can be utilized as an operative performance measure as both of these did not respond to the class imbalance problem.
Malhotra and Kamal [6] evaluated the efficiency of ML classifiers for SDP on twelve excessive NASA datasets by employing sampling methods and cost-sensitive classifiers. ey examine five prevailing methods including J48, RF, NB, AdaBoost, and Bagging, as well as the SPIDER3 method for SDP. ey have compared the performance based on accuracy, sensitivity, specificity, and precision. e outcomes show improvement in the prophecy competence of ML classifiers with the usage of oversampling methods. Moreover, the projected SPIDER3 method shows hopeful results.
Manjula and Florence [23] developed a hybrid model based on the genetic algorithm (GA) and deep neural network (DNN). GA is used for feature optimization while DNN is for classification. e performance of the projected technique is compared with NB, RF, DT, Immunos, ANNartificial bee colony (ABC), SVM, majority vote, AntMiner+, and KNN. All the performances are carried out on datasets that include KC1, KC2, CM1, PC1, and JM1 and assessed via recall, F-score, sensitivity, precision, specificity, and accuracy.
e experimental outcomes showed that the recommended technique outperformed as compared to other techniques in terms of achieving better accuracy.
Researchers have used various techniques to incredulous the boundaries of SDP on a variety of datasets. In each study, different evaluation measures are accomplished to evaluate the proposed techniques. e overall summary of the literature discussed above is given in Table 1 As shown in Table 1, each study has used different evaluation measures to achieve a higher accuracy, but no one made an effort to decrease the error rate which is a significant feature.

Datasets Description.
Each dataset consists of some attributes along with a known output class. Datasets contain numerical data, while the total number of attributes and instances are different, as presented in Table 2, where the first column presents the datasets and second and third columns present the number of attributes and number of instances, respectively. e fourth and fifth columns represent the number of defective modules and the number of nondefective modules, while the last column shows the type of data in each dataset. However, Table 3 shows the list of all attributes (software metrics) according to each dataset utilized in this research, where "-" means that this attribute is not part of the dataset while "Y" represents the presence of an attribute in the dataset.

Performance Measurement Parameters.
is section describes the calculation mechanism of each performance measurement parameter with a short description, where |y i − y| is the absolute error, n is the number of errors, T j is the goal value for record Ji, P ij is the value of forecast by the specific model I for record j (out of n records), TP is the number of true-positive classification, FN is the number of false-negative classification, TN is the number of true- (C) RAE is the same as a modest predictor, which is simply the average of the real values and can be found as (D) RRSE is known as the square root of the relative squared error (RSE) that mostly decreases the error to similar dimensions as the quantity being predicted. It can be found as (E) Specificity (also called the true-negative rate) measures the proportion of actual negatives that are correctly identified as such (e.g., the percentage of healthy instances that are correctly identified as not having the condition). It can be calculated as (F) Precision is the number of positive predictions divided by the total number of positive class values expected. It is also called the positive predictive value (PPV). It can be calculated as (G) Recall is defined as the ratio of TP modules with high opinion to the total number of positive modules. It can be found as (H) F-measure is also called the F-score. F1-score conveys the balance between precision and recall. It can be measured as (I) G-measure conveys the balance between the specificity and the recall. It can be calculated as GM � 2 * recall * specificity recall + specificity .
(K) Accuracy points to how much the forecast is accurate and can be calculated as 3.3. Summarization of Employed Techniques. ML techniques are currently extensively used to excerpt significant knowledge commencing a massive volume of data in diverse areas. ML applications embrace numerous real-world situations such as cybersecurity, bioinformatics, detecting communities in social networks, and software process enhancement to harvest high-quality software systems [7]. ML as well as TF-ML-based solutions for SDP have also been investigated [6,10,34]. e following subsections briefly discuss TF-ML techniques employed in this research.

Credal Decision Tree.
Credal Decision Tree (CDT) is a technique to design classifiers grounded on inexact possibilities and improbability measures [18]. roughout the creation procedure of a CDT, to sidestep producing a tooproblematical decision tree, a new standard was presented: stop once the total improbability increases due to the splitting of the decision tree. e function used in the total uncertainty dimension can be fleetingly articulated as in [14,19].

Cost-Sensitive Decision
Forest. CS-Forest practices cost-sensitive pruning as a substitute for the pruning used by C4.5. C4.5 prunes a tree if the probable number of misclassification for forthcoming records does not increase expressively due to the pruning. However, CS-Forest prunes a tree if the probable classification cost for forthcoming records does not increase expressively due to the pruning. Moreover, unlike Cost-Sensitive Decision Tree (CS-Tree), CS-Forest tolerates a tree to first completely develop and then get pruned [40]. Handling missing values and transformation of class attribute from numerical to categorical  Accuracy, sensitivity, specificity, and precision Alsaeedi and Khan [8] Bagging, SVM, DT, and RF PC1, PC3, PC4, PC5, JM1, KC2, KC3, MC1, MC2, and CM1 GM, specificity, F-score, recall, precision, and accuracy

Decision
Stump. DS is utilized as a base learner to construct ensemble models. DS is an ML model encompassing a one-level decision tree. is ensemble learning performs 1000 repetitions for accomplishing optimal performance [41]. DS is essentially decision trees with a solitary label. A stump is divergent to a tree that has various layers. It mostly stops after the first split. DS is commonly utilized in large data. Almost not, they also serve to create modest yes/ no decision models for smaller datasets [39].

Forest by Penalizing Attributes.
Forest-PA technique uses bootstrap samples and penalized attributes. It purposes to construct a set of highly accurate decision trees by manipulating the strong point of all nonclass attributes presented in a data set, not like certain current techniques that use a subset of the nonclass attributes. At a similar time to support robust assortment, Forest-PA enforces disadvantages (disadvantageous weights) to an individual's attributes that contributed to the newest tree to produce the subsequent trees. Forest-PA, moreover, has a contrivance to increase weights step by step from the attributes that have not been tested in the subsequent tree(s) [42].

Hoeffding
Tree. HT is identified as the streaming decision tree generation. e term is resulting from the Hoeffding bound that is utilized in tree generation. e elementary idea is Hoeffding bound delivers a specific level of assurance on the finest attribute to riven the tree, which can be the baseline to create the finest model [39].

Decision Tree (J48).
is is the basic C4.5 Decision Tree (DT) used for classification problems [37]. It is the deviation of information gain (IG), usually utilized to stun the result of biasness. An attribute with a maximum gain ratio is nominated in direction to shape a tree as a splitting attribute. Gain ratio-(GR-) based DT performs well as compared to IG, in terms of accuracy [43].

Logistic Model Tree.
LMTs are classification trees using logistic regression functions at the leaves. is technique can compact with dualistic and multiclass objective variables, nominal and numeric attributes, and missing values. LMT is a classification model with an attendant supervised training technique. It syndicates decision tree learning and logistic forecasts. Logistic model trees use a decision tree that has linear regression models at its leaves to deliver fragmentwise linear regression model [39].

Random
Forest. RF produces a set of techniques that involve constructing an ensemble or so-termed as a forest of decision trees from a randomized variation of tree induction techniques [44]. RF works by forming a mass of decision trees at the training period and outputting the class in the approach of the classes output by a single tree. It is deliberated as one of the utmost techniques which are extremely proficient for both classification and regression problems [45].

Random
Tree. RT is the supervised and collective learning technique that creates numerous solitary learners. It uses a grasping idea to construct a set of random data for building a tree. In the standard tree, nearby every node is split using the best split amid all variables. In the RF, each node is divided utilizing a best amid the subset of predictors randomly selected at that node [46].
3.3.10. REP-Tree. REP-T uses a regression tree that produces numerous trees in diverse repetitions. Afterward, it chooses the best one from all created trees. REP-T constructs a decision/regression tree using entropy as an adulteration measure and prunes it employing reduced-error pruning. It merely sorts of values for numeric attributes [46].

Experimental Study
is section provides an experimental study for SDP employing ten ML techniques using a 10-fold cross-validation method which is a standard approach for assessment [47]. 10-fold cross-validation is the process that splits the complete data into ten subsets of equal sizes; one subset is used for testing, while others are used for training. is process is continued until each subset has been used for testing.
e overall experiments are divided into three phases; these are experimental scenarios 1, 2, and 3. Experimental scenario 1 represents the analysis of CCI, ICI, TPR, and FPR, while experimental scenario 2 presents the performance analysis of absolute and squared error rates that are MAE, RAE, RMSE, and RRSE accordingly. However, experimental scenario 3 describes the performance achieved using accurate measurements that are specificity, precision, recall, F-measure, G-measure, MCC, and accuracy.

Experimental Scenario 1: Instances Classification and
True-Positive and False-Positive Rates. Here, in this section, experiments carried out to find correctly classified and incorrectly classified instances are presented, as well as the true-positive and false-positive rate of each classifier over each solitary dataset. Tables 4 and 5 show the CCI and ICI  analyses achieved, while Tables 6 and 7, respectively, present the TPR and FPR values of each technique on an individual dataset. In each of the mentioned tables, the first row represents the dataset utilized, while the first column represents the techniques employed. e rest of the table represents the outcome of CCI, ICI, TPR, and FPR, respectively. e observation from CCI shows that RF correctly classified the instances on five datasets that are AR3, PC1, PC2, PC3, and PC4; CDT and HT do the same for three datasets, DS and REP-T do the same for two datasets, while other rest of CF-Forest and RT performs the same for only one dataset individually. In the case of ICI, each technique showed the same performance as CCI and ICI are contrasting each other. Figure 2 shows the inclusive analysis of CCI and ICI. However, the situation has changed for TPR Complexity and FPR. Calculating TPR, RF shows the best performance on five datasets, CDT, DS, HT, and LTM outperforms on three datasets individually, REP-T on two datasets, while Forest-PA and J48 do the same only on one individual dataset. However, measuring FPR, Forest-PA performs well on four datasets, DS, HT, and REP-T outperform on three datasets, respectively, CDT and RF show the best performance on two individual datasets, while J48 and LMT outperform only on one individual dataset.
e overall analysis showed that RF performs well while calculating CCI, ICI, and TPR, while calculating FPR, Forest-PA outperformed other techniques. Figure 3 shows the inclusive analysis of TPR and FPR.

Experimental Scenario 2 (Error Rate Analysis and Results Discussion).
is section describes all the error rates achieved utilizing TF-ML techniques on different datasets. Tables 8 and 9 show the absolute errors MAE and RAE, respectively. Firstly, we consider MAE to measure MAE where J48 outperformed other techniques and achieved best results on five datasets; RT achieved best results on four datasets and RF on two datasets, while CDT, DS, Forest-PA, and HT performed well only on one solitary dataset. Secondly, if we consider RAE, likewise MAE J48 outperformed other techniques achieving best results on four datasets, RT on three datasets, RF on two, while HT surpassed other techniques on one dataset. Figures 4 and 5 show the error bar with a standard deviation of MAE and RAE, respectively.
Error bar is a graphical demonstration of the inconsistency of data and used on graphs to specify the uncertainty or error in a described measurement. It provides an overall indication of how accurate a measurement is or, on the other hand, how distant from the described value the true (error free) value might be. Here, these analyses show the best performance of J48 and RT while reducing absolute error rates.
Tables 10 and 11 present the analysis of squared errors that are RMSE and RRSE. In each table, the first row represents the datasets, while the first column represents the employed techniques. e rest of the table cells shows the outcomes of each employed technique on individual datasets. On both squared error measures RRSE and RMSE, RF achieved better results on five datasets that are AR3, KC3, PC1, PC3, and PC4. LMT outperforms other techniques on two datasets, while CDT, DS, Forest-PA, and REP-T do the same only on one individual dataset. Figures 6 and 7 represent the error bar with a standard deviation of RMSE and RRSE. ese outcomes present the best performance of RF to reduce the squared error rate.

Experimental Scenario 3 (Accuracy Analysis and Results
Discussion). To measure the performance of any technique, accuracy is considered as one of the most important evaluation measures. Here, in this section, we present different measurements for accuracy that are specificity, precision, recall, F-measure, G-measure, MCC, and accuracy. All these measurements depend on the values of the confusion matrix shown in Table 12. ere are two types of classes in which prediction is possible, i.e., class 1 (positive) and class 2 (negative). Class 1 represents that there is defect in the software, while Class 2 represents that there is no defect in the software. Here, TP is the case in which the software has positive (they have the defect) and FP is also the case of positive, but they do not actually have the defect, and it is also called type 1 error. FN is the negative cases, but they actually do have the defect, and it is also known as type 2 error. TN is a negative case, which shows that they do not have the defect. Table 13 presents the specificity assessment of all employed techniques on various datasets. In this table, column one presents the list of techniques, while the rest of the columns represents the specificity achieved on an individual dataset. Instead of some values, there is a categorical message "#DIV/0!," which is due to the "0" value in the confusion matrix. According to different equations, if there is a need to divide some values and that value becomes "0," at that time as we know that "0" is not divisible, we will have this message. Here, in the tables, we used "?" instead of "#DIV/0!." e analysis of Table 13 shows that measuring specificity, DS outperforms on AR3 and KC3, J48 outperforms on AR1 and PC4, LMT outperforms on KC2 and MW1, while RF shows better performance on CM1 and PC3. CS-Forest and Forest-PA outperform other techniques on PC2 and PC1, respectively. Table 14 presents the overall analysis of precision achieved by TF-ML techniques on each dataset. e outcomes show that, on one individual dataset, several techniques perform better. As we consider the AR1 dataset, three techniques perform well that are CDT, HT, and REP-T, while on CM1, PC1, and PC3 datasets, DS and HT produce the same results, whereas on the PC2 dataset, seven techniques show the same results outperforming the rest of the three techniques. However, on AR3, KC2, KC3, MW1, and PC4 datasets, CDT, LMT, DS, HT, and J48, respectively, outperform other techniques. Table 15 shows the recall assessments of each technique. e analysis shows that CDT produces better results only on the KC3 dataset while RF does the same only on the AR3 dataset. However, CS-Forest outperforms other techniques and shows the best performance on eight datasets that are AR1, CM1, KC2, MW1, PC1, PC2, PC3, and PC4. is analysis recommended the CS-Forest technique for calculating recall. For better understanding of analysis taken over specificity, precision, and recall, Figures 8-10 represent the error bar and standard deviation lines correspondingly for specificity, precision, and recall. e F-measure analysis is presented in Table 16 where RF generates better results on four datasets, DS and HT generate better outcomes on three datasets, respectively, while CDT, J48, LMT, and REP-T do the same for two datasets, respectively. However, Forest-PA outperforms other techniques only on one dataset. If we consider the PC2 dataset, seven techniques produce the same results and likewise in the case of precision and MAE too. Now a question may arise here that why more techniques present 8 Complexity the same and good results on the PC2 dataset? e riposte here is that this is due to the PC2 dataset which contains a very less number of defective modules that are only 0.5%. If we consider the rest of the datasets, no one can have less than 6.9% defective modules which is why the performance of each technique is different on these datasets.       Accuracy assessments of all employed TF-ML techniques are shown in Table 19. e outcome achieved on AR3, KC3, and PC2 datasets presents that CDT outperforms other techniques in terms of achieving a higher accuracy. Moreover, on AR1 and PC2 datasets, REP-T also outperforms other techniques while achieving the same results as CDT. Going on the PC2 dataset, CDT, REP-T well DS, Forest-PA, J48, and RF also produce the same results.
is is due to the very less number of defective modules in this dataset. However, on CM1 and KC3 datasets, DS beats other employed techniques in tenure of achieving better accuracy results, as well as proceeding CM1 dataset; HT also   Complexity outperforms other techniques that turn out the same results such as DS. Furthermore, on AR3, PC1, PC3, PC4, and PC2 as well, RF performs better results to achieve a higher accuracy than other utilized techniques. e overall accuracy analysis suggests RF for a better measurement of accuracy. Moreover, the error bar and standard deviation line for the better understanding of F-measure, G-measure, MCC, and accuracy are presented in Figures 11-14 individually. e average accuracy is demonstrated in Figure 15 for the better understanding of outcomes analysis of each technique going through each utilized dataset.

Discussion on Overall Performance
A popular way to compare the overall performances of classifiers is to count (w) the number of data sets on which an algorithm is an overall winner, also known as the Count of Wins test. We have used 10 datasets, and no technique has given the best results for at least 10 datasets at α � 0.05, according to the critical values in Table 3 of [48]. Since the Count of Wins test is also considered to be a weak testing procedure, therefore, we have a detailed matrix in Table 20, keeping in view Scenario 2. Table 20 shows the determined analysis of all evaluated absolute and squared error rates. It represents the measurement of absolute errors that are MAE and RAE on the employed datasets and TF-ML techniques. e results concluded that J48 and RT outperform other techniques. However, evaluating on squared errors that are RMSE and RRSE, the RF technique beats the rest of the techniques in terms of reducing squared error. Moreover, Table 21 shows the inclusive outcomes of experimental scenario 3. For specificity, the outcomes show the best performance of J48, DS, RF, and LMT for two datasets individually. On precision, HT beats other techniques on five datasets and DS on four datasets, while CDT beats others on only two datasets. At the moment of considering recall, CF-Forest shows the best performance on eight datasets, while on F-measure, RF outperforms the rest of the techniques, and on G-measure, RF, LMT, and CS-Forest perform better results on two individual datasets. Making an allowance for MCC, CS-Forest outperforms on four datasets and J48 and RF on two individual datasets. Finally, if we discuss the accuracy, RF produces the best results on five datasets, CDT on three datasets, while DS on two datasets. However, the performance of RF for error rate as well as for accuracy is better than the rest of the utilized TF-ML techniques.
Generally, we can say that the more trees in the forest the more robust the forest looks like. In the same way in the RF classifier, the higher the number of trees in the forest gives the high accuracy results. In other words, it is believed that RF ensures as an ensemble method of numerous trees, enhanced to knob categorical data when gaining the ultimate solution in the widely held voting system for the outcomes of respective trees is umpired [33,45,49]. RF not only delivers a dual classification of data facts and nevertheless also delivers the prospects for each factor to be appropriate to defective or nondefective categories [50]. It is deliberated as one of the utmost dominant techniques as it is extremely proficient in the accomplishment of both regression and classification [44].

Friedman Two-Way Analysis of Variance by Ranks.
To compare all applied ML techniques on multiple data sets, we have applied the statistical procedure as described by Sheskin [51], García, and Herrera (2008) [52]. e Friedman two-way analysis of variance by ranks (Friedman (1937) [53] is adopted with rank-order data in a hypothesis testing situation. A significant test indicates that there is a significant difference between at least two of the techniques in the set of k techniques. Friedman test checks whether the measured average ranks are significantly different from the mean rank (in our case, Rj � 3.96). e chi-square (χ 2 ) distribution is used to approximate the Friedman test statistic [51]. Friedman's statistic is χ 2 � 139.7985.
To reject the null hypothesis, the computed value must be equal to or greater than χ 2 the tabled (table of the chisquare distribution) critical chi-square value at the prespecified level of significance [51]. e number of degrees of freedom df � k − 1; thus, df � 10 − 1 � 9. For df � 9, the tabled critical α � 0.05 chi-square values are � 16.92. Since the computed value � 139.7985 is greater than χ 2 0.05 � 16.92, the alternative hypothesis is supported at α � 0.05. It can be concluded that there is a significant difference among at least nine of the ten ML techniques. is result can be summarized as follows: χ 2 0.05 (9) � 139.7985, p < 0.05. Since the critical value is lower than the χ 2 , we can continue with the post hoc tests to spot the significant pairwise differences among all the techniques. e results are shown in Table 22, where z is the corresponding statistics and p values for each hypothesis. Z is computed using the following equation: where Ri is the ith technique and the standard error is SE � ����������� � (k(k + 1)/6n) � 0. 0.175. Columns 5 th and 6 th represent Nemenyi's and Holm's statics procedure. e second last column lists the differences between the average ranks of ith and jth techniques. However, the last column shows the critical difference (CD), and it states that the performance of the two techniques is significantly different if the corresponding average ranks differ by at least the CD. CD can be calculated using the following equation: where critical values qα is given in Table 5(b) in (Demsar 2006) [48]. e notations "significant" and "insignificant" represent whether the difference in the average rank (Ri-Rj) is greater or less than the value of CD, respectively. Greater means a significant difference between two means. Here, the value of CD is � 0.485. In Table 22, the family of hypotheses is ordered by their p values. As can be seen, Nemenyi's procedure rejects the first 28 hypotheses, whereas Holm's procedure also rejects the next 3 hypotheses since the corresponding p values are smaller than the adjusted NMα's and Holm. erefore, we conclude that the performance of HTand LMT is comparable, and RTresulted in a lower performance. Besides, the obtained value CD � 0.485 indicates that any difference between the average ranks of two techniques that is equal to or greater than 0.485 is significant. Concerning the pairwise comparisons in Table 22, the difference between the average ranks of two techniques which are greater than CD � 0.485 is the first 33. us, it can be concluded that there is a significant difference between the average ranks of the first 33 pairs of techniques.

reats to Validity.
In this section, we converse the effects that may anguish the validity of this research work.

Internal
Validity. e exploration of this paper is grounded on diverse very familiar evaluation standards that are used in the past in various studies. Amid these standards, several methods are used to assess the error rate, while certain approaches are used to assess accuracy. So, the treat can be that the renewal of new evaluation standards as a replacement for utilized standards can decrease the accuracy. Furthermore, the techniques used in this research can be supplanted with some newer techniques that can be hybridized with each other and can harvest enhanced outcomes than the employed techniques.

External Validity.
We piloted investigations on various datasets. A threat to validity may arise if we relate the projected techniques in the other real data composed from the different software development organizations using surveys etc. or replace these datasets with some other datasets, which may distress the outcomes while growing the error rates. Likewise, the projected technique may not be capable of harvesting better forecasts in outcomes using some other SDP datasets. Hence, this study concentrated on AR1, AR3, CM1, KC2, KC3, MW1,  18 Complexity PC1, PC2, PC3, and PC4 datasets to measure the performance of the utilized techniques.

Construct Validity.
In this study, diverse TF-ML techniques are benchmarked with each on various datasets based on several assessment measures. e assortments of techniques utilized in this study are at the center of their progressive features over the other techniques that have been exploited by the researchers in the last decades. However, the threat can be that if we put on some other new techniques, then it can be the probability that these new techniques can exhaust the projected techniques. Furthermore, any change in the dataset splitting (increasing or decreasing the number of K-Folds) may change the current outcomes. It also can be promising that using the newest evaluation standards creates improved outcomes that can beat the current accomplished outcomes.

Conclusion
Nowadays, SDP using ML techniques is dignified as one of the emerging research areas. As the identification of software defects at the primary stage of SDLS is a challenging task, nevertheless it can subsidize the provision of high-quality software systems. is paper considered ten extensively used publically available datasets to compare ten famous TF-ML techniques: CDT, CS-Forest, DS, Forest-PA, HT, J48, LMT, RF, RT, and REP-T, which are broadly used for SDP. e performance is evaluated utilizing different measures such as MAE, RAE, RMSE, RRSE, specificity, precision, recall, FM, GM, MCC, and accuracy. e inclusive results of this paper recommended RF technique by providing the best results in terms of reducing error rates as well as increasing accuracy on five datasets that include AR3, PC1, PC2, PC3, and PC4, where the accuracy rate for each of these datasets is 92.0635%, 93.688%, 99.5885%, 90.1472%, and 90.6722%, respectively. However, CDT and DS are best in terms of increasing accuracy on three individual datasets. CDT accuracy outcomes are 92.562%, 81.9588%, and 99.5885% correspondingly going on AR1, KC3, and PC2, while DS shows an accuracy performance of 90.1606%, 81.9588%, and 99.5885% individually on CM1, KC3, and PC2. e outcomes obtainable in this research can be recycled as a baseline for other studies and researchers so that the outcomes of any projected technique, model, or framework can be benchmarked and simply confirmed. For future work, class imbalance matters ought to be committed to these datasets. Furthermore, to increase the enactment, feature selection and ensemble learning techniques should also be explored.
Data Availability e datasets used in this research are taken from UCI ML Learning Repository available at https://archive.ics.uci. edu/.

Conflicts of Interest
e authors declare that they have no conflicts of interest related to this study.