Empirical Study of Homogeneous and Heterogeneous Ensemble Models for Software Development Effort Estimation

Accurate estimation of software development effort is essential for effective management and control of software development projects. Many software effort estimation methods have been proposed in the literature including computational intelligence models. However, none of the existing models proved to be suitable under all circumstances; that is, their performance varies from one dataset to another. The goal of an ensemble model is to manage each of its individual models’ strengths and weaknesses automatically, leading to the best possible decision being taken overall. In this paper, we have developed different homogeneous and heterogeneous ensembles of optimized hybrid computational intelligence models for software development effort estimation. Different linear and nonlinear combiners have been used to combine the base hybrid learners. We have conducted an empirical study to evaluate and compare the performance of these ensembles using five popular datasets. The results confirm that individual models are not reliable as their performance is inconsistent and unstable across different datasets. Although none of the ensemble models was consistently the best, many of them were frequently among the best models for each dataset. The homogeneous ensemble of support vector regression (SVR), with the nonlinear combiner adaptive neurofuzzy inference systems-subtractive clustering (ANFIS-SC), was the best model when considering the average rank of each model across the five datasets.


Introduction
Software development effort estimation is one of the core tasks in software project management.It is defined as "the process of predicting the effort required to develop a software system" [1].It is usually measured by the number of person-hours that were spent in developing the software from specification until delivery.The success of a software development project highly depends on accurate estimation of its development effort, among other factors.One of the most common factors of software project failure is inaccurate estimates of needed resources [2].Overestimation results in wasting of resources, whereas underestimation results in schedule/budget overruns and/or quality compromise.Many software effort estimation methods have been proposed in the literature since 1980s.In recent years, the application of computational intelligence models in estimating software development efforts has been receiving increasing attention in order to improve the estimation accuracy.However, none of the existing models proved to be suitable under all circumstances.The performance of these models is unreliable, that is, varies from one dataset to another.Therefore, there is a need to build estimation models that are reliable and provide high accuracy.Ensembles of hybrid computational intelligence models are candidates for this objective.The goal of an ensemble model is to manage each of its individual models' strengths and weaknesses automatically, leading to the best possible decision being taken overall.
In this paper, we have developed different homogeneous and heterogeneous ensembles of some optimized hybrid of computational intelligence models for software development effort estimation.Different linear and nonlinear combiners have been used.We have conducted an empirical study to evaluate and compare the performance of these ensembles using five popular datasets.The rest of this paper is organized as follows.Section 2 reviews related work.Section 3 describes the computational intelligence and ensemble models that have been developed.Section 4 reports the conducted empirical study and discusses its results.Section 5 provides concluding remarks and directions for future work.

Related Work
Software effort estimation methods can be grouped into three general approaches [3,4]: expert judgment, algorithmic models, and computational intelligence.Expert judgment makes estimations based on the experience of experts on similar projects.The accuracy of this method greatly depends on the degree in which a new project concurs within the experience and the ability of the expert.According to a recent experiment, it was found that there is a high degree of inconsistency in expert judgment-based estimates of software development effort [5].Therefore, the process of deriving an estimate is not explicit and thus not repeatable [3].Jørgensen [6] conducted a review of studies on expert estimation of software development effort.
Algorithmic models represent the relationship between characteristic(s) of a software project, usually software size, and its development effort.These models are parametric in nature with a formula of standard form that is parameterized from historical data.Examples of such models include constructive cost model (COCOMO) [7], function points analysis [8], and software lifecycle management (SLIM) [9].Algorithmic models are unable to capture the complex set of relationships.Moreover, they need to be calibrated or adjusted to local circumstances [4,10].
Computational intelligence models, in recent years, have been widely applied to software effort estimation.Examples include neural networks [4,[11][12][13], Bayesian network [14], fuzzy logic [3], regression trees (RT) [15,16], casebased reasoning [10,17,18], genetic programming [19,20], and support vector regression [15,21].Some advantages of computational intelligence models include their ability to model the complex set of relationships between effort and its drivers and their ability to learn from historical projects data [10].Wen et al. [1] performed a systematic literature review of software development effort estimation based on computational intelligence models.
Recently, few research studies have investigated the use of homogeneous ensemble models for software effort estimation.Braga et al. [22] investigated the use of bagging predictors for estimation of software project effort using two versions of a NASA dataset.In particular, they applied bagging to linear regression, multilayer perceptron (MLP), M5P regression trees, M5P model trees, and support vector regression (SVR).They concluded that bagging was able to improve the performance of all models except SVR.Kultur et al. [23,24] proposed ensemble of neural networks with associative memory (ENNA) for estimating software development effort.As neural network model, they used MLP and training sets were generated by bootstrapping.The outputs of the base learners were combined by taking the average of largest cluster obtained using adaptive resonance theory algorithm.The results showed that ENNA are significantly better than neural network in terms of accuracy and robustness.Minku and Yao [25,26] investigated bagging ensemble with MLPs, with Radial Basis Function (RBF) network, and with regression trees.In addition, they investigated random ensemble with MLPs and negative correlation learning ensemble with MLPs.The outputs of the base learners were combined by taking the average.They observed that bagging ensemble of regression trees performed well in comparison to other approaches.
Some other few studies have recently investigated the use of heterogeneous ensemble models for software effort estimation.Kocaguneli et al. [27] evaluated a heterogeneous ensemble of multiple learners by averaging across them.However, no improvement of the estimation accuracy of software effort was achieved.Kocaguneli et al. [28] evaluated ensembles of preprocessed estimation methods.They used simple linear combination schemes, which are the mean, median, and inverse-ranked weighted mean (IRWM).Elish [29] evaluated the extent to which the voting ensemble model, with median combination rule, offers reliable and improved estimation accuracy over five individual models: MLP, RBF, RT, K-nearest neighbor (KNN), and SVR in estimating software development effort.In three out of the five datasets that were used in that study, the ensemble model outperformed the individual models.In the other two datasets, the ensemble model achieved the second best performance.
This paper differs from the above related works on the use of ensemble models for software effort estimation in several aspects.This paper investigates and compares both homogeneous and heterogeneous ensembles of hybrid computational intelligence models.Furthermore, in addition to simple linear combiners, this paper investigates and compares several nonlinear combiners.A comparison between this paper and related works is provided in Table 1.

Hybrid Computational Intelligence and Ensemble Models
A hybrid computational intelligent (HCI) model combines at least two computational intelligent (CI) techniques.For example, the combination of an artificial neural network (ANN) with a fuzzy inference system (FIS) results in a hybrid neurofuzzy system.HCI models are defined as any effective combination of CI techniques in sequential or parallel manner that perform superior to simple CI techniques [30].The main challenge of HCI model is the collaboration efficiency of each component.Another important factor of an HCI model is the speed of process and the time needed to produce a generalized high-performance decision model.
In this paper we have used nonlinear (categorical) principal component analysis (PCA) along with the different CI models results in HCI models.PCA was first introduced by Pearson in 1901 and become a popular tool in data analysis.PCA finds the directions in which a cloud of data points is stretched most.The objective of PCA is to perform dimensionality reduction while preserving the randomness in the high-dimensional space.PCA performs a mapping of the data to a lower dimensional space in such a way that the variance of the data in the low-dimensional representation is maximized.The basic idea behind using PCA for feature selection prior to regression is to select variables according to the magnitude (from largest to smallest in absolute values) of their coefficients.In the proposed ensemble models, PCA seeks to replace more or less correlated variables by uncorrelated combinations (projections) of the original variables.Also, PCA is used to perform dimension reduction and variable selection based on the resulting variable loadings.

The Ensemble
Model.An ensemble model employs a group of multiple learning algorithms and combines their outputs acting as a single decision maker.Figure 1 shows an ensemble of  number of CI models.Every CI model has limitations and so different learning algorithm suit different problems.The principle of ensemble is that the combined results of ensemble model should have better overall accuracy than any individual member.Numerous studies showed that ensemble accuracy significantly exceeds the single model [31][32][33].The goal of ensemble model is to manage each individual model's strengths and weaknesses automatically, leading to the best possible decision being taken overall.In fact, one of the motivations of using the ensemble approach is to avoid overfitting that is caused by high variance.Ensemble models take a combination of several hypotheses, which tends to cancel out overfitting errors.Variance reduction methods such as bagging can help and typically that has been done.Indeed, best results are often obtained by bagging overfitted classifiers.Typically, the proposed ensembles are constructed in two steps.First, a number of base learners are produced, which can be generated in a parallel style (bagging) or in a sequential style (boosting) where the generation of a base learner has influence on the generation of subsequent learners.Then, the base learners are combined to use, where among the most popular combination schemes are majority voting for classification and weighted averaging for regression.Simple averaging often provides an effective method of combining continuous outputs from individual CI models using  en = (1/) ∑  =1   , where   ( = 1, 2, . . ., ) is the output of CI member  and  en is the combined output.The  en formula assumes that outputs from all CI models are of equal weight.In practice, some outputs may have greater weights than others, and individual CI outputs may be weighted to reflect this fact.The combined output is then a weighted sum of the outputs of all members as given by To assign weight of the members of ensemble model we predicted the whole training data to measure each member's performance in terms of RMSE and assigned highest weight to the model having lowest RMSE.
The formula for weighted average method is where the weight calculation formula is To combine the outputs of the ensemble, we have used linear and nonlinear approaches.For linear method we have used averaging and weighted averaging methods.We have used the CI models as nonlinear combiner.We have chosen the CI models such as ANN, SVR, FIS, and adaptive neurofuzzy inference system (ANFIS) created with fuzzy Cmeans clustering (FCM) and subtractive clustering (SC) as a combiner.For ANN combiner we have used MLP with two neurons in the hidden layer with log-sigmoidal activation function, and in the output layer we have used tan-sigmoidal activation function and we have used either of the trainlm or trainrp training algorithms.For SVR we have used "Gaussian" type kernel with  value 0.5.The other parameters are, for example, C = 0.5, lambda = 1 − 7, and epsilon = 0.0001.For combining using FIS, we have used FCM with 6 clusters and radius of 0.3 for SC. three homogeneous ensemble models and each has three PCA-based CI models of type MLP, SVR, and ANFIS.Moreover, in the proposed ensembles, we optimized their parameters using an evolutionary algorithm based on the genetic algorithm (GA).To improve the efficiency of PCA approach, the GA has been used to select the features that would increase the performance in both training phase and test phase.We used GA to extract the most important feature for improving time and accuracy of their methods, and the PCA is used for feature extraction and classification, respectively.Table 2 lists the homogeneous and heterogeneous ensembles that have been investigated in this paper.A total of 32 different ensemble models have been evaluated and compared.For each ensemble, the table provides its base learner(s), combination type, and rule.The naming convention (abbreviation) for these ensemble models is as follows: EnsembleType-BaseLearner(s)- [CombinationRule].Ensem-bleType is either HM (homogeneous) or HT (heterogeneous).BaseLearner(s) is either MLP, or SVR, or ANFIS for HM ensembles; or MLP, SVR, and ANFIS for HT ensembles.There are eight combination rules: two linear (averaging and weighted averaging) and six nonlinear (MLP, SVR, FIS-FCM, FIS-SC, ANFIS-FCM, and ANFIS-SC).

Selecting Training Set for Each Ensemble Member.
At first we divided the whole datasets into training and testing datasets.Around 80% of the datasets is used for training and the rest 20% were used for testing.The ensemble members are actually trained using 80% data of the training set and the rest is used for model validation.After the first run of the algorithm, in each of the following runs we have selected the same amount of actual training data, that is, the 80% of the whole training set as selected in the previous run which are poorly predicted by the CI model.

Empirical Evaluation
We conducted an empirical study to evaluate and compare the performance of the homogeneous and heterogeneous ensemble models under investigation in estimating software development effort.This section discusses the conducted empirical study and its results.4.1.Datasets.Five well-known datasets were used in this empirical study.These datasets, which are described next, have been widely used in the literature.A summary of their characteristics is provided in Table 3.   4.1.5. Desharnais Dataset. Desharnais dataset [36] consists of 77 software projects from a Canadian Software house.It has eight independent variables: "team experience, " "manager's experience, " "length of project, " "transactions, " "entities, " "adjusted function points, " "development environment, " and "programming language." The dependent variable is the software development effort measured in person hours.

Performance Evaluation Metrics.
In order to assess and compare the different estimation models, three performance evaluation metrics were considered.The first metric is mean magnitude of relative error (MMRE), which is calculated as follows: where   and x are the actual and estimated values of observation , respectively, in a dataset of  observations.The second metric is PRED (25), which is a measure of the percentage of observations whose magnitude of relative error (MRE) is less than 0.25.A good estimation model will minimize MMRE and maximize PRED (25).
The third performance metric is a recently proposed evaluation function (EF) [37], which is a combination of MMRE and PRED (25).It is calculated as follows: 4.3.Evaluation of Hybrid Models.This section evaluates whether the hybridization of an individual model improves its estimation performance.If so, we will use the hybrid version of it in the development of the ensemble models; otherwise we use it as it is.In other words, the performance of the individual SVR model was compared to the hybrid PCA-SVR model and the hybrid PCA-GA-SVR model.The individual MLP and ANFIS models were also compared against their hybrid versions.Table 4 reports the performance of the individual and hybrid models based on the EF metric in each of the five datasets.Figures 2-6 show histograms of the models' performance for the five datasets, respectively.In Albrecht dataset, as observed from Table 4 and Figure 2, the hybrid PCA-SVR model performed better than both the individual SVR model and the hybrid PCA-GA-SVR model.However, the individual MLP model performed better than both the hybrid PCA-MLP model and the hybrid PCA-GA-MLP model.The individual ANFIS model also performed better than its hybrid versions.For this dataset, we accordingly developed homogeneous and heterogeneous ensembles of hybrid PCA-SVR model and individual MLP and ANFIS models.
In Miyazaki dataset, as observed from Table 4 and Figure 3, the hybrid PCA-GA-SVR model performed better than both the individual SVR model and the hybrid PCA-SVR model.The hybrid PCA-MLP model performed better In Desharnais dataset, as observed from Table 4 and Figure 6, the individual SVR model performed better than both the hybrid PCA-SVR model and the hybrid PCA-GA-SVR model.Similarly, the individual MLP and ANFIS models performed better than their hybrid models.For this dataset, we accordingly developed homogeneous and heterogeneous ensembles of individual models rather than using their hybrid versions.

Evaluation of Homogeneous and Heterogeneous Ensemble
Models.This section evaluates and compares the estimation performance of the homogeneous and heterogeneous ensemble models under investigation.Tables 4, 5, 6, 7, 8 and 9 report the performance of the individual models, the homogeneous ensemble models, and the heterogeneous ensemble models in estimating software development effort using Albrecht, Miyazaki, Maxwell, COCOMO, and Desharnais datasets, respectively.Figures 7,9,11,13,and 15 show histograms of the EF measure for each model's performance in each of the five datasets, respectively.Figures 8,10,12,14,and 16 show five plots of the MMRE versus the PRED (25) values that were achieved by each model in each dataset, respectively.Relatively accurate estimation models appear in the top left  corner of these plots.The following subsections discuss the results based on each dataset and then the overall results based on all datasets.
(vii) None of the heterogeneous ensembles was among the top 10 models in Albrecht dataset.However, at least two of the heterogeneous ensembles were among the top 10 models in the other datasets.
(viii) All ensembles models with the nonlinear combiner [FIS-FCM] did not perform well as they were not among the top 10 models in any dataset.
(ix) All ensembles models with the nonlinear combiner [SVR], except HM-SVR-[SVR] model, were not among the top 10 models in any dataset.In case of the HM-SVR-[SVR] model, it was ranked 10th in Albrecht dataset and was not among the top 10 models in the other four datasets.

Figure 2 :
Figure 2: Performance of individual and hybrid models based on EF metric using Albrecht dataset.

Figure 3 :
Figure 3: Performance of individual and hybrid models based on EF metric using Miyazaki dataset.

Figure 4 :
Figure 4: Performance of individual and hybrid models based on EF metric using Maxwell dataset.

Figure 5 :Figure 6 :
Figure 5: Performance of individual and hybrid models based on EF metric using COCOMO dataset.

4. 4 . 1 .
Results Based on Albrecht Dataset.Among the individual models, the MLP model achieved the best performance in terms of MMRE, PRED(25), and EF.By comparing the performance of the homogeneous ensembles of MLP, it can be observed that HM-MLP-[WtAvg], and HM-MLP-[FIC-SC] were the best models.Moreover, only three homogeneous ensembles of MLP (i.e., HM-MLP-[WtAvg], HM-MLP-[FIC-SC] and HM-MLP-[MLP]) achieved better EF than the individual MLP model, whereas the other ensembles of MLP were worse than it.By comparing the performance of the homogeneous ensembles of SVR, it can be noticed that HM-SVR-[ANFIS-SC] was the best, followed by the HM-SVR-[ANFIS-FCM].Furthermore, all homogeneous ensembles of SVR except HM-SVR-[FIC-SC] improved the performance of the individual SVR model in terms of EF.By comparing the performance of the homogeneous ensembles of ANFIS, it can be observed that HM-ANFIS-[MLP] was the best among them in terms of EF.In addition, only three homogeneous ensembles of ANFIS (i.e., HM-ANFIS-[Avg], HM-ANFIS-[FIC-SC], and HM-ANFIS-[ANFIS-SC]) achieved worse EF than the individual ANFIS model, whereas the other ensembles of ANFIS were better than it.Among the heterogeneous ensemble models, HT-(MLP, -SVR, ANFIS)-[ANFIS-FCM]andHT-(MLP, SVR, ANFIS)-[ANFIS-SC] achieved relatively better performance than the other heterogeneous ensembles.It is interesting to observe that the individual MLP model performed better than all the heterogeneous ensembles.The distribution of the top 10 models, in terms of EF, is as follows: 1 individual model (MLP), 5 ensembles of MLP, 3 ensembles of SVR, and 1 ensemble of ANFIS.None of the heterogeneous ensembles was among the top 10 models.4.4.2.Results Based on Miyazaki Dataset.Among the individual models, the ANFIS model achieved the best performance in terms of MMRE and EF, and the SVR model was the best in terms of PRED(25).By comparing the performance of the homogeneous ensembles of MLP, it can be observed that the HM-MLP-[ANFIS-FCM] model was the best.Moreover, only three homogeneous ensembles of MLP (i.e., HM-MLP-[MLP], HM-MLP-[FIC-SC], and HM-MLP-[ANFIS-FCM]) achieved better EF than the individual MLP model, whereas the other ensembles of MLP were worse than it.By comparing the performance of the homogeneous ensembles of SVR, it can be noticed that HM-SVR-[ANFIS-SC] was the best.Furthermore, all homogeneous ensembles of SVR except HM-SVR-[SVR] improved the performance of the individual SVR model in terms of EF.By comparing the performance of the homogeneous ensembles of ANFIS, it can be observed that HM-ANFIS-[Avg] and HM-ANFIS-[WtAvg] were the best models among them in terms of EF.In addition, only three homogeneous ensembles of ANFIS (i.e., HM-ANFIS-[Avg], HM-ANFIS-[WtAvg], and HM-ANFIS-[ANFIS-FCM]) achieved better EF than the individual ANFIS model, whereas the other ensembles of ANFIS were worse than it.Among the heterogeneous ensemble models, HT-(ML-P, SVR, ANFIS)-[Avg], HT-(MLP, SVR, ANFIS)-[WtAvg-], and HT-(MLP, SVR, ANFIS)-[MLP] achieved relatively better performance than the other heterogeneous ensembles.The distribution of the top 10 models, in terms of EF, is as follows: 1 ensemble of MLP, 3 ensembles of SVR, 3 ensembles of ANFIS, and 3 heterogeneous ensembles.None of the individual model was among the top 10 models.4.4.3.Results Based on Maxwell Dataset.Among the individual models, the MLP model achieved the best performance in terms of MMRE, whereas the SVR model was the best MLP, SVR, ANFIS)-[Avg] HT-(MLP, SVR, ANFIS)-[WtAvg] HT-(MLP, SVR, ANFIS)-[MLP] HT-(MLP, SVR, ANFIS)-[SVR] HT-(MLP, SVR, ANFIS)-[FIS-FCM] HT-(MLP, SVR, ANFIS)-[FIS-SC] HT-(MLP, SVR, ANFIS)-[ANFIS-SC] HT-(MLP, SVR, ANFIS)-[ANFIS-FCM]
PRED(25).Both models achieved the best EF value.By comparing the performance of the homogeneous ensembles of MLP, it can be observed that HM-MLP-[ANFIS-SC] was the best model based on PRED(25) and EF metrics.Moreover, only three homogeneous ensembles of MLP (i.e., HM-MLP-[Avg], HM-MLP-[FIC-SC], and HM-MLP-[ANFIS-SC]) achieved better EF than the individual MLP model, whereas the other ensembles of MLP were worse than it.By comparing the performance of the homogeneous ensembles of SVR, it can be noticed that HM-SVR-[MLP] was the best, followed by HM-SVR-[ANFIS-SC] in terms of EF.Furthermore, all other homogeneous ensembles of SVR performed worse than the individual SVR model in terms of EF.By comparing the performance of the homogeneous ensembles of ANFIS, it can be observed that HM-ANFIS-[ANFIS-SC] was the best among them in terms of MMRE, PRED(25), and EF.In addition, only two homogeneous ensembles of ANFIS (i.e., HM-ANFIS-[FIC-FCM] and HM-ANFIS-[ANFIS-FCM]) achieved worse EF than the individual ANFIS model, whereas the other ensembles of ANFIS were better than it.Among the heterogeneous ensemble models, HT-(MLP, SVR, ANFIS)-[FIS-SC] and HT-(MLP, SVR, ANFIS)-[ANFIS-SC] achieved relatively better performance than the other heterogeneous ensembles.The distribution of the top 10 models, in terms of EF, is as follows: 3 ensembles of

4. 4 . 5 .
Results Based on Desharnais Dataset.Among the individual models, the SVR model achieved the best performance in terms of MMRE, PRED(25), and EF.By comparing the performance of the homogeneous ensembles of MLP, it can be observed that HM-MLP-[FIS-SC], HM-MLP-[ANFIS-FCM], and HM-MLP-[ANFIS-SC] were the best models based on the EF metric.Moreover, all homogeneous ensembles of MLP except HM-MLP-[MLP] improved the performance of the individual MLP model in terms of EF.By comparing the performance of the homogeneous ensembles of SVR, it can be noticed that HM-SVR-[Avg]  and HM-SVR-[WtAvg] were the best.Furthermore, all other homogeneous

Table 1 :
Comparison of related works on the use of ensemble models for software effort estimation.
Ensembles.Heterogeneous ensemble consists of members having different base learning algorithms.We developed one heterogeneous ensemble model having PCA-based CI models of type MLP, SVR, and ANFIS.At first we provided the input in MLP.We selected the poorly predicted training data by MLP and provided it to train the SVR and later on the poorly predicted training data by SVR is provided to ANFIS for training.In this way the model would become diverse by having training datasets.Homogeneous ensemble consists of members having a single-type base learning algorithm.In this case ensemble members can be different by the structure.We developed

Table 2 :
Investigated homogeneous and heterogeneous ensemble models.

Table 3 :
Characteristics of datasets.

Table 4 :
Performance of individual and hybrid models based on EF metric.
measured by the number of hours of the work carried out by the software supplier from specification until delivery.4.1.4.COCOMO Dataset.COCOMO dataset [7]consists of 63 software projects including business, scientific, systems, real-time, and support software projects.It has 16 independent variables that measure product, project, computer, and personnel attributes.The dependent variable is the software development effort measured in person hours.

Table 7 :
Models' performance using Maxwell dataset.SC] achieved relatively better performance than the other heterogeneous ensembles.The distribution of the top 10 models, in terms of EF, is as follows: 4 ensembles of MLP, 2 ensembles of SVR, 2 ensembles of ANFIS, and 2 heterogeneous ensembles.None of the individual models was among the top 10 models.

Table 8 :
Models' performance using COCOMO dataset.Overall Results Based on All Datasets.Ranking of models' performance, in each dataset, on the EF metric, is provided in Table10, where top 10 models in each dataset are highlighted.Based on this table, we provide the following observations across the five datasets.

Table 10 :
Ranking of models based on EF metric (top 10 models are highlighted).