Global Optimization Ensemble Model for Classification Methods

Supervised learning is the process of data mining for deducing rules from training datasets. A broad array of supervised learning algorithms exists, every one of them with its own advantages and drawbacks. There are some basic issues that affect the accuracy of classifier while solving a supervised learning problem, like bias-variance tradeoff, dimensionality of input space, and noise in the input data space. All these problems affect the accuracy of classifier and are the reason that there is no global optimal method for classification. There is not any generalized improvement method that can increase the accuracy of any classifier while addressing all the problems stated above. This paper proposes a global optimization ensemble model for classification methods (GMC) that can improve the overall accuracy for supervised learning problems. The experimental results on various public datasets showed that the proposed model improved the accuracy of the classification models from 1% to 30% depending upon the algorithm complexity.


Introduction
According to Han and Kamber [1], "Data Mining is known to be a part of knowledge discovery (KDD) process in which data is analysed and summarized from different perspectives and converted into useful information. It helps in extracting the hidden and valid data which has the potential of being transformed into useful information. " It is similar to machine learning process and can also be termed as supervised learning process. Supervised learning is the process of data mining for deducing rules from training datasets. A broad array of supervised learning algorithms exists, every one of them with its own advantages and drawbacks.
In classification the first step is to divide the data in two portions known as training set and testing set [2]. In these datasets, one attribute must be necessarily defined as class. According to Han et al. [2], the two steps of the classification task are model construction and model usage. In this task, the model is built with the help of trained dataset and then this trained model is used to allocate the unseen records as precisely as possible. While training dataset is used to build and train the model, the testing dataset is used to validate and test the model accuracy [3], which bring us to some of the basic issues that affect the accuracy of a classifier while solving a supervised learning problem. For instance, the bias-variance tradeoff, the dimensionality curse, or the noise in the dataset all contribute towards a decreasing accuracy. Bias arises when the classifier cannot represent the true function; that is, the classifier under fits the data; that is, when it is training on any data set than for a specific input value it is methodically inaccurate when predicting the right outcome for that input value. In contrast to this, variance occurs when the algorithm over fits the data and for a specific input value in a dataset it gives a different outcome every time the training dataset is changed. Another problem that can affect the accuracy of a classifier is the dimensionality or the number of attributes or features in a dataset. If we input a large number of attributes in a classification algorithm even for problems where decision depends on subset of all those attributes, then performance of the classifier will be clouded by high variance due to high dimension of dataset. Therefore if a dataset with high dimension is being used the classifier must make a tradeoff between high bias and low variance. The classification results are also altered by the noise in data, that is, redundant records, incorrect records, missing records, outliers, and so forth. All these problems affect the accuracy of a classifier. Usually the improvements done in a classifier or ensemble model are limited to a very narrow spectrum and they cannot be applied to another classifier under the same conditions.  Classification accuracy is normally improved through ensemble models like bagging (which averages the prediction of a number of classification models), boosting (it uses the voting scheme over a number of classification models), or a combination of classifiers from different or same families as discussed in Section 2.
Therefore, in this paper we propose a global optimization using the idea of ensemble models for classification methods and prove through experimental results that our model improves the classification accuracy of various classifiers on various different public datasets. Section 2 of the paper gives an insight into the previously related work. Design of the proposed model is given in Section 3. Section 4 explains the implementation. Section 5 gives the result and analysis. Section 6 contains the conclusion and future work.

Literature Review
As mentioned earlier that so far no global optimization ensemble model is present which can help in improving the classification and prediction accuracy for supervised learning problems which are generally affected by a spectrum of issues like dimensionality, accuracy rate, data quality, and so forth . Although, no global solution exists for these problems but some other efforts have been made to resolve these issues and all of them are either algorithm specific or data specific. Every approach has tackled the problem of classification accuracy rate from a different angle and perspective. One such work is [16] where Dash et al. have proved through comparison of various classification techniques like support vector machine (SVM) with polynomial kernel, support vector machine with RBF kernel, radial basis function network (RBFN), and multilayer perceptron network (MLP) with and without feature extraction. It was found that for construction of high performance classification model for microarray dataset, partial least square (PLS) regression method is the suitable feature selection method instead of hybrid dimensionality reduction scheme and feature selection combined with various classification techniques can yield better results. Lin and Chen in [17] combined PSO-(particle swarm optimization-)based approach with commonly used classification technique LDA (linear discriminant analysis). This research also emphasizes the importance of feature selection and its positive effect on classification accuracy. Authors of this study have compared the performance of this combined model called PSOLDA with many other feature selection techniques like forward selection, back propagation selection, and so forth and shown through experimental results that for many public datasets the proposed combined model (PSOLDA) has higher classification accuracy rate. Bryll et al. [18] developed a new wrapper method AB (attribute bagging) to improve the classification accuracy implementing a two-stage method in which first a suitable size was provided for training data and then randomly a subset of attributes was selected for voting scheme. This method was compared with bagging which was used with some decision tree algorithms and some rule induction algorithms, and it was found that the AB performs better in terms of accuracy and constancy. And authors conclude that attribute partitioning is better than data partitioning for improving the accuracy in an ensemble method. Abbott [19] compared boosting with an ensemble of models across the algorithm families. These combined models used voting as the selection scheme and authors report that boosting performs better because it focuses on complicated cases in data and takes into account the confidence value of a particular classification decision. Sohn and Lee [20] tried to improve the classification accuracy of algorithms like neural network and decision trees by applying different approaches including bagging, boosting, and clustering. However for the particular problem of road traffic accident classification clustering leading to classification was found to be more effective. Smith and Martinez [21] suggested that outliers and noise should be eliminated from the dataset as it will yield better results in terms of classification accuracy. Because by removing or filtering these instances the dataset becomes clean of all the cases that could be misclassified. As there is no general definition or guide available as to what noise is and what an outlier is, therefore the identification of these two elements in any dataset is difficult. Furthermore PRISM was found to be one of the best algorithms for finding cases that could be outliers. Dimensionality reduction problem has been an interesting topic for researchers in a diverse spectrum of fields like image detection, voice detection, microarrays, neural network patterns, and so forth. As discussed by Zamalloayz et al. [22], Liu et al. [23], and Raymer et al. [24] genetic algorithm (GA) is quite a popular method under research and is found to be quite effective for feature selection and classification accuracy improvement. All these researches related to GA are data specific or algorithm specific. In [22] the performance of GA is compared with other feature reduction and extraction techniques like liner discriminant analysis (LDA); principle component analysis (PCA) for one dataset GA was found to perform better while for the other dataset LDA and PCA showed promising results. In [23] the genetic algorithm is combined with the boosting technique in order to improve accuracy of classification. The improved version assigns higher weight to the misclassified instances in order to shift the focus on them in the next iteration. This process tends to achieve higher accuracy with less number of evaluations than the original GA. In [24] genetic algorithm is implemented in combination with K-nearest neighbor classifier and feature extraction; reduction and classifier training are all done simultaneously and results are compared with other industry standard feature extraction and reduction technique like liner discriminant analysis and sequential floating forward feature selection.
Despite all this extensive work on ensemble methods and feature reduction problem and various classification The Scientific World Journal 3 algorithms for improving the accuracy rate in classification, there is no global optimization ensemble model suggested so far that can improve the accuracy of classification methods with any dataset. Therefore in this paper we design and implement such a global optimization model.

Design
The idea was to implement the concept of ensemble model in order to create an improved global model. Figure 1 shows the overall design of the ensemble model.
Layer 1 was providing antidote for dimensionality curse. As discussed in the literature review the dimensionality reduction or feature reduction is necessary in order to improve the classification accuracy. Therefore in our model the first layer contains the data set, preprocessing operator, and a feature reduction operator. For feature selection, genetic algorithm (GA) is used as it has shown to produce better results than other feature reduction techniques [23][24][25].
Maximal fitness of GA is set to infinity as there is no absolute maxima for the fitness function which means the GA will keep on selecting the best of best until the stop criteria are met which in this case is the maximum number of generation. Roulette wheel selection scheme was used for selecting individuals because it has the obvious advantage that it does not ignore or discard any individuals and each individual is given a chance of being chosen as even the weakest of individuals might be hiding valuable information. And as we are striving for a global solution, therefore a selection method that preserves diversity and is fast to converge sounds good. Crossover type was set to shuffle because shuffle crossover is related to uniform crossover. A single crossover position (as in single-point crossover) is selected. But before the variables are exchanged, they are randomly shuffled in both parents. After recombination, the variables in the offspring are unshuffled. This removes positional bias as the variables are randomly reassigned each time crossover is performed. The parameter values chosen for GA are shown in Table 1.
Parameter optimization for the operators in each layer was done by implementing global optimization operator using grid search. This methodology involves setting up of grids in the decision space and evaluating the values of the objective function at each grid point. The point which corresponds to the best value of the objective function is considered to be the optimum solution. For all the layers, a total of 5 parameters were optimized using grid search optimizations. From each attribute 11 combinations were proposed; this means for optimizing these 5 parameters total 161051 combinations were tested. Table 2(a) shows all the parameter and there optimized values.
In layer 2 partition of training and testing dataset was done using X-fold crossvalidation. The data set is divided into subsets, and the holdout method is repeated times. Each time, one of the subsets is used as the test set and the other − 1 subsets are put together to form a training set. Then the average error across all trials is computed. The advantage of this method was that it matters less how the data gets divided. Every data point gets to be in a test set exactly once and gets to be in a training set − 1 times. Besides, the variance of the resulting estimate is also reduced as is increased. Stratified sampling scheme was used in CV with number of iteration set to 10 as shown in Table 2(a). In stratified sampling the random subsets are created but the distribution of class in those subsets is the same as the whole dataset. Thus this type of sampling reduces variance. For example we have a data set of 180 employees and we want a sample set of 40 employees. The first step is to calculate the percentage of male female in each group, that is, is the final ratio of records in each category in our sample of 40 employees. Layer 3 did an optimal bias-variance tradeoff. Accuracy improvement is done by implementing bootstrap aggregation (bagging). Bagging is a machine learning ensemble metaalgorithm which reduces both bias and variance in order to help avoid overfitting. Although it is usually applied to decision tree models, it can be used with any type of model. Bagging is a special case of the model averaging approach. Parameter setting for bagging is shown in Table 2(a). We are using bagging instead of boosting because error = noise error + bias + variance bagging can reduce both bias and variance but mostly it reduces just variance and it hardly ever increases error. For high-bias classifiers, it can reduce bias and for highvariance classifiers, it can reduce variance, while boosting in the early iterations is primary a bias-reducing method. In later iterations, it appears to be primarily a variance-reducing method. It may increase error and margins and is not good with data with noise. That is the reason why we chose bagging instead of boosting for bias and variance tradeoff.
Classifiers were placed in layer 4 with parameters configuration done according to the dataset. All classifier parameters were set to obtain the optimal model in order to reduce the bias. The setting used for each classifier is shown in Table 2(b).

Implementation
Implementation and testing are done using core i3 processor with 4 GB RAM, while coding is done using XML. Preprocessing is performed on every dataset according to requirements of the classifier used in order to remove noise from data and do type conversations. The model is implemented and tested in RapidMiner5 [15].
Step 1 (algorithm selection). As we are optimizing the model for supervised learning problems, therefore the following liner and nonliner classifiers were selected, implemented, and tested.
Step 2 (data set selection). Datasets from various different fields are selected such as banking, medicine, and census data. Selection of the datasets was based on In total 7 datasets from various fields are used for experimentation. The classifiers used in the implementation and the details of the datasets used are given in Tables 3 and  4 respectively.
Suitable classifier for each dataset is selected and indicated as shown in Table 5.
Step 4 (classification using global optimization ensemble model for classification methods (GMC)). All the classifiers are now encapsulated in the proposed generic optimization ensemble model and executed for results. Parameters of all the classifiers are the same as in Step 3. Now the improved results consisting of optimized classification accuracy are recorded for every classifier and compared with the previous result in order to calculate the improvement percentage.

Results
The results for each data set and the corresponding accuracy comparison between simple classification and GMC model are given in this section. Table 6 shows that using the GMC model for optimizing the classification accuracy for cancer dataset has improved from 1.13% to 29.76% depending on the classifier and the biasvariance tradeoff its inner complexity offers. Table 7 shows that using the GMC model for optimizing the classification accuracy for heart disease dataset has improved from 2.4% to 14.54% depending on the classifier and the bias-variance tradeoff its inner complexity offers. Table 8 shows that using the GMC model for optimizing the classification accuracy for wine dataset has improved from 3.92% to 19.67% depending on the classifier and the biasvariance tradeoff its inner complexity offers.
As shown in Table 9, using the GMC model for optimizing the classification accuracy for adult income dataset has improved from 1% to 6.5% depending on the classifier and the bias-variance tradeoff its inner complexity offers.
As shown in Table 10, using the GMC model for optimizing the classification accuracy for sonar dataset has improved from 4.82% to 15.36% depending on the classifier and the bias-variance tradeoff its inner complexity offers.
As shown in Table 11, using the GMC model for optimizing the classification accuracy for educational dataset has improved from 8% to 26% depending on the classifier and the bias-variance tradeoff its inner complexity offers. 6 The Scientific World Journal         As shown in Table 12, using the GMC model for optimizing the classification accuracy for diabetes dataset has improved from 1.39% to 10.55% depending on the classifier and the bias-variance tradeoff its inner complexity offers.
It can be seen that for K-NN the improvement in accuracy is as high as 30%. This is a significant increase in accuracy and shows the effectiveness of the GMC model. For other algorithms such as decision tress or logistic regression the increase in classification accuracy is varying between 1% and 3%. Although this may indicate the shorting comings of GMC model, but this is not the case in reality. During the experimentation it was noted that for some supervised techniques such as decision tress, the accuracy of the classifier was already very high (90% or more); therefore the possibility of further improving the classifier was rather limited. Thus in this case GMC model could only increase the accuracy by a small amount. For example in the case of decision tress, the GMC model increased the average accuracy from 91.1% to 93.2%. However for other supervised learning algorithms there existed a large gap within which the classifier accuracy may further be increased. This explains the significant increase classifier accuracy in algorithms such as K-NN. Thus although GMC model is dependent upon the algorithm in terms of how much classifier accuracy may be improved, yet it has in all cases increased the accuracy of the classifier.

Conclusion and Further Work
In order to solve the basic issues of supervised learning problems like dimensionality reduction, bias-variance tradeoff, and noise, we used the concept of ensemble models to design an optimized global ensemble model for classification methods (GMC). The model was designed in layers with each layer solving one of the basic issues of supervised learning. We proved through experimentation that if classifiers are enclosed in our model there accuracy improves from 1% to 30% depending upon the algorithm complexity and its capability of handling bias and variance. Our model yielded better results than when the classifiers were used alone or in combination.
The model can be further optimized for extremely large data set in real time. In that case the optimization will focus on the reduction of execution time as well as further improvement in accuracy. Parallel processing can be introduced into the model for minimizing time. There are a lot of optimization techniques available and a separate research and comparison can be carried out between all those techniques and the effect of those techniques on the global model. Furthermore, research can be carried out on this model for unsupervised learning problems with data sets related to more diverse fields.