Forest Pruning Based on Branch Importance

A forest is an ensemble with decision trees as members. This paper proposes a novel strategy to pruning forest to enhance ensemble generalization ability and reduce ensemble size. Unlike conventional ensemble pruning approaches, the proposed method tries to evaluate the importance of branches of trees with respect to the whole ensemble using a novel proposed metric called importance gain. The importance of a branch is designed by considering ensemble accuracy and the diversity of ensemble members, and thus the metric reasonably evaluates how much improvement of the ensemble accuracy can be achieved when a branch is pruned. Our experiments show that the proposed method can significantly reduce ensemble size and improve ensemble accuracy, no matter whether ensembles are constructed by a certain algorithm such as bagging or obtained by an ensemble selection algorithm, no matter whether each decision tree is pruned or unpruned.


Introduction
Ensemble learning is a very important research topic in machine learning and data mining. The basic heuristic is to create a set of learners and aggregate the prediction of each learner for classifying examples. Many approaches such as bagging [1], boosting [2], and COPEN [3] have been proposed to create ensembles, and the key to the success of these approaches is that base learners are accurate and diverse [4].
Ensemble methods have been applied to many applications such as image detection [5][6][7] and imbalanced learning problem [8]. However, an important drawback existing in ensemble learning approaches is that they try to train unnecessarily large ensembles. Large ensembles need a large memory for storing the bases learners and much response time for prediction. Besides, large ensemble may reduce its generalization ability instead of increasing the performance [9]. Therefore, a lot of researches to tackle this problem have been carried out, and the researches mainly focus on ensemble selection: selecting a subset of ensemble members for prediction, such as ordered-based ensemble selection methods [10][11][12] and greedy heuristic based ensemble selection methods [13][14][15][16][17][18][19][20][21]. The research results indicate that a welldesigned ensemble selection method can reduce ensemble size and improve ensemble accuracy.
Besides ensemble selection, we can prune an ensemble through the following two approaches if ensemble members are decision trees: (1) pruning individual members separately and combining the pruned members together for prediction and (2) repeatedly pruning individual members by considering the overall performance of the ensemble. For the first strategy, many decision tree pruning methods such as those used in CART [22] and C4.5 [23] have been studied. Although pruning can simplify model structure, whether pruning can improve model accuracy is still a controversial topic in machine learning [24]. The second strategy coincides with the expectation of improving model generalization ability globally. However, this method has not been extensively studied. This paper focuses on this strategy and names the strategy as forest pruning (FP).
The major job of forest pruning is to define an effective metric evaluating the importance of a certain branch of trees. Traditional metrics can not be applied to forest pruning, since these metrics just consider the influence on a single decision tree when a branch is pruned. Therefore, we need a new metric for pruning forest. Our contributions in this paper are as follows: (i) Introduce a new ensemble pruning strategy to prune decision tree based ensemble; 2 Computational Intelligence and Neuroscience (ii) propose a novel metric to measure the improvement of forest performance when a certain node grows into a subtree; (iii) present a new ensemble pruning algorithm with the proposed metric to prune a decision tree based ensemble. The ensemble can be learned by a certain algorithm or obtained by some ensemble selection method. Each decision tree can be pruned or unpruned. Experimental results show that the proposed method can significantly reduce the ensemble size and improve its accuracy. This result indicates that the metric proposed in this paper reasonably measures the influence on ensemble accuracy when a certain node grows into a subtree.
The rest of this paper is structured as follows. Section 2 provides a survey of ensemble of decision trees; Section 3 presents the formal description of forest trimming and the motivation of this study by an example. Section 4 introduces a new forest pruning algorithm. Section 5 reports and analyzes experimental results and we conclude the paper with simple remark and future work in Section 6.

Forests
A forest is an ensemble whose members are learned by decision tree learning method. Two approaches are often used to train a forest: traditional approaches and the methods specially designed for forests.
Bagging [1] and boosting [2] are the two most often used traditional methods to build forests. Bagging takes bootstrap samples of objects and trains a tree on each sample. The classifier votes are combined by majority voting. In some implementations, classifiers produce estimates of the posterior probabilities for the classes. These probabilities are averaged across the classifiers and the most probable class is assigned, called "average" or "mean" aggregation of the outputs. Bagging with average aggregation is implemented in Weka and used in the experiments in this paper. Since each individual classifier is trained on a bootstrap sample, the data distribution seen during training is similar to the original distribution. Thus, the individual classifiers in a bagging ensemble have relatively high classification accuracy. The factor encouraging diversity between these classifiers is the proportion of different examples in the training set. Boosting is a family of methods and Adaboost is the most prominent member. The idea is to boost the performance of a "weak" classifier (can be decision tree) by using it within an ensemble structure. The classifiers in the ensemble are added one at a time so that each subsequent classifier is trained on data which have been "hard" for the previous ensemble members. A set of weights is maintained across the objects in the data set so that objects that have been difficult to classify acquire more weight, forcing subsequent classifiers to focus on them.
Random forest [25] and rotation forest [26] are two important approaches specially designed for building forests. Random forest is a variant version of bagging. The forest is built again on bootstrap samples. The difference lies in the construction of the decision tree. The feature to split a node is selected as the best feature among a set of randomly chosen features, where is a parameter of the algorithm. This small alteration appeared to be a winning heuristic in that diversity was introduced without much compromising the accuracy of the individual classifiers. Rotation forest randomly splits the feature set into subsets ( is a parameter of the algorithm) and Principal Component Analysis (PCA) [27] is applied to each subset. All principal components are retained in order to preserve the variability information in the data. Thus, axis rotations take place to form the new features and rotation forest building a tree using all training set in the new space defined by a given new feature space.

Problem Description and Motivation
Similarly, for each example x to be classified, ensemble returns a vector ( 1 , 2 , . . . , ) indicating that x belongs to label with probability , where The prediction of on x is (x ) = argmax . Now, our problem is, given a forest with decision trees, how to prune each tree to reduce 's size and improve its accuracy, where is either constructed by some algorithm or obtained by some ensemble selection method.

Motivation.
First, let us look at an example, which shows the possibility that forest trimming can improve ensemble accuracy. 9 } be a forest with ten decision trees, where 1 is shown in Figure 1. Suppose that Obviously, for 0 , we can not prune the children of node V, since treating V as a leaf would lead to more examples incorrectly classified by 0 .
Assume that 's predictions on x 0 , x 1 , . . . , x 9 are as follows: where is the probability of x associated with label . From 's predictions shown above, we have that x 6 is incorrectly classified by .
It is easy to see that correctly classifies all of the ten examples.
This example shows that if a single decision tree is considered, maybe it should not be pruned any more. However, for the forest as a whole, it is still possible to prune some branches of the decision tree, and this pruning will probably improve the ensemble accuracy instead of reducing it.
Although the example above is constructed by us, similar cases can be seen everywhere when we study ensembles further. It is this observation that motivates us to study forest trimming methods. However, more efforts are needed to 4 Computational Intelligence and Neuroscience turn possibility into feasibility. Further discussions about this problem will be presented in the next section.

Forest Pruning Based on
Branch Importance

The Proposed Metric and Algorithm Idea.
To avoid trapping in detail too early, we assume that (V, , x ) has been defined, which is the importance of node V when forest classifies example x . If x ∉ (V), then (V, , x ) = 0. Otherwise, the details of the definition of (V, , x ) are presented in Section 3.2.
Let ∈ be a tree and let V ∈ be a node. The importance of V with respect to forest is defined as where is a pruning set and (V) is the set of the example in reaching node V from root( ). (V, ) reflects the impact of node V on 's accuracy.
Let (V) be the set of leaf nodes of branch(V), the branch (subtree) with V as the root. The contribution of branch(V) to is defined as which is the sum of the importance of leaves in branch(V). Let V ∈ be a nonterminal node. The importance gain of V to is defined by the importance difference between branch(V) and node V, that is, IG(V, ) can be considered as the importance gain of branch(V), and its value reflects how much improvement of the ensemble accuracy is achieved when V grows into a subtree. If IG(V, ) > 0, then this expansion is helpful to improve 's accuracy. Otherwise it is unhelpful to improve or even reduce 's accuracy. The idea of the proposed method of pruning ensemble of decision trees is as follows. For each nonterminal node V in each tree , calculate its importance gain IG(V, ) on the pruning set. If IG(V, ) is smaller than a threshold, prune branch(V) and treat V as a leaf. This procedure continues until all decision trees can not be pruned.
Before presenting the specific details of the proposed algorithm, we introduce how to calculate (V, , x ) in the next subsection.
4.2. Con(V, , x ) Calculation. Let ℎ be a classifier and let be an ensemble. Partalas et al. [28,29] identified that the prediction of ℎ and on an example x can be categorized into four cases: They concluded that considering all four cases is crucial to design ensemble diversity metrics.
Based on the four cases above, Lu et al. [11] introduced a metric, IC ( ) , to evaluate the contribution of the th classifier to when classifies the th instance. Partalas et al. [28,29] introduced a measure called Uncertainty Weighted Accuracy, UWA (ℎ, , x ), to evaluate ℎ's contribution when classifies example x . Similar to the discussion above, we define In the following discussions, we assume that V ∈ and x ∈ (V). Let and be the subscripts of the largest element and the second largest element in { 1 , . . . , }, respectively. Obviously, is the label of x predicted by ensemble . Similarly, let is the label of x predicted by decision tree . Otherwise, is the label of x predicted by , where is the decision tree obtained from by pruning branch(V). For simplicity, we call the label of x predicted by node V and say node V correctly classifies x if = .
We define (V, , x ) based on the four cases in formula where is the number of base classifiers in . Here, = and ̸ = , then V ≥ V , ≥ , and thus 0 ≤ Con(V, , x ) ≤ 1. Since V is the contribution of node V to the probability that correctly predicates x belonging to class while V is the contribution of node V to , the probability that incorrectly predicates x belongs to class , ( V − V )/ can be considered as the net importance of node V when classifies x .
− is the weight of V's net contribution, which reflects the importance of node V for classifying x correctly. The constant 1/ is to avoid − being zero or too small. For x ∈ (V), Con(V, , x ) is defined as Here, 0 ≤ Con(V, , x ) ≤ 1. In this case, both V and correctly classify x . We treat ( V − V )/ as the net  (4) for each node V in do (5) V ← 0; (6) for each x ∈ do (7) ← , 1 ≤ ≤ ; Let be the path along which x travels; (9) for each node V ∈ (10) V ← V + (V, , x ); (11) PruningTree(root( )); (12) for each x ∈ (13) ← , 1 ≤ ≤ ; Prune subtree(V) and set V to be a leaf; Algorithm 1: The procedure of forest pruning.
contribution of node V to and x , and − as the weight of V's net contribution.
For x ∈ (V), Con(V, , x ) is defined as It is easy to prove −1 ≤ Con(V, , x ) ≤ 0. This case is opposed to the first case. In this case, we treat −( V − V )/ as the net contribution of node V to and x , and − as the weight of V's net contribution.
For x ∈ (V), Con(V, , x ) is defined as where ∈ {1, . . . , } is the label of x , −1 ≤ Con(V, , x ) ≤ 0. In this case, both V and incorrectly classify x , namely, ̸ = and ̸ = . We treat −( V − V )/ as the net contribution of node V to and x , and − as the weight of V's net contribution.

Algorithm.
The specific details of forest pruning (FP) are shown in Algorithm 1, where is a pruning set containing instances, is the probability that ensemble predicts x ∈ associated with label , is the probability that current tree predicts x ∈ associated with label , V is a variant associated with node V to save V's importance, br(V) is a variant associated with node V to save the contribution of branch(V).
FP first calculates the probability of 's prediction on each instance x (lines (1)∼(2)). Then it iteratively deals with each decision tree (lines (3)∼ (14)). Lines (4)∼(10) calculate the importance of each node V ∈ , where (V, , x ) in line (10) is calculated using one of the equations (9)∼(12) based on the four cases in equation (8). Line (11) calls PruningTree(V) to recursively prune . Since forest has been changed after pruning , we adjust 's prediction in lines (12)- (14). Lines (3)- (14) can be repeated many times until all decision trees can not be pruned. Experimental results show that forest performance is stable after this iteration is executed 2 times.
The recursive procedure PruningTree(V) adopts a bottom-up fashion to prune the decision tree with V as the root. After pruning branch(V) (subtree(V)), V saves the sum of the importance of leaf nodes in branch(V). Then (branch(V), ) is equal to the sum of importance of the tree with V as root. The essence of using 's root to call 6 Computational Intelligence and Neuroscience PruningTree is to travel . If current node V is a nonleaf, the procedure calculates V's importance gain IG, saves into V the importance sum of the leaves of branch(V) (lines (2)∼(7)), and determines pruning branch(V) or not based on the difference between CG and the threshold value (lines (8)∼(9)).

Discussion.
Suppose pruning set contains instances, forest contains decision trees, and max is the depth of the deepest decision tree in . Let | | be the number of nodes in decision tree , and max = max 1≤ ≤ (| |). The running time of FP is dominated by the loop from lines (4) to (19). The loop from lines (5) to (7) traverses , which is can be done in ( max ); the loop from lines (8) to (14) searches a path of for each instance in , which is complexity of ( max ); the main operation of PruningTree(root( )) is a complete traversal of , whose running time is ( max ); the loop from lines (16) to (18) scans a linear list of length in ( ). Since max ≪ max , we conclude the running time of FP is ( max ). Therefore, FP is a very efficient forest pruning algorithm.
Unlike traditional metrics such as those used by CART [22] and C4.5 [23], the proposed measure uses a global evaluation. Indeed, this measure involves the prediction values that result from a majority voting of the whole ensemble. Thus, the proposed measure is based on not only individual prediction properties of ensemble members but also the complementarity of classifiers.
From equations (9), (10), (11), and (12), our proposed measure takes into account both the correctness of predictions of current classifier and the predictions of ensemble and the measure deliberately favors classifiers with a better performance in classifying the samples on which the ensemble does not work well. Besides, the measure considers not only the correctness of classifiers, but also the diversity of ensemble members. Therefore, using the proposed measure to prune an ensemble leads to significantly better accuracy results. Table 1 are randomly selected from UCI repertory [30], where #Size, #Attrs, and #Cls are the size, attribute number, and class number of each data set, respectively. We design four experiments to study the performance of the proposed method (forest pruning, FP):

Experimental Setup. 19 data sets of which the details are shown in
(i) The first experiment studies FP's performance versus the times of running FP. Here, four data sets, that is, autos, balance-scale, German-credit, and pima, are selected as the representatives, and each data set is randomly divided into three subsets with equal size, where one is used as the training set, one as the pruning set, and the other one as the testing set. We repeat 50 independent trials on each data set. Therefore a total of 300 trials of experiments are conducted. (ii) The second experiment is to evaluate FP's performance versus FL's size (number of base classifiers). The experimental setup of data sets is the same as the first experiment.
(iii) The third experiment aims to evaluate FP's performance on pruning ensemble constructed by bagging [1] and random forest [26]. Here, tenfold crossvalidation is employed: each data set is divided into tenfold [31,32]. For each one, the other ninefold is to train model, and the current one is to test the trained model. We repeat 10 times the tenfold cross-validation and thus, 100 models are constructed on each data set. Here, we set the training set as the pruning set. Besides, algorithm rank is used to further test the performance of algorithms [31][32][33]: on a data set, the best performing algorithm gets the rank of 1.0, the second best performing algorithm gets the rank of 2.0, and so on. In case of ties, average ranks are assigned.
(iv) The last experiment is to evaluate FP's performance on pruning the subensemble obtained by ensemble selection method. EPIC [11] is selected as the candidate of ensemble selection methods. The original ensemble is a library with 200 base classifiers, and the size of subsembles is 30. The setup of data sets is the same as the third experiment.
In the experiments, bagging is used to train original ensemble, and the base classifier is J48, which is a Java implementation of C4.5 [23] from Weka [34]. In the third experiment, random forest is also used to build forest. In the last three experiments, we run FP two times.
Computational Intelligence and Neuroscience

Experimental
Results. The first experiment is to investigate the relationship of the performance of the proposed method (FP) and the times of running FP. In each trial, we first use bagging to learn 30 unpruned decision trees as a forest and then iteratively run lines (3)∼(14) of FP many times to trim the forest. More experimental setup refers to Section 5.1. The corresponding results are shown in Figure 2, where the top four subfigures are the variation trend of forest nodes number with the iteration number increasing, and the bottom four are the variation trend of ensemble accuracy. Figure 2 shows that FP significantly reduces forests size (almost 40%∼60% of original ensemble) and significantly improves their accuracy. However, the performance of FP is almost stable after two iterations. Therefore, we set iteration number to be 2 in the following experiments.
The second experiment aims at investigating the performance of FP on pruning forests with different scales. The number of decision trees grows gradually from 10 to 200. More experimental setup refers to Section 5.1. The experimental results are shown in Figure 3, where the top four subfigures are the comparison between pruned and unpruned ensembles with the growth of the number of decision trees, and the bottom four are the comparison of ensemble accuracy. As shown in Figure 3, for each data set, the rate of forest nodes pruned by FP keeps stable and forests accuracy improved by FP is also basically unchanged, no matter how many decision trees are constructed.
The third experiment is to evaluate the performance of FP on pruning the ensemble constructed by ensemble learning method. The setup details are shown in Section 5.1. Tables 2, 3, 4, and 5 show the experimental results of compared methods, respectively, where Table 2 reports the mean accuracy and the ranks of algorithms, Table 3 reports the average ranks using nonparameter Friedman test [32] (using STAC Web Platform [33]), Table 4 reports the comparing results using post hoc with Bonferroni-Dunn (using STAC Web Platform [33]) of 0.05 significance level, and Table 5 reports the mean node number and standard deviations. Standard deviations are not provided in Table 2 for clarity. The column of "FP" of Table 2 is the results of pruned forest and, "bagging" and "random forest" are the results of unpruned forests constructed by bagging and random forest, respectively. In Tables 3 and 4, Alg1, Alg2, Alg3, Alg4, Alg5, and Alg6 indicate PF pruning bagging with unpruned C4.5, bagging with unpruned C4.5, PF pruning bagging with pruned C4.5, bagging with pruned C4.5, PF pruning random forest, and random forest. From Table 2, FP significantly improves ensemble accuracy in most of the 19 data sets, no matter whether the individual classifiers are pruned or unpruned, no matter whether the ensemble is constructed by bagging or random forest. Besides, Table 2 shows that the ranks of FP always take place of best three methods in these data sets. Tables 3 and 4 validate the results  in Table 2, where Table 3 shows that the average rank of PF is much small than other methods and Table 4 shows that, compared with other methods, PF shows significant better performance. Table 5 shows FP is significantly smaller than bagging and random forest, no matter whether the individual classifier is pruned or not.
The last experiment is to evaluate the performance of FP on pruning subensembles selected by ensemble selection method EPIC. Table 6 shows the results on the 19 data sets, where left and right are the accuracy and size, respectively. As shown in Table 6, FP can further significantly improve the

Conclusion
An ensemble with decision trees is also called forest. This paper proposes a novel ensemble pruning method called forest pruning (FP). FP prunes trees' branches based on the proposed metric called branch importance, which indicates the importance of a branch (or a node) with respect to the whole ensemble. In this way, FP achieves reducing ensemble size and improving the ensemble accuracy. The experimental results on 19 data sets show that FP significantly reduces forest size and improves its accuracy in most of the data sets, no matter whether the forests are the ensembles constructed by some algorithm or the subensembles selected by some ensemble selection method, no matter whether each forest member is a pruned decision tree or an unpruned one. Table 6: The performance of FP on pruning subensemble obtained by FP on bagging. • represents that FP is significantly better (or smaller) than EPIC in pairwise t-tests at 95% significance level and denotes that FP is significantly worse (or larger) than EPIC.

Dataset
Error rate Size PF EPIC PF EIPC Australian