A Comparison Study on Rule Extraction from Neural Network Ensembles , Boosted Shallow Trees , and SVMs

One way to make the knowledge stored in an artificial neural network more intelligible is to extract symbolic rules. However, producing rules from Multilayer Perceptrons (MLPs) is an NP-hard problem. Many techniques have been introduced to generate rules from single neural networks, but very few were proposed for ensembles. Moreover, experiments were rarely assessed by 10-fold cross-validation trials. In this work, based on the Discretized Interpretable Multilayer Perceptron (DIMLP), experiments were performed on 10 repetitions of stratified 10-fold cross-validation trials over 25 binary classification problems. The DIMLP architecture allowed us to produce rules from DIMLP ensembles, boosted shallow trees (BSTs), and Support Vector Machines (SVM). The complexity of rulesets was measured with the average number of generated rules and average number of antecedents per rule. From the 25 used classification problems, themost complex rulesets were generated fromBSTs trained by “gentle boosting” and “real boosting.” Moreover, we clearly observed that the less complex the rules were, the better their fidelity was. In fact, rules generated from decision stumps trained by modest boosting were, for almost all the 25 datasets, the simplest with the highest fidelity. Finally, in terms of average predictive accuracy and average ruleset complexity, the comparison of some of our results to those reported in the literature proved to be competitive.


Introduction
The explanation of neural network responses is essential for their acceptance.As an example, physicians cannot trust any model without any form of enlightenment.An intuitive way to give insight into the knowledge embedded within neural network connections and neuron activation is to extract symbolic rules.However, producing rules from Multilayer Perceptrons (MLPs) is an NP-hard problem [1].
In the context of classification, the format of a symbolic rule is given as follows: "if tests on antecedents are true then class ," where "tests on antecedents" are in the form   ≤   or   ≥   , with   as an input variable and   as a real number.Class  designates a class among several possible classes.The definition of the complexity of the extracted rules is often described with two parameters: number of rules and number of antecedents per rule.Rulesets of low complexity are preferred compared to those with high complexity, since at first sight fewer rules and fewer antecedents are better understood.Another reason of preference is that rule bases with lower complexity also reduce the risk of overfitting on new data.Nevertheless, Freitas clarified that the comprehensibility of rules is not necessarily related to a small number of rules [2].He proposed a new measure denoted as prediction-explanation size, which strongly depends on the average number of antecedents per rule.Another measure of rule transparency is consistency.Specifically, an extracted ruleset is deemed to be consistent if, under different training sessions, the rule extraction algorithm produces rulesets which classify samples into the same classes.Finally, a rule is redundant if it conveys the same information or less general information than the information conveyed by another rule.
In unordered rules "else if" is replaced again by "if tests on antecedents are true then conclusion."Thus, a sample can activate more than a rule.Long ordered rulesets are difficult to understand since they potentially include many implicit antecedents; specifically, those negated by "else if."Generally, unordered rulesets present more rules and antecedents than ordered ones, since all rule antecedents are explicitly provided, thus being more transparent than ordered rulesets.Each rule of an unordered ruleset represents a single piece of knowledge that can be examined in isolation, since all antecedents are explicitly given.With a great number of unordered rules, one would try to accurately understand the meaning of each rule with respect to the data domain.Getting the global picture could take a long time; nevertheless, one could be interested only in some parts of the whole knowledge, for instance, those rules with the highest number of covered samples.
The Discretized Interpretable Multilayer Perceptron (DIMLP) represents a special feedforward neural network architecture from which crisp symbolic rules are extracted in polynomial time [3].This particular Multilayer Perceptron (MLP) model can be used to learn any classification problem, and rule extraction is also performed for DIMLP ensembles.Furthermore, special DIMLP architectures were also defined to produce fuzzy rules [4].
Decision trees are widely used in Machine Learning.They represent transparent models because symbolic rules are easily extracted.However, when they are combined in an ensemble rule, extraction becomes harder [5].Here, we propose generating rules from ensembles of shallow decision trees with the help of DIMLP ensembles.In practical terms, each rule extracted from a tree is inserted into a single DIMLP network; then, all the rules generated from a tree ensemble are represented by a DIMLP ensemble.Finally, rule extraction is performed to obtain a ruleset representing the knowledge embedded within the decision tree ensemble.Because of the No Free Lunch Theorem no model is better than any other, in general [6].Hence, if a connectionist model is more accurate than a direct rule learner such as RIPPER [7], then it is worth extracting rules to understand the classifications, even if this involves extra computing time.
Authors who generated rules from single neural networks or Support Vector Machines (SVMs), very rarely assessed their techniques by tenfold cross-validation.Our experiments are based on ten repetitions of stratified tenfold crossvalidation trials over 25 binary classification problems.Note that the total number of training trials is equal to 42500.Moreover, we compare the complexity of the rules generated from DIMLP ensembles, boosted shallow trees (BST), and SVMs.For SVMs we define the Quantized Support Vector Machine (QSVM), which is a DIMLP architecture trained by an SVM learning algorithm [16].Our purpose is not to determine which model is the best for these classification problems, but to characterize the complexity of the rules produced by the models.Our results could serve as a basis for researchers who would like to compare their rule extraction techniques applied to connectionist models by 10-fold crossvalidation.In the following sections we present the DIMLP model that allows us to produce rules from BSTs and SVMs and then the experiments, followed by the conclusion.
1.1.State of the Art.Since the earliest work of Gallant on rule extraction from neural networks [17], many techniques have been introduced.In the 1990s, Andrews et al. introduced a taxonomy aiming at characterizing rule extraction techniques [18].Essentially, rule extraction algorithms belong to three categories: decompositional; pedagogical; and eclectic.In decompositional techniques, rules are extracted at the level of hidden and output neurons by analyzing weight values.Here, a basic requirement is that the computed output from each hidden and output unit must be mapped into a binary outcome which corresponds to the notion of a rule consequent.The basic idea of the pedagogical approach is to view rule extraction as a learning task where the target concept is the function computed by the network and the input attributes are simply the network's input neurons.Weight values are not taken into account in this category of techniques.Finally, the eclectic approach takes into account elements of both decompositional and pedagogical techniques.A few years later, Duch et al. published a survey article on this topic [9].More recently, Diederich published a book on techniques to extract symbolic rules from Support Vector Machines (SVMs) [19] and Barakat and Bradley reviewed a number of rule extraction techniques applied to SVMs [20].

Rule Extraction from Neural Network Ensembles.
Many rule extraction techniques from single neural networks have been introduced, but only a few authors have started to extract rules from neural network ensembles.Bologna proposed the Discretized Interpretable Multilayer Perceptron (DIMLP) to generate unordered symbolic rules from both single networks and ensembles [21,22].With the DIMLP architecture rule extraction is performed by determining the precise location of axis-parallel discriminative hyperplanes.

Zhou et al. introduced the REFNE (Rule Extraction from
Neural Network Ensemble) algorithm [23], which utilizes the trained ensembles to generate instances, and then extracted symbolic rules from those instances.Attributes are discretized during rule extraction and it also uses particular fidelity evaluation mechanisms.Moreover, rules have been limited to only three antecedents.For Johansson, rule extraction from ensembles is an optimization problem in which a trade-off between accuracy and comprehensibility must be taken into account [14].He used a genetic programming technique to produce rules from ensembles of 20 neural networks.Ao and Palade extracted rules from ensembles of Elman networks and SVMs by means of a pedagogical approach to predict gene expression in microarray data [24].More recently Hara and Hayashi proposed the two-MLP ensembles by using the "Recursive-Rule eXtraction" (Re-RX) algorithm [25] for data with mixed attributes [26].Re-RX utilizes C4.5 decision trees and backpropagation to train MLPs recursively.Here, the rule antecedents for discrete attributes are disjointed from those for continuous attributes.Subsequently, Hayashi at al. presented the "three-MLP Ensemble" by the Re-RX algorithm [27].

Rule Extraction from Ensembles of Decision Trees.
Basically, rule extraction techniques applied to ensembles of decision trees belong to two distinguished groups.In the first, the purpose is to reduce the number of decision trees by increasing their diversity.Techniques for the optimization of diversity are reported in [28]; as an example Gashler et al. improved the ensemble diversity by combining different decision trees algorithms [29].
Techniques in the second group concentrate on the rules extracted during the ensemble construction.A well-known representative technique in this group is RuleFit [30].The base learners are rules extracted from a large number of CART decision trees [31].Specifically, these trees are trained on random subsets of the learning set, the main idea being to define a linear function including rules and features that approximates the whole ensemble of decision trees.At the end of the process this linear function represents a regularized regression of the ensemble responses with a large number of coefficients equal to zero.Node Harvest is another rulebased representative technique [32].Its purpose is to find suitable weights for rules by performing a minimization on a quadratic program with linear inequality constraints.Finally, in [33], the rule extraction problem is viewed as a regression problem using the sparse group lasso method [34], such that each rule is assumed to be a feature, where the aim is to predict the response.Subsequently, most of the rules are removed by trying to keep accuracy and fidelity as high as possible.

Rule Extraction from Support
Vector Machines.To produce rules from SVMs, a number of techniques applied a pedagogical approach [35][36][37][38].As a first step, training samples are relabeled according to the target class provided by the SVM.Then, the new dataset is learned by a transparent model, such as decision trees, which approximately learn what the SVM has learned.As a variant, only a subset of the training samples are used as the new dataset: the support vectors [39].Before the training of a decision tree algorithm, Martens at al. generate additional learning examples close to randomly selected support vectors [38].In another technique, Barakat and Bradley generate rules from a subset of the support vectors using a modified covering algorithm, which refines a set of initial rules determined by the most discriminative features [40].
Fu et al. proposed a method aiming at determining hyperrectangles whose upper and lower corners are defined by determining the intersection of each of the support vectors with the separating hyperplane [41].This is achieved by solving an optimization problem depending on the Gaussian kernel.Núñez et al. determined prototype vectors for each class [15,42].With the use of the support vectors, these prototypes are translated into ellipsoids or hyperrectangles.
An iterative process is defined in order to divide ellipsoids or hyperrectangles into more regions, depending on the presence of outliers and the SVM decision boundary.Similarly, Zhang et al. introduced a clustering algorithm to define prototypes from the support vectors [43].Then, small hyperrectangles are defined around these prototypes and progressively grown until a stopping criterion is met.Note that for these two last methods the comprehensibility of the rules is low, since all input features are present in the rule antecedents.

Material and Methods
In this section we present the models used in this work, which are DIMLP ensembles, Quantized Support Vector Machines, and shallow boosted trees.The rule extraction process of the last two models has been made possible by transforming them into particular DIMLP architectures.
2.1.The DIMLP Model.DIMLP differs from MLP in the connectivity between the input layer and the first hidden layer.Specifically, any hidden neuron receives only a connection from an input neuron and the bias neuron, as shown in Figure 1.After the first hidden layer, neurons are fully connected.Note that very often DIMLPs are defined with two hidden layers, the number of neurons in the first hidden layer being equal to the number of input neurons.
In the first hidden layer the activation function is a staircase function () with  stairs that approximates the sigmoid function.
min represents the abscissa of the first stair.By default  min = −5.
max represents the abscissa of the last stair.By default  max = 5.Otherwise, if  min <  <  max we have Square brackets indicate the integer part function and  = 1, . . ., .The step function () is a particular case of the staircase function with only one step: If we would like to obtain a better approximation of the sigmoid function we could change these values and increase the number of stairs.The activation function in the hidden layers above the first one is again a sigmoid.Note that the step/staircase activation function makes it possible to precisely locate possible discriminative hyperplanes.
As an example, in Figure 1 assuming two different classes, the first is being selected when  1 > (0) = 0.5 (black circle) and the second with  1 ≤ (0) = 0.5 (white squares).Hence, two possible hyperplane splits are located in − 10 / 1 and − 20 / 2 , respectively.As a result, the extracted unordered rules are as follows: The training of a DIMLP network having step activation functions in the first hidden layer was performed by simulated annealing [8], since the gradient is undefined with step activation functions.When the number of stairs was allowed to approximate the sigmoid function sufficiently well, a modified backpropagation algorithm was used [8].The default number of stairs in the staircase activation function was equal to 50.

Rule Extraction.
Each neuron of the first hidden layer creates a number of virtual parallel hyperplanes that is equal to the number of stairs of its staircase activation function.As a consequence, the rule extraction algorithm corresponds to a covering algorithm for which the goal is to determine whether a virtual hyperplane is virtual or effective.A distinctive feature of this rule extraction technique is that fidelity which is the degree of matching between network classifications and rules' classifications is equal to 100%, with respect to the training set.
Here we describe the general idea behind the rule extraction algorithm, since more details are described in [3].The relevance of a discriminative hyperplane corresponds to the number of points viewing this hyperplane as the transition to a different class.In the first step of the rule extraction algorithm the relevance of discriminative hyperplanes is estimated from all training examples and DIMLP responses.
Once the relevance of discriminative hyperplanes has been established a special decision tree is built according to the strongest relevant hyperplane criterion.In other terms, during tree induction in a given region of the input space the hyperplane having the largest number of points viewing this hyperplane as the transition to a different class is added to the tree.
Each path between the root and a leaf of the obtained decision tree corresponds to a rule.At this stage rules are disjointed and generally their number is large, as well as their number of antecedents.Therefore, a pruning strategy is applied to all rules according to the most enlarging pruned antecedent criterion.The use of this heuristic involves that at each step the pruning algorithm removes the rule antecedent which mostly increases the number of covered examples without changing DIMLP classifications.Note that at the end of this stage rules are no longer disjointed and unnecessary rules are removed.
When it is no longer possible to prune any antecedent or any rule, again, to increase the number of covered examples by each rule all thresholds of remaining antecedents are modified according to the most enlarging criterion.More precisely, for each attribute new threshold values are determined according to the list of discriminative hyperplanes.At each step, the new threshold antecedent which mostly increases the number of covered examples without altering DIMLP classifications is retained.
The general algorithm is summarized as follows: (1) Determine relevance of discriminant hyperplanes using available examples.
(2) Build a decision tree according to the highest relevant hyperplane criterion.
(3) Prune rule antecedents according to the most enlarging pruned antecedent criterion.
(5) Modify antecedent thresholds according to the most enlarging criterion.

DIMLP Ensembles.
We implemented DIMLP ensemble learning by bagging [44] and arcing [45].Bagging and arcing are based on resampling techniques.For the first training method, assuming a training set of size , bagging selects for each classifier included in ensemble  samples drawn with replacement from the original training set.Hence, for each DIMLP network many of the generated samples may be repeated while others may be left out.In this way, a certain diversity of each single network proves to be beneficial with respect to the whole ensemble of combined classifiers.
Arcing defines a probability with each sample of the original training set.The samples of each classifier are chosen according to these probabilities.Before learning, all training samples have the same probability to belong to a new training set (=1/).Then, after the first classifier has been trained the probability of sample selection in a new training set is increased for all unlearned samples and decreased for the others.
Rule extraction from ensembles can still be performed, since an ensemble of DIMLP networks can be viewed as a single DIMLP network with one more hidden layer.For this unique DIMLP network, weight values between subnetworks are equal to zero. Figure 2 illustrates three different kinds of DIMLP ensembles.Each "box" in this figure is transparent, since it can be translated into symbolic rules.The ensemble resulting from different types of combinations is again transparent, since it is still a DIMLP network with one more layer of weights.

Classification Strategy of the Rules.
For the training set the degree of matching between DIMLP classifications and rules, also denoted as fidelity, is equal to 100%.With unordered rules, an unknown sample not belonging to the training set activates zero, one, or several rules.Thus, several activated rules of different class involve an ambiguous decision process.As a remedy, classifications provided by DIMLPs are taken into account to disambiguate the classification process.We summarize the possible situations for an unclassified sample not belonging to the training set: (i) No activated rules: the classification is provided by the DIMLP network (thus, no explanation is provided).
(ii) One or several rules belonging to the same class corresponding to the one provided by the DIMLP network: thus, rule(s) and network agree.
(iii) One or several rules belonging to different classes: if the class provided by DIMLP is represented in the rule(s), we only take into account this (these) rule(s) to explain the classification and discard the other(s).
(iv) One or several rules belong to one or several classes, but the class provided by DIMLP is not represented in the rule(s).Thus, rule(s) and network disagree and the classification provided by the rules is wrong.
Predictive accuracy is the proportion of correct classified samples of an independent testing set.With respect to the rules it can be calculated by following three distinct strategies: (i) Classifications are provided by the rules.If a sample does not activate any rule the class is provided by the model without explanation.
(ii) Classifications are provided by the rules, when rules and model agree.In case of disagreement, no classification is provided.Moreover, if a sample does not activate any rule the class is provided by the model.
(iii) Classifications are provided by the rules, when rules and model agree.In case of disagreement, the classification is provided by the model without any explanation.Moreover, if a sample does not activate any rule, the class is again provided by the model without explanation.
By following the first strategy, the unexplained samples are only those that do not activate any rule.For the second one, in case of disagreement between rules and models no classification response is provided; in other words the classification is undetermined.Finally, the predictive accuracy of rules and models is equal in the last strategy, but with respect to the first strategy we have a supplemental proportion of uncovered samples, those for which rules and models disagree.

Quantized Support Vector Machines (QSVMs).
Functionally, SVMs can be viewed as a feedforward neural networks.Here, we focus on how an SVM is transformed into a QSVM, which is a DIMLP network with specific neuron activation functions.Since QSVM is also a DIMLP network, rules can be extracted by performing the DIMLP rule extraction algorithm.QSVM is trained by a standard SVM training algorithm, for which details are provided in [46] or [47].
The classification decision function of an SVM model is given by and  being real values,   ∈ {−1, 1} corresponding to the target values of the support vectors, and (  , ) representing a kernel function with   as the vector components of the support vectors.The sign function is The following kernels are used: Specifically, for the dot and polynomial cases we have with  = 1 for the dot kernel and  = 3 for the polynomial kernel.The Gaussian kernel is with  > 0, a parameter.We define a Quantized Support Vector Machine as a DIMLP network with two hidden layers.The activation function of the neurons in the second hidden layer is related to the SVM kernel.Figure 3 presents a QSVM with a Gaussian activation function in the second hidden layer.Neurons in the first hidden layer have a staircase activation function.The role of neurons of the first hidden layer is to perform a normalization of the input variables.This normalization is carried out through weight values depending on the training data before the learning phase.Note that during training these weights remain unchanged.Let us assume that we have the same number of input neurons and hidden neurons in the first hidden layer.These weights are defined as (i)   = 1/  , with   as the standard deviation of input , (ii)  0 = −  /  , with   as the average on the training set of input .
With a dot kernel, the activation function in the second hidden layer corresponds to the identity function, while it is a cubic polynomial with a polynomial kernel.The number of neurons in this layer is equal to the number of support vectors, with the incoming weight connections corresponding to the components of the support vectors.Specifically, a weight between the first and second hidden layers denoted as V  in Figure 3 corresponds to the th component of the th support vector.Weights between the second hidden layer and the output neuron denoted as   in Figure 3 correspond to   coefficients in (6).Finally, the activation function of the output neuron is a sign function.

Ensembles of Shallow Decision Trees.
A binary decision tree is made of nodes and branches.At each node, a test on an attribute is performed; depending on its predicate value the path continues to the left or to the right branch (if any), until a terminal node also denoted as a leaf is reached.Shallow trees have very limited number of nodes; they represent "weak" learners with limited power of expression.As an example, a tree with a unique node performs a test only on an attribute.Such a shallow tree is also called a decision stump.The key idea behind ensembles of shallow decision trees is to obtain strong classifiers by training weak learners by boosting [48].Three variants of boosting are used in this work to train boosted shallow trees (BSTs): (i) Modest Adaboost [49] (ii) Gentle Adaboost [50] (iii) Real Adaboost [51].
A single decision tree is built according to a splitting criterion.Specifically, at each step the most informative attribute that splits the training set accurately is determined.Many possible criteria can be used to determine the best splitting attribute; for more details see [31,52].Once training is completed, BSTs are transformed into DIMLP ensembles.Specifically, for each BST, a path from a root to a leaf represents a symbolic rule.Then, each rule is inserted into a unique DIMLP network.Note also that all the rules extracted from a BST could be inserted into a DIMLP, but for simplicity we will show the former rule insertion technique.We assume here that DIMLPs have a unique hidden layer with an activation function which is a sigmoid (cf.( 5)).
Figure 4 exhibits a shallow decision tree with two nodes.Following the paths between the root and the leaves, we obtain three rules.
(1) if Each rule is inserted into a single DIMLP.Note that rule antecedents are present in the weight values between the input layer and the hidden layer (see Figure 5).
Without loss of generality we formulate the rule insertion algorithm for classification problems of two classes, vector (1, 0) coding the first class and vector (0, 1) coding the second.

Rule Insertion Algorithm
(1) For all BSTs generate the list of rules  with their corresponding class by following all the paths between roots and leaves.
(  Boosting algorithms provide for each weak learner coefficients that are inserted in the combination layer (cf. Figure 2).Note that for DIMLP ensembles trained with bagging or arcing these weights are equal to 1/, with  being the number of networks in the ensemble.

Results
In the experiments we use 25 datasets representing classification problems of two classes.Table 1 illustrates their main characteristics in terms of number of samples, number of input features, type of features, and source.We have four types of inputs: Boolean; categorical; integer; and real.The public sources of the datasets are (i) UCI: Machine Learning Repository at the University of California, Irvine: https://archive.ics.uci.edu/ml/datasets.html[53], (ii) KEEL: http://sci2s.ugr.es/keel/datasets.php[54], (iii) LIBSVM:https://www.csie.ntu.edu.tw/∼cjlin/libsvmtools/datasets/.The complexity of boosted shallow trees was controlled according to the parameter defining the number of splits for each shallow tree (cf.Section 2.3).This parameter varies from one to four.Note that when this value is equal to one we  The default number of neurons in the first hidden layer is equal to the number of input neurons and the number of neurons in the second hidden layer is empirically defined in order to obtain a number of weight connections that is less than the number of training samples.Finally, the default number of DIMLPs in an ensemble is equal to 25, since it has been empirically observed that for bagging and arcing the most substantial improvement in accuracy is achieved with the first 25 networks [44].
For QSVMs, default learning parameters are those defined in the libSVM library (this software is available at https://www.csie.ntu.edu.tw/∼cjlin/libsvm/).The number of stairs in the staircase function was set to 200, in order to guarantee a sufficient number of quantized levels in the input values.We used nu-SVM [55]; note that our goal was not to optimize the predictive accuracy of the models but just to use default configurations in order to assess the accuracy and complexity of the models.With respect to all the defined models and datasets, the total amount of training and rule extractions is equal to 42500 (=17 ⋅ 25 ⋅ 100).

Overall Results
. Figure 6 gives a general view of the logarithm of the complexity of the rulesets (-axis) generated from the models (-axis).Here, complexity corresponds to the total number of rule antecedents per ruleset.With respect to the -axis indexes 1 to 4 indicate BST-M with the split parameter varying from 1 to 4, indexes from 5 to 8 are related to BST-G, indexes from 9 to 12 indicate BST-R, and finally indexes from 13 to 17 are illustrated as the results corresponding to DIMLP-B, DIMLP-A, QSVM-L, QSVM-P3, and QSVM-G, respectively.For each boxplot, the central mark is the median obtained by cross-validation trials and the edges of the box are the 25th and 75th percentiles.Overall, with respect to the 25 datasets used in the experiments the lowest median complexity is obtained by BST-M1, while the top medians are given by BST-G3, BST-G4, BST-R3, and BST-R4.Moreover, it clearly appears that the median complexity augments with the increase of the number of splits in the shallow trees from one to three.
Figure 7 illustrates the average predictive accuracy of the extracted rulesets (-axis) with respect to each model (axis).It is worth noting that BST-R4 and DIMLP-B reach the highest medians, with DIMLP-B obtaining a better 25th percentile.
Figure 8 shows boxplots of the average fidelity of the extracted rulesets.Qualitatively, BST-M obtains the best results with respect to median fidelity, while BST-G and BST-R give lowest fidelity results.As a qualitative rule of the obtained results, the lower the complexity of the extracted rulesets the higher the fidelity, and vice versa.This observation is also illustrated in Figure 9. Specifically, with respect to  the 25 classification problems used in the experiments, each point of this figure represents the average fidelity of the extracted rulesets versus the average number of antecedents per ruleset.Is it worth noting that from left to right (with respect to the -axis), red "+" indicates BST-M1, BST-M2, BST-M3, and BST-M4.Thus, ruleset complexity augments with the number of splits of the shallow trees.Similarly, we can see the same trend for the triangles related to BST-Gs and BST-Rs.Based on the 17 models, a linear regression is also shown.Hence, we can clearly see a trend for which fidelity is inversely proportional to the complexity of rulesets.

Detailed Results
. Table 2 gives for each dataset the average predictive accuracy obtained by the best model (column three), as well as the average predictive accuracy of the best extracted rulesets (column five).The difference of these average accuracies is reported in column six.The last three columns indicate the average fidelity, the average number of generated rules, and the average number of antecedents per rule, respectively.It is worth noting that the average predictive accuracy of rulesets is rarely better than the predictive accuracy provided by the best model, because the power of expression of rules is somewhat limited with respect to that of the original models.However, for many datasets, ruleset average predictive accuracy is quite close to that provided by the best model.Results shown in Table 3 are similar to those provided by Table 2.The only difference resides in the way that the average predictive accuracy of the rulesets is measured.Specifically, here, we only take into account whether the model from each rule is generated and if the rules agree.In that case, the average predictive accuracy of the rules is always equal or higher than that provided by the model.Intuitively, it means that if rules and models agree then results are more reliable.
The purpose of Figure 10 is to show the average difference in predictive accuracy between a model and its generated rulesets over the 25 classification problems.The lower part of this Figure concerns this average difference when rules and network agree.
Tables 4 and 5 present the detailed results of rulesets' average predictive accuracy and standard deviations.Note that the classification decision was determined by the neural network model when a testing sample was not covered by any rule.Moreover, in the case of conflicting rules (i.e., rules of two different classes), the selected class is again the one determined by the model.Tables 6 and 7 show the average complexity in terms of average number of rules and average number of antecedents per ruleset.Finally, Tables 8 and 9 illustrate average fidelity results with their standard deviations.
In Table 10 our purpose is to illustrate the impact of DIMLP ensembles with respect to single DIMLPs.We focus on average predictive accuracy and average complexity of the generated rulesets.Columns four and seven are related to single architectures.Complexity, which is given in terms of number of rules and average number of antecedents per rule, is in bold when the product of these two components is the lowest.Note that for single architectures, 10% of the samples are used to decide when to stop training (with 80% of samples used for training).With respect to single DIMLPs, bagging tends to reduce average complexity of the generated rulesets, since in 22 problems out of 25 it was lower.Conversely, for DIMLP ensembles trained by arcing, average complexity was higher in 20 problems.Finally, average predictive accuracy of rulesets produced by ensembles was higher than or equal to that provided by single DIMLPs in 22 problems out of 25.

Related Work.
Among several published works on the knowledge extracted from ensembles, very few are based on cross-validation trials.ensembles.Note that a fair comparison for the complexity of the extracted rules is difficult, since some techniques such as Re-RX generate ordered rules, while DIMLP-B extracts unordered rules.For the predictive accuracy, DIMLP-B obtains the highest average.With the use of G-REX [14], a genetic programming technique, Johansson presented a number of results on the extraction of decision trees from ensembles of 20 neural networks, based on one repetition of 10-fold cross-validation.Table 12 presents these results, with columns three and four depicting the results provided by Trepan [14], which is a general technique for knowledge extraction [13].Our results with DIMLP-Bs (based on 10 repetitions of stratified 10fold cross-validation) are shown in the last three columns.Average fidelity of DIMLP-Bs is always greater than that obtained by G-REX and Trepan (it is considerably higher in five of the classification problems).With the exception of one classification problem, the average predicative accuracy values of our models and rulesets are a bit greater than that of G-REX and Trepan.
In [15] rule extraction from SVMs is reported based on ten repetitions of stratified tenfold cross-validation.Table 13 illustrates the comparison with our results obtained by QSVMs.Note that the average number of antecedents is not reported, because their number in [15] is equal to the number of inputs.Thus, we generate less complex rulesets, on average, while our predictive accuracy is better or very close.Finally, we obtain better average fidelity.
3.5.Discussion.SVMs are very often used as single models, because with boosting they tend to overfit the data.Shallow trees are weak learners; thus they have to be trained in ensembles.For DIMLPs, we observed that when they are trained by bagging, the complexity of the extracted rulesets tends to be a bit lower than that of rulesets produced by a single network 22 times out of 25.In contrast, ensembles trained by arcing show increased complexity in the extracted rulesets 20 times out of 25.Concerning the impact of model architecture, from this work it turned out that for boosted decision trees when the number of splits is increased, then the extracted rulesets tend to be more complex, on average (see Figure 9 with BST-M, BST-G, and BST-R with the number of splits in a decision tree varying from 1 to 4).
With respect to rulesets, the lower the fidelity, the higher the complexity.Conversely, the higher the fidelity, the lower the complexity.Since average predictive accuracy is in some cases provided by the most complex rulesets, we also have a clear trade-off between accuracy and complexity.Another compromise to take into account is the proportion of covered samples with respect to predictive accuracy.Specifically, from Table 2 we showed that very often the average predictive accuracy of rulesets is lower than that of the models from which they are generated.In case of disagreement between rules and models, if rules are ignored, more samples are left without explanation, but the remaining rules will have better predictive accuracy, on average (cf.Table 3).
Let us suppose that a physician is in a realistic situation for which a patient diagnosis is provided by an ensemble of DIMLPs.If the patient symptoms (e.g., inputs) are not covered by any rule, the physician cannot explain the response given by the neural ensemble.Hence, a first possibility would be to perform again rule extraction by including the new patient data.However, this solution has two drawbacks.The first is the rule extraction time duration, which is fast for all the used datasets in this work but will be prohibitive with big data.The second drawback is that, after reextraction of the rules, the new ruleset could have considerably changed and so it could take time for the physician to understand it.
To minimize the number of times a new sample remains unexplained, we can increase fidelity.The basic idea consists of aggregating the rules extracted from several models.With the use of unordered rules representing single pieces of knowledge, even if their number is greater than those obtained with a single model, their comprehension could be possible in a reasonable amount of time.In the next experiment we consider combinations of five models (out of 17) by majority voting, even if the number of extracted rules roughly increases by a factor equal to five.When rules of different classes are activated we ignore the rules that are different from the majority voting response (this corresponds to the first strategy in Section 2.1.4).This approach was applied to 10 classification problems.Table 14 shows the obtained results for all the possible combinations of five aggregated rulesets, equal to 6188.The second column represents the average over the 6188 possible combinations of the average predictive rulesets' accuracies (with the standard deviation).Columns three and four show the minimal and maximal rulesets' predictive accuracy and the last column is the average of the average fidelity.It is worth noting that this last value is always above 99.6% and the average of the average ruleset accuracy is greater than the best corresponding values shown in Table 2 (fifth column).

Conclusion
In this work, the DIMLP model was used to extract unordered rules from ensembles of DIMLPs, boosted shallow trees, and Support Vector Machines.Experiments were performed on 25 datasets by 10 repetitions of 10-fold cross-validation.We measured the predicative accuracy of the generated rulesets, their complexity, and their fidelity.For the 17 classifiers used in this study, we emphasized a strong relationship between average complexity and average fidelity of the extracted rulesets.As a result, we obtained a spectrum of models showing a clear trade-off between fidelity and complexity.At one end lie the decision stumps trained by modest Adaboost for which the less complex rulesets are generated, bringing also the best fidelity, on average.At the other end lie models with highest complexity and lowest fidelity, corresponding to BSTs trained by real Adaboost and gentle Adaboost.The average complexity of rulesets produced by BSTs is augmented with the number of splitting nodes.Another trade-off is between the covering of testing samples by rules and predictive accuracy.We clearly pointed out that when models and rulesets agree then the average predictive accuracy is better when we ignore the test samples for which models and rules disagree.Intuitively, this can be explained by the fact that when models and rules disagree the classification is somewhat more uncertain.By aggregating the responses of several models it was possible to increase both fidelity and predictive accuracy.Nevertheless, this also increased complexity.
Very few works systematically assessed symbolic rules generated from connectionist models by cross-validation.[9] 96.3 (0.2) 3 -FSM (10 CV) [9] 96.5 12 -MINERVA (10 CV) [10] 94.5 (1.5) 4.2 3.3 NeuroLinear + GRG (10 CV) [11] 96.0 2 -Re-RX + J48graft (10 × 10 CV) [12] 95.Hence, our work could be useful in the future to researchers who would like to compare their results.So far, the comparison with a work in which rules were extracted from MLP ensembles was in our favour for both fidelity and predictive accuracy in eight out of nine classification problems.Moreover, with respect to two datasets from which rules were generated from SVMs we obtained better fidelity, with predictive accuracy being greater in one of the problems and slightly worse in the other.Lastly, we would like to encourage researchers to perform systematic experiments by 10-fold cross-validation to assess their rule extraction algorithms applied to neural networks.

2. 1 . 1 .
DIMLP Architecture.The activation function in the output layer is a sigmoid function given as

Figure 1 :
Figure 1: A DIMLP network creating two discriminative hyperplanes.The activation function of neurons ℎ 1 and ℎ 2 is a step function, while for output neuron  1 it is a sigmoid.

3. 1 .
Models and Learning Parameters.Our experiments are based on 10 repetitions of stratified 10-fold cross-validation trials.Training sets were normalized by Gaussian normalization.Specifically, the input variable averages and standard deviations calculated on a training set were used to normalize the input variables in a testing set.The following models were trained on the 25 datasets: (i) Boosted shallow trees trained by modest boosting (BST-M) (ii) Boosted shallow trees trained by gentle boosting (BST-G) (iii) Boosted shallow trees trained by real boosting (BST-R) (iv) DIMLP ensembles trained by bagging (DIMLP-B) (v) DIMLP ensembles trained by arcing (DIMLP-A) (vi) QSVM with dot kernel (QSVM-L) (vii) QSVM with polynomial kernel of third degree (QSVM-P3) (viii) QSVM with Gaussian kernel (QSVM-G).

Figure 4 :
Figure4: A shallow decision tree with two splitting nodes.Three rules are obtained from the paths between the root and the leaves.

Figure 5 :
Figure 5: Three symbolic rules represented by three DIMLP networks with a step activation function in the hidden layer and a sigmoid function in the output layer (see the rules in the text).Weight values  1 and  2 are constants denoting thresholds of rule antecedents.

Figure 6 :
Figure 6: Boxplots of the average log-complexity of the extracted rulesets (-axis) with respect to each model (-axis).Complexity corresponds to the number of rule antecedents per ruleset.With respect to the -axis indexes 1 to 4 indicate BST-M with split parameter varying from 1 to 4, indexes from 5 to 8 are related to BST-G, indexes from 9 to 12 indicate BST-R results, and finally indexes from 13 to 17 correspond to DIMLP-B, DIMLP-A, QSVM-L, QSVM-P3, and QSVM-G, respectively.

Figure 7 :Figure 8 :
Figure 7: Boxplots of the average predictive accuracy of the extracted rulesets (-axis) with respect to each model (-axis).

Figure 9 :
Figure 9: Plot of average fidelity versus average number of antecedents per ruleset.

Figure 10 :
Figure 10: Average difference in predictive accuracy between a model and its generated rulesets.The lower part with negative values is obtained when rules and network agree.
For each rule   in , let   be the number of antecedents of   ; then let us define a DIMLP  network with   inputs,   neurons in the hidden layer, and two output neurons.(3)For each DIMLP  coding a unique rule   in  and for the th antecedent   in   , such as   ≥   (  being a constant),   = −  and   = 1, with   being the weight value between the bias neuron and hidden neuron  and   being the weight value between input neuron  and hidden neuron .
(4) For each DIMLP  coding a unique rule   in  and for each antecedent   in   , such as   <   ,   =   and   = −1.

Table 1 :
Datasets used in the experiments.

Table 2 :
Comparison between the average predictive accuracy obtained by the best model (column three) and the average predictive accuracy of the best extracted rulesets (column five).The last three columns indicate the average fidelity, the average number of generated rules, and the average number of antecedents per rule, respectively.
Table 11 presents rule extraction results with respect to the Breast Cancer classification problem.Only the last two rows concern rule extraction from

Table 3 :
Comparison between the average predictive accuracy obtained by the best model (column three) and the average predictive accuracy of the best extracted rulesets (column five) when rules and model agree on classification.The last three columns indicate the average fidelity, the average number of generated rules, and the average number of antecedents per rule, respectively.

Table 4 :
Average predictive accuracy and standard deviations of the extracted rules for boosted shallow trees trained by modest boosting and gentle boosting.

Table 5 :
Average predictive accuracy and standard deviations of the extracted rules for BST-R, DIMLP ensembles, and QSVMs.

Table 6 :
Average complexity of the extracted rules given by the number of rules and antecedents per rule for BST-M and BST-G.

Table 7 :
Average complexity of the extracted rules given by the number of rules and antecedents per rule for BST-R DIMLPs and QSVMs.

Table 8 :
Average fidelity and standard deviations of the extracted rules for BST-M and BST-G.

Table 9 :
Average fidelity and standard deviations of the extracted rules for BST-R, DIMLP ensembles, and QSVMs.

Table 11 :
Comparison of rule extraction algorithms for the Breast Cancer classification problem, based on 10-fold cross-validation on single networks or ensembles of neural networks (last two rows).

Table 12 :
[14]arison on neural network ensembles with respect to Trepan[13]and G-REX[14].Our results in the last three columns are those provided by DIMLP-B.

Table 13 :
Rule extraction comparison for SVMs.

Table 14 :
Average predictive accuracy and average fidelity of rulesets by aggregating five models.