Extracting Credible Dependencies for Averaged One-Dependence Estimator Analysis

Of the numerous proposals to improve the accuracy of naive Bayes (NB) by weakening the conditional independence assumption, averaged one-dependence estimator (AODE) demonstrates remarkable zero-one loss performance. However, indiscriminate superparent attributeswill bring both considerable computational cost and negative effect on classification accuracy. In this paper, to extract themost credible dependencies we present a new type of seminaive Bayesian operation, which selects superparent attributes by building maximum weighted spanning tree and removes highly correlated children attributes by functional dependency and canonical cover analysis. Our extensive experimental comparison on UCI data sets shows that this operation efficiently identifies possible superparent attributes at training time and eliminates redundant children attributes at classification time.


Introduction
Bayesian networks (BNs) are a key research area of knowledge discovery and machine learning.A BN consists of two parts: a qualitative part and a quantitative part.The qualitative part denotes the graphical structure of the network, while the quantitative part consists of the conditional probability tables (CPTs) in the network.Although BNs are considered efficient inference algorithms, the quantitative part is considered a complex component and learning an optimal BN structure from existing data has been proven to be an NPhard problem.The graphical structure of naive Bayes (NB) is simple and definite because of the conditional independence assumption between attributes, making NB efficient and effective [1,2].However, violations of this conditional independence assumption can make the classification of NB suboptimal.Numerous algorithms have been proposed to retain the desirable simplicity and efficiency of NB while alleviating the problems of the independence assumption.Averaged one-dependence estimator (AODE) [3,4] utilizes a restricted class of one-dependence estimators (ODEs) and aggregates the predictions of all qualified estimators within this class.A superparent attribute is indiscriminately selected from an attribute set as the parent of all the other attributes in each ODE.By averaging the estimates of all of the threedimensional estimators, AODE makes a weaker conditional independence assumption than NB.Previous studies that compared different variations of NB techniques prove that AODE is significantly better than other NB techniques in terms of zero-one loss reduction [5].Since its introduction in 2005, AODE has enjoyed considerable popularity because of its capability to improve the accuracy of NB [5].
Another strategy to remedy violations of the attribute independence assumption is to eliminate highly correlated attributes.Backward sequential elimination (BSE) [6] uses a simple heuristic wrapper approach that selects a subset of the available attributes to minimize zero-one loss on the training set.BSE is effective especially for data sets with highly correlated attributes.Forward sequential selection (FSS) [7] uses the reverse search direction to BSE.However, both FSS Figure 1: The network structure of NB and AODE.
and BSE have high computational overheads, especially on learning algorithms with high classification time complexity, because they apply the algorithms repeatedly until no accuracy improvement occurs.Subsumption resolution (SR) [8] identifies pairs of attribute values, such that one appears to subsume (be a generalization of) the other and deletes the generalization.Near-subsumption resolution (NSR) [8] is a variant of SR; it extends SR by deleting not only generalizations but also near-generalizations.For different instances, SR and NSR may find different attributes to remove, making them much more flexible than BSE and FSS.Since generalization mainly deals with pairs of attribute, there is no solution for more complicated situation and loop relationship.In this paper, we present a new type of seminaive Bayesian operation, which selects parent (SP) attributes by building maximum weighted spanning tree and removes children (RC) attributes by functional dependency and canonical cover analysis.Thus this algorithm has the advantages of BSE, FSS, and SR.The remainder of the paper is organized as follows.Section 2 introduces the basic ideas of NB, AODE, and related background theory.Section 3 introduces the SP and RC techniques for attribute selection and elimination with AODE and presents the theoretical justification.Section 4 shows the experimental results on UCI data sets and a detailed analysis of different attribute selection techniques.The final section concludes the paper.(2)

Related Research Work
Then, the following equation is often calculated in practice rather than (1): The corresponding network structure is depicted in Figure 1(a).One advantage of NB is avoiding model selection because selecting between alternative models can be expected to increase variance and allow a learning system to overfit the training data [3].In consequence, changes in the training data will not lead to any change in NB, which leads in turn to lower variance [4].By contrast, the underlying conditional probability tables will change correspondingly for those approaches (e.g., NB) with a definite model form when the training data changes, resulting in relatively gradual changes in the pattern of classification.
Numerous techniques have sought to enhance the accuracy of NB by relaxing the conditional independence assumption while attaining the efficiency and efficacy of one-dependence classifiers.Among them, averaged onedependence estimator (AODE) [3,4] utilizes a restricted class of one-dependence estimators (ODEs) and aggregates the predictions of all qualified estimators within this class.A superparent attribute (e.g.,   ) is selected as the parent of all the other attributes in each ODE, since The corresponding network structure of AODE is depicted in Figure 1(b).AODE maintains the robustness and much of the efficiency of NB and at the same time exhibits significantly higher classification accuracy for many data sets.Therefore, it has the potential to be a valuable substitute for NB over a considerable range of classification tasks.
2.2.Related Background Theory.In the following discussion, Greek letters (, , , . ..) are used to denote sets of attributes.Lowercase letters represent the specific values used by corresponding attributes (e.g.,   represents   =   ).(⋅) denotes the probability and P(⋅) denotes the probability estimation of (⋅).Given a relation  (in a relational database), attribute  of  is functionally dependent on attribute  of  and  of  functionally determines  of  (in symbols  → ).Armstrong (1974) proposed in [9] a set of axioms (or, more precisely, inference rules) to infer all the functional dependencies (FDs) on a relational database, which represent the expert knowledge of the organizational data and their interrelationships.The axioms mainly include the following rules.
(i) Augmentation rule: if  →  holds and  is a set of attributes, then  → .
Based on the aforementioned rules, we use the FD rules of probability in [10,11] to link FD and probability theory.The following rules are included in the FD-probability theory link.
(i) Representation equivalence rule of probability: suppose data set  consists of two attribute sets {, } and  can be inferred by ; that is, the FD  →  holds; then the following joint probability distribution holds: (ii) Augmentation rule of probability: if  →  holds and  is a set of attributes, then the following joint probability distribution holds: (iii) Transitivity rule of probability: if  →  and  →  hold, then the following joint probability distribution holds: (iv) Pseudotransitivity rule of probability: if  →  and  →  hold, then the following joint probability distribution holds: In the 1940s, Claude E. Shannon introduced information theory, the theoretical foundation of modern digital communication.Although Shannon was principally concerned with the problem of electronic communications, the theory has much broader applicability.Two commonly used definitions of information theory are described as follows.
Definition 1. Mutual information (; ) measures the information quantity that is transferred between attributes  and : where (, ) is the joint probability distribution function of  and , and () and () are the marginal probability distribution functions of  and , respectively.High mutual information indicates a great relationship between  and ; and zero mutual information between two random variables means they are independent.
Proof.By applying the augmentation rule and the decomposition rule, from  2 →  1 we can obtain By applying the representation equivalence rule of probability and the augmentation rule of probability, we can obtain Then,

Mathematical Problems in Engineering
We can also prove Theorem 3 from the viewpoint of information theory, since End of the proof.
As for AODE, (4) will turn to be Thus, if FD  2 →  1 is neglected, the contribution of  1 to classification will be calculated repeatedly for each ODE and the classification result may be wrong.

Attribute Selection and Elimination
AODE makes a weaker attribute conditional independence assumption than that of NB.It selects one attribute as superparent in turn for each ODE submodel, and the other attributes are supposed to be conditionally independent.Previous studies have demonstrated that AODE has a considerably lower bias than that of NB with moderate increases in variance and time complexity [5].The same attribute may play different roles (either parent or child) in different ODE submodels.In the following discussion, we will repair harmful interdependencies from two viewpoints: (1) select parent () attributes (SP) by building maximum weighted spanning tree; (2) remove children () attributes (RC) by functional dependency analysis.

How to Select Parent
Attributes.SP selects the branch nodes as the  attributes from maximum weighted spanning tree (MST), the learning procedure of which can be summarized as follows.
(1) Use CMI to measure the weights of edges between each pair of attributes.The  attributes selected must satisfy the criterion that they either appear as the branch nodes in MST or as the leaf nodes but with stronger relationship with other attributes.Figures 2(a), 2(b), and 2(c) show the original spanning tree, procedure of selecting edges, and final MST, respectively.As shown in Figure 2(c), attributes B, C, and F are branch nodes and can be used as  attributes.In addition, A, D, and E are leaf nodes with corresponding CMIs of 7, 6, and 2, respectively.The CMIs are then sorted into descending order.In this paper, if the sum of CMIs of the first  leaf nodes is greater than 85% of the sum of CMIs of all leaf nodes, we suppose that they represent the most important marginal relationships and can also be selected as  attributes.For example, since then A and D can be used as  attributes.This criterion helps to ensure that strong, and only strong, relationships among attributes will be retained.By contrast, AODE [4] indiscriminately uses each attribute as superparent even if some attributes may be independent of others.Besides, SP supports incremental learning because it may reselect the subset of attributes when a new training instance becomes available.
At training time SP needs only to form the tables of joint attribute value, class frequencies to estimate the probabilities P(), P(,   ), and P(,   ,   ), which are required for estimating P(  | ), P(  ,   | ), and P(  | ,   ) in turn.Calculating the estimates requires a simple scan through the data, an operation of time complexity ( 2 ), where  is the number of training instances and  is the number of attributes.To build maximum weighted spanning tree, SP must first calculate the CMI, requiring consideration for each pair of attributes, every pairwise combination of their respective values in conjunction with each class value.The time complexity to build a MST is ( 2 log ).The resulting time complexity is ( 2 + (V) 2 +  2 log ) and space complexity is ((V) 2 ), where  is the number of classes and V is the average number of values for an attribute.At classification time SP needs only to store the probability tables, space complexity ((V) 2 ).This compression over the table required at training time is achieved by storing probability estimates for each attribute value conditioned by the parent selected for that attribute and the class.The time complexity of classifying a single instance is ( 2 ).

How to Eliminate Children Attributes. Kohavi and
Wolpert [12] presented a bias-variance decomposition of expected misclassification rate, which is a powerful tool from sampling theory statistics for analyzing supervised learning scenarios.Suppose  and ŷ are the true class label and that class generated by a learning algorithm, respectively; the zero-one loss function is defined as where (, ŷ) = 1 if ŷ =  and 0 otherwise.The bias term measures the squared difference between the average output of the target and the algorithm.This term is defined as follows: where  is the combination of any attribute value.The variance term is a real valued nonnegative quantity and equals zero for an algorithm that always makes the same guess regardless of the training set.The variance increases as the algorithm becomes more sensitive to changes in the training set.It is defined as follows: Moore and McCabe [13] illustrated bias and variance through shooting arrows at a target, as described in Figure 3.The perfect model can be regarded as the bull's eye on a target and the learned classifier as an arrow fired at the bull's eye.Bias and variance describe what happens when an archer fires many arrows at the target.Bias means that the aim is off and the arrows land consistently off the bull's eye in the same direction.Variance means that the arrows are scattered.Large variance means that repeated shots are widely scattered on the target.They do not give similar results but differ widely among themselves.
It is reported that removing redundant children attributes from within ODEs can help to decrease both bias and zeroone loss [3,14].Subsumption resolution (SR) [8] identifies pairs of attribute values such that one can replace the other.SR mainly considers pair relationship of one-one.However, four basic relationships exist in the real world: oneone, one-many, many-many, and many-one.These four relationships can be grouped into two sets: one-one and many-one.Thus, SR cannot resolve interdependencies when a loop appears in the many-one relationship.The data presented in Table 1 shows a loop example with four attributes { 1 ,  2 ,  3 ,  4 } and class label .For the first instance The loop relationship is described in Figure 4(a), where " → " represents the one-one relationship and "[" represents the many-one relationship.After SR, only attributes  3 are used for classification and NB will misclassify the first instance as " =  2 ", even though it occurs in the training data.For different testing instances, different correlated attributes will be deleted.These instances will be illustrated from the viewpoint of FD. { 1 =  1 ,  2 =  1 ,  3 =  1 ,  4 =  1 } can be replaced by three FDs: Figure 4: Loop relationship between attributes of training instance.The following results can be generated.We can obtain We can obtain by applying augmentation rule.As Figure 4(c) shows, the arc from  4 to  1 is removed to avoid a loop relationship.Thus, we can infer and obtain the other two attribute values from two attribute values of { 1 =  1 ,  3 =  1 }.Correspondingly the first instance will be correctly classified.
It should be noted that SP selects  attributes from the probabilistic viewpoint by calculating CMI while RC selects  attributes from the logical viewpoint by inferring FDs from the training data.That is, the learning procedure of SP + RC is divided into two parts; SP roughly describes the basic structure of each submodel which uses  attributes as superparents and other attributes as the children; and for different instances, RC further refines the model by deleting redundant children attributes and thus makes the final model much more flexible and robust.Suppose an extreme instance, the CMIs of all attributes are all small and equal, and then all the attributes will be selected as  attributes.The structure will be just the same as AODE after applying SP.But for different testing instances, different FDs can help to make each submodel express the key dependencies.For example, suppose FD 1 = { 1 →  2 } holds for instance-1 and FD 2 = { 2 →  3 } holds for instance-2; Figures 5, 6, and 7 show the original AODE structure after applying SP and corresponding structures for instance-1 and instance-2, respectively.
Discovering FD from existing databases is an important issue.This issue has long been investigated and has been recently addressed with a data mining viewpoint in a novel and efficient way.Rather than exhibiting the set of all functional dependencies which hold in a relation, related work aims to discover a smaller cover equivalent to this set.This problem is known as FD inference.Association rules can be used to discover the relationships and potential associations of items or attributes among huge data.These rules can be effective in uncovering unknown relationships, thereby providing results that can be the basis of forecast and decision.They have proven to be useful tools for an enterprise as they strive to improve their competitiveness and profitability.

Experimental Study
We expect AODE with SP and RC to exhibit low zero-one loss and low bias.Thus, we compare the performance of the system with the following attribute selection methods.First is  attribute addition (PAA), which starts with  and  initialized to the empty and full sets, respectively.It adds one  attribute to each ODE at each step.Second is  attribute addition (CAA), which begins with  and  initialized to the full and empty sets, respectively.It adds one  attribute to every ODE at each step.Third is  attribute elimination (PAE), which starts with  and  initialized to the full set and deletes one  attribute from every ODE at each step.Fourth is  attribute elimination (CAE), which deletes one  attribute from every ODE at each step.Fifth and sixth are SR and NSR, respectively.
Table 2 summarizes the characteristics of each data set, including the numbers of instances, attributes, and classes.Missing values for qualitative attributes are replaced with modes and those for quantitative attributes are replaced with means from the training data.We estimate the base probabilities (), (,   ), and (,   ,   ) using the Laplace estimate as follows [15]: where (⋅) is the frequency with which a combination of terms appears in the training data,  is the number of training instances for which the class value is known,   is the number of training instances for which both the class and attribute   are known, and   is the number of training instances for which all of the class and attributes   and   are known. is the number of attribute values of class ,   is the number of attribute value combinations of  and   , and   is the number of attribute value combinations of ,   , and   .As NB and AODE require discrete valued data, all data were discretized using minimum description length (MDL) discretization [16].Classifier is formed from data set and bias, variance, and zero-one loss estimated from the performance of those classifiers on the same data set.Experiments are performed on a dual-processor 3.1 GHz Windows XP computer with 3.16 Gb RAM.All algorithms are applied to the 22 data sets described in Table 2.
4.1.Zero-One Loss, Bias, and Variance Results.Table 3 presents for each data set the zero-one loss, which is estimated by 50 runs of twofold cross-validation to give an accurate estimation of the average performance of an algorithm.The advantage of this technique is that it uses the full training data as the training set and the testing set.Moreover, every case in the training data is used for the same number of times in each of the roles of training and testing data.Tables 4 and 5 provide the bias and variance results, respectively.The zeroone loss, bias, or variance across multiple data sets provides a gross measure of relative performance.
The basic relationships among attributes can be clearly observed by building MST.If one attribute is connected with several other attributes, the attribute is supposed to have crossed functional zones and will be selected as  attribute to retain complementarity.If one attribute is connected with only one other attribute, its independence characteristics may be obvious and will be reconsidered by the weight of CMI.Besides, RC helps to detect the situation in which the relationships that hold in MST need to be refined.Table 3 shows that the advantage of SP + RC is significant compared with SR and NSR in terms of zero-one loss.However, SR and NSR have a significant advantage over CAA, PAA, CAE, and PAE.The disappointing performances of CAA, PAA, CAE, and PAE can be ascribed to their susceptibility to getting trapped into poor selections by local minima during the first several additions or deletions.
The records in Table 4 show that all the  attribute selection algorithms applying SR, NSR, or FDs have a significant advantage in bias over CAE and PAE.In addition, CAE and PAE outperform CAA and PAA.However, comparing SP + RC with SP again does not show obvious difference.This result indicates that SP takes the main role for classification and its effect differs greatly in different data sets.The same result can also be inferred by comparing SP + RC with SR.The training sets containing only 25% of each data set for bias-variance evaluation are small because the data sets are primarily small.The bias of SP + RC decreases as training set size increases because more data will lead to more accurate probability distribution estimates and hence to more appropriate  attribute selection.Of these algorithms, RC, SR, and NSR have the weakest sensitivity to the changes in training data because they can utilize the testing set to infer rules for  attribute elimination.By contrast PAA, PAE, CAA, and CAE perform model selection and their biases differ greatly with different training data.
With respect to variance, as Table 5 shows, the variance of SP + RC does not show obvious advantage to other algorithms.Low-variance algorithms tend to enjoy an advantage with small data sets, whereas low-bias algorithms tend to enjoy an advantage with large data sets.Cross data set experimental studies of the traditional form presented above also support this hypothesis.The main reason may be that the relationship inferred based on MST may overfit the training data because SP needs to calculate CMI to construct MST.This requires enough instances to achieve precise probability estimation.
In the following discussion, canonical cover analysis [17], which can use limited number (e.g., 100) of instances to the reason for SP's outstanding performance on zero-one loss reduction is that it greatly utilizes the probabilistic dependency relationship on training data.For example, the elimination ratio of SP is as high as 61.5% for data set "anneal." However, the corresponding zero-one loss is lower than that of PAA and PAE.The main reason may be that, for as many as 38 attributes and only 894 instances, some attributes may have cross-functional zones, and only a few attributes may play the decisive role.After calculating and comparing the sum of CMI between one attribute and all the other attributes, most of the eliminated attributes have weak relationships to other attributes or are even nearly independent of them.
SP selects  attributes based on MST; attributes with a strong relationship among them will be selected first.If any attribute is removed by mistake, the classification results will not be affected greatly.However, for different training sets, especially when their sizes are very small, the conditional distribution estimates may differ greatly and different structures of MST may be obtained.Different  attributes will be selected for classification.The number of FDs extracted from RC is less than that from SR because numerous  attributes are eliminated during SP, especially for data set "audio, " with fewer  attributes and much more complicated FDs to remove  attributes.In the W/D/L records, the advantage in zero-one loss is significant with respect to SP versus PAE or SP versus PAA, but not SP + RC versus SP.This result shows that the advantage of SP + RC is from SP but not from RC.
With the increasing number of attributes, more RAM is needed to store joint probability distributions.An important restriction of our algorithm is that the number of the left side of FD should be no more than 2. To observe the effect of SP + RC and SR on each data set, we calculate the  attribute elimination ratios by the following criterion: where    is the number of  attributes eliminated for the th instance, and  is the size of data set.Table 8 shows the comparison results of   Ratio of SP + RC with the other three  attribute elimination algorithms, CAA, CAE, and SR.Table 3 shows that SP + RC has a significant advantage in zero-one loss over SR and NSR while SR and NSR outperform CAA and CAE.Comparing Table 3 with Table 8 reveals that both RC and SR can help to decrease zero-one loss.However, the effectiveness of RC relies greatly on SP while SR can always improve the performance of AODE.If SP removes  attributes by mistake, some valuable FDs will not be extracted by RC.However, if just redundant  nodes are eliminated, RC can extract more reliable FDs than SR because RC has considered all possible situations of SR.
For example, on data set "hypothyroid" the   Ratio of RC is 32%, which indicates that RC just uses approximately 30% of all attributes as  attribute.The reason for this high ratio is that SP has eliminated 21% of the attributes from the data set.For data set "anneal, " the   Ratio is also as high as 34%, and the zero-one loss is much higher than that of the other three algorithms.This result means SP has removed some  attributes by mistake; RC cannot extract FDs that are dependent on those deleted  attributes.Hence, the experimental results of SP + RC can still be improved if we can find other methods to keep more valuable  attributes for classification.

Conclusion and Future Work
AODE provides an attractive framework by averaging all models from a restricted class of one-dependence classifiers.The class of all such classifiers that have all other attributes depends on a common attribute and the class attribute.The current work aims to improve the accuracy derived by MST and FD from weakening the attribute independence assumption without high computational overheads.
Overall, this study developed a classification learning technique that retains the simplicity and direct theoretical foundation of AODE while reducing computational overhead without incurring a negative effect on classification performance.The  attribute of AODE can also be considered the parent of the class attribute.Therefore, we hypothesize that the success of AODE and its variations may be attributed to the fact that AODE not only aggregates all other restricted classes of models but also extends NB to handle the parent of class attribute.If this hypothesis can be proven, we can design a novel and perhaps more effective Bayesian classifier than AODE by constructing the Markov blanket of class attributes.

Figure 3 :
Figure 3: Bias and variance in shooting arrows at a target.
by applying union rule.As shown in Figure 4(b),  2 disappears and the arc that once connects  2 and  4 now extends to connect  1 and  4 .
Sort the edges into descending order by CMI.Let  be the set of edges comprising the MST.Set  = {Φ}.
(2)Find the edge with the greatest weight and add this edge to  if and only if it does not form a cycle in .If no remaining edges exist, exit and report MST to be disconnected.(3)If  has  − 1 edges (where  is the number of vertices in MST) stop and output .Otherwise go to step (2).

Table 1 :
A loop relationship example.
Deleting   from a Bayesian classifier should not be harmful when   is a generalization of   ; that is, (  |   ) = 1.0.Only the attribute value   is necessary for classification; that is, ( |  1 , . . .,   ) = (| 1 , . . .,  −1 ,  +1 , . . .,   ).Such deletion may improve a classifier's estimate if it makes unwarranted assumptions about the relationship of   to the other attributes when estimating intermediate probability values, such as NB's independence assumption.Since (  |  ) = 1.0 can be represented as FD:   →   , SR and FD have the same meaning but from different viewpoints.

Table 7 :
= FDs repeat Use the union rule to replace any dependencies in   of the form  1 →  1 and  1 →  2 with  1 →  1  2 .Find a functional dependency  →  in   with an extraneous attribute either in  or in .If an extraneous attribute is found, delete it from  →  in   .until (  does not change) Win/draw/loss comparison of elimination ratio of  attribute.

Table 8 :
Win/draw/loss comparison of elimination ratio of  attribute.