A Bayesian Classifier Learning Algorithm Based on Optimization Model

Naive Bayes classifier is a simple and effective classification method, but its attribute independence assumption makes it unable to express the dependence among attributes and affects its classification performance. In this paper, we summarize the existing improved algorithms and propose a Bayesian classifier learning algorithm based on optimization model (BC-OM). BC-OM uses the chi-squared statistic to estimate the dependence coefficients among attributes, with which it constructs the objective function as an overall measure of the dependence for a classifier structure. Therefore, a problem of searching for an optimal classifier can be turned into finding the maximum value of the objective function in feasible fields. In addition, we have proved the existence and uniqueness of the numerical solution. BC-OM offers a new opinion for the research of extended Bayesian classifier.Theoretical and experimental results show that the new algorithm is correct and effective.


Introduction
With the development of information technology, in particular the progress of network technology, multimedia technology and communication technology, massive data analysis, and processing become more and more important.Since Bayesian network as classifier has a solid mathematical basis and takes the prior information of samples into consideration, it is now one of the hottest areas in machine learning and data mining fields.Moreover, it has been applied to a wide range of tasks such as natural spoken dialog systems, vision recognition, medical diagnosis, genetic regulatory network inference, and so forth [1][2][3][4][5][6][7][8].Naive Bayes (NB) [9][10][11] is a simple and effective classification model.Although its performance can be comparable with other classification methods, such as decision trees and neural network, its attribute of independence assumption limits its real application.Extending its structure is a direct way to overcome the limitation of naive Bayes [12][13][14], since attribute dependencies can be explicitly represented by arcs.Treeaugmented naive Bayes (TAN) [9] is an extended tree-like naive Bayes in which the class node directly points to all attribute nodes and an attribute node can have only one parent from another attribute node.On this basis, Cheng et al.
presented Bayesain-network-Augmented naive Bayes (BAN) [15,16] which further expanded the tree-like structure of TAN classifier and allowed the dependency relation between any two attribute nodes.In constructing BAN, they use a scoring function based on the minimum description length principle.Unfortunately, the search for the best network is performed in the space of all possible networks, and the number of elements in this space increases exponentially with the number of nodes, finding the best structure is NP-hard [17,18].
Based on above analysis, this paper presents a Bayesian classifier learning algorithm based on optimization model (BC-OM) for the first time, inspired by constraint-based Bayesian network structure learning method [19][20][21][22].We discuss the classification principles of Bayesian classifier from a new view.Because chi-squared tests are a standard tool for measuring the dependency between pairs of variables [23], BC-OM first introduces the chi-squared statistic to define the dependence coefficients of variables.Then, it uses the dependence coefficients to construct an overall measure of the dependence in a classifier structure, from which the objective function for our optimization model can be derived.Therefore, a problem of searching for an optimal classifier can be turned into finding the maximum value of the objective Theoretical and experimental results show the proposed algorithm is not only effective in improving the accuracy, but also has a high learning speed and simple solving procedure.The remainder of this paper is organized as follows.Section 2 reviews the existing Bayesian network classifiers.We describe our algorithm and its theoretical proofs in Section 3. Section 4 details the experimental procedures and results of the proposed algorithm.Finally, in Section 5, we conclude and outline our future work.

Background
In this section, we discuss previous work that is relevant to this paper, and describe some of the notations used firstly.We use boldface capital letters such as V, E, X for sets of variables.General variables are denoted by italic capital letters or index italic capital letters , ,  − ; specific values taken by these variables are denoted  − ,  − ,  − .Specially, we use the same italic letters , ,  − for graph nodes which corresponds with the random variables.
Classification is a basic task in data analysis and pattern recognition that requires the construction of a classifier, that is, a function that assigns a class label to instances described by a set of attributes.The induction of classifiers from data sets of preclassified instances is a central problem in machine learning.Let V = { 1 ,  2 , . . .,   } represent the variable set which corresponds with the training data set .We assume that  1 is the class variable and { 2 ,  3 , . . .,   } is the set of attribute variables.Bayesian networks are often used for classification problems, in which the main task is to construct the classifier structure  from a given set of training data with class labels and then compute the posterior probability   ( 1 |  2 ,  3 , . . .,   ), where  1 is the value that  1 takes.Thus, it only needs to predict the class with the highest value of probability   ( 1 |  2 ,  3 , . . .,   ), that is, According to Bayes theorem, maximizing   ( ), where   ∈ { 1 ,  2 , . . .,  (−1) }.It is an efficient extend of naive Bayes.BAN is a specific case of general Bayesian network classifier, in which the class node also directly points to all attribute nodes, but there is no limitation on the arcs among attribute nodes (except that they do not form any directed cycle).It is clear that TAN and BAN are useful to model correlations among attribute nodes that cannot be captured by naive Bayes.They embody a good tradeoff between the equality of the approximation of correlations among attributes and the computational complexity in the learning stage.In addition, existing algorithms using the same idea to construct the structure of Bayesian classifier which first learn the dependent relationships among attribute variables using Bayesian network structure learning algorithm, then add the class variable as the root node of the network.It is equivalent to learning the best Bayesian network among those in which  1 is a root.Thus, even if we could improve the performance of a naive Bayes classifier in this way, the computational effort required may not be worthwhile.
Based on above analysis, this paper presents an optimization model to learn the structure of Bayesian classifier, which inspired by constraint-based Bayesian network structure learning method.It is the first time that a problem of structural learning for a Bayesian classifier is transformed into its related mathematical programming problem by defining objective function and feasible region.And, we also propose a new method to measure the dependent relationships between attributes.The theoretical basis of this method is established by Theorem 1 [24].
Theorem 1.Given a data set  and a variable set V = {

A Bayesian Classifier Learning Algorithm
Based on Optimization Model

Optimization Model Design.
In this subsection, we give some basic concepts and theorems which is the foundation of the method proposed in this paper.
A Bayesian classifier is a graphical representation of a joint probability distribution that includes two components.One is a directed acyclic graph  = (V, E), where the node set V = { 1 ,  2 , . . .,   } represents the class and attribute variables, and the edge set E represents direct dependency relationships between variables.The other is a joint probability distribution Θ = {  |   = (  | pa(  )),  = 1, 2, . . ., } that quantifies the effects of pa(  ) has on the variable   in , where pa(  ) = {  |   →   ∈ }.We assume that  1 is the class node and { 2 ,  3 , . . .,   } is the set of attribute nodes.The structure of  reflects the underlying probabilistic dependence relations among the nodes and a set of assertions about conditional independencies.The problem of data classification can be stated as follows: the learning goal is first to find the classifier structure that best matches  and estimate the parameters using the training data set , then to assign class label to test instances.Since  is a directed acyclic graph, it can be represented by a binary node-node adjacency matrix  = (  ).Entry (, ) is 1 if there is a directed arc from node  to node , and 0 otherwise.That is, Let  =  +  2 + ⋅ ⋅ ⋅ +   be the sum of powers of the adjacency matrix.Entry   ∈  is equal to the number of directed paths from node   to node   in the graph [25].
We wish to be able to use a mathematical programming formulation, and this formulation requires that we are able to measure the impact of adding or removing a single arc from the network.In order to approximate the impact of adding such an arc, we define the dependence coefficient.
Definition 2. Given a data set  and a variable set V = { 1 ,  2 , . . .,   }, we define the dependence coefficient   between variables   and   as is the  2 statistics of   and   given   ,  2 |, is the critical value at the significance level  of a  2 distribution with (  − 1)(  − 1)  degrees of freedom.
Obviously,   is a conservative estimate of the degree of dependence between two nodes.If   > 0, then, regardless of the other variable involved, there is statistically significant dependence between   and   , so there should be an arc between them.If   < 0, then there is at least one way of conditioning the relationship so that significant dependence is not present.We define  = (  ) the dependence coefficient matrix corresponding to the variable set V, that is, According to the measure of Definition 4, if   and   are conditionally independent, by Lemma 3   < 0, and hence, adding an arc between   and   decreases the value of F. Thus, we wish to find the feasible solution which increases F. The optimal solution corresponds to the best classifier structure.We next explain what constitutes feasible network.
Given the variable set V = { 1 ,  2 , . . .,   },  1 is the class node and {  ,  = 2, . . ., } is the set of attribute nodes.A directed network is a feasible classifier structure if and only if the following conditions are satisfied: (1) for any attribute node   ∈ V,  = 2, . . ., , there is no directed edge from   to  1 ; (2) for any node   ∈ V,  = 1, 2, . . ., , there is no directed path from   to   , namely, the graph is acyclic; (3) there exists at least one attribute node   ∈ V,  = 2, . . .,  which is dependent with class node  1 , namely, there is an attribute node   such that   can be reached from  1 by a directed path.
In order to incorporate the requirements of the above three conditions into a mathematical programming formulation, we express them by the following constrains: (1) ∑  =2  1 = 0; (1) Input: Data set ; Variable set V = ( 1 ,  2 ,. ..,   ) ( 1 is the class node and others are attribute nodes).
(2) For any pair of variable   and   contained in V, calculate the dependence coefficient   by Definition 2; (3) Solve the mathematical programming (OM) and obtain the optimal solution  * = (  ); (4) Build the classifier structure  * = (V, E * ) by  * = (  ); (5) For any variable The feasible classifiers are those that satisfy constrains (1)-(3).Thus, learning best Bayesian classifier can be transformed into the following related mathematical programming problem, where the objective function is a global dependence measure of the network, and the feasible region is the set of classifiers with reachability constrains ( 1) ∈ {0, 1} . (OM)

BC-OM Algorithm and Its Correctness.
In this subsection, we present the main algorithm of this paper.Our method starts with finding the best Bayesian classifier by solving the above optimization model.Second, we use the d-separation rule of Bayesian network to delete irrelevant or redundant attributes in the network which have low dependence degree with the class variable.The parameters of modified network can be estimated.Third, classification is done by applying obtained classifier to predict the class label of test data.We prove the correctness of proposed method under the faithfulness assumption for the data distribution.Given a directed acyclic graph  = (V, E) where V is the node set and E the set of directed edges.A path  between two distinct nodes  1 and   is a sequence of distinct nodes in which the first node is  1 , the last one is   and two consecutive nodes are connected by an edge, that is  =  1  1  2 ⋅ ⋅ ⋅  −1   where   denotes   →  +1 or   ←  +1 for  = 1, 2, . . ., ( − 1).Definition 5. A path  is said to be d-separated by a set Z in a directed acyclic graph  if and only if (1)  contains a "headto-tail meeting":   →   →   or a "tail-to-tail meeting":   ←   →   such that the middle node   is in Z, or (2)  contains a "head-to-head meeting":   →   ←   such that the middle node   is not in Z and no descendant of   is in Z. Specially, two distinct sets of nodes X and Y are said to be d-separated by a set Z in  if Z d-separates every path from any node in X to any node in Y [26].
In this paper, we assume that all the distributions are compatible with  [27].We also assume that all independencies of a probability distribution of variables in V can be checked by d-separations of , called the faithfulness assumption [26].The faithfulness assumption means that all independencies and conditional independencies among variables can be represented by .Now we formally describe our method in the following Algorithm 1.
From the detailed steps of BC-OM, we can see that BC-OM classifier relaxes the restrictions on condition variable and further meets the need of practical application.Since its network structure is similar to that of BAN's, BC-OM does not need to build all possible networks in which class node is a root and removes irrelevant or redundant nodes from the network before the process of estimating the network parameters, which greatly reduces the calculation for posterior probability of class variable.In fact, the training process of BC-OM is different from other BN classifiers.Its main task is to solve the mathematical programming (OM).To create the dependence coefficient matrix corresponding to (OM), BC-OM needs to compute the conditional statistics  2 | .Moreover, just as other constraint based algorithms, the main cost of BC-OM is the number of conditional independence tests for computing the dependence coefficients of any two variables in step 2. The number of conditional independence tests is  2  ⋅  1 −2 and the computing complexity is ( 3 ).The total complexity of BC-OM is bound by ( 3 ⋅ ), where  is the number of variables in the network and  is the number of cases in data set .In principle, BC-OM is a structure-extension-based algorithm.In BC-OM, we essentially extend the structure of TAN by relaxing the parent set of each attribute node.Thus, the resulting structure is more complex than TAN, but more simple than BAN.Therefore, BC-OM is a good tradeoff between the model complexity and accuracy compared with TAN and BAN.Next, we prove the correctness of BC-OM algorithm under the faithfulness assumption.
The next two results establish the existence and uniqueness properties of solution to (OM).Theorem 6.Let X = {(  ) × |   ∈ {0, 1}}.There always exists an  ∈ X such that  is a feasible point of (OM).
Proof.Given the set of variables  = { 1 ,  2 , . . .,   } where  1 is the class variable and { 2 , . . .,   } are the attribute variables.We give a matrix  as follows: Obviously, the adjacency matrix  always satisfies the constrains ( 1)-( 3).In fact, the graph represented by  is the Naive Bayes classifier.Thus,  is a feasible solution of (OM).
According to Theorem 6, we can prove that there exists a feasible classifier which satisfy constrains ( 1)-( 3).Theorem 7 further shows that such classifier is unique under certain condition.
Theorem 7. Let  * be the optimal solution of (OM),  1 = {  |   = 1} and  2 = {  |   = 0} be the coefficient sets where   is the element of  * . * is the unique solution of (OM) if and only if any element in  1 cannot be expressed as the sum of any number of elements in  2 .
Proof.Without loss of generality, we suppose, by reduction to absurdity, that  (1) and  (2) are two optimal solutions of (OM).The values of the objective function is the same in both solutions, that is,   ( (1)   −  (2)   ) = 0.
Proof.Without loss of generality, suppose ( 2 , . . .,   ) is an example to be classified.The classifier represented by  is given as follows: We write the right side of (11) We get the results.
Theorem 8 reveals that it is effective and correct to remove redundant or irrelevant attributes using d-separation rule, and the performance of Bayesian classifier can be improved.

Experimental Results
We run our experiments on 20 data sets from the UCI repository of Machine Learning datasets [28], which represent a wide range of domains and data characteristics.Table 1 shows the description of the 20 data sets which are ordered by ascending number of samples.In our experiments, missing values are replaced with the modes and means of the corresponding attribute values from the available data.For example, if the sex of someone is missing, it can be replaced by the mode (the value with the highest frequency) of the sexes of all the others.Besides, we manually delete three useless attributes: the attribute "ID number" in the dataset "Glass", the attribute "name" in the dataset "Hayes-roth", and the attribute "animal name" in the dataset "Zoo".The experimental platform is a personal computer with Pentium 4, 3.06 GHz CPU, 0.99 GB memory, and Windows XP.Our implementation is based on the BayesNet Toolbox for Matlab [29], which provides source code to perform several operations on Bayesian networks.The purpose of these experiments is to compare the performance of the proposed BC-OM with Naive Bayes, TAN and BAN in terms of classifier accuracy.The accuracy of each model is based on the percentage of successful predictions on the test sets of each data set.In all experiments, the accuracy of each model on each data set are obtained via 10 runs of 5-fold cross validation.Runs with the various algorithms are carried out on the same training sets and evaluated on the same test sets.In particular, the cross-validation folds are the same for all the experiments on each data set.Finally, we compared related algorithms via two-tailed -test with a 95 percent confidence level.According to the statistical theory, we speak of two results for a data set as being "significantly different" only if the probability of significant difference is at least 95 percent [30].
Table 2 shows the accuracy (and standard deviation of accuracy) of each model on each data set, and the average values and standard deviation on all data sets are summarized at the bottom of the table.In each row, the best of the four classifier results are displayed in bold.If another's performance is not significantly different from the best, it is also highlighted, but if the differences between all four classifies are not statistically significant, then none of them is highlighted.From our experiments, we can see that BC-OM is best in 6 cases.NB, TAN, and BAN are best in 6, 5, and 3 cases, respectively.When the number of samples is larger than 400, the performance of TAN and BAN is better than that of NB, and BC-OM is best.Although it can be seen that the performance of BC-OM and TAN becomes similar as the sample size increase, BC-OM has a higher accuracy on average.From a general point of view, we can see that from the first data set to the last one, the highlighted numbers change from few to even more in the sixth column of Table 2.It means the advantage of BC-OM is more evident with the increase of data size.
Table 3 shows the compared results of two-tailed -test, in which each entry // means that the model in the corresponding row wins in  data sets, ties in  data sets and loses in  data sets, compared to the model in the corresponding column.From Table 3, we can see that BC-OM significantly outperforms NB (9 wins and 4 losses), TAN (12 wins and 5 losses) and BAN (11 wins and 5 losses) in accuracy.Figures 2, 3, and 4 show two scatter-plots comparing BC-OM with NB, TAN, and BAN, respectively.In the scatter plot, each point represents a data set, where the  coordinate of a point is the percentage of misclassifications according to NB or TAN or BAN, and the  coordinate is the percentage of misclassifications according to BC-OM.Thus, points below the diagonal line correspond to data sets on which BC-OM performs better.From Figures 2  and 3, we can see that BC-OM generally outperforms NB and TAN as is also demonstrated in Table 3.It provides strong evidence that BC-OM is performing well against the other two classifiers both in terms of accuracy as well as the percentage of misclassifications.Figure 4 also shows BC-OM outperforming BAN, though the difference in performance is not as marked as in the results of Figures 2 and 3.In other words, the performance of BC-OM and BAN is similar in terms of the percentage of misclassifications.However, BC-OM has a higher accuracy and a more simple graph structure, which suggests that BC-OM is able to handle very large data sets and is a more promising classifier.

Conclusions
In many real-world applications, classification is often required to make optimal decisions.In this paper, we summarize the existing improved algorithms for naive Bayes and propose a novel Bayesian classifier model: BC-OM.We conducted a systematic experimental study on a number of UCI datasets.The experimental results show that BC-OM has a better performance compared to the other state-of-the-art models for augmenting naive Bayes.It is clear that in some situations, it would be useful to model correlations among attributes.BC-OM is a good tradeoff between the quality of the approximation of correlations among attributes and the computational complexity in the learning stage.Considering its simplicity, BC-OM is a promising model that could be used in many field.
In addition, we use the chi-squared statistic to estimate the dependence coefficients among attributes from dataset.We believe that the use of more sophisticated methods could improve the performance of the current BC-OM and make its advantage stronger.This is the main research direction for our future work.

Figure 2 :
Figure 2: Relative errors of BC-OM and NB.
Figure 1 schematically illustrates the structures of the Bayesian classifiers considered in this paper.In naive Bayes, each attribute node has the class node as its parent, but does not have any parent from attribute nodes.Computing   ( 2 ,  3 , . . .,   |  1 ) is equal to ∏  =2   (  |  1 ).Because the values of   ( 1 ) and   (  |  1 ) can be easily estimated from training examples, naive Bayes is easy to construct.However, its prerequisite of condition independence assumption and data completeness limit its real application.TAN takes the naive Bayes and adds edges to it in which the class node directly points to all attribute nodes and an attribute node can have only one parent from another attribute node.Computing   ( 2 ,  3 , . . .,   |  1 ) is equivalent to ∏  =2   (  |   ,  1 1|  2 ,  3 , . . .,   ) is equivalent to maximizing ( 1 )⋅  ( 2 ,  3 , . . .,   |  1 ).The difference between the existing Bayesian classifiers is the computing mode of   (⋅).
1 ,  2 , . . .,   }, if the hypothesis that   and   are conditionally independent given   is true, then the statistics  2 | = 2 ∑ ,,    log[      /(      )] approximates to a distribution  2 () with  = (  − 1)(  − 1)  degrees of freedom, where   ,   , and   represent the number of configurations for the variables   ,   , and   , respectively.   is the number of cases in  where   = ,   = , and   = .   is the number of cases in  where   =  and   =  and    is the number of cases in  where   = .

Table 1 :
Descriptions of UCI datasets used in the experiments.

Table 2 :
The detailed experimental results on accuracy and standard deviation.

Table 3 :
The compared results of two-tailed -test on accuracy with the 95 percent confidence level.