Bayesian Prediction Model Based on Attribute Weighting and Kernel Density Estimations

Although näıve Bayes learner has been proven to show reasonable performance in machine learning, it often suffers from a few problems with handling real world data. First problem is conditional independence; the second problem is the usage of frequency estimator. Therefore, we have proposed methods to solve these two problems revolving around naı̈ve Bayes algorithms. By using an attribute weighting method, we have been able to handle conditional independence assumption issue, whereas, for the case of the frequency estimators, we have found a way to weaken the negative effects through our proposed smooth kernel method. In this paper, we have proposed a compact Bayes model, in which a smooth kernel augments weights on likelihood estimation. We have also chosen an attribute weighting method which employs mutual information metric to cooperate with the framework. Experiments have been conducted on UCI benchmark datasets and the accuracy of our proposed learner has been compared with that of standard näıve Bayes. The experimental results have demonstrated the effectiveness and efficiency of our proposed learning algorithm.


Introduction
Naïve Bayes classifier is a supervised learning method based on Bayes rule of probability theory, running on labeled training examples and driven by a strong assumption that all attributes in the training examples are independent from one another on the given training examples known as naïve Bayes assumption or naïve Bayes conditional independence assumption.Naïve Bayes classifier has high performance and rapid classification speed and has exhibited its effectiveness especially in huge training instances with plenty of attributes mainly because of its independence assumption [1].
In practice, classification performance is affected by the attribute independence assumption which is usually violated in real world.However, due to the attractive advantages of efficiency and simplicity, both stemming from the attribute independence assumption, many researchers have proposed effective methods to further improve the performance of naïve Bayes classifier by weakening the attribute independence without neglecting its advantages.We categorize some typical previous methods of relaxing naïve Bayes assumption and give brief reviews in Section 3.However, we have found out that attribute weighting method has drawn relatively little attention among those previous methods in improving naïve Bayes classifier, especially in the case when attribute weighting method is combined with kernel method in a reasonable way.
Although Chen and Wang [2] proposed attribute weighting method with the kernel, their weighting scheme generates a series of parameters from least squares cross-validation which is less meaningful in terms of interpretation than our proposed method.In contrast, we propose an attribute weighting algorithm based on attribute weighting framework with kernel method.Our method makes the weights embedded in kernel have relatively interpretable meaning; thus we can flexibly choose different metrics and methods to measure the weights based on our attribute weighting framework.(i) We briefly make a survey of ways to improve naïve Bayes, especially focusing on those naïve Bayes weighting methods.
(ii) We propose a novel attribute weighting framework called Attribute Weighting with Smooth Kernel Density Estimation, simply AW-SKDE.The AW-SKDE framework employs a smooth kernel that makes the probabilistic estimation of likelihood to be dominated by the weights, which enables the combination of kernel methods and weighting methods.After setting up the kernel, we can generate a set of weights directly by using various methods cooperating with the kernel.
(iii) On the AW-SKDE framework, we propose a learner called AW-SKDE MI in which we choose the mutual information criterion to measure the dependency between an attribute and its class label.
Our experimental results show that mutual information criterion based on AW-SKDE framework exhibits superior performance compared to standard naïve Bayes classifier.
The paper is organized as follows: we briefly make a survey of ways to improve naïve Bayes in Section 2. In Section 3, we introduce the background of our study.In Section 4, we first propose our attribute weighting framework based on kernel density estimation.After that, we propose a method employing the mutual information criterion for attribute weighting based on our proposed framework.In Section 5, we describe the experiment and results in detail.Lastly, we draw conclusions for our study and describe the future research in Section 6.

Related Work
A number of methods that weaken attribute independent assumption for naïve Bayes have been proposed in the recent years.Jiang et al. [3] made a survey about improving naïve Bayes method.Those methods are broadly divided into five main categories: structure extension, feature selection, data expansion, local learning, and attribute weighting.We make a brief review by following this categorization.
For data expansion, Kang and Sohn [4] have presented an algorithm called propositionalized attribute taxonomy learner, simply PAT-learner.In PAT-learner, the training data set is first disassembled into small pieces with attributes values; then, PAT-learner rebuilds a new data set called PAT-Table by using divergence between the distribution of the class labels associated with the corresponding attributes at the disassembled date set.Kang and Kim [5] also proposed a Bayes learner based on PAT-learner, called propositionalized attribute taxonomy guided naïve Bayes learner (PAT-NBL).They utilize propositionalized data set and PAT-Table that is generated from PAT-learner to build naïve Bayes classifiers.
Wong [6] has focused on the discretization method of attributes to improve naïve Bayes.Wong has proposed a hybrid method for continuous attributes and mentioned that discretizing continuous attributes in a data set using different methods can improve the performance of naïve Bayes learner.Also, Wong provides a nonparametric measure to evaluate the dependence level between a continuous attribute and the class.
In structure extension, Webb et al. [7] have proposed a method called aggregating one-dependence estimators, simply AODE.In AODE, the conditional probability of test instances given class is tuned by one attribute value which occurs in the test instance.After the training stage, AODE outputs an average one-dependence estimator.AODE is a lazy method of structure extension of Bayesian network.Jiang et al. [3] have proposed hidden naïve Bayes, simply HNB, which is also a kind of structure extension method.
As for attribute weighting methods, we have two ways to get attribute weights.The first one is to construct a function with the parameters of attribute weight and to let this function fit itself with the training data by estimating the weights.Zaidi et al. [8] have proposed a weighted naïve Bayes algorithm, called weighting to alleviate the naïve Bayes independence assumption, simply WANBIA.Based on WANBIA framework, the authors have described two methods to obtain the attribute weights: WANBIA CLL , which maximizes the conditional log likelihood function and WANBIA MSE , which minimizes mean squared error function.
Chen and Wang [2] have also proposed an algorithm to minimize mean squared error function in order to obtain the attribute weights.In another paper, Chen and Wang [9] have proposed a method called subspace weighting naïve Bayes (simply SWNB) that is a naïve Bayes weighting method to deal with high-dimensional data.Using the local featureweighting technique, SWNB has the ability to describe different contributions of attributes in the training data set and outputs an optimal set of attribute weights fitting a Logit normal priori distribution.
There are many other methods that can be categorized into attribute weighting.Lee et al. [10] have calculated attributes weight via Kullback-Leibler divergence between the attribute and class label.Wu and Cai [11] have proposed decision tree-based attribute weighted AODE, simply DTWAODE.DTWAODE generates a set of attribute weights directly, and the weight value decreases according to attribute depth in the decision tree.Omura et al. [12] have proposed a weighting method, called confidence weight for naïve Bayes, and that confidence weight is derived from the probabilities of the majority class in the training data set.

Background
In this section, we explain the concepts of machine learning methods used in this paper, including naïve Bayes classifier, naïve Bayes attribute weighting, and kernel density estimation for naïve Bayes categorical attributes.The symbols used in this paper are summarized in Notations section.

Naïve Bayes Classifier.
In supervised learning, consider a training data set D = {x (1) , . . ., x (n) } composed of  instances, where each instance x = ⟨ 1 , . . .,   ⟩ ∈ D (-dimensional vector) is labeled with class label  ∈ .For the posterior probability of  given x, we have But likelihood (x | ) cannot be directly estimated from D because of insufficient data in practice.Naïve Bayes uses attributes independence assumption to alleviate this problem; from the assumption, (x | ) is shown as follows: In the training phase, only (  | ) and () need to be estimated for each class  ∈  and each attribute value   ∈   .The estimation method uses the frequency of   given  and the frequency of  for (  | ) and (), respectively.
In the classification phase, if we have a test instance t = ⟨ 1 , . . .,   ⟩ where   is an attribute value of the attribute  in the test instance, naïve Bayes classifier outputs a class label prediction of t based on the frequency estimation of (  | ) and () which have been generated in the training phase.The classifier of naïve Bayes is shown as follows: As it was aforementioned, naïve Bayes assumption conflicts with most real world applications (note that it is rare that attributes in the same data set do not have any relationships between each other).Therefore, many researchers provide proposals to relax naïve Bayes assumption effectively, which have been reviewed in Section 2.
In this paper, we focus on attribute weighting methods combined with kernel density estimation technique which is applied to naïve Bayes learner in order to relax conditional independence assumption.

Naïve Bayes Attribute
Weighting.Generally, naïve Bayes attribute weighting scheme can be formulated in several forms.Firstly, the weight to each attribute is defined as follows: If the weight depends on attribute and class, the corresponding formula is as follows: The following formula is used for the case when the weight depends on attribute value: Referring back to (4), when ∀  = , the formula is shown as follows: It is worthwhile to mention that ( 7) is considered as a special case of naïve Bayes classifier, where each attribute   has the same weight ∀  =  = 1.In other words, naïve Bayes classifier ignores the importance of attributes.From information theoretic perspective, naïve Bayes classifier abandons the chance of digging more information from D to reduce the entropy of class.This is one of the reasons why attribute weighting method provides more accuracy of classification result than naïve Bayes classifier.
In our approach, we follow (4) that assigns   which corresponds to the attribute   .But instead of using   as an exponential parameter, we incorporate   into p(  | ) so that it works in a more generalized form.The weight in our paper works in the kernel, as is shown in (13), described in Section 4.1.
Based on information theoretic perspective, attribute weighting method tries to find out which attribute will give more information for classification than other attributes.If an attribute   in data set D provides more information to reduce the entropy of class label  than other attributes, then   will be assigned with a higher weight.

Kernel Density Estimation for Naïve Bayes Categorical
Attributes.In naïve Bayes learner, which has been discussed in Section 3.1, the likelihood (  is the value of attribute  at the th instance in a data set D. From a statistical perspective, a nonsmooth estimator has the least sample bias, but it also has a large estimation variance [2,13] at the same time.Aitchison and Aitken [14] have proposed a kernel function and Chen and Wang [2] have proposed a variant of smooth kernel function alternating the frequency.The definition of their kernel function in [2] is as follows.
Given a test instance t = ⟨ 1 , . . .,   ⟩ where   is an attribute value of the attribute  in the test instance, Note that (  , ,   ) is a kernel function for   given , which may become an indicator if   = 0.   (=   ⋅   ) is the bandwidth such that   = 1/√  ,   ∈ [0, 1], and   is a number of instances in D given .
In [2], they have used (8) to estimate (  | ) as follows: where we use (  | ,   ) instead of (  | ).(Note that () is still estimated by frequency.)They minimize the cost function to take out a series   for each   in class .The cost function is defined as follows: Hence, the classifier is formulated as follows:

AW-SKDE Framework and AW-SKDE MI Learner
As | ) should be more close to 1/|  |.We let the bandwidth   = (1−  ) 2 ×  , where   ∈ [0, 1],   = 1/√  , and   is the number of instances labeled  = .The variation of (8) according to our proposal is as follows: The estimation (  | ,   ) of probability of (  | ) is described as follows: Hence, AW-SKDE framework is defined as follows: The AW-SKDE framework incorporates a smooth kernel to make the probabilistic estimation of likelihood dominated by the weights.This enables natural combination of kernel methods and weighting methods.After setting up the kernel, we can generate a set of weights estimated by various methods cooperating with the kernel.
where the definition of (  ; ) is as follows: We also incorporate split information used in C4.5 [15] with   split into our weighting scheme to avoid choosing the attributes with lots of values.The split information for each   is defined as follows where  ()  is the value of attribute   at th instance (as described in Notations section): Now, the weight of   is defined as follows: We feed AW-SKDE MI with a training data set D. In the training stage, we generate   avg ,   split , and   out for each   .In the classification phase, we give a test instance t; then AW-SKDE MI classifier is formed; a prediction of class is outputted finally.The learning algorithm of AW-SKDE MI is described in Algorithm 1.
During the training phase, AW-SKDE MI only needs to construct conditional probability tables (CPT), which are the tables that contain joint probabilities of attributes and a class label.In terms of time complexity, the calculation of (  ; ),   avg ,   split , and   requires (), ( 2 ), (V), and ( 2 ), respectively.Therefore, the total time complexity is ( +  2 + V) in the training phase.In the classification phase, the algorithm time complexity is ().We summarize the time complexity of AW-SKDE MI and naïve Bayes in Table 1.
Here, we also present a framework named Attribute Weighting with Light Smooth Kernel Density Estimation, simply AW-LSKDE, which does not consider the bandwidth.

Classification phase: begin
(1) for each dimension of test instance t and : The estimation (  | ,   ) is described as follows: We also build an attribute weighting naïve Bayes learner with mutual information metric based on this AW-LSKDE framework, called AW-LSKDE MI .The method of obtaining the weight of   is the same as that of AW-SKDE MI learner.Unfortunately, AW-LSKDE framework does not give us encouraging results.The experimental results of AW-LSKDE MI learner can be found in Table 3 with analysis of the results.

Experimental Results
In order to compare AW-ESKD MI , AW-LSKDE MI , and naïve Bayes in terms of classification accuracy, we have conducted experiments on UCI Machine Learning Repository Benchmark Data Sets [16].The UCI benchmark data sets that we have used are shown in Table 2.Note that we have conducted preprocessing to each data set: removing missing values and discretizing numerical attribute values.
In the implementation of our algorithm, all the probabilities including p( = ) and p(  =   ,  = ) are estimated via Laplacian smoothing which is shown as follows: To compare the performance of the algorithms, we have adapted -test with 10-fold cross-validation.We have conducted the experiments by applying our algorithm and standard naïve Bayes on the same training data sets as well as the same test data sets.The performance of the algorithm is evaluated through classification accuracy.
Table 3 shows the comparison of accuracies among standard naïve Bayes, AW-SKDE MI learner, and AW-LSKDE MI learner.
It can be seen that AW-SKDE MI learner shows four better results, six even results, and seven worse results than naïve Bayes within seventeen UCI data sets.AW-LSKDE MI learner only has one better result.Note that accuracies are estimated using 10-fold cross-validation with 95% confidence interval.AW-SKDE MI has a significant performance in the anneal data set and the mean accuracy of the AW-SKDE MI learner is 84.81 which is better than that of naïve Bayes' 84.78.This experimental result can prove that our new attribute weighting model AW-SKDE MI is efficient and effective.AW-LSKDE MI learner has performed poorly due to the ignorance of bandwidth parameters in the kernel methods which results in a relatively larger bias.

Conclusions and Future Work
In this paper, a novel attribute weighting framework called Attribute Weighting with Smooth Kernel Density Estimations, simply AW-SKDE framework, has been proposed.The AW-SKDE framework enables the estimation of likelihood to be dominated by attribute weights.Based on AW-SKDE, AW-SKDE MI has been proposed to exploit mutual information.We have conducted experiments on seventeen UCI benchmark data sets and made a comparison of accuracy among the standard NB, AW-SKDE MI , and AW-LSKDE MI .The experimental result proves that our new learner, AW-SKDE MI , is efficient and effective.Also, due to the relatively larger bias in the algorithm of AW-LSKDE MI , it has underperformed.
Even though AW-SKDE MI shows comparable results, as shown in Table 3, it does not quite outperform naïve Bayes.In the future work, we plan to improve AW-SKDE framework and investigate more effective attribute weighting methods instead of the weight measurement method with mutual information between attributes and class label.Th e v a l u e o f   at th instance D = {x (1) , . . ., x (n) }: Training data set consists of  instances x = ⟨ 1 , . . .,   ⟩: Aninstance,-dimensional vector, x ∈ D : Class label,  = { 1 , . . .,  || } : A n e l e m e n t o f ,  ∈  t = ⟨ 1 , . . .,   ⟩: A test instance, -dimensional vector (): The unconditioned probability of event  ( | ): The conditional probability of  given  P(•): An estimation of (•)   (⋅): Th e f r e q u e n c y o f ⋅ given    ∈ [0, 1]: Th ew e i g h t -v a l u eo fa t t r i b u t e  (  ; ): The mutual information between   and .

2 Mathematical
Problems in EngineeringContributions of this paper are threefold: is often estimated by   ( ()  ), the frequency of  ()  given ; note that  ()

where
is the number of training examples for which the class value is known;   is the number of training examples for which both attribute  and the class are known.The count(•) is the count value of •.The quotient of p(  =   ,  = ) as the dividend and p( = ) as the divisor result in conditional probability p(  =   |  = ).

Notations
: Th e th attribute in data set |  |: Th e c a r d i n a l i t y o f a t t r i b u t e   ()  : mentioned earlier, in this section, we propose an attribute weighting framework working on the categorical attribute called Attribute Weighting with Smooth Kernel Density Estimations, simply AW-SKDE.Based on the AW-SKDE framework, a learner named AW-SKDE MI is proposed, in which mutual information attribute weighting is applied.
4.1.AW-SKDE Framework.In (8), we pose an assumption that if a certain attribute   has more importance for classification given class label, in other words,   can provide more information to reduce the indeterminacy of class , then the value of ( ()  | ) should be more close to   (

Table 1 :
Time complexity (: the number of attributes, : the number of training examples, : the number of classes, and V: the average number of values for an attribute).Our approach generates a set of attribute weights   ∈ [0, 1] by employing mutual information between   and .It makes sense that if one attribute has more mutual information with class label, the attribute will provide more classification ability than other attributes and therefore should be assigned a larger weight.The average weight   avg of each attribute   is defined as follows: for each   and  in   and : estimate (  , ), (), (  | ), (  ) and |  |. (2) for each   and :  (

Table 2 :
Description of data sets used in the experiments.

Table 3 :
Experimental results in terms of classifiers' accuracy.Note that accuracies are estimated using 10-fold cross-validation with 95% confidence interval.