Predicting Facial Biotypes Using Continuous Bayesian Network Classifiers

. Bayesian networks are useful machine learning techniques that are able to combine quantitative modeling, through probability theory, with qualitative modeling, through graph theory for visualization. We apply Bayesian network classifiers to the facial biotype classification problem, an important stage during orthodontic treatment planning. For this, we present adaptations of classical Bayesian networks classifiers to handle continuous attributes; also, we propose an incrementaltree constructionprocedure for tree like Bayesian network classifiers. We evaluate the performance of the proposed adaptations and compare them with other continuous Bayesian network classifiers approaches as well as support vector machines. The results under the classification performance measures, accuracy and kappa, showed the effectiveness of the continuous Bayesian network classifiers, especially for the case when a reduced number of attributes were used. Additionally, the resulting networks allowed visualizing the probability relations amongst the attributes under this classification problem, a useful tool for decision-making for orthodontists.


Introduction
In orthodontics, it is essential to know the changes that occur during facial growth when planning a treatment, especially in children and adolescents, because the amount and direction of growth can significantly alter the need for different treatment mechanics [1,2].Normally, clinicians use radiographs or photographs to compute angular, linear, or proportional measurements of the face and skull to obtain growth patterns or facial biotypes [3].One of the most popular methods to determine the facial biotype is through the VERT index proposed by Ricketts [4].The VERT index is computed using five different features (or attributes) that allows analyzing the facial morphology [5].Based on the VERT index, the biotypes can be classified into Dolichofacial (long and narrow face), Brachyfacial (short and wide face), and an intermediate type called Mesofacial [3,5].These three biotypes are shown in Figure 1.
It has been described that some attributes used in the VERT index can alter the index in patients in whom the sagittal relationship between the jaws is altered, leading to possible diagnostic errors [3].That is why, the possibility of automatically determining the facial biotype using attributes that are not altered by the sagittal position of the jaws would eliminate the errors observed with the use of the VERT index.Thus, in this work, we propose a machine learning approach to automatically classify a patient's biotype using alternative attributes.
In recent years, we have seen great advances in the field of machine learning in relation to predictive modeling, in particular, supervised learning algorithms for classification and regression problems, such as random forests (RF) [6], support vector machines (SVM) [7], neural networks with random weights such as feedforward neural networks with random weights (RWSLFN) [8], random variable functional link neural networks (RVFLN) [9], and extreme learning machine (ELM) [10].All of these models are achieving extraordinary performances in several applications, including orthodontics, such as the automatic Dent-landmark detection in 3D conebeam computed tomography dental data [11], a method that objectively evaluates orthodontic treatment need and treatment outcome from the lay perspective [12], pattern classification for finding facial growth abnormalities [13], and an automated diagnostic imaging system for orthodontic treatment in dentistry [14], just to mention a few.While high accuracy in the predictions and good generalization power are the main goals in several applications, the use of machine learning in medical treatment planning requires additionally that these models should be simple to interpret and therefore use them as a tool for decisionmaking.The algorithms mentioned before, although very powerful from a quantitative point of view, are somewhat limited from a qualitative aspect, in the sense that, for example, a trained SVM classifier, does not give you explicit classification rules or a simple visual interpretation on how the attributes interact in order to obtain the classification of a new data point.This issue has been tackled by other types of machine learning techniques, where the qualitative aspect plays a key role such as inductive learning algorithms [15][16][17][18] and decision trees [19].These techniques are known as white box models (opposite to the black box models mentioned before) since the prediction process is open to the user.An interesting machine learning model that combines probability theory (quantitative) with graph theory for visualization (qualitative) is Bayesian networks introduced by J. Pearl [20], and in particular for this work, Bayesian network classifiers [21].A Bayesian network (BN) is a directed acyclic graph (DAG), whose nodes represent discrete attributes and the edges probabilistic relationships among them.Additionally, each node has associated a conditional probability table, indicating the conditional probability for each discrete value of the node conditioned for each value of the parent nodes in the network (graph).The structure of the graph encodes the assertion that each attribute (node) is conditionally independent of its nondescendants, given its parents in the graph (this is known as the Markov condition).Therefore, given that a Bayesian network satisfies the Markov condition, the joint probability distribution of all the attributes can be computed in a factorized form.Bayesian networks have been applied in the domain of dentistry, for example, a decisionmaking system for the treatment of dental caries [22], the assessment of tooth color changes due to orthodontic treatment [23], the evaluation of the relative role and possible causal relationships among various factors affecting the diagnosis and final treatment outcome of impacted maxillary canines [24], to establish a ranking in efficacy and the best technique for coronally advanced flap-based root coverage procedures [25], a minimally invasive technique for lateral maxillary sinus elevation and to identify the relationship between the involved factors [26], and the development of a clinical decision support system to help general practitioners assess the need for orthodontic treatment in patients with permanent dentition [27].
Learning Bayesian networks from data has two components that must be handled: (1) the structure of the networks and (2) the parameters (conditional probability tables).It has been proven that learning Bayesian networks is NP-complete [28].Therefore, several approximate learning approaches have been devised in order to simplify the learning process [29][30][31][32].
In this paper, we consider the problem of facial biotype classification using Bayesian network classifiers with continuous attributes.The rest of the paper is organized as follows.Section 2 presents a general overview of Bayesian network classifiers; then in Section 3 we describe the dataset used in this work, the continuous attribute adaptation for common Bayesian network classifiers, a description of an incremental tree construction procedure for tree like Bayesian networks, other continuous Bayesian network classifiers approaches, and the simulation setup to test and compare the classifiers.The results and discussion appear in Section 4; then the final conclusions are given in Section 5.

Background
Probabilistic classification consists in computing a posterior probability given an input data point.We will use the standard notation in Bayesian networks, where random variables (attributes) are denoted by capital letters, e.g., , and particular values with lower-case letters, e.g., .Let us consider a training set  consisting of  data points, each one characterized by  attributes  1 , . . .,   and their respective output  or class label (with  classes).Given a new input data point , this can be classified using the Bayes rule, with  the normalizing constant.From (1), we notice that there are two probabilities that can influence the resulting prediction.The first one is ( = ) (with  = 1, . , is called the likelihood and corresponds to the joint probability distribution of the attributes conditioned to the class .There are several methods to compute the joint probability distribution, in particular, using Bayesian networks, thus, given way to Bayesian network classifiers.The simplest approach is to consider "naively" that the attributes are independent amongst them given the class, which yields the naive Bayesian (NB) network classifier [33].
The prediction is computed by An example of the Bayesian network representation (with  = 5) of this classifier is shown in Figure 2(a).Given the difficulty of learning Bayesian networks from data, as discussed before, learning strategies have considered restrictions on the type of the structure of the network.That is the case with the seminal work by Chow and Liu [34], which developed a learning algorithm for approximating the joint distribution by a tree structure, i.e., a network with −1 edges, where one node acts as the root (no incoming edges only outgoing edges), and all the rest of the nodes have only one parent node.Let Π  represent the parent node of the attribute   (for  = 1, . . ., ); also let  * be the index of the node which acts as the root; therefore, Π  * = {0}.Under this scheme, the training set  is partitioned according to the different class labels.Then for each partition, a tree structure is learned to model the corresponding joint probability distribution   (with  = 1, . . ., ).The prediction is computed by This model is also known as the Chow-Liu (CL) classifier.An example of the CL classifier (for  = 5 and  = 2) is shown in Figure 2(b).Notice that given that  ∈ {1, . . ., }, i.e., there are  different class labels, then the CL classifier must learn  tree structures.An alternative to this is the model called the tree augmented naive Bayes classifier or TAN [21], which learns only one tree structure for all the classes.Under this model, Π  = {  , }; i.e., for each node   , the parent set Π  is composed of two nodes:   (with  ̸ = ) and the class variable , with exception of  * (the attribute root node), where Π  * = {}.The prediction using the TAN classifier can be obtained by An example of the TAN classifier (with  = 5) is shown in Figure 2(c).
It is interesting to notice that while TAN was presented as a solution to the strong independence assumption in the naive Bayes classifier, in the tests presented in the TAN paper [21], there are cases where the naive Bayes outperformed TAN.Can it be that given that TAN forces a tree structure amongst the attributes, there may be edges in the network which should not exist but are there in order to satisfy the tree structure?With this in mind, in this paper, we propose an incremental tree construction procedure which may lead to an incomplete tree structure, known as a forest.

Dataset Description and
Preprocessing.The dataset consists of 182 lateral teleradiographies from Chilean patients.For each one, cephalometric analysis was performed to compute 31 continuous attributes (see Appendix) that characterize the craniofacial morphology.This dataset has been used previously to identify craniofacial patterns through clustering analysis [41].For this work, each lateral teleradiograph has been manually classified and validated by orthodontists into one of the three classes (Brachyfacial, Dolichofacial, and Mesofacial).A visualization of the correlation matrix of the 31 attributes is shown in Figure 3, where we can appreciate that there are several attributes which are highly (more than 0.8 in absolute value) correlated.
Highly correlated attributes are essentially attributes which capture the same information, and therefore we can reduce the number of attributes by leaving only one attribute from a highly correlated set of attributes.For example, from Figure 3 we notice that Ri10 and Mc3 are highly correlated (a correlation of 0.95); this is not surprising since both attributes indicate the sagittal position of the maxilla with respect to the skull, using different cephalometric landmarks.Therefore,  we may drop Ri10 in further analyses and use only Mc3.By assuming a threshold of absolute value of 0.8 for the correlation, we excluded the following attributes: Mc5, Mc6, Ri10, Ri18, Ja8, Ja10, and Ja11.Thus, the number of attributes of the dataset is now 24.From these remaining attributes, we proceeded in visualizing their discriminatory power by performing a principal component analysis (PCA) projection of the 24-dimensional data points to a 2-dimensional space; then each point is labeled according to their class (facial biotype).The resulting visualization is shown in Figure 4.
From Figure 4, we notice that while the attributes have sufficient discriminatory power to separate the Brachyfacial class with the Dolichofacial class, the third Mesofacial class lies just between the other two, making this a difficult classification problem.

Continuous Bayesian Network Classifiers.
As explained in the Introduction, we will consider Bayesian networks for this facial biotype classification problem.Given that Bayesian networks were originally formulated for discrete random variables, and our dataset has continuous variables (attributes), we need to address this issue.A typical approach is to discretize the continuous attributes and then proceed as usual.While this is a practical solution, an ideal discretization is not that straightforward, and therefore, valuable information may be lost during this process.In what follows, we describe the continuous adaptation for the naive Bayes, TAN, and an incremental tree construction version of TAN, through the implementation in R, the open source software environment for statistical computing and graphics [42], that we used in our work.(

Continuous
This is a nonnegative quantity that measures the information that   provides about   when the value of  is known.For continuous variables, the mutual information between two attributes is given by Then, the conditional mutual information for the continuous case can be computed by So, (  ;   | ) can be computed for each class value  (with  = 1, . . ., ) by ( 6) using all the training examples, where  = .We estimate (6) using the knnmi function available in the parmigene package in R [45].This function estimates the mutual information between two attributes using entropy estimates from k-nearest neighbors distances [46].Once we have computed the conditional mutual information for each pair of attributes, we construct the fully connected graph with the graph.fullfunction in the igraph package in R [47].Then the tree structure is obtained from the fully connected graph by using the minimum.spanning.treefunction (also in igraph) that uses Prim's algorithm.Since we are interested in the maximum spanning tree, we use minimum.spanning.treewith the negative values of the conditional mutual information as weights.The resulting tree is undirected.To obtain the directed tree, we identify which is the pair of attributes with the highest edge weight (conditional mutual information), we consider from the winning pair one of the attributes as the root, and then we set the direction of all the remaining edges to be outward from it.To finally obtain the TAN classifier, we add an edge from  to each attribute   .Now we are in conditions to compute (4) for a given data point.The priors () can be computed as usual through relative frequencies.
Then the terms (  | Π  ) in the product are computed as follows.For the root attribute  * we have that Π  * = ; thus, we can use the kernel density approach described for the naive Bayes classifier.For the rest of the terms in the product, we will have Π  = {  , } given by the tree structure.
We estimate the joint probability with a two-dimensional kernel approach.In particular, we use the function kde2d in the MASS package in R [48].This function performs a two-dimensional kernel density estimation with an axisaligned bivariate normal kernel, evaluated on a square.Then, to obtain specific values from this density, we use the interp.surfacefunction from the fields package in R [49].This function uses bilinear weights to interpolate values on a rectangular grid to desired values.Finally, this joint probability estimate is normalized by (  | ) which can be computed using the same approach used for the naive Bayes classifier.

Continuous Incremental Tree Construction Augmented
Naive Bayes Classifier.We propose an alternative learning procedure for the TAN classifier, which we call incremental tree construction augmented naive Bayes (ITCAN).One of the limitations of the TAN model is that the resulting structure will always be a tree, even if some edges have very low weights (conditional mutual information).With ITCAN, we identify partial TAN solutions where some nodes (attributes) might end up with only the incoming edge from the class.The ITCAN learning procedure with a training set is as follows: (1) Evaluate the accuracy of a naive Bayes classifier using -fold cross validation.Let this value be   .
(2) Learn the TAN tree structure as described in Section 3.2.2.
(5) For each ℎ in the list: (a)  ←  +  ℎ (b) Evaluate the accuracy of  classifier using -fold cross validation.Let this value be  ℎ .
From the above learning procedure, if ℎ * = , then the resulting model is the naive Bayes classifier.If ℎ * =  − 1, then the resulting model is the TAN classifier.For any other value of ℎ * , the resulting structure will be a forest, a midway solution between naive Bayes and TAN.For the results presented later on, we use  = 5 in the -fold cross validation in the ITCAN learning procedure.
There have been other approaches to search for Bayesian network models bounded by naive Bayesian networks and the TAN classifier; one example is the Forest-Augmented Bayesian Network (FAN) algorithm [50].While the ITCAN learns once the TAN tree structure, the FAN algorithm uses another approach.It first computes the conditional mutual information between all pairs of attributes, then it constructs the fully connected graph using the negative value of the conditional mutual information as weights between the attributes.But now instead of finding the minimum weighted spanning tree (like TAN), it searches for the minimum weighted spanning forest containing exactly  edges (with  ≥ 0 defined by the user).So to explore the complete range of structures, the user must apply FAN -times ( = 0, . . ., −1).Another difference is when FAN transforms the undirected forest into a directed forest, it does so by choosing a root vertex for every tree in the forest.This procedure could yield different structures when compared to ITCAN which uses the edges from the unique TAN structure.

Other Continuous Bayesian Network Classifiers Approaches.
In [51] conditional Gaussian networks (CGN) classifiers were introduced.In particular, it is of interest for this work the Gaussian NB (gNB) and the Gaussian TAN (gTAN).In the case of gNB, the probabilities in the product term in (2) are approximated by where  | and  | are the mean and the standard deviation, respectively, of attribute   , computed by using only the examples that have a class value  = .For gTAN, the probabilities in the product term in ( 4) are approximated by where  | and V | are defined by where we have considered   as the parent attribute of   . | is the regression coefficient of   on   conditioned to the class value  = , defined by where  | is the covariance between the variables   and   conditioned to  and  2 | is the variance of   conditioned to .
Also important to point out, under this approach, is that the conditional mutual information is computed by where   (  ,   ) =  | /√ 2 |  2 | is the correlation coefficient between   and   conditioned to the class value  = .
Another approach to handle continuous attributes is described in [52], where kernel density estimation is adopted Complexity 7 (similar to the approach presented in this paper) giving way to the so-called flexible classifiers.The flexible naive Bayes (f NB) classifier uses a similar approach as the one described in Section 3.2.1,where the conditional probabilities are computed with Gaussian kernels.One difference is the smoothing parameter ℎ (used by the kernel density estimator) in f NB, which is the normal rule: where  is the number of continuous variables in the density function to be estimated and  is the number of cases from which the estimator is learned.In our proposal, the smoothing parameter considered (used by the density function in R) is a rule-of-thumb described in [53]: with The flexible tree augmented naive Bayes (fTAN) computes the conditional probabilities in the product term of ( 4) using ( 8) and employing a 2-dimensional Gaussian density with identity covariance matrix, similar to the continuous TAN proposed, but f TAN uses (15) to compute the bandwidth for the kernel, whereas our proposal uses (16) with the factor 0.9 changed to 4.24.Also, f TAN estimates the conditional mutual information in the following way: where the super-index  :  refers to the th case in the partition induced by the value , and   is the number of cases verifying that  = .f are computed using the kernels described previously.On the other hand, in our proposal, we use another approach to estimate the conditional mutual information using entropy estimates from k-nearest neighbors distances [46].Overall, when comparing to these previous continuous formulations (CGN and flexible), we notice that our proposal, based on kernel density estimates, resembles the flexible classifiers of [52], but with alternative implementations and using current available R functions.

Simulation
Setup.We will compare the classification performance of the described continuous Bayesian network classifiers; in particular, we will compare our implementations, namely, cNB, cTAN, and cITCAN, with the conditional Gaussian networks approach: gNB, gTAN, and gITCAN, as well as the flexible approach: f NB, f TAN, and fITCAN.Also, we will consider the discrete versions: dNB, dTAN, and dITCAN.For this we will use the discretize function from the bnlearn package in R [54].Finally we will also consider a black box classifier such as SVM.In particular, we use the svm function with default setting from the e1071 package in R [55].
To compute the classification performance, we randomly sample 70% of the dataset examples to generate a training set and use the remaining 30% as a test set.We train the thirteen classifiers on the same training set and then compute the accuracy (the fraction of correct predictions) and the kappa statistic using the test set.The kappa statistic compares the accuracy of the trained model with the accuracy of a random model.To interpret the kappa value, we use the common characterization proposed in [56]: values ≤ 0 as indicating poor agreement, 0 − 0.2 as slight, 0.21 − 0.4 as fair, 0.41 − 0.6 as moderate, 0.61 − 0.8 as substantial, and 0.81 − 1 as almost perfect agreement.
We run the data splitting procedure 50 times and then report the average and the standard deviation of the accuracy and the kappa value for each run.To statistically compare the performance between all the algorithms we will consider the Friedman test and a post hoc test to evaluate the pairwise performance when all the algorithms are compared to each other; in particular, we will use the Nemenyi test.Further details of the process for comparison of multiple algorithms are given in [57].

Results and Discussions
The classification performance results for the thirteen classifiers are shown in Table 1.On average, the best performance was obtained by SVM, while within the Bayesian network classifiers, the cITCAN obtained the best performance.Also, considering the kappa value, only SVM and cITCAN correspond to the moderate interval of classification agreement with the true classes, whereas most of the other classifiers are in the fair interval.The worst performance was obtained by f TAN (and the second worst fITCAN); this could be due to the conditional mutual information estimation, where probably not enough samples were available to conduct Complexity We considered the null hypothesis to be tested that all the algorithms performed the same and that the observed differences were merely random.We conducted the Friedman test in order to analyze if there are statistically significant differences for all the algorithms.All the algorithms are ranked for each dataset (run) separately, where the best performing algorithm is the one obtaining the lowest rank.Table 2 shows the average rank for each algorithm.
The Friedman statistic is given by the following: where  2  is the -th average rank of the algorithms.The statistic is distributed according to  2  with  − 1 degrees of freedom and  is the number of datasets.For the comparison of all the algorithms with the Friedman test, the  2  statistic is 300.2 and the  value is <2.2e-16, which rejects the null hypothesis that all the algorithms have the same performance.
Then, a post hoc test is performed to evaluate the pairwise performance when all the algorithms are compared to each other.The Nemenyi test with  = 0.05 was applied, and the results are presented in Table 3.When comparing SVM with all the other classifiers, we notice that the null hypothesis cannot be rejected when compared to cNB, gNB, cITCAN, and gITCAN, respectively, since there are no statistically significant differences between them, whereas for our second best classifier, cITCAN, we notice that the null hypothesis cannot be rejected when compared to cNB, gNB, gTAN, gITCAN, and SVM, respectively.
Figure 5 shows the best cTAN model obtained throughout the 50 runs.We notice that Ja5 and Ri15 are the two attributes with the most outgoing edges (apart from the obvious class node Biotype), conditioning the probabilities of the other attributes.In particular, Ja5 is the parent node of Ri13, Ri16, Ja13, Ja6, and Ri15.This can be explained, in part, by the following: Ja5 as well as Ri13 corresponds to the length of the anterior cranial base using different landmarks.Ja13 and Ja6 correspond to variables given by the posterior cranial length.In this case, the relationship is explained given the fact that the growth of the anterior cranial base (Ja5) and posterior cranial base (Ja13 and Ja16) depends on a common factor, which is the growth of the brain; therefore, there is a linear proportionality between both structures.There is no biologically direct relationship to explain the relation between Ja5 with Ri15 and Ri16, except that, as in any biological system, there is a proportional and compensatory relationship between the structures tending to maintain the functionality and stability of the systems.
On the other hand, Ri15 is the parent node of Ri21, Mc7, Ri20, and Mc3.In this case, a greater or smaller mandibular size is directly related to a larger or smaller size of all its components, such as the width of the symphysis (Ri21) and width of the condyle (Ri20), which explains the relationship between these variables and the size of the mandibular body (Ri15).On the other hand, there is no biologically direct relation to explain the relationship between attribute Ri15 with Mc3 and Mc7.Attribute Mc3 points out the sagittal position of the maxilla, which is independent of the size of the mandibular body (Ri15), and Mc7 is a vertical relationship (lower facial height) that is not directly influenced by the mandibular size.
The best cITCAN model obtained throughout the 50 runs is shown in Figure 6.We notice that it is a forest, where only 5 edges are considered from the total 23 of the cTAN model (without counting the outgoing edges of the class variable).Here we observe that the influence of Ja5 on Ri13 and Ri15 is still required.
We explore the possibility to improve the classification performance by identifying the most relevant attributes for classification and then proceed to repeat the simulations with a reduced number of attributes.For this, the importance function from the randomForest package in R [58] was used.This function computes the importance of each attribute based on the Gini importance, a measure used to quantify the node impurity during the tree inference process (in decision trees or random forests).The result is shown in Figure 7.
We observe that Ja4 is the attribute with the most discriminatory power.We proceed to select the top 4 attributes, i.e., Ja4, Ja12, Mc7, and Mc3.In particular, the first three correspond to measurements that describe vertical dimensions, which is directly related to the determination of the biotype, since the primary difference between them is the relationship between the vertical dimensions of the anterior and posterior region of the craniofacial complex.It is noteworthy that attribute Mc3 is among those of higher importance, since it indicates the sagittal position of the maxilla with respect to the skull, a characteristic that is independent and not directly related to the characteristics that allow the differentiation of biotypes.
With these four attributes, we repeat the performance evaluations and the statistical tests using the same 50 runs.Following the same statistical tests as before, Table 5 shows the average rank for each algorithm.For the comparison of all the algorithms with the Friedman test, the  2  sta- tistic is 372.66 and the  value is <2.2e-16, which rejects the null hypothesis that all the algorithms have the same performance.Similar as before, a post hoc test was performed to evaluate the pairwise performance when all the algorithms are compared to each other.The Nemenyi test with  = 0.05 was applied, and the results are presented in Table 6.
When comparing cITCAN with all the other classifiers, we notice that the null hypothesis cannot be rejected when compared to cNB, gNB, cTAN, gTAN, gITCAN, and SVM, respectively, since there are no statistically significant differences between them, whereas for our second ranked best classifier, cNB, we notice that the null hypothesis cannot be rejected when compared to gNB, cTAN, gTAN, cITCAN, gITCAN, and SVM, respectively.
The resulting network structures for cTAN and cITCAN (for the simulations with only four attributes) are shown in Figures 8 and 9, respectively.
Overall, dropping irrelevant attributes contributed to the improvements of the classification performances of all the models.

Conclusion
We have presented adaptations for popular Bayesian network classifiers (naive Bayes and TAN) to handle continuous attributes.Additionally, we have proposed an incremental tree construction procedure for TAN (ITCAN) that may yield forest structures that model more effectively the posterior class distribution, thus, yielding competitive classification performances.We have applied these models to the facial biotype classification problem.Through classification performance measures and comparisons with other continuous Bayesian network classifiers approaches, we showed that these models can obtain competitive results when compared to a black-box model such as SVM.Also, the resulting network structures help to shed light on the probability relations amongst the attributes, which contributes to the understanding of their role in the classification process.As an application in the context of medical informatics, trained Bayesian network classifiers for facial biotype classification can be used as an initial automatic screening process by orthodontists.Then, based on the posterior probability of the assigned class for each patient, define a threshold from which classifications with posterior probabilities below this threshold would require a manual validation by the orthodontist.

Figure 4 :
Figure 4: PCA projection of the 24-dimensional data points.

Figure 8 :
Figure 8: The cTAN classifier for the facial biotype dataset using only four attributes.

Figure 9 :
Figure 9: The cITCAN classifier for the facial biotype dataset using only four attributes.
. ., ) which is known as the a priori probability for the class value  and represents the class  distribution in .The computation of this probability is simple, since it consists in counting the number of training examples in  for which  =  and then dividing this value by .The second probability, ( 1 The class priors are straightforward and can be computed by the relative frequency of each class value (Brachyfacial, Dolichofacial, and Mesofacial) in the training set.For the conditional probabilities, we partition the training set examples accordingly to their class, then for each partition we use the kernel density estimator with Gaussian kernels to compute the desired densities.The kernel density estimator function in R is called density.Then we use the approx function in R that performs linear interpolation from the estimated density to obtain the value of (  =   | ) for a specific value   .  and   , respectively, and the class variable , this measure is computed by  (  ;   | ) = ∑   ,  ,  (  ,   , ) log  (  ,   , )  (  | )  (  | ) .
[44]e Bayes Classifier.The classification under this model is computed by(2).Here, we need to estimate the class priors () and the conditional probabilities (  | ) for  = 1 ..., .treestructure.TAN finds this tree by applying the maximum weighted spanning tree algorithm (Kruskal's algorithm[43]or Prim's algorithm[44]) over a fully connected undirected graph of the attributes where the weights are given by the conditional mutual information measure.For the discrete case, given two attributes   and   ( ̸ = ) with their values ,   ) and the marginal probability (  ) for each partition, then  (  |   , ) =  (  ,   | )  (  | ) .
Therefore, we need to estimate conditional probabilities such as (  |   , ).Using the product rule, we have that (  ,   ) = (  |   )(  ).So, if we partition the training data set accordingly to the class, we can estimate the joint Complexity probability (

Table 1 :
Performance measures for each classifier (with 24 attributes).

Table 2 :
The average ranks for all the algorithms (with 24 attributes).

Table 3 :
Nemenyi test for single models (with 24 attributes) in terms of accuracy (%).Attribute importance ranking based on the Gini importance measure.

Table 4 :
Performance measures for each classifier (with 4 attributes).In relation to the kappa values, we notice that now cNB, gNB, f NB, dNB, cTAN, gTAN, cITCAN, gIT-CAN, and SVM are in the moderate interval of classification agreement with the true classes, with cITCAN obtaining the highest value.The worst accuracy and kappa value was obtained by the fITCAN classifier.

Table 5 :
The average ranks for all the algorithms (with 4 attributes).