Predicting Mycobacterium tuberculosis Complex Clades Using Knowledge-Based Bayesian Networks

We develop a novel approach for incorporating expert rules into Bayesian networks for classification of Mycobacterium tuberculosis complex (MTBC) clades. The proposed knowledge-based Bayesian network (KBBN) treats sets of expert rules as prior distributions on the classes. Unlike prior knowledge-based support vector machine approaches which require rules expressed as polyhedral sets, KBBN directly incorporates the rules without any modification. KBBN uses data to refine rule-based classifiers when the rule set is incomplete or ambiguous. We develop a predictive KBBN model for 69 MTBC clades found in the SITVIT international collection. We validate the approach using two testbeds that model knowledge of the MTBC obtained from two different experts and large DNA fingerprint databases to predict MTBC genetic clades and sublineages. These models represent strains of MTBC using high-throughput biomarkers called spacer oligonucleotide types (spoligotypes), since these are routinely gathered from MTBC isolates of tuberculosis (TB) patients. Results show that incorporating rules into problems can drastically increase classification accuracy if data alone are insufficient. The SITVIT KBBN is publicly available for use on the World Wide Web.


Introduction
Tuberculosis (TB) represents a reemerging serious health threat worldwide.TB is caused by the Mycobacterium tuberculosis complex (MTBC) bacterium.One-third of the world population is latently or actively infected with TB.Molecular epidemiology now plays a crucial role in the tracking and control of TB [1].DNA fingerprinting methods have made it possible to distinguish between cases of recent transmission of TB and reactivation of latent infections.This has enabled the tracking of transmission routes and the timely identification of outbreaks.Thus, knowledge about the genotypes of prevailing strains has revolutionized traditional approaches to the epidemiology of TB.Moreover, the predominance of certain strains or groups of strains in certain host populations has been clearly observed [2].Studies of the genetic and biogeographic diversity of the MTBC have revealed differences in the virulence, immunogenicity, and drug resistance of strains [2].This has consequences for the development of control measures for TB.
The classification of MTBC strains into genetic groups or clades is important to help track transmission patterns and develop a better understanding of pathologic specificities in TB.Phylogeographic clades have been defined based on genetic similarities between strains and observed associations between groups of similar MTBC genotypes with host populations [3].A variety of molecular techniques including the analysis of phylogenetically informative single nucleotide polymorphisms (SNPs) and long sequence polymorphisms (LSPs) are used to genotype MTBC strains [4].Classification based on SNPs and LSPs is considered to be the gold standard.However, studies of such variations in DNA sequences of MTBC strains are not performed frequently for public health purposes.Spacer oligonucleotide typing (spoligotyping) and mycobacterial interspersed repetitive units-variable number of tandem repeats (MIRU-VNTRs) typing are two polymerase-chain-reaction-(PCR-) based DNA fingerprinting methods routinely used in the United States for genotyping all identified culture-positive TB cases.Large databases of spoligotypes have been collected.Each spoligotype for a strain is determined by the presence or absence of 43 specific spacers in the DR region, producing a 43-bit number.Each spacer separates two direct repeats.These strains have been assigned sublineage labels using mixed expert-based and bioinformatics approaches derived from visual rules applied to spoligotypes as shown in Figure 1.These visual rules are based on the identification of characteristic deletions of one or more adjacent spacers in spoligotypes.Certain inferred mutations (deletions of blocks of adjacent spacers) in progenitor strains are considered to be lineage defining [5].These deletions are conserved in all descendent strains since studies have shown that the mechanism of mutation observed in the direct repeat (DR) region involves loss of spacers, and spacers are rarely gained [6]. Additionally, the existence of these sublineages has been independently verified by clustering based on spoligotype and MIRU types of strains [7,8].Therefore, while it has been established that strains of TB belong to distinct sublineages, the definitions of these sublineages based on spoligotypes are not clear.The visual rules for a sublineage are generalizations of spoligotype patterns that belong to the sublineage.However, directly applying visual rules to spoligotype patterns can lead to multiple assignments of sublineage labels since spoligotype patterns may match patterns prescribed by more than one rule, and sometimes spoligotype patterns do not exactly match the patterns specified by any rule.This is an inherent limitation of a rule-based system, wherein rules need to be broad enough to capture general patterns but narrow enough to delineate classes.Additionally, spoligotyping is based on polymorphisms in a single locus, the DR region, and therefore has the potential for convergent evolution.Relying on specific subsequences within the spoligotypes for the study of genetic diversity is hence error prone.
This paper presents a hierarchical probabilistic graphical model, the knowledge-based Bayesian network (KBBN), that encodes rules of thumb and large training databases to classify data into given classes.Expert knowledge is modeled in the top level of variables in the BN representing the rules.The middle level variables represent the class and the lower level represents various features of interest.KBBN uses the strategy of not directly modeling the dependency of the rules on the features.This greatly reduces the model parameter space which helps reduce the amount of data required for training while capturing the knowledge in the rules.
The overall goal of this paper is to construct a model for predicting clades based on spoligotypes as determined by SITVIT using published rules from SPOLDB4 and other sources and to make this model available via the World Wide Web.For MTBC clade classification in this model, the visual rules of thumb are the top-level variables, the clades are the classes, and the 43 spacers that constitute the spoligotypes are the features.The KBBN for MTBC sublineages builds on the conformal Bayesian network previously designed for that domain.The structure of the KBBN encodes the knowledge base captured in the rules of thumb helping to improve overall accuracy while overcoming any potential problems such as ambiguous, inaccurate, or incomplete rules.A secondary goal is to assess the effectiveness of the KBBN in the MTBC lineage classification task.Thus we do extensive experiments on SITVIT as well as an additional testbed: CDC.The CDC test set consists of data and rules from the United States Centers for Disease Control and Prevention.
This paper is organized as follows.We first examine prior Bayesian networks for MTBC classification and then introduce the KBBN approach to incorporate rules.We then examine the rules of thumb and data associated with MTBC clades.We present results for the KBBN-SITVIT model and assess its accuracy.Finally, computational studies examine how KBBN can improve accuracy over Bayesian networks on the two KBBN testbeds (SITVIT and CDC).The results show that KBBN is quite resilient to incomplete, inaccurate, or ambiguous rules and can obtain better performance than BN using less data.
Previously in the MTBC domain, other approaches to incorporate advice in the form of rules have been shown to improve discriminative learning models of MTBC major lineages and other problems [9].However, those methods are limited to rules expressed in less-intuitive polyhedral form that requires preprocessing of data and rules.
The proposed KBBN model allows the existing rules of thumb to be incorporated with no modification resulting in improved classification over the predictions made with the rules or Bayesian networks alone.Also, unlike visual rules, the flexibility offered by the KBBN enables it to handle these common problems with the following rules of thumb.
(i) Incompleteness: rules only exist for some of the classes or only partially cover a class.(ii) Ambiguity: multiple rules of thumb for different classes apply to the same exemplar.This frequently occurs if there is no precedence associated with the rules.(iii) Inaccuracy: rules may incorrectly classify some exemplars.
Visual rules with precedence have been established for six major MTBC lineages [10].A prior online knowledgebased support vector machine (SVM) approach combined these visual rules and precedence into a set of rules expressed in polyhedral form [9].The method produced a highaccuracy SVM using much less data.However, as discussed in Section 5, this elegant work has several practical limitations that we sought to overcome in this study.First, expressing rules and precedence as polyhedral rules [9] can be challenging for a large number of rules.Second, the method works best with linear SVMs, but linear SVMs do not capture the underlying complexity of the biomarkers and their mechanism of evolution.This can be overcome by using nonlinear SVMs (SVMs using 3-degree polynomial kernels work very well), but then incorporating the polyhedral rules becomes even more challenging.Third, the complexity of training increases with the introduction of rules.Thus, the proposed design of the KBBN has the following salient features: (i) incorporates rules easily without modification and without imposing precedence, (ii) models known properties of the domain such as biomarkers and their mutation mechanisms, (iii) provides an efficient training method for classes with and without rules, (iv) achieves high prediction accuracy, (v) overcomes ambiguity, incompleteness, and inaccuracy of the rules, (vi) provides additional information about the effectiveness of each rule.
The overall approach produces a high quality model for predicting SITVIT clades which has been made available for use by other researchers.

Bayesian Network Background
A Bayesian network (BN) is a graphical representation of a probability distribution.Formally speaking, a BN is a directed acyclic graph (, ) consisting of a set of nodes  = {  |   ∈ } to represent the variables and a set of directed links to connect pairs of nodes [11].Each node has a conditional probability distribution that quantifies the probabilistic relation between the node and its parents such that for a network of  nodes Therefore, one can compute the full joint probability distribution from the information in the network.In other words, a well-represented Bayesian network can capture the complete nature of the relationship among a set of variables.
The SPOTCLUST Bayesian network was the first generative model used for analysis of MTBC sublineages [7].SPOTCLUST uses mixture models based on spoligotypes to identify strain families of MTBC.SPOTCLUST models the asymmetric evolution of spacers using a Bayesian network with "hidden parents" [7].The hidden parents of a lineage generate the members of the lineage.They capture the evolution of spoligotypes without generating the full phylogeny.A spacer in the hidden parent may be lost with small probability.A spacer that is absent in the parent is almost never gained.The design models the evolution mechanism of the DR region, allowing the Bayesian network to capture the deletions that are known to characterize spoligotype lineages.The hidden parent technique of SPOTCLUST is used for the spoligotype-associated parts of the KBBN model.
The conformal Bayesian network (CBN) is another generative model for analysis of both spoligotype and MIRU type data for MTBC strains [9,12] (spoligotype CBN is shown in Figure 2(a)) originally designed for predicting major MTBC lineages.CBN captures the domain knowledge about the properties of spoligotypes and MIRU and uses this information to classify MTBC strain genotyping data into major lineages.CBN reflects the known mutation mechanisms of the spoligotypes and MIRU.With rare exceptions, ancestral strains have 2 or more repeats at MIRU24.Thus the top-level variable,  24 , indicates whether MIRU24 is less than two (indicating one of the modern lineages with high probability) or at least two (indicating one of the ancestral lineages with high probability).
One can think of the MIRU CBN model "generating" the data as follows.The value of locus MIRU24 generates the lineage, which in turn determines the number of repeats in the remaining MIRU loci.Thus, patterns in the occurrences of repeats at each locus for each lineage are captured.The lineage also generates the hidden parents of the lineage which in turn generate the spoligotype spacers.The MIRU24 determines the lineage priors.
We tried using the CBN model as designed for major lineages to classify MTBC genotyping data into sublineages.But using the single rule, if 24 ≥ 2, then lineage is ancestral, as in the original CBN was not enough to generate a good where the random variable  represents the sublineage class, the random variable  Ω = {  |  ∈ Ω} with Ω = {1, . . ., 43} represents the spoligotype spacers, and  Ψ = {  |  ∈ Ψ} represents the set of binary rules indicating whether each specific rule is fired.The spacer variables  and class variable  are assumed to follow binomial and multinomial distributions, respectively.The conditional probabilities of  given  are represented as a table which maps the set of possible combinations of rules fired in the data to the probability of each class.Laplacian smoothing is used.
For spoligotypes, we followed the SPOTCLUST model [7].It captures the fact that spacers are lost but almost never gained, by introducing a variable for the unobserved hidden parent (  ) and for each spacer   , both of which follow a binomial distribution.Given a 43-dimensional spoligotype  and its spacer position ,   = 1 if the spacer is present, and   = 0 if the spacer is absent.The probabilities of the spacer given the parent (  |   ) are assumed to be known.As in Vitol et al., 2006 [7], we considered the probability of losing a spacer as 10 −1 and the probability of gaining a spacer as 10 −7 .
The KBBN assumes that the spoligotype hidden parents are conditionally independent given the sublineage.The conditional independence assumption of spacers is a model simplification previously made in the SPOTCLUST BN model.This conditional independence of the biomarkers in the BN model enables KBBN to conform to the set of available biomarkers without any expensive missing value computations.None of the genotyping variables in the BN are treated as unobserved except for the hidden parent spacers, which are always unobserved.
Using Bayes' rule, one can predict the sublineage for new data by determining the sublineage with maximum probability: (3)

Data Domains and Biology Rules
This study focused on creating a predictive model for clades that emulated SITVITWEB, a publicly available international Note that some lineages have been reclassified while the KBBN model was under development.Two LAM sublineages were recently raised to lineage level: LAM10-CAM as the Cameroon lineage [14] and LAM7-TUR as the Turkey lineage [15,16].Some spoligotype patterns previously classified as H3 and H4 sublineages were relabeled "Ural" [17].The latter include patterns belonging to H4 sublineage that were relabeled "Ural-2" and some patterns previously classified as H3 sublineage but with an additional specific signature (presence of spacer 2, absence of spacers 29 to 31 and 33 to 36), which are now relabeled "Ural-1." With their definitive reclassification pending, we hereby refer to these as H4-Ural-2, H3, and H3-Ural-1.Spoligotype patterns labeled as EAI and EAI5 were merged into a single group called EAI since one rule covers both patterns.
A sample of SpolDB4 rules is presented in Figure 1.Each line corresponds to a rule.The underlined portions of the spoligotype must match exactly while the portions not underlined can take any value.Note that in Brudey et al. [3] the rules are expressed using the octal coding of spoligotypes; here we express them in binary for simplicity.While these rules establish characteristic patterns for sublineages of MTBC, they are not exclusive and in some cases overlap.Up to 4 rules fired per example.The mode of the number of rules fired per record was 2. In practice, a precedence or order is introduced over the rules using expert knowledge so that unambiguous sublineage predictions are generated.However, this precedence has not been published for sublineages and is up to the individual user of the rules.The SpolDB4 rules have continued to evolve as new lineages such as H3-URAL-1 which are added and refined, and thus the exact rules that we used are provided in Supplement 1 in the Supplementary Materials available online at http://dx.doi.org/10.1155/2014/398484.

CDC-Sublineage and CDC Rules.
The second dataset, CDC-Sublineage, examines 1286 MTBC isolates genotyped by spoligotyping and labeled with 8 sublineages.Dr. Lauren Cowan of the CDC was interviewed to obtain 8 rules of thumb.The data is a subset of 31,482 MTBC isolates genotyped by spoligotyping and 12-locus mycobacterial interspersed repetitive units (MIRU) typing with known lineages from a set collected by the CDC as part of routine TB surveillance in the United States from 2004 to 2009.Since only spoligotypes are used in the rules, the data for training were restricted to spoligotypes with labeled sublineages.
There is one rule per sublineage.This dataset was preprocessed by adding an array of 8 bits, one bit per rule.The value of a bit was set to 1 if the rule was fired and zero otherwise.Note that the sublineage sizes are unequal.Overall, the minimum sublineage size was 39, the maximum sublineage size was 356, and the median was 138 records.The rules were ambiguous and no precedence was imposed.In some cases no rules fire for a record.A maximum of 2 rules is fired for each record.The mode of the number of rules fired per record was also 2. If multiple rules fire for a record and the sublineages determined conflict or if no rules fire, the record is considered to be misclassified.Details of the CDC rules can be found in Supplement 2.

SITVIT Experimental Results
In this section, we examine the effectiveness of the KBBN model for prediction of SITVIT classification results.Our experiments consist of two parts: (1) in-sample accuracy of the SITVIT KBBN model trained using all available data and (2) out-of-sample accuracy of the SITVIT KBBN model trained on the SITVIT-Train and tested on the much larger SITVIT-Test set.The accuracy of the results was measured using the -measure on the testing data (harmonic mean of precision and recall) averaged over the classes.The -measure was selected since it effectively captures performance on the unbalanced multiclass data sets studied here.Reporting class accuracies/errors can be misleading for unbalanced classes such as those in the TB data.The minimum and maximum class sizes are reported in Table 1.
The -value was computed as where recall is the percentage of the isolates in a given clade correctly identified as being in that clade and precision is the percentage of isolates predicted to be in a clade that are actually in the clade.

SITVIT KBBN Model
Accuracy.The SITVIT KBBN model was trained to predict 69 sublineages using the combined SITVIT-Train and SITVIT-Test data extracted from SITVITWEB along with the SPOLDB4 rules.Overall the model is very accurate; it correctly classifies 94.3% of all of the spoligotypes, achieving an average -value of 0.93 across the 69 clades.Table 2 presents the in-sample results of SITVIT-KBBN for each clade.The errors that do occur primarily come from lack of specificity not sensitivity.The model achieves a sensitivity of greater than 82% on all of the clades, but the specificity is below 82% on 13 clades.The T clade, which is known to be ill defined, contributes errors leading to reduced specificity in a wide variety of clades including LAM6, T1-RUS, T3-ETH, T3-OSA, and AFRI.Within the M. africanum clades, AFRI is primarily confused with other M. africanum clades (AFRI 1, AFRI 2, and AFRI 3) which is an acceptable error.A few Cameroons, H3, and T isolates are mistakenly identified as AFRI.Many BOV isolates are assigned by the model as belonging to BOV 3 indicating that a more expansive definition of BOV 3 may be warranted.There are some minor confusions within the Haarlem sublineages H1, H2, and H3 combined with the new H4-URAL-2 and H3-URAL-1 sublineages.About 16% of H3 is assigned to other classes.This suggests that further refinement of the definition of these sublineages will be ongoing.Microti, PINI, and PINI2 have lower -values, but this is partially due to the fact that these sublineages have only a few exemplars.More data is needed for these rare lineages to improve the model.The -value of ZERO is reduced by 6 CAS misclassified as ZERO.The overall specificity averaged over the clades is 0.909 and the sensitivity is 0.965.or the set of rules used may be varied.The accuracy of the results was measured using the -measure on the testing data (harmonic mean of precision and recall) averaged over the classes.To facilitate a fair comparison, the data were constructed so that there are at least 10 records per class.In the SITVIT domain, this required removing clades that do not commonly infect human beings (e.g., PINI1 and PINI2).We refer to this subset of the SITVIT-Train dataset as SITVIT-CV.SITVIT-CV had 45 classes and 2593 records.The minimum sublineage size was 11, and the maximum sublineage size was 390 with a mode of 21 records.

Comparison with Other
Techniques.We designed several sets of experiments on the two datasets SITVIT-CV and CDC-Sublineage to determine the following: if incorporating rules improved the performance of the Bayesian network over the performance of the BN or rules alone.The results were gathered for KBBN, BN, and the rules used alone.In addition, linear and nonlinear SVM results were provided as a baseline for comparison.The SVM implementation in WEKA (http://www.cs.waikato.ac.nz/ml/weka/) was used.The SVM kernels and parameters were selected using a grid search of 9fold cross validated accuracy of the training set.The degreethree polynomial kernel and radial basis function kernels were found to work best.All SVM data was normalized  before training.Also, we are interested in the nature of the misclassification because it tells us about the potential inaccuracies in the definition of the lineages.
Table 4 compares the results of KBBN, BN, Rules-only, and SVM (nonlinear and linear) on the two testbeds.The rules themselves have very poor overall accuracy, but they led to improvements over the baseline BN accuracy on both datasets with statistically significant improvements on CDC-Sublineage and SITVIT-CV.The SVM results indicate that KBBN's accuracy is competitive with state-of-the-art nonlinear and linear classification methods.But note that KBBN, being a generative method, has many advantages over SVM such as availability of posterior probabilities of each class given the observation that can be interpreted as the confidence of the prediction, easier interpretation, and ease of incorporation of domain knowledge.

Effectiveness of Rules in Bayesian
Network.Next, we designed several sets of experiments to determine the following: how the quality and quantity of rules and data affected the performance of KBBN.The basic underlying experimental design was the same for experiments across the two testbeds.
Our hypothesis was that KBBN can learn the concept faster with less data by adding rules.We wanted to show that rules can improve learning especially where you have less data.For each dataset, first we used 10-fold stratified cross validation.Next each training set was divided into 9 parts providing models using 1/9, 2/9,. . .or 9/9 of the training set and tested on the corresponding test set.The test sets were kept the same for different training set sizes.We measured the amount of -value for different training set sizes with or without the rules and compared the result with the case of using no data at all (i.e., BN case) or only rules.The results are presented in Figure 3. Similar smaller testing set studies on CDC-Sublineage and SITVIT-CV found that KBBN always performs better than or as good as BN for all training set sizes.
To further examine the effect of incorporating rule sets and using incomplete rules, we performed two sets of experiments described in the following section: (1) using increasing percentages of the available rules and (2) using subsets of rules, removing rules for a given class at a time.

Removal of Rule Sets for a Class .
In these experiments, we examined the effect of removing all the rules associated with a given class.We examined the KBBN accuracy and recorded the amount of average -value between all classes after all the rules corresponding to a single class are removed.Again, 10-fold stratified cross validation was performed.The results are presented in Figure 4. "All (BN)" is when no rules are used in KBBN, which is equivalent to BN performance.Clearly, KBBN can lead to significant improvements compared to when no rules exist for entire classes of MTBC.We leave a more comprehensive study of when rules are most helpful for problems in other domains to future work.

Quality of Rules
KBBN can provide us with information about the quality of each rule.We studied posterior probabilities of rules given the class to provide insight into the utility and accuracy of each rule.The ( | ) is of great interest because it tells us how good rule  is for a given class .The posterior probability of the rules given the classes for the CDC-Sublineage data is presented in Table 5.The table includes a row for "No rule" indicating the probability of no rule getting fired.When no rule is fired a regular BN is used instead of KBBN.Note that the probabilities within columns may sum to more than 1 since rules are not mutually exclusive.
For CDC, the rule for LAM exactly corresponds to the class LAM on this data, since (Rule = LAM | Class = LAM) = 1 and all other probabilities in the LAM row or column are 0. The rules for S and X correctly fire for their respective classes, but they also fire incorrectly for other lineages as indicated by the other entries in the S and X rows.The rules for Haarlem and Manila correctly predict their corresponding sublineages, but the fact that "No Rule" occurs 29.6% and 24.3% of the time, respectively, indicates that these rules fail to cover all members of their class.For the India class, the India rule is quite accurate, but the rules can be ambiguous as indicated by the multiple entries in the India column.Most Vietnams are not covered by any rules and for those that are covered the rules may be ambiguous.
We provide the posterior probability distribution of each rule given the sublineage for the SITVIT-CV dataset as a heat map in Figure 5. Good rules only have red on the diagonal.A rule fires for multiple classes if it has multiple red entries in a row.The rule set is ambiguous for a class if there are multiple red entries within a given class column.Notice that the rules that are fired for many classes with high probability

Figure 1 :
Figure 1: Example rules from SpolDB4.The rule column represents characteristic patterns specified by the visual rules as underlined subsequences in the spoligotype patterns.Each line corresponds to a rule.The underlined portions of the spoligotype must match exactly while the portions not underlined can take any value.All of these rules fire for the spoligotype 1101111111110111111100001111111100001111111, while three of the rules fire for 1111111111110011111100001111111100001111111.

Figure 2 :
Figure 2: (a) The spoligotype conformal Bayesian network uses a single rule based on the number of repeats at the MIRU24 locus as the first level of a hierarchical Bayesian network.It uses the 43 spacers as features.CBN predicts the major lineage with high accuracy.(b) The KBBN uses multiple rules based on the presence of characteristic deletions as the first level of a hierarchical Bayesian network.As with the CBN, it uses the 43 spoligotype spacers.

Figure 3 :
Figure 3: The result of adding rules to different training set sizes for the (a) SITVIT-CV and (b) CDC-Sublineage testbeds.

Figure 5 :
Figure 5: The heat map represents the posterior probability of each rule given the sublineage for the SITVIT dataset.A strong association of a rule in predicting a sublineage is shown with a red square while a blue square represents no relation.Here H includes URAL-1 and URAL-2 and LAM includes Turkey and Cameroon sublineages.

Table 1 :
SITVIT and CDC MTBC testbeds.Different datasets were created for training, testing, and cross validation.To validate the approach we also used a dataset of isolates collected by the CDC for cross validation studies.The following two sections describe the datasets in detail.Table1summarizes them.3.1.SITVIT Testbed.SITVIT-Train and SITVIT-Test are based on the SITVIT, a MTBC genotyping markers database provided by the Institut Pasteur de la Guadeloupe, and on the SpolDB4 rules that are published inBrudey et al., 2006  [3], plus one rule recently developed for the URAL1 clade.KBBN was trained on the SITVIT-Train dataset of 2714 records, each corresponding to a spoligotype and clade.There were 69 classes, the minimum sublineage size was 1, and the maximum sublineage size was 390.To test this model while keeping all classes we used SITVIT-Test, a large dataset based on SITVIT with 7949 records, each record corresponding to a spoligotype and clade.This dataset contained the same 69 classes as SITVIT-Train with different class distributions and again with the minimum class size of 1. SITVIT-Train and SITVIT-Test do not overlap so the total SITVIT dataset consists of 10633 distinct spoligotypes.To enable 10-fold cross validation (CV) with at least one spoligotype per class, the SITVIT-CV dataset was created which consists of the SITVIT-Train data restricted to the 45 classes with at least 11 spoligotypes each.

Table 3 :
Results of the -measures of KBBN based on out of-sample test.The KBBN model was trained on SITVIT-Train (with 2714 records) and tested on SITVIT-Test with 7949 records.Overall average -measure is 0.939.To assess the out-of-sample predictive accuracy of the KBBN SITVIT model we trained the model on SITVIT-Train and tested it on SITVIT-Test.The model was very accurate overall achieving an average out-ofsample test -value of 0.939, almost identical to the in-sample estimate of above 0.930.The average recall (percentage of the isolates in a given clade correctly identified as being in that clade) between all lineages is 97.5%, and the average precision (the percentage of isolates predicted to be in a clade that are actually in the clade) among all lineages is 91.9%.As shown in Table3, the results for each clade are very similar to those reported in Table2.The T clade and small rarer clades such as PINI variants and Microti account for much of the decrease in precision.4.3.Model Validation.The next set of experiments evaluates the effectiveness of the KBBN approach with respect to other techniques and the effectiveness of incorporating rules.All experiments were done on both the SITVIT and CDC datasets to ensure that the results are not an artifact of a single dataset.For each dataset, first we used 10-fold stratified cross validation.Each training set was divided into 10 parts with 9 parts available as the training data for creation of models and 1 part held out as an independent test set.For all experiments the same test sets were employed, but the training dataset

Table 4 :
Average -measure of KBBN, BN, Rules-only, and SVM (nonlinear and linear) on two testbeds.While using Rules-only provides poor results, KBBN is able to provide results that are significantly better or at least not worse than BN and SVM on both domains.Results significantly different from KBBN at 5% significance level are shown in bold.