Performance Evaluation of the Machine Learning Algorithms Used in Inference Mechanism of a Medical Decision Support System

The importance of the decision support systems is increasingly supporting the decision making process in cases of uncertainty and the lack of information and they are widely used in various fields like engineering, finance, medicine, and so forth, Medical decision support systems help the healthcare personnel to select optimal method during the treatment of the patients. Decision support systems are intelligent software systems that support decision makers on their decisions. The design of decision support systems consists of four main subjects called inference mechanism, knowledge-base, explanation module, and active memory. Inference mechanism constitutes the basis of decision support systems. There are various methods that can be used in these mechanisms approaches. Some of these methods are decision trees, artificial neural networks, statistical methods, rule-based methods, and so forth. In decision support systems, those methods can be used separately or a hybrid system, and also combination of those methods. In this study, synthetic data with 10, 100, 1000, and 2000 records have been produced to reflect the probabilities on the ALARM network. The accuracy of 11 machine learning methods for the inference mechanism of medical decision support system is compared on various data sets.


Introduction
A decision support systems (DSSs) is a computer-based information system that supports organizational and business decision making activities. Medical decision support systems, which are variants of decision support systems, are intelligent software systems that are designed to improve clinical diagnosis system and to support the healthcare personnel in their decision. Intelligent decision support systems use artificial intelligence system techniques to support the healthcare personnel for selecting the best method for both diagnosis and also for treatment especially when the information about the treatment is incomplete or uncertain. These systems can work in both active and passive modes. When they are in passive mode, they will be used only when they are required. When they are in active mode, they will be making recommendations as well. When we look at the approaches of the inference mechanisms, which constitute the most important part of the medical decision support systems, these approaches can be divided into two parts such as rule-based systems and data-driven systems. Rule-based systems are constructed on the knowledge base, which are formed by if-then structures. In this structure, the information base is formed by the rules. The operation logic of the system is to find relevant rules on basis of the available information, operate them, and continue to search for a rule until a result has been obtained.
Those rule-based systems have some strong features as well as some disadvantages. For example, the performance of the system decreases and the maintenance of the system 2 The Scientific World Journal becomes difficult in case of the number of the rules being large enough. The examples of the medical decision support systems are MYCIN [1,2], TRAUMAID [3], and RO 2 SE [4].
Data-driven systems, on the other hand, operate in large data stacks and support the decision making process using data mining methods. Several studies can be found on literature about data-driven systems. Some of these studies can be referred as Bayes networks [5], rough sets [6], and artificial neural networks [7] which are the examples of such studies. Data-driven systems are more flexible compared to the rule-based systems and they have the ability to learn by themselves.
In our previous study [8] ALARM network structure was used for the generated synthetic data on the same data set. When the results are examined in that study, it can be seen that the rule based method is more successful in the rate of 25% than the "Bayesian network based" method in all dimensions of the data sets. Besides, when both of these methods are combined and utilized together the success rate rises to 80%; that is, much higher rates are acquired in comparison to the values obtained by applying these methods individually.
In this study, the accuracy of 11 machine learning methods which can be used in the inference mechanism of the medical decision support systems is carried out on various data sets.

Decision Support Systems
Decision support systems (DSSs) are interactive computerbased systems or subsystems that are designed to help decision makers to decide and complete the decision process operations and also to determine and solve problems using communication technologies, information, documents, and models. They provide data storage and retrieval but enhance the traditional information access and retrieval functions with support for model building and model-based reasoning. They support framing, modeling, and problem solving. Typical application areas of DSSs are healthcare, management, and planning in business, the military, and any area in which management will encounter complex decision situations. DSSs are typically used for strategic and tactical decisions faced by upper-level management-decisions with a reasonably low frequency and high potential consequences, in which the time taken for thinking through and modeling the problem pays off generously in the long run [10].
Generally, decision support systems should include the following features.  Decision support systems and relevant operation methods can be divided into four main subjects. These subjects are called as inference mechanism, knowledge base, explanation module, and active memory. Inference mechanism constitutes the basis of decision support systems. In this part, the results are generated in consideration of the current information and/or the information that was entered to the system by the user. The generated results may be a decision or they may include guiding information. The second part is the knowledge base which holds the expert information used when the decision support system is making inference. The active memory part holds the information, which is supplied by the user and/or current inference processes. Also, explanation module, which may not be present on each decision support system, generates an accuracy validation and explanation in consideration of the results generated by the inference mechanism and knowledge base [11]. Those subjects and their relations are shown in Figure 1.
In rule-based systems, the knowledge base is formed by the rule group. The results are obtained for various circumstances on the problem relevant to the subject, using the generated rules. The rules forming the knowledge base are prepared by if-then structure. The content of an inference system, which is developed using rule-based methods, consists of the rules generated by if-then, the facts, and an interpreter that interprets the facts using the rules in the system [12].
There are two methods used to process the rules in the rule-based methods. These methods are forward chaining and backward chaining. In forward chaining method, the results are obtained using the preliminary facts with the help of the rules. In backward chaining method, it is started with a hypothesis (or target) and the rules, which will reach that hypothesis, are searched. The reached rules generate subrules and the process continues in this way.
In cases, which the result is estimated and this estimation should be verified, backward chaining method should be used instead of forward chaining method.
In order to generate the rule set in rule-based methods of inference systems, people who are experienced on the problem should contribute to the design of the system. This process usually proceeds with the help of experienced people in the rule development phase by determining the faults and defects in the estimations and using the planned system as a reference [13].
The designer usually develops simple interfaces for experts to contribute in the development phase. In the begin-The Scientific World Journal 3 ning of the process, the experts start testing the systems as if they will use the system for operational purposes. The questions asked to the experts in the scope of the limited information of the systems are answered by the same experts.
The aim is to test the system in order to improve it. The expert who answered the questions evaluates the system by looking at the results generated by the system and then tries to correct the defined defects and faults by using the rule development tool. The rule set in the inference systems, which use rule-based methods, can be generated by the expert on the problem.
Data-driven systems examine large data pools in organizations. These systems usually work with the systems that collect data like data warehouse, and so forth. Data-driven systems take place in decision making process with online analytical processing (OLAP) and data mining methods. These systems work on very large datasets. The relations in these datasets are analyzed electronically and make predictions for future data relations. Data-driven systems use the bottom-up procedure to explain the characteristics of the data system [14].

Machine Learning Algorithms
Machine learning is about learning to make predictions from example of desired behavior or past observations. Learning methods have found numerous applications in performance modeling and evaluation [15]. The basic definitions of machine learning are given below. The label of an example will be predicted. The space of possible labels is denoted by .
A learning problem is some unknown data distribution over × , coupled with a loss function ( , ) measuring the loss of predicting when the true label is .
A learning algorithm takes a set of labeled training examples of the form ( , ) ∈ × and produces a predictor : → . The goal of the algorithm is to find minimizing the expected loss ⃗ ( , )∼ ( ( ), ). There are two base learning problems, defined for any feature space . In binary classification, examples are categorized into two categories [15].

Definition 1.
A binary classification problem is defined by a distribution over × , where = {0, 1}. The goal is to find a classifier ℎ : → minimizing the error rate on : By fixing an unlabeled example ∈ , a conditional distribution | over is found.
Regression is another basic learning problem, where the goal is to predict a real-valued label .
The loss function typically used in regression is the squared error loss between the predicted and actual labels.

Definition 2.
A regression problem is defined by a distribution over × R. The goal is to find a function : → R minimizing the squared loss [15]: The machine learning algorithms that are used in the study will be explained below.

C4.5 Decision Tree.
A decision tree is basically a classifier that shows all possible outcomes and the paths leading to those outcomes in the form of a tree structure. Various algorithms for inducing a decision tree are described in existing literature, for example, CART (classification and regression tress) [16], OC1 [17], ID3, and C4.5 [18]. These algorithms build a decision tree recursively by partitioning the training data set into successively purer subsets [19]. C4.5 [18] is an algorithm used to generate a decision tree. C4.5 uses the fact that each attribute of the data can be used to make a decision that splits the data into smaller subsets. C4.5 examines the normalized information gain (difference in entropy) that results from choosing a feature for splitting the data [20] where SplitInfo represents the potential information provided by dividing dataset, , into partition corresponding to the outputs of attributes , and Gain ( ) is how much gain would be achieved by branching on .

Multilayer Perceptron (MLP).
Multilayer perceptron (MLP) [21] also referred to as multilayer feed forward neural networks is the most used and popular neural network method. It belongs to the class of supervised neural network. The MLP topology consists of three sequential layers of processing nodes: an input layer, one or more hidden layers, and an output layer which produces the classification results. A MLP structure is shown in Figure 2.
The principle of the network is that when data are presented at the input layer, the network nodes perform calculations in the successive layers until an output value is obtained at each of the output nodes. This output signal should be able to indicate the appropriate class for the input data. A node in MLP can be modeled as one or more artificial neurons, which computes the weighted sum of the inputs at the presence of the bias and passes this sum through the nonlinear activation function. This process is defined as follows [7]: where is the linear combination of inputs 1 , 2 , . . . , , is the bias (adjustable parameter), is the connection synaptic weight between the input and the neuron , and (⋅) is the activation function (usually nonlinear function) of the th neuron, and is the output. Here, hyperbolic tangent and logistic sigmoid function can be used for the nonlinear activation function. But, in most of the applications widely used logistic sigmoid function is applied as follows: where represents the slope of the sigmoid [22]. The bias term contributes to the left or right shift of the sigmoid activation function, depending on whether takes a positive or negative value.

Backpropagation Learning Algorithm.
Learning in a MLP is an unconstrained optimization problem, which is subject to the minimization of a global error function depending on the synaptic weights of the network. For a given training data consisting of input-output patterns, values of synaptic weights in a MLP are iteratively updated by a learning algorithm to approximate the desired value. This update process is usually performed by backpropagating the error signal layer by layer and adapting synaptic weights with respect to the magnitude of error signal [23].
The first backpropagation learning algorithm for use with MLP structures was presented by [21]. The backpropagation algorithm is one of the simplest and most general methods for the supervised training of MLP. This algorithm uses a gradient descent search method to minimize a mean square error between the desired output and the actual outputs. Backpropagation algorithm is defined as follows [7,24].
(i) Initialize all the connection weights with small random values from a pseudorandom sequence generator.
(ii) Repeat until convergence (either when the error is below a preset value or until the gradient / is smaller than a preset value). where is the iteration number, represents all the weights in the network, and is the learning rate and merely indicates the relative size of the change in weights. The error can be chosen as the mean square error function between the actual The Scientific World Journal 5 output and the desired output ; and are the desired and the network output vector of length :

Support Vector Machines (SVMs).
The support vector machines (SVMs) [25] is a type of learning machine based on statistical learning theory. SVMs are supervised learning methods that have been widely and successfully used for pattern recognition in different areas [26].
In particular in recent years SMVs with linear or nonlinear kernels have become one of the most promising learning algorithms for classification as well as regression [27]. The problem that SVMs try to solve is to find an optimal hyperplane that correctly classifies data points by separating the points of two classes as much as possible [28].
Let → = Φ( ) be the corresponding vectors in feature space, where Φ( ) is the implicit kernel mapping, and let ( , ) = Φ( ) ⋅ Φ( ) be the kernel function, implying a dot product in the feature space [29].
( , ) represents the desired notion of similarity between data and . ( , ) needs to satisfy a Mercer's condition in order for Φ to exist [28].
There are a number of kernel functions which have been found to provide good generalization capabilities [30].
The most commonly used kernel functions are as follows: where > 0 and are kernel parameters, is the degree of kernel and positive integer number, and is the standard deviation and positive real number. The optimization problem for a soft-margin SVM is subject to the constraints ( ⃗ + ) = 1− and ≥ 0, where ⃗ is the normal vector of the separating hyperplane in feature space, and > 0 is a regularization parameter controlling the penalty for misclassification. Equation (7) is referred to as the primal equation. From the Lagrangian form of (7), we derive the dual problem Figure 3: A simple Naïve-Bayes structure.
subject to 0 ≤ ≤ . This is a quadratic optimization problem that can be solved efficiently using algorithms such as sequential minimal optimization (SMO) [31].
Typically, many go to zero during optimization, and the remaining corresponding to those > 0 are called support vectors. To simplify notation, from here on we assume that all nonsupport-vectors have been removed, so that is now the number of support vectors, and > 0 for all . With this formulation, the normal vector of the separating plane ⃗ is calculated as Note that because → = Φ( ) is defined implicitly, ⃗ exists only in feature space and cannot be computed directly. Instead, the classification ( ⃗ ) of a new query vector ⃗ can only be determined by computing the kernel function of ⃗ with every support vector: where the bias term is the offset of the hyperplane along its normal vector, determined during SVM training [29].

Naïve Bayes.
Naïve-Bayes is one of the most efficient and effective inductive learning algorithms for machine learning and data mining [32]. A Naïve-Bayes Bayesian network is a simple structure that has the classification node as the parent node of all other nodes. This structure is shown in Figure 3.
No other connections are allowed in a Naïve-Bayes structure. Naïve-Bayes has been used as effective classifier for many years. It has two advantages over many other classifiers. First, it is easy to construct, as the structure is given a priori (and hence no structure learning procedure is required). Second, the classification process is very efficient. Both advantages are due to its assumption that all the features are independent of each other. Although this independence assumption is obviously problematic, Naïve-Bayes has surprisingly outperformed many sophisticated classifiers over 6 The Scientific World Journal a large number of datasets, especially where the features are not strongly correlated [33].
The procedure of learning Naïve-Bayes (Figure 3) is as follows.
(1) Let the classification node be the parent of all other nodes.
(2) Learn the parameters (recall these are just the empirical frequency estimates) and output the Naïve-Bayes Bayesian network [34].
Typically, an example is represented by a tuple of attribute values ( 1 , 2 , . . . , ), where is the value of attribute . Let represent the classification variable, and let be the value of [32]. Naïve-Bayes classifier is defined as below: 3.6. Instance-Based Learning. Instance-based learning (IBL) [35] algorithms have several notable characteristics. They employ simple representations for concept descriptions, have low incremental learning costs, have small storage requirements, can produce concepts exemplars on demand, can learn continuous functions, and can learn nonlinearly separable categories; IBL algorithms have been successfully applied to many areas such as speech recognition, handwritten letter identification, and thyroid disease diagnosis. All IBL algorithms consist of the following three components [36].
(1) Similarity function: Given two normalized instances, this yields their numeric-valued similarity.
(2) Classification function: Given an instance to be classified and its similarity with each saved instance yields a classification for .
(3) Memory updating algorithm: Given the instance being classified and the results of the other two components updates the set of saved instances and their classification records.
The IB1 (one nearest neighbor) algorithm is the simplest instance-based learning algorithm. IB1 (one nearest neighbor) algorithm will be explained below.
3.6.1. IB1 (One Nearest Neighbor). IB1 [35] is an implementation of the simplest similarity based learner, known as nearest neighbor. IB1 simply finds the stored instance closest (according to Euclidean distance metric) to the instance to be classified. The new instance is assigned to the retrieved instance's class. Equation (12) shows the distance metric employed by IB1: Equation (10) gives the distance between two instances and ; and refer to the th feature value of instance and , respectively.

Simple Logistic Regression.
Logistic regressions are one of the most widely used techniques for solving binary classification problems. In the logistic regressions, the posterior probabilities * , ∈ {1, 2} are represented as in the following: where is a function of an input ⃗ 0 . For example, is a linear function of the input ⃗ 0 , that is, and the parameters ⃗ , are estimated by the maximum likelihood method. is an arbitrary function of ⃗ 0 . Note that if you choose an appropriate , the model in (13) can represent some kinds of binary classification systems, such as neural networks and LogitBoost [38].
LogitBoost with simple regression functions as base learners is used for fitting the logistic models. The optimal number of LogitBoost iterations to perform is crossvalidated, which leads to automatic attribute selection. This method is called "simple logistic" [39,40]. LogitBoost algorithm is defined below.

LogitBoost
Algorithm. The LogitBoost algorithm [41] is based on the observation that AdaBoost [42] is in essence fitting an additive logistic regression model to the training data. An additive model is an approximation to a function where the are constants to be determined and the are basis functions. If it is assumed that ( ) is the mapping that is looked for to fit as our strong aggregate hypothesis and the ( ) are our weak hypothesis, then it can be shown that the two-class AdaBoost algorithm is fitting such a model by minimizing the criterion: where is true class label in {−1, 1}. LogitBoost minimizes this criterion by using Newton-like steps to fit an additive logistic regression model to directly optimize the binomial log-likelihood − log(1 + exp(−2 ( ))) [43].

Boosting.
Boosting [44] is a meta-algorithm which can be viewed as a model averaging method. It is the most widely used ensemble method and one of the most powerful The Scientific World Journal 7 learning ideas introduced in the last twenty years. Originally designed for classification, it can also be profitably extended to regression. One first creates a "weak" classifier; that is, it suffices that its accuracy on the training set is only slightly better than random guessing. A succession of models is built iteratively, each one being trained on a dataset in which points misclassified (or, with regression, those poorly predicted) by the previous model are given more weight. Finally, all of the successive models are weighted according to their success and then the outputs are combined using voting (for classification) or averaging (for regression), thus creating a final model. The original boosting algorithm combined three weak learners to generate a strong learner [45]. In a practical situation the label may be hidden, and the task is to estimate it using the vector of features. Let us consider the most simple linear decision function where 0 is a constant term. A decision rule can be defined as a function of decision function and threshold parameter Let us consider minimizing the criterion where the weight function is given below: It is assumed that the initial values of the ensemble decision function ( ⃗ ) are set to zero. Advantages of the exponential compared with squared loss function were discussed in [46]. Unfortunately, it is not possible to optimize the step-size in the case of exponential target function. It is essential to maintain low value of the step size in order to ensure stability of the gradientbased optimization algorithm. As a consequence, the whole optimization process may be very slow and time-consuming. The AdaBoost algorithm was introduced in [42] in order to facilitate optimization process. The following Taylorapproximation is valid under assumption that values of ( ⃗ ) are small: Therefore, quadratic-minimization (QM) model is applied in order to minimize (19). Then, the value of the threshold parameters Δ for is optimized and the corresponding decision rule ∈ {−1, +1} is found.
Next, we will return to (19), where the optimal value of the parameter may be easily found: and where Finally, for the current boosting iteration, we update the function and recomputed weight coefficients according to (20) [47].

Bagging.
Bagging [48] predictors is a method for generating multiple versions of a predictor and using these to get on aggregated predictor. The aggregation averages over the versions when predicting a numerical outcome and does a plurality vote when predicting a class. The multiple versions are formed by making bootstrap replicates of the learning set and using these as new learning sets. Tests on real and simulated data sets using classification and regression trees and subset selection in linear regression show that bagging can give substantial gains in accuracy. The vital element is the instability of the prediction method. If perturbing the learning set can cause significant changes in the predictor constructed, then bagging can improve accuracy [48]. 3.11. Reduced Error Pruning Tree. Reduced error pruning (REP) was introduced by Quinlan [50], in the context of decision tree learning. It has subsequently been adapted to rule set learning as well [51]. REP produces an optimal pruning of a given tree, the smallest tree among those with minimal error with respect to a given set of pruning examples 8 The Scientific World Journal [51,52]. The REP algorithm works in two phases: first the set of pruning examples is classified using the given tree to be pruned. Counters that keep track of the number of examples of each class passing through each node are updated simultaneously. In the second phase-a bottom-up pruning phase-those parts of the tree that can be removed without increasing the error of the remaining hypothesis are pruned away [53]. The pruning decisions are based on the node statistics calculated in the top-down classification phase.

ZeroR (Zero Rule).
Zero rule (ZeroR, 0-R) is a trivial classifier, but it gives a lower bound on the performance of a given a dataset which should be significantly improved by more complex classifiers. As such it is a reasonable test on how well the class can be predicted without considering the other attributes [54].

ALARM Network Structure and Datasets
In order to compare the performances (in terms of accuracy) of machine learning methods in the scope of this study, the network structure, which is used in scientific studies and known as ALARM (a logical alarm reduction mechanism) network [5] in literature is used. ALARM network is a network structure that is prepared by using real patient information for many variables and shows the probabilities derived from the real life circumstances. ALARM network calculates the probabilities for different diagnosis based on the current evidences and recently it has been used for many researchers. Totally there are 37 nodes in ALARM network and the relationships and conditional probabilities among these have been defined. The medical information has been coded in a graphical structure with 46 arches, 16 findings, and 13 intermediate variables that relate the examination results to the diagnosis problems that represent 8 diagnosis problems. Two algorithms have been applied to this Bayes network; one of them is a message-passing algorithm, developed by Pearl [55] to update the probabilities in the various linked networks using conditioning methods and the second one is that the exact inference algorithm, developed by Lauritzen and Spiegelhalter [56] for local probability calculations in the graphical structure. There are three variables named diagnosis, measurements, and intermediate variables in the ALARM network.
(1) Diagnosis and the qualitative information are on the top of the network. Those variables do not belong to any predecessors and they are deemed mutually independent from the predecessors. Each node is linked to the particular and detailed value sets that represent the severity and the presence of a certain disease.
(2) Measurements represent any current quantitative information. All continuous variables are represented categorically with a discrete interval set that divides the value set.
(3) Intermediate variables show the element that can not be measured directly. The probabilities in the Bayes network can represent both objective and subjective information. ALARM network includes statistical data, logical conditional probabilities, which are calculated from the equations relevant to the variables, and a certain number of subjective valuations and it is usually used to form the network structure over synthetically data.
In cases for all given different predecessor nodes, it is required to obtain a conditional probability for a node. The structure of ALARM network and defined variable are shown in Figure 4.
In order to compare the performances of algorithms mentioned in Section 3, synthetic test data with 10, 100, 1000, and 2000 records have been produced to reflect the possibilities on the ALARM network. For these operations, based on ALARM network structure, NETICA 3.18 [57] software has been used. Conditional probability diagram for ALARM network structure and a variable defined in the structure are shown in Figure 5. Some of the synthetic data has been taken as test data.
Each record on those generated data shows probable values for each of the 37 variables that were defined on this network. Each record consist of values for intermediate variable as well as 12 input and 11 output variables. The tests, which were carried out, send the input variable values on each record to the relevant module and keep the resulting list as a separate file. The accuracy of the results is decided by comparing the variable values on the relevant record on the test data. For each record, 11 probable results have been obtained.
The results that were obtained by using JavaBayes [58] open source software are applied to each of the generated synthetic data sets separately. 11 output variables for one record belonging 100 data sets are shown in Table 1. JavaBayes uses a generalized version of "variable elimination" method as an inference algorithm [59]. It has generated 110 output variables in 10 data sets, 1100 output variables in 100 data sets, 11000 output variables in 1000 data sets, and 22000 output variables in 2000 data sets.
In Table 1, for each data set only 11 output variables for one record are presented. In this table, first column shows the variable name (disease name) and the second column shows the accuracy and they are calculated by the software using Bayes theorem, third column shows the real situations in the ALARM network, fourth column shows the results, generated by the software, and fifth column shows the comparison between the real situation and the results generated by the software. In the fifth column, if the real situation and the results generated by the software are the same POSITIVE and if the real situation and the results generated by the software are not the same NEGATIVE result will be generated. POSITIVE values show correct diagnosis, and NEGATIVE values show incorrect diagnosis.
For example, in Table 1, the accuracy of the MinVol variable has been calculated as 0.9136 by the software. Because this value is not the same with the real situation,  the correct diagnosis has not been obtained. Similarly, for HREKG variable, the accuracy has been calculated as 0.8228 by the software. Because this value is the same with the real situation, the correct diagnosis has been obtained. Similar interpretations are also valid for other data sets. Each sample generated by ALARM network includes 12 independent and 11 depended variables. So we formed 11 classification datasets having 12 inputs and one output. The class labels for these 11 datasets are given at Table 2.
To see to effects of sample size, we generated several datasets having 10, 100, 1000, and 2000 samples for each of 11 classification datasets. At the end, we have 44 (= 11 * 4) classification datasets.

Experimental Design
We used 11 machine learning algorithms from WEKA library [60] for the classification of these 44 datasets. The algorithms are given in Table 3.
The default design parameters were selected for NB, MLP, SL, SMO, IBK, J48, and RT algorithms. For the 10 The Scientific World Journal      meta-algorithms (boosting, bagging, and random forest) the ensemble sizes were selected as 100 to be sure from maximum accuracy.

Experimental Results
The performance of each classification algorithm was evaluated using 5 runs of 10-fold cross validation. In each 10fold cross validation, each dataset is randomly split into 10 equal size segments and results are averaged over 50 (5 * 10) trials. The classification results are divided by 4 according to the dataset's sample size. Tables 4, 5, 6, and 7 show the averaged classification accuracies with experiments having 10, 100, 1000, and 2000 samples, respectively. Figure 6, shows the classification accuracies changes with the datasets' sample size. J48 decision tree is used as classifier in Figure 6.
As can be seen at Tables 4-7 and Figure 6 when the sample size increases it gives more accurate results, as expected. Zero rule defines accuracy by chance. It selects the most existent 12 The Scientific World Journal  class for all samples. In BP and PAP datasets, none of the algorithms won the zero rule. This means that the datasets can not be learned by any of the algorithms.
We compared the accuracies of all classification algorithms in a pairwise manner in Table 8. To compare two algorithms' performances, we employed the statistically significance difference test (paired -test) with 0.05 significance level. The win/loss records in Table 8 are the number of wins and losses of the algorithm in the row over the method in the column. The number of ties is the sum of wins and losses subtracted from 11. For example, J48 won over MLP on 5 datasets and the algorithms have similar performances on other 6 datasets. For the comparison, the datasets having 2000 samples were only used.
In addition to statistical difference test, we also compared the classification algorithms according to their average ranks.
In the average rank comparison, for each of the datasets, the algorithms were ordered according to their performances. Then their ranks were averaged over 11 datasets. The average ranks and the sum of win and loses in Table 8 are given in Table 9.
According to Table 9, J48 (C4.5 decision tree) is the best ranked algorithm for our 11 datasets. The second one is bagging. According to the sum of wins, the best one is again J48.
To show the statistically meaningful difference between the average ranks we also applied the Nemenyi test [61]. According to is the Nemenyi test, the performance of two classifiers is significantly different if the corresponding average ranks differ by at least the critical difference (CD) calculated by CD = √ ( + 1) 6 .
In (26), is the number of classifiers compared, is the number of datasets, is the critical value, and is the significance level. In our experiments, the critical value ( 0.05 ) is 3.219 for 11 classifiers [62]. The critical difference (CD) is 3.129 * sqrt((11 * 12)/(6 * 11)) = 4.424. According to the Nemenyi test (at < 0.05), there are no statistical differences between J48 and the algorithms having at most 4.424 + 3.55 = 7.974 average rank (NB, SL, SMO, BG, RF, and RT).

Conclusion
In cases of uncertainty and the lack of information, the most important part of the decision support systems which supports decision making process is the inference mechanism. There are data mining methods like SVM, MLP, decision trees, and so forth which are available in inference mechanism. Those methods can be used separately in an inference mechanism or also as a hybrid system, which consist of a combination of those methods.
In the study, for the generated synthetic data, ALARM network structure which is widely used in scientific studies has been used. This network structure is a structure that has been prepared using real patient information for many   variables and shows the possibilities derived from the real life circumstances.
In this study, the performances of 11 machine learning algorithms (SVM, MLP, C4.5, etc.) are tested on 44 synthetic data sets (11 different dependent variables and 4 different dataset sizes). The comparison of algorithms we applied two different tests (statistically difference and average rank). C4.5 decision tree is the best algorithm according to the both of the tests for our 44 datasets. The datasets having more samples can be better predicted than having fewer samples.
In the future study, the comparison of the performances of the hybrid methods, which are combinations of the rulebased methods, and the data-driven methods and other machine learning systems will be carried out.