Maximum likelihood classifier (MLC) and support vector machines (SVM) are two commonly used approaches in machine learning. MLC is based on Bayesian theory in estimating parameters of a probabilistic model, whilst SVM is an optimization based nonparametric method in this context. Recently, it is found that SVM in some cases is equivalent to MLC in probabilistically modeling the learning process. In this paper, MLC and SVM are combined in learning and classification, which helps to yield probabilistic output for SVM and facilitate soft decision making. In total four groups of data are used for evaluations, covering sonar, vehicle, breast cancer, and DNA sequences. The data samples are characterized in terms of Gaussian/nonGaussian distributed and balanced/unbalanced samples which are then further used for performance assessment in comparing the SVM and the combined SVMMLC classifier. Interesting results are reported to indicate how the combined classifier may work under various conditions.
Maximum likelihood classification (MLC) is one of the most commonly used approaches in signal classification and identification, which has been successfully applied in a wide range of engineering applications including classification for digital amplitudephase modulations [
Based on the principles of Bayesian statistics, MLC provides a parametric approach in decision making where the model parameters need to be estimated before they are applied for classification. On the contrary, SVM is a nonparametric approach, where the theoretic background is supervised machine learning. Due to the differences of these two classifiers, their performance appears to be much different. Taking the application in remote sensing, for example, in Pal and Mather [
Furthermore, there becomes a trend to combine the principle of MLC, Bayesian theory, with SVM for improved classification. In Ren [
In this paper, analysis and evaluations of SVM and MLC are emphasized, using data from various applications. Since the selected data satisfy certain conditions in terms of specific sample distributions, we aim to find out how the performance of the classifiers is connected to the particular data distributions. As a consequence, the work and the results shown in the paper are valuable for us to understand how these classifiers work, which can then provide insightful guidance as how to select and combine them in real applications.
The remaining parts of the paper are organized as follows. Section
In this section, the principles of the two classifiers, SVM and MLC, are discussed. By comparing their theoretic background and implementation details, the two classifiers are characterized in terms of their performances during the training and testing processes. This in turn has motivated our work in the following sections.
Let
For a given sample
Based on Bayesian theory, we have
Since
Applying logarithm operation to the right side of (
Again we can ignore the constant in (
As can be seen,
In a particular case when
Based on (
Moreover, in a special case when
SVM was originally developed for the classification of twoclass problem. In Cortes and Vapnik [
Note that the decision function in (
Hence, the optimal hyperplane to separate the training data with a maximal margin is defined by
To determine this optimal hyperplane, we need to maximize
Eventually, the parameters
For any nonzero
Eventually if we combine (
For nonlinear problems which are not linearly separable, the discrimination function is extended as
Another important step is to introduce the
Though SVM is initially developed for twoclass problems, it has been extended to deal with multiclass classification based on either combination of decision results from multiple twoclass classifications or optimization on multiclass based learning. Some useful further readings can be found in [
MLC and SVM are two useful tools for classification problems, where both of them rely on supervised learning in determining the model and parameters. However, they are different in several ways as summarized below.
Firstly, MLC is a parametric approach which has a basic assumption that the data satisfy Gaussian distribution. On the other contrary, SVM is a nonparametric approach and it has no requirement on the prior distribution of the data, yet various kernels can be empirically selected to deal with different problems.
Secondly, for MLC the model parameters,
Thirdly, MLC can be straightforward applied to twoclass and multiclass problems, yet additional extension is needed for SVM to deal with multiclass problem as it is initially developed for twoclass classification.
Finally, a posterior class probabilistic output for the predicted results can be intuitively generated from MLC, which is a valuable indicator for classification to show how likely a sample belongs to a given class. For SVM, however, this is not an easy task though some extensions have been introduced to provide such an output based on the predicted value from SVM. In Platt [
The parameters
In addition, in Lin et al. [
Although there are significant differences between SVM and MLC, the probabilistic model above has uncovered the connection between these two classifiers. Actually, in Franc et al. [
In this paper, analysis and evaluations of SVM and MLC are emphasized, using data from various applications. Since the selected data satisfy certain conditions in terms of specific sample distributions, we aim to find out how the performance of the classifiers is connected to the particular data distributions. As a consequence, the work and the results shown in the paper are valuable for us to understand how these classifiers work, which can then provide insightful guidance as how to select and combine them in real applications.
In our experiments, four different datasets, SamplesNew, svmguide3, sonar, and splice, are used. Among these four datasets, SamplesNew is a dataset of suspicious microclassification clusters extracted from [
Four datasets used in our experiments.
Dataset  Features  Balance status  Distribution of feature values  Number of samples (class 0/class 1)  Skewness coefficients  

Max  Min  Mean  
SamplesNew  39  Unbalanced  NonGaussian Approx.  748 (115/633)  7.577 

2.343 
svmguide3  21  Unbalanced  NonGaussian Approx.  1284 (947/337)  10.074 

2.181 
Sonar  31  Balanced  Approx. Gaussian  209 (97/102)  1.123 

0.214 
Splice  60  Balanced  Approx. Gaussian  1269 (653/616)  0.672 


In our approach, a combined classifier using SVM and MLC is applied, which contains the following three stages. In Stage
The open source library libSVM [
In our experiments, the training ratios are set at three different levels, that is, 80%, 65%, and 50%. Basically, there is no overlap between training data and testing data. At a given training ratio, the training data is randomly selected and repeated five times, which leads to 5 groups of test results generated. Finally, the average performance over these five experiments is used for comparisons.
For those correctly classified samples, which lie in two classes, that is, class 0 and class 1, they are taken to decide two probabilitybased models, in a way as discussed in MLC. In other words, for samples correctly classified in class 0, they are used to determine the mean vector and the corresponding covariance matrix within class 0. On the other hand, samples which are correctly classified in class 1 are used to determine the mean vector and the corresponding covariance matrix within class 1. Note that not all samples in class 0 or class 1 are used in calculating the related MLC models, as those which cannot be correctly classified by SVM are treated as outliers and ignored in MLC modeling for robustness.
After MLC modeling, for each sample
With the estimated MLC models and the optimal threshold
For the four datasets discussed in Section
In this group of experiments, a combined classifier using a linear SVM and the MLC is employed, and the relevant results are presented in Figure
Comparing training (a) and testing results (b) using linear SVM and the combined classifier for the four datasets under three different training ratios.
Firstly, for the three datasets, sonar, splice, and svmguide3, apparently we can see that the combined solution yield significantly improved results in training, especially for the first two datasets. This demonstrates that the combined classifier can indeed achieve more accurate modeling of the datasets. In addition, possibly due to overfitting, the experimental results show that a larger training ratio does not necessarily improve the training performance.
However, the testing results are somehow different. For the sonar dataset, which is balanced and appears nearly Gaussian distributed, the combined classifier yields much improved results in testing, especially when the training ratios are 80% and 50%. Such results are not surprising as the MLC is ideal to model Gaussianalike distributed datasets. For the splice dataset, which is balanced and also nearly Gaussian distributed, slightly improved testing results are also produced by the combined classifier at training ratios at 80% and 50%, but the testing results at the training ratio of 65% become slightly worse than those from the SVM. For the more challenging svmguide3 dataset, which is unbalanced and nonGaussian distributed, although the combined classifier yields improved testing results at the training ratio of 50%, the results at the other two training ratios, perhaps due to overfitting, seem inferior to the results from the SVM. Actually, in nature the MLC has difficulty in modeling nonGaussian distributed datasets, and this explains where the combined classifier makes less contribution to these datasets.
In this group of experiments, the RBF kernel is used for the SVM in the combined classifier as it is popularly used in various classification problems [
Comparing training (a) and testing results (b) using RBFkernelled SVM and the combined classifier for the four datasets under three different training ratios.
First of all, RBFkernelled SVM (RSVM) produces much improved results compared to those using linear SVM, especially for the training results. In fact, the combined classifier generates better results than the SVM only in the SampleNew dataset, slightly worse results in sonar and splice datasets, and much degraded results in the svmguide3 dataset.
Regarding testing results, although the combined classifier generates comparable or slightly worse results in the SampleNew dataset and the svmguide3 dataset, RSVM yields better results in splice dataset and sonar dataset. The reason behind that is that results from the nonlinear kernel in RSVM cannot be directly refined using MLC. Also, occasionally the results from the combined classifier seem more sensitive to the training ratio, especially for the splice dataset, which is perhaps due to the threshold to be determined which depends more or less on the training data used.
In this group of experiments, using the challenging dataset svmguide3, how various strategies to rebalance the unbalanced data may affect the classification performance is analyzed. For the unbalanced dataset, samples from one class may be overrepresented compared to those in another class. As a result, we can either oversample the data of minority or subsample the data of majority to balance the number of samples represented in the training set for better modeling of the data. On the other hand, the test samples remain to be unbalanced as it is assumed we have no label information for the test samples.
For oversampling, data samples which are in minority class are randomly duplicated and inserted into the dataset. The replication of data items continues until the entire training set becomes balanced. Different from oversampling, subsampling randomly discards samples from the majority class until the training set achieves balanced. Since the performance may be affected by samples duplicated or discarded, this process is repeated for over 10 times and the average performance is then recorded for comparisons.
Using three different training ratios at 80%, 65%, and 50%, results of balanced learning for the svmguide3 dataset are summarized in Figure
Results of balanced learning for the svmguide3 dataset, using linear SVM (a) and RSVM (b).
When linear SVM is used, as shown in the first row of Figure
For RBFkernelled SVM, apparently, the training results from SVM via oversampling are among the best, though the testing results are inferior to those from unbalanced training. This indicates that the training process has been overfitting in this context. In fact, testing results from the combined classifier are slightly worse than those from the SVM classifier, that is, some degradation. Again, this is caused by the inconsistency of the nonlinear SVM and the linear nature of the MLC.
SVM and MLC are two typical classifiers commonly used in many engineering applications. Although there is a trend to combine MLC with SVM to provide a probabilistic output for SVM, under what conditions the combined classifier may work effectively needs to be explored. In this paper, comprehensive results are demonstrated to answer the question above, using four different datasets. First of all, it is found that the combined classifier works under certain constraints, such as a linear SVM, balanced dataset, and near Gaussiandistributed data. When a RBFkernelled SVM is used, the combined classifier may produce degraded results due to the inconsistency between the nonlinear kernel in SVM and linear nature of MLC. In addition, for a challenging dataset, balanced learning may improve the results of training but not necessarily the testing results. The reason behind that is that the combined SVMMLC classifier works on three assumptions, that is, Gaussian distributed, interclass separable, and model consistency between training data and testing data. Although the third assumption is true in most cases, the precondition of separable Gaussian distributed data is rather a strict constraint for data and is rarely satisfied. As a result, this introduces a fundamental difficulty in combining these two classifiers. However, under certain circumstances, the combined classifier indeed can significantly improve the classification performance. It is worth noting that when more groups are introduced in modelling a given dataset the efficacy can be severely degraded due to the inconsistency of statistical distribution between groups. Future work will focus on combining other classifiers such as neural network for applications in medical imaging [
The authors declare that there is no conflict of interests regarding the publication of this paper.