Robustification of Naïve Bayes Classifier and Its Application for Microarray Gene Expression Data Analysis

The naïve Bayes classifier (NBC) is one of the most popular classifiers for class prediction or pattern recognition from microarray gene expression data (MGED). However, it is very much sensitive to outliers with the classical estimates of the location and scale parameters. It is one of the most important drawbacks for gene expression data analysis by the classical NBC. The gene expression dataset is often contaminated by outliers due to several steps involved in the data generating process from hybridization of DNA samples to image analysis. Therefore, in this paper, an attempt is made to robustify the Gaussian NBC by the minimum β-divergence method. The role of minimum β-divergence method in this article is to produce the robust estimators for the location and scale parameters based on the training dataset and outlier detection and modification in test dataset. The performance of the proposed method depends on the tuning parameter β. It reduces to the traditional naïve Bayes classifier when β → 0. We investigated the performance of the proposed beta naïve Bayes classifier (β-NBC) in a comparison with some popular existing classifiers (NBC, KNN, SVM, and AdaBoost) using both simulated and real gene expression datasets. We observed that the proposed method improved the performance over the others in presence of outliers. Otherwise, it keeps almost equal performance.


Introduction
Classification is a supervised learning approach for separation of multivariate data into various sources of populations. It has been playing significant roles in bioinformatics by class prediction or pattern recognition from molecular OMICS datasets. Microarray gene expression data analysis is one of the most important OMICS research wings for bioinformatics [1]. There are several classification and clustering approaches that have been addressed previously for analyzing MGED [2][3][4][5][6][7][8][9][10][11]. The Gaussian linear Bayes classifier (LBC) is one of the most popular classifiers for class prediction or pattern recognition. However, it is not so popular for microarray gene expression data analysis, since it suffers from the inverse problem of its covariance matrix in presence of large number of genes (p) with small number of patients/samples (n) in the training dataset. The Gaussian naïve Bayes classifier (NBC) overcomes this difficulty of Gaussian LBC by taking the normality and independence assumptions on the variables. If these two assumptions are violated, then the nonparametric version of NBC is suggested in [12]. In this case the nonparametric classification methods work well but they produce poor performance for small sample sizes or in presence of outliers. In MGED the small samples are conducted because of cost and limited specimen availability [13]. There are some other versions of NBC also [14,15]. However, none of them are so robust against outliers. It is one of the most important drawbacks for gene expression data analysis by the existing NBC. The gene expression dataset is often contaminated by outliers due to several steps involved in the data generating process from hybridization of DNA samples to image analysis. Therefore, in this paper, an attempt is made to robustify the Gaussian NBC by the minimum -divergence method within two steps. At step-1, the minimum -divergence method [16][17][18] attempts to estimate the parameters for the Gaussian NBC based on the training dataset. At step-2, an attempt is made to detect the outlying data vector from the test dataset using the -weight function. Then an attempt is made to propose criteria to detect the outlying components in the test data vector and the modification of outlying components by the reasonable values. It will be observed that the performance of the proposed method depends on the tuning parameter and it reduces to the traditional Gaussian NBC when → 0. Therefore, we call the proposed classifier as -NBC.
An attempt is made to investigate the robustness performance of the proposed -NBC in a comparison with several versions of robust linear classifiers based on M-estimator [19,20], MCD (Minimum Covariance Determinant), and MVE (Minimum Volume Ellipsoid) estimators [21,22], Orthogonalized Gnanadesikan-Kettenring (OGK) estimator including MCD-A, MCD-B, and MCD-C [23], and Feasible Solution Algorithm (FSA) classifiers [24][25][26]. We observed that the proposed -NBC outperforms existing robust linear classifiers as mentioned earlier. Then we investigate the performance of the proposed method in a comparison with some popular classifiers including Support Vector Machine (SVM), k-nearest neighbors (KNN), and AdaBoost; those are widely used in gene expression data analysis [27][28][29]. We observed that the proposed method improves the performance over the others in presence of outliers. Otherwise, it keeps almost equal performance.

Methodology
2.1. Naïve Bayes Classifier. The naïve Bayes classifiers (NBCs) [30] are a family of probabilistic classifiers depending on the Bayes' theorem with independence and normality assumptions among the variables. The common rule of NBCs is to pick the hypothesis that is most probable; this is known as the maximum a posteriori (MAP) decision rule. Assume that we have a training sample of vectors {x = ( 1 , 2 , . . . , ) ; = 1, 2, . . . , } of size for = 1, 2, . . . , , where x denotes the jth observation of the ith variable in the kth population/class ( ). Then the NBCs assign a class label̂= C for some k as follows: For the Gaussian NBC, the density function (x | , ) of kth population/class ( ) can be written as where ={ ,Λ }, and here = ( 1 , 2 , . . . , ) , is the mean vector and the diagonal covariance matrix is

Maximum Likelihood Estimators (MLEs) for the Gaussian NBC.
We assume that the prior probabilities ( ) are known and the maximum likelihood estimators (MLEs)â ndΛ of and Λ are obtained based on the training dataset as follows: It is obvious from (1)-(2) that the Gaussian NBC depends on the mean vectors ( ) and diagonal covariance matrix (Λ ); those are estimated by the maximum likelihood estimators (MLEs) as given in (4)-(6) based on the training dataset. Therefore, MLE based Gaussian NBC produces misleading results in presence of outliers in the datasets. To get rid of this problem, an attempt is made to robustify the Gaussian NBC by minimum -divergence method [16][17][18].
The minimum -divergence estimator is defined bŷ For the Gaussian density = { , Λ } and the minimum -divergence estimatorŝ, andΛ , for the mean vector and the diagonal covariance matrix Λ , respectively, are obtained iteratively as follows: The formulation of (10)-(12) is straightforward as described in the previous works [17,18]. The function in (12) is called the -weight function, which plays the key role for robust estimation of the parameters. If tends to 0, then (10) are reduced to the classical noniterative estimates of mean and diagonal covariance matrix as given in (4) and (6), respectively. The performance of the proposed method depends on the value of the tuning parameter and initialization of the Gaussian parameters = { , Λ }.

Parameters Initialization and Breakdown Points of the
Estimates. The mean vector is initialized by the median vector, since mean and median are same for normal distribution and the median (Me) is highly robust against outliers with 50% breakdown points to estimate central value of the distribution. The median vector of kth class/population is defined as The diagonal covariance matrix Λ is initialized by the identity matrix (I). The iterative procedure will converge to the optimal point of the parameters, since the initial mean vector would belong to the center of the dataset with 50% breakdown points. The proposed estimators can resist the effect of more than 50% breakdown points if we can initialize the mean vector by a vector that belongs to the good part of the dataset and the variance-covariance Λ by the identity (I) matrix. More discussion about high breakdown points for the minimum -divergence estimators can be found in [18].

-Selection Using T-Fold Cross Validation (CV) for
Parameter Estimation. To select the appropriate by CV, we fix the tuning parameter to 0 . The computation steps for selecting appropriate by T-fold cross validation is given below.

Outlier Identification
Using -Weight Function. The performance of NBC for classification of an unlabeled data vector x using (1) not only depends on the robust estimation of the parameters but also depends on the values of x weather it is contaminated or not. The data vector x is said to be contaminated if at least one component of x = { 1 , 2 , . . . , } is contaminated by outlier. To derive a criterion of whether the unlabeled data vector x is contaminated or not, we consider -weight function (12) and rewrite it as follows: The values of this weight function lie between 0 and 1. This weight function produces larger weight (but less than 1) if x ∈ and smaller weight (but greater than 0) if x ∉ or contaminated by outlier. Therefore, the -weight function (15) can be characterized as The threshold value can be determined based on the empirical distribution of -weight function as discussed in [31] and by the quantile values of , (x |̂, ,Λ , ) for = 1, 2, . . . , with probability where is the probability for selecting the cut-off value and the value of should lie between 0.00 and 0.05. In this paper, heuristically we choose = 0.03 to fix the cut-off value for detection of outlying data vector using (18). This idea was first introduced in [31]. Then the criteria whether the unlabeled data vector x is contaminated or not can be defined as follows: However, in this paper, we directly choose the threshold value of as follows: With heuristically = 0.10, where D is the training dataset including the unclassified data vector x, (19) was also used in the previous works in [16,18] to choose the threshold value for outlier detection.

Classification by the Proposed -NBC.
When the unlabeled data vector x is usual, the appropriate label/class of x can be determined using the minimum -divergence estimators Table 1: Gene expression data generating model.

Gene group Individual Normal
Patient . If the unlabeled data vector x is unusual/contaminated by outliers, then we propose a classification rule as follows. We compute the absolute difference between the outlying vector and each of mean vectors as Compute sum of the smallest r components of d as = Then the unlabeled test data vector x can be classified aŝ If the outlying test vector x is classified in to class , then its ith component is said to be outlying if > ( = 1, 2, . . . , ). Then we update x by replacing its outlying components with the corresponding mean components from the mean vector̂, of kth population. Let x * be the updated vector of x. Then we use x * instead of x to confirm the label/class of x using (1).

Simulated Dataset 1.
To investigate the performance of our proposed ( -NBC) classifier in a comparison with four popular classifiers (KNN, NBC, SVM, and AdaBoost), we generated both training and test datasets from = 2 multivariate normal distributions with different mean vectors ( , = 1, 2) of length = 10 but common covariance matrix (Λ = Λ; = 1, 2). In this simulation study, we generated N 1 = 40 samples from the first population and N 2 = 42 samples from the second population for both training and test datasets. We computed the training error and test error rate for all five classifiers using both original and contaminated datasets with different mean vectors {( 1 , 2 = 1 + ); = 0, . . . , 9}, where the other parameters remain the same for each dataset. For convenience of the presentation, we distinguish the two mean vectors in such a way in which the second mean vector is generated by adding t with each of the components of the first mean vector.

Simulated Dataset 2.
To investigate the performance of the proposed classifier ( -NBC) in a comparison of the classical NBC for the classification of object into two groups, let us consider a model for generating gene expression datasets as displayed in Table 1 which was also used in Nowak and Tibshirani [32]. In Table 1, the first column represents the gene expressions of normal individuals and the second column represents the gene expressions of patient individuals. First row represents the genes from group A and second row represents the genes from group B. To randomize the gene expression, Gaussian noise is added from (0, 2 ). First we generate a training gene-set using the data generating model (Table 1) where the scalar number Ω is the common difference between two corresponding mean components of 1 and 2 , respectively. Similarly, for generating the training and test datasets, we consider the 1 = 30, 2 = 30, and 3 = 30 ( = 1 + 2 + 3 ) samples from = 3. It is carried out with different means and common variance-covariance matrix of multivariate normal populations ( 1 , Λ 1 ), ( 2 , Λ 2 ), and ( 3 , Λ 3 ). In this case we consider = + Ω with Ω = 0, 1, . . . , 10 and = 1, 2, 3 such that 1 = 2 = 3 for Ω = 0; otherwise 1 ̸ = 2 ̸ = 3 , where the scalar number Ω is the common difference among the corresponding mean components of 1 , 2 , and 3 , respectively.

Head and Neck Cancer Gene Expression Dataset.
To demonstrate the performance of the proposed classifier ( -NBC) in a comparison with four popular classifiers (KNN, NBC, SVM, and AdaBoost) with the real gene expression dataset, we considered the head and neck cancer (HNC) gene expression dataset from the previous work [33]. The term head and neck cancer denotes a group of biologically comparable cancers originating from the upper aero digestive tract, including the following parts of human body: lip, oral cavity (mouth), nasal cavity, pharynx and larynx, and paranasal sinuses. This microarray gene expression dataset contains 12626 genes, where 594 genes are differentially expressed and the rest of the genes are equally expressed.

Simulation Results of Dataset 1.
We have used the simulated dataset 1 to investigate the performance of the proposed method with the performance of the other popular classifiers such as classical NBC, SVM, KNN, and AdaBoost. Figures  1(a)-1(f) represent the test error rate estimated by these five classifiers against the common mean differences in absence of outliers (original dataset) and in presence of 5%, 10%, 15%, 20%, and 25% outliers in test dataset, respectively. From Figure 1(f) it is evident that in absence of outlier every method produces almost the same result, whereas, in presence of different levels of outliers (see Figures 1(a)-1(e)), the proposed method outperformed the other methods by producing low test error rate. Table 2 is summarized with different performance measures (accuracy, sensitivity, specificity, positive predicted value (PPV), negative predicted value (NPV), prevalence, detection rate, detection prevalence, Matthews correlation coefficient (MCC), and misclassification error rate). All these performance measures are computed by the five methods (NBC, KNN, SVM, AdaBoost, and proposed).
From Table 2 we observed that the proposed method produces better results than the other classifiers (NBC, SVM, KNN, and AdaBoost), since it produces higher values of accuracy (>97%), sensitivity (>95%), specificity (>94%), PPV (>94%), NPV (>94%), and MCC (>94%) and lower values of prevalence and MER (<4%). The proportion test statistic [34] has been used to test the significance of several proportions produced by the five classifiers for each of the performance measures. The column 7 of Table 2 represents the values of this test statistic. Since all the values except MER are less than 0.01, so we can conclude that the performance results are highly statistically significant. The MER ( value < 0.05) is also statistically significant at 5% level of significance. So we may conclude from simulated dataset 1 that our proposed method performed better than the other classical methods for the contaminated dataset. It keeps equal performance in absence of outliers for the original dataset.

Simulation Results of Dataset 2.
To investigate the performance of the proposed classifier ( -NBC) in a comparison of the classical NBC for the classification of objects into two groups, we considered the simulated dataset 2. Figures 2(a) and 2(b) show training and test datasets in absence of outliers, respectively. Here genes are randomly allocated in the test dataset. Figures 3(a) and 3(b) show the results of classified test dataset by classical and proposed NBC, respectively.
From classification results we observed that both the naïve Bayes procedures and proposed method produce almost the same results with low misclassification error rates in absence of outliers. To investigate the robustness performance of our proposed method in a comparison with the conventional naïve Bayes procedure for classification, we randomly contaminated 30% genes by outliers in the test gene-sets (Figures 4(a)-4(c)).

Head and Neck Cancer Gene Expression Data Analysis.
We also investigated the performance of the proposed method in real microarray gene expression dataset. The normalized Head and Neck cancer (HNC) dataset is considered here [33]. The RNA sample was extracted from the 22 normal and 22 cancer tissues for generating the HNC dataset. The Affymetrix GeneChip was used for processing RNA samples and finally got the quantified CELL file format. The Robust Multichip Analysis (RMA) and quantile normalization methods were used for processing the CELL files. The HNC dataset was 12,642 probe sets, 44 samples, and 42 significantly differentially expressed probe sets. The detailed discussion is shown in [33] for preprocessing of HNC dataset. We first select the differentially expressed (DE) genes whose posterior probability is more than 0.9; otherwise the genes are equally expressed (EE) using bridge R package [35] which is shown in Figure 6 that shows 594 differentially expressed genes from 12626 genes. We have performed the Anderson-Darling (A-D) normality test [36,37] for the HNC dataset. The results show that a few numbers of DE genes (5%) for both normal and cancer groups break the normality assumption at 1% level of significance. Also we checked the independence assumption of DE genes using the mutual information [38]. We found that the mutual information for HNC dataset is 0.044 which is almost close to zero for both normal and cancer groups. So we may conclude that the DE genes almost satisfy the independence assumption. Therefore, we may assume that the HNC dataset almost satisfies the normality and independence assumption of NBC for a given class/groups.
For classification problem, we have considered half of the differentially expressed genes (594/2 = 297) as training gene-set and we identified their group using hierarchical clustering (HC). Figure 7 represents the dendrogram of HC of half of the differentially expressed genes for training data. The rest of the 297 differentially expressed genes are considered as a test gene-set. Then we employed both classical NBC and robust NBC ( -NBC) in this dataset to classify cancer genes (see Figures 8(a)-8(d)). We observed that from Figure 8 the traditional naïve Bayes procedure can not find the group of gene properly whereas our proposed method ( -NBC) performs better for identifying the gene group in the HNC dataset. Figure 8(d) shows that the proposed classifier shows better performance for classifying the samples than the classical method (Figure 8(c)).
We also computed different performance measures (accuracy, sensitivity, specificity, positive predicted value (PPV), negative predicted value (NPV), prevalence, detection rate, detection, prevalence, Matthews correlation coefficient (MCC), and misclassification error rate) by the five classification methods (NBC, KNN, SVM, AdaBoost, and proposed) using HNC dataset (Table 5). From Table 5 we have observed    that the proposed classifier produces better results than the other classifiers (NBC, SVM, KNN, and AdaBoost). The proportion test [34] has shown that the values <0.01 for the different performance results excluding MCC and MER. Then we may say that they are highly statistically significant. The MCC and MER are statistically significant at 5% level of significance because of the values < 0.05. Hence, the performances of the proposed methods in real HNC data analysis are better than classical and other methods. Also this data set is contaminated by outliers reported in [31]. So we consider this dataset to investigate the performance of the proposed method in a comparison of some popular existing classifiers. We observed that the proposed method outperforms the others for this HNC dataset.

Discussion
In this paper, we discussed the robustification of Gaussian NBC using the minimum -divergence method within two steps. For both simulated and real data analysis, at first, the mean vectors and the diagonal covariance matrices were computed by the minimum -divergence estimators for the Gaussian NBC based on the training dataset. Then outlying test data vectors were detected from the test dataset using the -weight function and outlying components in each test data vector were replaced by the corresponding values of their estimated mean vectors. Then the modified test data vectors were used as the input data vectors in the proposed -NBC for their class prediction or pattern recognition. The rest of the data vectors from the test dataset were directly used as the input data vectors in the proposed -NBC for their class prediction or pattern recognition. We observed that the performance of the proposed method depends on the tuning parameter and the initialization of the Gaussian parameters. Therefore, in this paper, we also discussed the initialization procedure for the Gaussian parameters and the -selection procedure using cross validation in Sections 2.3.2 and 2.3.3, respectively. The classifier reduces to the traditional Gaussian NBC when → 0. Therefore, we call the proposed classifier -NBC. We investigated the robustness performance of the proposed -NBC in a comparison of several robust versions of linear classifiers based on MCD, MVE, and OGK estimators taking the smaller number of variables/genes (p) with larger number of patients/samples (n) in the training dataset, since these types of robust classifiers also suffer from the inverse problem of its covariance matrix in presence of large number of variables/genes (p) with small number of patients/samples ( ) in the training dataset. We observed that the proposed -NBC outperforms the existing robust linear classifiers as early mentioned in presence of outliers. Otherwise, it keeps almost equal performance. Then we investigated the performance of the proposed method in a comparison of some popular classifiers including Support Vector Machine (SVM), K-Nearest Neighbors (KNN), and AdaBoost which are widely used for gene expression data analysis [27][28][29].
In that comparison, we used both simulated and real gene expression datasets. We observed that the proposed method improves the performance over the others in presence of outliers. Otherwise, it keeps almost equal performance as before. The main advantage of the proposed classifier over the others is that it works well for both conditions of (i) < and (ii) > , and it can resist the effect of 50% breakdown points. If the dataset does not satisfy the normality assumptions, then the proposed method may show weaker performance than others in absence of outliers. However, the nonnormal dataset can transform to the normal dataset by some suitable transformation like Box-Cox transformation [39]. Then the proposed method would be useful to tackle the outlying problems. The proposed method may also suffer from the correlated observations. In that case, correlated observations can be transforming to the uncorrelated observations using standard principal component analysis (PCA) or singular value decomposition (SVD) based PCA. Then the proposed method would be more useful to tackle the outlying problems as before. However, in our current studied in this paper, we investigated the performance of the proposed classifier ( -NBC) in a comparison of some popular existing classifiers (NBC, KNN, SVM, and AdaBoost) including some robust linear classifiers (MCD, MVE, OGK, MCD-A, MCD-B, MCD-C, and FSA) using both simulated and real gene expression datasets, where simulated datasets satisfied the normality and independent assumptions. We observed that the proposed method improved the performance over the others in presence of outliers. Otherwise, it keeps almost equal performance. Usually gene expression datasets are often contaminated by outliers due to several steps involved in the data generating process from hybridization to image analysis. Therefore the proposed method would be more suitable for gene expression data analysis.

Conclusion
The accurate sample class prediction or pattern recognition is one of the most significant issues for MGED analysis. The naïve Bayes classifier is an important and widely used method for the class prediction in bioinformatics. However, this method suffers from outlying problems to estimate the location parameters in the MGED analysis. To overcome this we proposed -NBC for estimating the robust location and scale parameters. In the simulation studies 1 and 2, we showed that, in presence of outliers, the proposed -NBC outperforms other popular classifiers while datasets were generated from the multivariate and univariate normal distribution, respectively, and it keeps equal performance with the other classifiers, in absence of outliers. We also investigated the robustness performance of the proposed -NBC in a comparison of linear classifier using some popular robust estimators in the simulation study 3. From this simulation study we observed that the proposed -NBC outperforms existing robust linear classifiers. Finally we applied in the real HNC dataset; our proposed -NBC showed better performance than the other traditional classifiers. Therefore, we may conclude that, in presence of outliers, our proposed -NBC outperforms other methods using both simulated and real datasets.

Additional Points
Supplementary Materials. The source code is written in R which is available in the Supplementary Material, available online at https://doi.org/10.1155/2017/3020627.

Conflicts of Interest
The authors declare that they have no conflicts of interest.