Modified Mahalanobis Taguchi System for Imbalance Data Classification

The Mahalanobis Taguchi System (MTS) is considered one of the most promising binary classification algorithms to handle imbalance data. Unfortunately, MTS lacks a method for determining an efficient threshold for the binary classification. In this paper, a nonlinear optimization model is formulated based on minimizing the distance between MTS Receiver Operating Characteristics (ROC) curve and the theoretical optimal point named Modified Mahalanobis Taguchi System (MMTS). To validate the MMTS classification efficacy, it has been benchmarked with Support Vector Machines (SVMs), Naive Bayes (NB), Probabilistic Mahalanobis Taguchi Systems (PTM), Synthetic Minority Oversampling Technique (SMOTE), Adaptive Conformal Transformation (ACT), Kernel Boundary Alignment (KBA), Hidden Naive Bayes (HNB), and other improved Naive Bayes algorithms. MMTS outperforms the benchmarked algorithms especially when the imbalance ratio is greater than 400. A real life case study on manufacturing sector is used to demonstrate the applicability of the proposed model and to compare its performance with Mahalanobis Genetic Algorithm (MGA).


Introduction
Classification is one of the supervised learning approaches in which a new observation needs to be assigned to one of the predetermined classes or categories. If the number of the predetermined classes is more than two, it is a multiclass classification problem; otherwise, the problem is known as the binary classification problem. At present, these problems have found applications in different domains such as product quality [1] and speech recognition [2].
The classification accuracy depends on both the classifier and the data types. The classifier types can be categorized according to supervised versus unsupervised learning, linear versus nonlinear hyperplane, and feature selection versus feature extraction based approach [3]. On the other hand, Sun et al. [4] reported that the parameters affecting the classification are the overlapping between data (i.e., class separability), small sample size, within-class concept (i.e., a single class may consist of various subclasses, which do not necessary have the same size), and the data distribution for each class. If the data distribution of one class is different from distributions of others, then the data is considered imbalance. The border that separates balance from imbalance data is vague; for example, imbalance ratio, which is the ratio between the major to minor class observations, is reported from small values of 100 to 1 to 10000 : 1 [5].
The assumption of an equal number of observations in each class is elementary in using the common classification methods such as decision tree analysis, Support Vector Machines, discriminant analysis, and neural networks [6]. Imbalance data occurs often in real life such as text classification [7]. The problem of treating the applications that have imbalance data with the common classifiers leads to bias in the classification accuracy (i.e., the predictive accuracy for the minority class will be much less than for the majority class) and/or considering the minority observation as noise or outliers, which will result in ignoring them from the classifier.
To handle the classification of imbalanced data problem, the research community uses data and algorithmic or both approaches. For the data approach, the main idea is to balance the class density randomly or informatively (i.e., targeted) either eliminating (downsampling) the majority class observations or replicating (oversampling) the minority class observations or doing both. While at the algorithmic approach, the main idea is to adapt the classier algorithms 2 Computational Intelligence and Neuroscience towards the small class, a combination of the data and algorithmic levels approaches is also used and known as costsensitive learning solutions.
The problems reported [4] using data approach are as follows: deleting significant information for certain instances in case of downsampling, bringing noise to original data in case of oversampling, determining the appropriate sample size in within-class concept data, specifying the ideal class distribution, and using clear criteria for selecting samples.
While the problem reported [4] using the algorithmic approach is that it needs a deep understanding about the classier used itself and the application area (i.e., why a classifier deteriorates when imbalance data occurs).
Finally, the problem in using the cost-sensitive learning approach is the assumption of previous knowledge for many errors types and imposing a higher cost to the minority class to improve the prediction accuracy. Knowing the cost matrices in most cases is practically difficult.
While data and algorithmic approaches constitute the majority efforts in the area of imbalanced data, several other approaches have also been conducted, which will be reviewed in Literature Review.
To overcome the pitfalls of data and algorithmic approaches to solve the problem of imbalanced data classification, the classification algorithm needs to be capable of dealing with imbalance data directly without resampling and should have a systematic foundation for determining the cost matrices or the threshold. One of the promising classifiers is the Mahalanobis Taguchi System (MTS), which has shown good classification results for imbalance data without resampling, it does not require any distribution assumption for the input variables, and it can be used to measure the degree of abnormality (i.e., the degree of abnormality is proportional to the magnitude of Mahalanobis Distance for the positive observations), but unfortunately it lacks a systematic foundation for threshold determination [8].
The Receiver Operating Characteristics (ROC) based approach has been reported in the research domain [9] for Support Vector Machines (SVMs) and random forests (RF) as a cost function to trade off the required metrics (i.e., sensitivity versus specificity). Three operating point selection criteria, shortest distance, harmonic mean, and antiharmonic mean, have been compared, and the results in [9] showed that there is no difference among classifiers performances. Based on that, and up to author knowledge, no previous work has been reported for using ROC based approach to find the optimum threshold for the Mahalanobis Taguchi System (MTS) approach; therefore, a Modified Mahalanobis Taguchi System (MMTS) methodology is proposed in this paper.
The aim of this work is to enhance the Mahalanobis Taguchi System (MTS) classifier performance by providing a scientific, rigorous, and systematic method using the ROC curve for determining the threshold that discriminates between the classes.
The organization of the paper is as follows: Section 2 reviews the previous work of imbalance data classifications methods, the Mahalanobis Taguchi System, and its applications. In Section 3, the proposed Modified Mahalanobis Taguchi System (MMTS) methodology is described. In Section 4, results are presented for the comparison among the suggested MMTS algorithm with the Probabilistic Mahalanobis Taguchi System (PMTS), Naive Bayes (NB), and Support Vector Machine (SVM) through several datasets. Section 5 presents a case study to demonstrate the applicability of the proposed research. And in Section 6, the results obtained from this research are summarized.

Literature Review
In this section, an overview of the imbalance classification approaches, the Mahalanobis Taguchi System concept, its different areas of applications, weakness points, and its variants is presented.
Solutions to deal with the imbalanced learning problem can be summarized into the following approaches [10]: sampling (sometimes called the data level approach), algorithmic, and cost-sensitive approaches.
Data level approach [11] is mainly returning the balance distribution between the classes through resampling techniques. It includes the following types: (1) Random undersampling\oversampling of the nega-tive\positive observations (2) Targeted undersampling\oversampling of the nega-tive\positive observations (3) Mixing approach from the above two items The problems reported in data approaches are as follows: (i) Determining the best class distribution or imbalance ratio for given observations: in Weiss and Provost [12], the relation between the classifier performance and the class distribution had been investigated; the results showed that balanced class distribution does not necessary produce optimal classification performance.
(ii) Undersampling the negative data can lead to loose important information, whereas oversampling the positive one may cause noise interference [13].
(iii) The uncertain criterion for selecting samples for within-class concept: that is, the class itself consists of several subclasses (i.e., how oversampling and/or undersampling will be performed for within-class concept).
Algorithmic level approach solutions are based upon creating a biased algorithm towards positive class. The algorithmic level approach has been used in many popular classifiers such as decision trees, Support Vector Machines (SVMs), association rule mining, back-propagation (BP) neural network, one-sample learning, active learning methods, and the Mahalanobis Taguchi System (MTS).
The adaptation of decision tree classifier to suit the imbalance data can be accomplished by adjusting the probabilistic estimate of the tree leaf or developing new trimming approaches [14].
Computational Intelligence and Neuroscience 3 Support Vector Machines (SVMs) showed good classification results for slightly imbalanced data [15], while for highly imbalanced data researchers [16,17] reported poor performance classification results, since SVM try to reduce total error, which will produce results shifted towards the negative (majority) class. To handle the imbalance data, there are proposals such as using penalty constants for different classes found in Lin et al. [18] or changing the class border based on kernel adjustment as in Wu and Chang [19]. Therefore, in this paper, SVM was selected as one of the benchmarked algorithms to compare with ours; the results showed that SVM classification performance largely degrades with a high imbalance ratio, which supports the previous findings of the researchers (more details will be presented in Results).
Association rule mining is a recent classification approach combining association mining and classification into one approach [20][21][22]. To handle the imbalance data, determining many minimal supports for different classes to present their varied recurrence is required [23].
On the other hand, one-class learning [24,25] used the target class only to determine if the new observation belongs to this class or not. BP neural network [26] and SVMs [27] are examined as one-class learning approach. In the case of highly imbalanced data, one-class learning showed good classification results [28]. Unfortunately, one-class learning algorithms drawbacks are that the size of the training data is relatively larger than those for multiclass approaches, and it is also hard to reduce the dimension of features used for separation.
Active learning approach is used to handle the problems related to the unlabeled training data. Research on active learning for imbalance data reported by Ertekin et al. [29] is based on the iterative approach by training the classifier on the data near the classification boundary instead of the whole training dataset, since the imbalance ratio for the dataset near the boundary is different from those away from the boundary. Unfortunately one of the bit falls for using this approach is that it can be computationally expensive [30].
The problem with the algorithmic approach is that it needs an extensive knowledge of specific classifier (i.e., why the algorithm fails to detect the positive cases), also understanding the application domain is critical (i.e., the effect of misclassification on the domain).
Cost-sensitive methods use both data and algorithmic approaches, where the objective is to optimize (i.e., minimize) the total misclassification cost while giving a positive class a higher misclassification cost [31,32].
Cost-sensitive methods used different costs or penalties for different misclassification types. For example, let pos,neg be the cost of wrongly classifying positive instant as a negative one, while neg,pos is the cost of the contrary case. In imbalance data classification, usually, the revealing of the positive instant is more important than the negative one; hence, the cost of positive instance misclassification outweighs the cost of negatives ones (i.e., pos,neg > neg,pos ), with correct classification cost equal to zero (i.e., pos,pos = neg,neg = 0). Different types of cost-sensitive approaches have been reported in the literature: (i) Modifying the weights of the data space: in this approach, modification to the training data density is performed using the misclassification cost criteria, in a way that the density is adjusted towards the costly class.
(ii) Making the classifier objective cost-sensitive: instead of minimizing the misclassification error, the objective is tuned to reduce the misclassification cost [32].
(iii) Using risk minimization approach: in a binary c4.5 (i.e., decision tree) classifier, the assignment of a class type to a leaf end is based on the high-frequency class that reaches the end, while for the cost-sensitive classifier, the assignment of the class label is based on minimizing the classification cost [33].
The problem of using the cost-sensitive approach is that it is based on previous knowledge of the cost matrix for the misclassification kinds, while in most cases it is unavailable.

Mahalanobis Taguchi System (MTS).
MTS is a multivariate supervised learning approach, which aims to classify new observation into one of the two classes (i.e., healthy and unhealthy classes). MTS was used previously in predicting weld quality [3], exploring the influence of chemicals constitution on hot rolling manufactured products [34], and selecting the significant features in automotive handling [35]. The MTS approach starts with collecting considerable observations from the investigated dataset, tailed by separating of the unhealthy dataset (i.e., positive or abnormal) from the healthy (i.e., negative or normal). Calculation of the Mahalanobis Distance (MD) using the negative observation is performed first, followed by scaling (i.e., dividing the MD calculated over the number of features used), which will result in an average MDs around one for the negative observations. The scaled MD for the positive date set supposes to be different from MD for those for the negative dataset. Since many features are used to calculate the MD, so that the probability to have significant features for the multivariable dataset is high, Taguchi orthogonal array is used to screen these features. The criterion for selecting the appropriate features is determined by selecting the features that possess high MD values for the positive observations. It is worth noticing that a continuous scale is constructed from the single class observations by using MTS; unlike other classification techniques, learning is done directly from the positive and negative observations. This characteristic helps the MTS classifier to deal with the imbalance data problems. The step of determining the optimal threshold is a critical one for effective MTS classier. To determine the appropriate threshold, loss function approach was proposed by [36]; however, it is not a practical approach because of the difficulty in specifying the relative cost [37]. In order to overcome this problem, Su and Hsiao [6] used a Chebyshev's theorem to specify the threshold and called their method a "probabilistic thresholding method (PTM)" for the MTS, whereas in MTS the threshold is assumed to be one. It has been shown 4 Computational Intelligence and Neuroscience Prerequisite: Obtain healthy (negative) and unhealthy (positive) observations Split the obtained data into two groups; training and validation Initialization, let: Training mode = True Threshold * = 1 Selected features * = All features threshold optimization indicator = False MMTS Algorithm (1) IF Training mode == True (2) While threshold optimization indicator = False Do (3) MD ← , (i.e. by using the correlation matrix of the negative observations, and * ) (4) * ← Selected features, (i.e. use Taguchi approach for features selection and update * ) (5) MD ← , (i.e. recalculate Mahalanobis distance using the new features * ) (6) Classify observations based on the threshold * , and the selected features * (7) Observation is classified as negative (9) Else (10) Observation is classified as positive (11) End (12) Calculate the True Positive rate (TP rate ) and the False Positive rate (FP rate ) calculate the fitness function) (14) IF the threshold optimization termination criteria is reached (15) threshold optimization indicator = True (16) Select threshold = * , and features, = * that will result in minimum fitness function (17) Else (18) Use genetic algorithm to find the threshold * that will minimize the fitness function (19) End (20) End While threshold (21) Training mode = False, the optimum threshold = * , and the optimum features = * (22) Else (23) Using the threshold = * , and features = * , calculate the Mahalanobis distance, MD ← (24) Observation is classified as negative (26) Else (27) Observation is classified as positive in [6] that PTM classifier performance outperformed MTS classifier performance; therefore, it has been selected to be benchmarked with the proposed classifier. Unfortunately, the PTM method is based on previously assumed parameters, and the accuracy of the classification results was less than the benchmarked classifiers (this is one of the findings in this research, which will be discussed in Results).
The other research area in the MTS is related to the modification of the Taguchi method not in the threshold determination. Due to the lack of a statistical foundation [37] for the Taguchi method, the Mahalanobis Genetic Algorithm (MGA) [3] and the Mahalanobis Taguchi System using Particle Swarm Optimization (PSO) [38] have been used. Both the MGA and MTS Particle Swarm Optimization methods deal with the Taguchi system (orthogonal array) part, while the threshold determination still lacks a solid foundation or is hard to be determined in reality.
Finally, the aim of this research is to enhance the Mahalanobis Taguchi System (MTS) classifier performance through providing a scientific, rigorous, and systematic method of determining the binary classification threshold that discriminates between the two classes, which can be applied to the MTS and its variants (i.e., MGA).

Modified Mahalanobis Taguchi System (MMTS)
The proposed model, Algorithm 1, provides an easy, reliable, and systematic way to determine the threshold for the Mahalanobis Taguchi System (MTS) and its variants (i.e., Mahalanobis Genetic Algorithm, MGA) to carry out the classification process effectively. The currently used approaches either are difficult to use in practice such as the loss function Computational Intelligence and Neuroscience [36] due to the difficulty in evaluating the cost in each case or are based on previously assumed parameters [6].
The proposed model is based on using the Receiver Operating Characteristics (ROC) curve [39] for the MTS threshold determination. As shown in Figure 1, point (TP rate = 1, FP rate = 0) represents the optimum theoretical solution (best performance) for any classifier. The closer the classifier performance to this point is, the better it is. The curve drawn in the figure represents the MTS classifier performance for different threshold values. Changing the threshold will change the point location on the curve (i.e., points , , , and ). Therefore, the problem of finding the optimum threshold can be reformulated into the problem of finding the closest point that lies on the curve to point (FP rate = 0, TP rate = 1).
MMTS can be summarized in the following steps.
Step 1 (construction of the initial model stage). Assume there are two classes: negative (the one with majority observations) and positive (the one with the minority observations). A set of data is sampled from both classes. Using the negative observations only, reference Mahalanobis Distances are calculated using (1) with all features used. The Mahalanobis Distances (MD) for the positive observations are also calculated by using the same equation with all features, with the inverse of the correlation matrix of the negative observation used. Selection of the new features is performed by using the orthogonal array approach; then a recalculation of MDs for the negative and the positive observation is performed. An arbitrary threshold is assumed (i.e., one), and accordingly the true positive rate, the true negative rate, and the fitness function can be estimated.
Step 2 (optimization stage). If the stopping criteria (i.e., fitness function value is zero, the number of maximum iterations is reached, and/or the differences among successive fitness value are less than a certain value) are not met yet, an optimization model (i.e., genetic algorithm) is invoked to obtain a better threshold value that minimizes the desired fitness function. Accordingly, new features will be selected using the orthogonal array approach, and true positive rate, false positive rate, and the fitness function will be also updated.
If the stopping criteria are met, then the training stage is done, and the model is ready for testing observations.
Step 3 (testing stage). In this stage, the optimum threshold and the associated features are determined from the previous stage and the Mahalanobis Distance for the new observation is calculated based on those parameters. If the Mahalanobis Distance for this observation is less than the optimum threshold, then it will be classified as negative; otherwise, it will be classified as positive. Now, after providing an overview of how MMTS algorithm works, detailed calculation of the Mahalanobis Distance, the true positive and the negative rates, and the fitness function will be presented in the followings subsection.

Mahalanobis Distance (MD).
In order to demonstrate the MTS threshold determination mathematically, let us assume that negative data (also called healthy or normal observations) and the positive data (also called unhealthy or abnormal observations) are available, where the number of positive observations is and the number of negative observations is , and both positive and negative observations consist of variables (or features).
Given a sample of size , the Mahalanobis Distance (MD) for the th observation can be calculated by where = 1 ⋅ ⋅ ⋅ , = 1 ⋅ ⋅ ⋅ , is total number of features (or variables), is the normalized vector obtained by normalizing the values of : that is, = ( − )/ , where and are the average and the sample standard deviation of variable , respectively, is the transpose of observation and variable for , and −1 is the inverse of the correlation matrix of the negative variables.
Using (1), −1 , , , the inverse of the correlation matrix, the mean, and the sample standard deviation of the feature , for the negative data, respectively, the MD of the positive observations can be calculated. The next step is to determine the threshold that will be used to discriminate the negative observations from the positive ones based on the MD magnitude, which means that the new observation can be classified into either a positive or negative observation according to the following criteria: if MD < , the observation is negative; otherwise, it is positive.
The contribution of this paper mainly is in the area of establishing a reliable and systematic threshold for classification. A rough method for determining the threshold is to plot the positive and negative MD observations versus their orders and decide upon the threshold manually. This method is not accurate, especially when dealing with the overlapping values of the MDs. 6 Computational Intelligence and Neuroscience

Proposed Threshold Determination.
The essential classifier performance can be explained by examining the confusion matrix Table 1. The ratio between negative to positive observations (left to right columns in Table 1) is representation for the class distribution (i.e., imbalance ratio). In that sense, any performance metrics using both columns will be sensitive to the imbalance data issue, such as accuracy and error rate, (14) and (15), respectively. To overcome this problem, the Receiver Operating Characteristic (ROC) curves are recommended by the research community.
From the confusion matrix, Table 1, the following can be defined: (i) TN ( ) is the total number of observations classified as negative from the pool of the negative observations (i.e., the negative observations whose MD < ).
(ii) FN ( ) is the total number of observations classified as negative from the pool of the positive observations (i.e., the positive observations whose MD < ).
(iii) FP ( ) is the total number of observations classified as positive from the pool of the negative observations (i.e., the negative observations whose MD ≥ ).
(iv) TP ( ) is the total number of observations classified as positive from the pool of the positive observations (i.e., the positive observations whose MD ≥ ). Now, the true positive rate and the false negative rate at the threshold can be defined as Using TP ( ) rate and FP ( ) rate for different values of threshold , the ROC for the MMTS can be constructed.
The ROC plot is an -plot in which TP ( ) rate (2) is plotted on the vertical axis and FP ( ) rate (3) is plotted on the horizontal axis.
Since TP ( ) rate uses the right column in the confusion matrix and FP ( ) rate uses the left column in the confusion matrix, they are unaffected by the imbalance data problem. The ROC is beneficial because it provides a tool to show the advantages (represented by true positives) versus disadvantages (represented by false positives) of the classifier relating to data density. Figure 1 represents MTS classifier ROC curve, created by changing the MTS threshold (i.e., each point on the curve such as , , and represents the different threshold for MTS classifier). The closest point lies on the curve (i.e., threshold) to point (0, 1) which is considered the optimum threshold among the other candidates. Mathematically, this can be converted into the following optimization model.

Nonlinear Optimization Model.
The following optimization model is used to determine the optimum threshold that discriminates between the negative and the positive observations, depending on minimizing the Cartesian distance between the MMTS ROC classifier curve and the theoretical optimum point (i.e., TP ( ) rate = 1, FP ( ) rate = 0).
where is Euclidean distance between point and any point that lies on the ROC curve such as , , or . FP ( ) rate is the false positive rate at point which is equal to zero. TP ( ) rate is the true positive rate at point which is equal to one. FP ( ) rate is the false positive rate at the threshold . TP ( ) rate is the true positive rate at the threshold .
Accordingly, the optimization model becomes Subject to: TP ( ) rate = 1, The optimization model is a nonlinear one, where the objective function is the Euclidean distance between points on the ROC MMTS curve and the " " point (i.e., TP ( ) rate = 1, FP ( ) rate = 0). The first two constraints ( (6) and (7)) are the theoretical optimum values of true\false rate of the positive observations while the last two constraints (inequalities (8) and (9)) are the lower and the upper boundaries of the true positive rate and the false positive rate.

Taguchi
System. Since more features mean a higher cost of monitoring and require more processing time, it is important to exclude the unnecessary features from having an efficient classifier. MTS approach uses orthogonal array (OA) experiments to screen the important features. Each factor in the orthogonal array design can be calculated independently of all other factors since the design is balanced (i.e., the factors levels are weighted equally) (readers are referred to Woodall et al. [37] for further information about an OA). The metric of the Taguchi orthogonal array is the signalto-noise ratio, where uses (in our case) "the larger the better" criterion, which can be calculated for different treatment using where is an index that represents run or row in the orthogonal design and its domain varied from 1 to 2 , where is the total number of features. Based on the above equation, the feature mean gain can be calculated by where is an index that represents the feature, ∈ [1 ⋅ ⋅ ⋅ ], and is the total number of features. The feature will be included if it has a positive gain; otherwise, it should be excluded.

Results
In this section, the description of the dataset used in this study, brief of the used benchmarked classifiers, an overview of the metrics used for imbalanced data classifiers, and the results of classifiers performance for different datasets will be presented.

Dataset.
The binary or multiclass imbalance ratio threshold, which is the ratio between negative to positive observations border that separates balance from imbalance dataset, is still an open area for the research community. In this paper, we investigated a wide range of IR, from 1.25 up to 2088, considering a dataset to be imbalanced if IR is equal or higher than 1.25. Table 2 contains a description of the selected datasets properties. All the datasets (except for the welding dataset) were obtained from the UCI machine learning repository [41]. It should be noted in this study that the imbalance ratio effect on the classification results should be explored. Accordingly, the datasets were selected related to this criterion (i.e., to investigate at a wide range of IR). Unfortunately, imbalance ratio is not the only reason that causes degradation in classifier performance. The maximum Fishers Discriminant Ratio ( -ratio) is also considered as a major factor in classifier performance degradation. A low value of -ratio means that observations are mixed together and overlapped regions are large, and therefore it is difficult to discriminate between these observations. Estimates of the different metrics were obtained by means of 10 repetitions; the data has been randomly partitioned by 35% as the training set and the remainder of the testing set for each repetition. MMTS and the benchmarked algorithms have been evaluated for each of the ten repetitions simultaneously.

Benchmarked Classifiers Used in the Study.
In this section, an overview of the benchmarked classifiers, with their parameters, and the machine specifications used for analysis will be presented.

Support Vector Machines (SVMs).
The first work regarding SVMs was published by Cortes and Vapnik [42], continued by significant contributions from other researchers [43]. SVMs showed a good classification performance for the rare and noisy data, which makes them favorable in a 8

Support vector
Slabs S e p a r a t i n g H y p e r p l a n e M a r g i n number of applications from cancer detection [44] to text classification [45]. The idea of the SVMs classifier is based on establishing the most appropriate hyperplane that separates class observations from each other ( Figure 2). The most appropriate hyperplane means the one with the largest width of the margin parallel to the hyperplane with no interior points.
More details about SVMs methodology can be found in [46].

Mahalanobis Taguchi System (MTS) Based on Probabilistic Thresholding Method (PTM).
In the PTM method, Chebyshev's theorem is employed to determine the threshold (12) that separates the normal observations from abnormal ones; see [6]: where is the threshold that separates negative from positive observations, MD is the negative data mean MDs, MD is the negative data standard deviation MDs, is a small value, and is the portion of the negative observations whose MDs are less than the lower value of the positive MD observations.

Naive Bayesian Classifier.
Bayes theorem is the center of Naive Bayesian classifier (NB) in which class conditional independence is assumed. This assumption means that the influence of features on a given class is independent of each other. Mathematically, where X = ( 1 , 2 , . . . , ) is a variable vector of size and is the class. Even with such unrealistic assumption, Naive Bayes still found noticeable success stories comparable with other types of sophisticated classifiers, for example, NB used in text classification [47], medical diagnosis [48], and systems performance management [49].

Experimental Settings.
The parameters values setting for the examined classifiers were selected from the suggestions of the corresponding authors as follows: (i) MMTS: the MMTS does not need any tuning parameters, which is one of the important benefits of using MMTS over the traditional MTS. (ii) PTM: for the PTM algorithm, a small parameter is set to 0.05, based on the recommendation from [6]. (iii) SVM: for the SVM algorithm, to map observations from the data space to the kernel space, the linear function was used. (iv) NB: for the NB algorithm, kernel distribution was selected to fit the conditional features distributions.
It is worth mentioning that no tuning parameters for any of the examined classifiers were performed; consequently, baseline line comparisons among the classifiers with the default setting were established, which leads to the most robust classifier selection [50].
Finally, MATLAB R2013a was used for the data analysis on HP machine with a processor Intel (R) Core (TM) i7 CPU 2.2 GHz and 4.00 GB RAM. For the genetic algorithm, the following parameters were used in the implementation: population size, 20 chromosomes, with the number of features corresponding to the bit number, 0.8, a crossover fraction, 0.01, a mutation rate, 100, and the limit for the number of generations, and for the stopping criteria, value of the fitness function cumulative change was less than 10-6 over 50 iterations.

Metrics.
Several metrics such as accuracy (14), error (15), specificity (16), precision (17), sensitivity or recall (18), means (19), and measure (20) are used by the research community as comprehensive assessments of classifiers performances. The most important metrics among the above-mentioned ones are the sensitivity and the specificity, whereas the first one (sometimes called recall) can be seen as the accuracy of the positive observations: that is, how many positive observations were classified correctly. On the other hand, specificity can be understood as the accuracy of the negative observations: that is, how many negative observations were classified correctly.
Unfortunately, the examination of accuracy and error rates ( (14) and (15)) reveals that these metrics are not sensitive to the data distribution [10]. For example, the given dataset consists of ninety percent of negative observations and ten percent of positive ones. If the classifier ignores the positives observations and classifies all instances as negative, it means that the classifier has ninety percent accuracy (i.e., error rate, 10 percent), which is a good classification performance for the entire dataset, but it cannot detect the positive instances as if it does not exist. In this context, it can be seen that accuracy and error rate metrics are biased towards one class on behalf of the other.
Error = 1 − Accuracy, Computational Intelligence and Neuroscience In order to overcome the above problem, several metrics such as means [51] (19), the area under a Receiver Operating Characteristic (AUC-ROC) curve [52], and measure [19] (20) are used to assess the imbalance data classifier performance.
The most common used metrics for the evaluation of the imbalance data classification performance are means and measure , where the last one uses weighted importance of the recall and precision (controlled by , the default value of is 1), which results in better assessment than accuracy metric, but still biased to one class [10]. Therefore, means will be used as a main metric for the analysis criterion.

Classification
Results. In this section, performance presentation for the classification results of MMTS with the other four investigated classification algorithms: Support Vector Machines (SVMs), Probabilistic Mahalanobis Taguchi System (PTM), Naive Bayes (NB), and Mahalanobis Taguchi System (MTS) (based on previously assumed threshold equal to one). In order to investigate the robustness performance of the studied classifiers related to the class imbalance criterion, fourteen different UCI [41] datasets and one data (welding) from El-Banna et al. [40] were used. Table 3 summarizes the median values with the upper and the lower 95% confidence level interval based on nonparametric Wilcoxon Signed Rank Test for means values of the investigated data for the five classifiers. In order to discriminate between the classifiers performances among each other, nonparametric pairwise comparison Wilcoxon test was performed to test the null hypothesis that the two classifiers have equal medians versus the alternating hypothesis that the first classifier's median is larger than the second one; the results of these comparison are summarized in the ranking score of each classifier for each dataset. Based on this table, one can observe the following: (i) The MMTS classifier has a higher classification performance than MTS across the whole fourteen investigated datasets.
(ii) The MMTS has a superior classification performance comparable with the other benchmarked classifiers when the imbalance ratio (IR) is high (i.e., IR ≥ 463).
(iii) The MMTS and SVM have equal classification performance when the imbalance ratio (IR) is medium (i.e., 189 ≤ IR ≤ 417). (iv) The SVM has a superior classification performance comparable with the other benchmarked classifiers when the imbalance ratio (IR) is low (i.e., 1 ≤ IR ≤ 189). (v) The MMTS has the most robust classification performance over the investigated IR range (i.e., the MMTS ranks eight\six times as the first\second one, resp.). (vi) The NB has the least classification performance comparable with the other benchmarked classifiers over the investigated IR range. (vii) The effect of the -ratio is dominated by the imbalance ratio (IR) effect (i.e., the IR is more important than the -ratio).

MMTS versus Modified SVMs and NB Classifiers.
Many published works [16,19,53,54] pointed out that SVMs classification performance drops significantly when dealing with the imbalance data; therefore, modified SVMs classifiers have been suggested to overcome this issue at both data and algorithmic levels. At the data level, Synthetic Minority Oversampling Technique (SMOTE) [11] has been applied successfully to handle the imbalance data issue, while at the algorithmic level, Adaptive Conformal Transformation (ACT) [54] and Kernel Boundary Alignment (KBA) [19] are among the most popular SVMs modified classifiers for imbalance data handling. Therefore, in order to assess the MMTS classification performance against imbalance data classifiers, UCI datasets and their classification performance results using SVMs, SMOTE, ACT, and KBA from [19] were used, where the same experimental settings were used for the MMTS classifier in order to compare the benchmarked classifiers results.
Using the performance classification results obtained from [19] and the test performed using the MMTS classifier, means performance metrics in the form of the 95% confidence intervals are reported in Table 4. It can be seen that the means of the MMTS classifier are higher than those for the benchmarked classifiers at relatively high imbalance ratio (i.e., for the Abalone dataset), while for the yeast dataset, MMTS means were less than KBA and ACT but better than SVM and SMOTE. Finally, MMTS was the least performance among the classifiers for the car dataset.
Using the same dataset in [19], modified NB algorithms such as tree augmented Naive Bayes (TAN), Hidden Naive Bayes (HNB), Average One-Dependence Estimators (AODE), and Weighted Average of One-Dependence Estimators (WAODE) are used to compare the MMTS classification performance with them. Table 5 shows that the means MMTS classification results for the examined datasets have the highest values comparable with the others.

Case Study
The case presented will be in the manufacturing sector in the area of resistance spot welding. Due to its cost and simplicity, resistance spot welding is the dominant joining      process in the autoindustry. The reasons behind chosen spot welding joining process over other joining processes can be summarized as follows: being inexpensive and having fast process, its applicability to join different types of materials (coated steel, low carbon steel, aluminum, etc.) with varying thickness, and its relative robustness to the different noise factors existing in the plant such as fit-up variations. Despite the above-mentioned advantages, weld quality cannot be estimated with high certainty due to factors such as tip wear, sheet metal debris, variation in the power supply; therefore, it is common practice in the autoindustry to add extra welds to increase their confidence in the structural integrity of the welded assembly [40].
Recently worldwide competition pushes automotive OEMs to improve their productivity, reduce nonvalue added activity, and reduce cost. Therefore, autoindustry is extremely concerned with the elimination of these redundant welds. To achieve this objective of using the optimum number of required welds that sustain the required strength of the structure, weld quality must be achieved.
To achieve an acceptable weld quality, nondestructive weld assessment should be performed. This assessment can be translated into the problem of classifying the dynamic resistance profile (input signal) for those welds into normal or abnormal welds.
The welding data, summarized in Table 6, are used for this case having similar conditions to the one used in El-Banna et al. [40]. The experimental setup, the materials used, and all the other related information can be found in the same reference. The data consisted of 3,294 welds, from which 3,288 were normal welds, and the others were expulsion welds performed by an alternating current (AC) constant current controller. Each weld has 28 features, which represents the dynamic resistance value in the 28 half cycles or welding time. The welds were performed by an alternating current (AC) welding machine that has a capacity of 180 KVA with 680 lb of welding force provided by a pneumatic gun. An HWPAL25 truncated electrode type with a 6.4 mm face diameter was used with a welding time of 14 cycles and 11.3 KA as the initial input secondary current. Tip dressing was performed 10 times (approximately every 300 welds) in order to return the electrode tip to its original diameter by removing the excess material. The constant current control applied a current stepper, one Ampere per weld, to compensate for the increase in the electrode diameter or what is known as mushrooming effect.
5.1. Implementation. The first step after obtaining the dataset was to split them into training and testing groups. In this case, the training data was 1,153 observations (i.e., training ratio is 35%), in which two observations were expulsion welds (i.e., positive observations), and the others were normal welds (i.e., negative observations).
Running the MMTS and the other benchmarked algorithms, in addition to the Mahalanobis Genetic Algorithm (MGA) [3] over the welding data, Table 7 shows the results for the 10 repetitions in terms of the following metrics: specificity, sensitivity, precision, means , and measure . In addition, the suggested threshold is reported for the MMTS and PMTS algorithms. As mentioned before, means will be used as the main metric, but the results for other metrics will be reported here for future researchers to use.
In order to determine if there is a significant difference among the classifiers performances (i.e., means ), Table 7, nonparametric Kruskal-Wallis test is used, in which the value obtained from performing this test on the welding data is 0.000, which reveals that there is at least one classifier performance that is significantly different from the others. In order to rank the classifiers, the pairwise Mann-Whitney test is used. Table 8 shows the values obtained from comparing the performances of the classifiers between any two classifiers    using the Mann-Whitney test and the resulting classifiers rank. It can be seen clearly that the MMTS outperforms the other classifiers. This result is also emphasized in the ROC curves and the area under the curve (AUC) values for the examined classifiers ( Figure 3).

Conclusions
The Mahalanobis Taguchi System (MTS) is one of the most promising binary classification approaches to handling the imbalance data problem. Unfortunately, the MTS suffers from the lack of a systematic rigorous method for determining the threshold to discriminate between the two classes. In this paper, a nonlinear optimization model with the objective of minimizing the Euclidean distance between MTS classifier ROC curve and the theoretical optimal point (i.e., TP rate = 100% and FP rate = 0%) is used to determine this threshold.
In order to assess the suggested algorithm, the MMTS has been benchmarked with several popular algorithms: Mahalanobis Taguchi System (MTS), Support Vector Machines (SVMs), Naive Bayes (NB), Probabilistic Mahalanobis Taguchi System (PTM), Synthetic Minority Oversampling Technique (SMOTE) with SVM, Adaptive Conformal Transformation (ACT), Kernel Boundary Alignment (KBA), Hidden Naive Bayes (HNB), and other improved Naive Bayes algorithms over benchmarked datasets with a wide range of imbalance ratio (i.e., 1.25 ≤ IR ≤ 2088). The results showed that the MMTS has a superior performance for high imbalance ratio (i.e., IR ≥ 463), while for the medium imbalance ratio (i.e., 189 ≤ IR ≤ 417), the MMTS has an equal classification performance with the SVMs. For the low imbalance ratio (IR ≤ 189), the SVM was the best among the classifiers. It has been noticed that the effect of the maximum Fishers Discriminant Ratio ( -ratio) is dominated by the imbalance ratio (IR) effect (i.e., IR is more important than -ratio). MMTS showed a very robust classification performance across the range of the imbalance ratio; it also showed better classification performance results comparable with KBA, ACT (i.e., state of the art Modified SVM classifiers for imbalance data), HNB, NBtree, and other modified Naive Bayes classifiers when imbalance ratio is relatively high.
In order to demonstrate the MMTS applicability, a case study in the welding area was used. The results showed that the MMTS classifier performance outperformed the benched marked classifiers performances and MGA. The case results emphasize that the MMTS is one of the most suitable classifier algorithms when there is a high imbalance ratio.
For future research work, the problems of multiclass imbalanced data and the mixed data need to be tackled thoroughly.

Disclosure
Permanent address of Mahmoud El-Banna is as follows: Industrial Engineering Department, German Jordanian University, P.O. Box 35247, Amman 11180, Jordan.

Conflicts of Interest
The author declares that there are no conflicts of interest regarding the publication of this paper.