Comparing the Linear and Quadratic Discriminant Analysis of Diabetes Disease Classification Based on Data Multicollinearity

Linear and quadratic discriminant analysis are two fundamental classi­cation methods used in statistical learning. Moments (MM), maximum likelihood (ML), minimum volume ellipsoids (MVE), and t-distribution methods are used to estimate the parameter of independent variables on the multivariate normal distribution in order to classify binary dependent variables. e MM and ML methods are popular and eective methods that approximate the distribution parameter and use observed data. However, the MVE and t-distribution methods focus on the resampling algorithm, a reliable tool for high resistance. is paper starts by explaining the concepts of linear and quadratic discriminant analysis and then presents the four other methods used to create the decision boundary. Our simulation study generated the independent variables by setting the coecient correlation via multivariate normal distribution or multicollinearity, often through basic logistic regression used to construct the binary dependent variable. For application to Pima Indian diabetic dataset, we expressed the classi­cation of diabetes as the dependent variable and used a dataset of eight independent variables. is paper aimed to determine the highest average percentage of accuracy. Our results showed that the MM and ML methods successfully used large independent variables for linear discriminant analysis (LDA). However, the t-distribution method of quadratic discriminant analysis (QDA) performed better when using small independent variables.


Introduction
Logistic regression is a standard statistical algorithm used to classify binary dependent Logistic regression is a standard statistical algorithm to classify binary dependent variables based on independent variables. Nevertheless, the critical assumption of logistic regression is that there is a linear decision surface between the dependent and independent variables. Furthermore, logistic regression requires an average amount of or no multicollinearity problems between independent variables. Linearly separable data and multicollinearity in data are often found in real-world situations. It is hard to determine complicated relationships using logistic regression. More powerful and concise algorithms such as discriminant analysis [1] can easily exceed the performance of this algorithm.
Discriminant analysis is used for classi cation, dimension, and data visualization. e classi cation was explored in a study of discriminant analysis [2] that has been used in many classi cation problems [3,4]. When performing discriminant analysis, users can discuss classi cation methods in which two or more groups and one or more independent variables are placed into one of the measured characteristics. Medical scientists investigate how groups (characterized by blood pressure, blood glucose levels, and age) di er across independent variables. One study used discriminant analysis to determine which patients had previously su ered a heart attack [5] to classify whether the patient would survive based on other variables. Two discriminant analyses are interesting: linear discriminant analysis (LDA) and quadratic discriminant analysis (QDA).
LDA can be interpreted from decision boundaries at two points. e rst is a probabilistic interpretation, and the second is an interpretation of Fisher [6]. e rst interpretation helps consider the assumptions of LDA. e second interpretation provides a better understanding of how LDA performs classification. Fisher [7] introduced an LDA that identified a linear combination of the independent variables with a maximum class separation ratio. LDA's ability to successfully distinguish between tumor classes [8] using gene expression data was an essential feature of a new approach to cancer classification. LDA assumes that the variance between all classes is equal, and the decision boundary is calculated in a linear function. In other words, when each class has an individual covariance matrix, this is QDA, and the decision boundary shows in the form of a quadratic function. e LDA and QDA [9] should be recommended when using NIR data to classify an ill-posed problem. e principal component analysis [10] was applied with LDA and QDA for discriminants between healthy control and cancer samples based on MS data sets. e seven classification methods [11] were tested with binary logistic, probit, and cumulative probit regression, LDA, QDA, artificial neural networks, and naïve Bayes classification, to examine skeletal sex estimation. Consequently, LDA may be more preferred in skeletal sex estimation than other methods. LDA and QDA proposed flow regime identification [12] to combine responses from a non-intrusive optical sensor for air and water's vertical upward gas-liquid flow. e LDA and QDA methods use the multivariate normal distribution of independent variables as a classification rule. e parameter of prior probability, mean, and covariance matrix of each class also create the discriminant function for the boundary of classes. e two classes are assumed to have the normal distribution, the most common and default distribution in real-world applications. erefore, if the normal distribution is considered for the two classes where the covariance matrices are assumed to be equal, the decision boundary of classification is in the form of a linear discriminant function. When the covariance matrices are assumed to be unequal, the decision boundary of classification is in the form of a quadratic discriminant function.
is study aimed to investigate the parameter estimations for LDA and QDA. We proposed commonly-used methods, namely the moments (MM) and maximum likelihood (ML) methods, and the iterative algorithm, namely the minimum value ellipsoid (MVE) and t-distribution methods, to determine which conditional distribution was from the multivariate normal distribution.
Traditionally, LDA and QDA are estimated using the first and the second theoretical moments, also called the MM method introduced by Bowman and Shenton [13] in their concept of the central limit theorem. is method estimated population parameters that expressed population moments as the parameter of interest in terms of the functions. ese functions were then set equal to the sample moments, and the number of equations was the same as the number of parameters. e equations were then used to estimate the parameters of interest before the estimators were solved to approximate these parameters. e ML method [14] was a technique that was widely used for estimating probability distribution function parameters based on the observed data. e parameters were estimated by maximizing the likelihood function of observation.
e logarithm was then based on the likelihood function. Taking the derivative with respect to a parameter was straightforward, and when set to zero, the ML estimators produced a pleasant result.
Rousseeuw [15,16] studied the resampling algorithm using the minimum volume ellipsoid method. is method generated a robust estimation of multivariate location and scale. It estimated a low bias and expedient for outlier inspection via multivariate data, often through the benefit of robust distances on the MVE estimate. e MVE properties included breakdown value, affine equivariance, and efficiency. e suitable t-distribution method [17] used the statistical modeling finite mixtures model to classify multivariate data sets. Its advantages were that it was the most comfortable method for modeling, and it obtained increasing determination. Multivariate normal data had been used because of their easy computation. Modeling that uses a mixture of t-distribution methods can be fitted using an ML iteration [18] expectation-maximization (EM) algorithm.
e debate on using machine learning or discriminant analysis (LDA and QDA) has been one of the critical questions in classification. Generally speaking, the discriminative classifiers model directly depends on the probability distribution function and thus needs fewer parameters; this alleviates the difficulty of parameter estimation problems. Machine learning is concerned with the algorithms that learn how to assign a class label to an example from the problem domain. On the other hand, because class probability distributions are known, LDA and QDA have unique advantages in dealing with multicollinearity data. e ML, MM, MVE, and t-distribution methods are approximate mean and covariance from the multivariate normal distribution that plays important role in classification.
is study compares the MM, ML, MVE, and t-distribution methods depending on the LDA and QDA. ese methods were used to classify the binary dependent variables related to the multicollinearity of independent variables. eir percentage of accuracy determined the performance of the four methods. Additionally, we clarified some of the simulation study's theoretical concepts and real data.

The Concept of Linear and Quadratic
Discriminant Methods e probability density function for multivariate normal distribution, or x ∼ N(μ, ), is written as: 1 − 1 x, and when each side is multiplied by 2, we get: Equation (4) shows the form of a linear term. LDA discriminant function the two classes as: e class of instance x is defined by: e LDA Algorithm.

Quadratic Doscriminant Analysis
e natural logarithm taken from equation (7) is: We multiplied the sides of the previous equation by 2 and got: International Journal of Mathematics and Mathematical Sciences To obtain Equation (8) in the quadratic form x T Ax + b T x + c � 0, we brought the expression to the QDA discriminant function as: e classification term is shown in equation (6). e QDA Algorithm.

The Estimation Methods
e classification of these methods focused on the probability density function for multivariate normal distribution from equation (1), x ∼ N(μ, ), and is written as: is section estimated parameters from multivariate normal distribution across the four methods.

Moments (MM) Method.
e MM method relates the equivalent of sample and theoretical moments. Sample moments are denoted where E(X k ) is the k th moment of the distribution for k � 1, 2, ..., and M k � 1/n n i�1 X k i is the k th sample moment for k � 1, 2, ....
For multivariate normal distribution, the first sample moment approximates the origin by.
M 1 � 1/n n i�1 x i � x to the first theoretical moment E(X). en, the second sample moment approximates the origin using M 2 � 1/n n i�1 e first and the second theoretical moments approximate the origin by E(X i ) � μ and. E(X 2 i ) � Σ + μμ T . e first theoretical moments approximate the origin with the sample moment and the moment estimator of μ is: e second theoretical moment approximates the origin with the sample moment and we obtain: where μ mm � 1/n n i�1 x i , and the moments estimator of Σ is approximated as:

Maximum Likelihood (ML) Method.
From the probability density function, the likelihood function is created by . e likelihood function is calculated using the logarithm to the log likelihood function: Taking the derivative with respect to μ is straightforward: and setting to zero we obtain a pleasant result: 4 International Journal of Mathematics and Mathematical Sciences e ML estimator is the mean in the sample mean. e covariance matrix is approximated by taking the derivative with respect to the matrix from Σ − 1 using equation (16).
Finally, setting to zero yields the ML estimator: It can be concluded that the ML and the MM estimators used the same equation and values.
e MM and ML Algorithm.
(2) Select the sample sizes of each class n 1 and n 2 .
(3) Calculate the mean of each class is method [20] depends on the MVE, which covers the least point h of the n observations. e multivariate data is defined as x � (x 1 , x 2 , ..., x p ) with independent variables p and x i � (x i1 , ..., x ip ) T ; i � 1, 2, ..., n. Computing the MVE for a set of data x requires the examination of all n h ellipsoids that contain the observed h � n + p + 1/2 of x for the smallest ellipsoid volume. e MVE algorithms defined ellipsoids by including the subsets of observed p + 1 on data x. For each subset of size p + 1, indicated by J � i 1 , ..., i p+1 ⊂ 1, ..., n { }, the sample mean and sample covariance matrix were computed by: e covariance matrix Σ J is a nonsingular covariance matrix when the subset of p + 1 is in general rank. If the subset of p + 1 was not in general rank, then the observed data from x were included until a subset with the sample of nonsingular covariance matrix was achieved. e MVE Algorithm.

e T-distribution Method.
e modeling of a mixture of t-distributions [17] is considered a robust estimator that fits the data set with the ML via the expectation-maximization (EM) algorithm [18]. e t-distribution proposes a longer-tailed distribution that is closed to the normal distribution for large sample sizes. e data set is denoted by x � (x 1 , x 2 , ..., x p ) with p independent variables from sample sizes n. In a normal mixture model based on these data, each data set is assumed to be a comprehension of the random variables of dimensional vector p with the g-component normal mixture probability density function on data set x: where the proportions π j are nonnegative and all proportions sum to one. en denotes the p-variate multivariate normal probability density function with mean μ i and covariance matrix j (j � 1, . . . g). Here Ψ � (π 1 , ..., π g− 1 , θ T ) T where θ is composed of the components of the μ j and the distinct components of the Σ j (j � 1, ..., g). e t mixture model depending on the normal distribution is obtained with the location parameter μ, positive covariance matrix Σ, and ] degrees of freedom, . e implementation of the EM algorithm for ML estimation in the case of a single component t-distribution has been discussed in MaLachlan et al. [21]. For the EM algorithm at the iterations (k + 1) th, π (k+1) , θ (k+1) , and ] (k+1) can be computed independently of each other. To compute the weight sample mean and sample covariance matrix, we used: Where It can be seen that μ (k+1) i and Σ

Simulation and Results
e objective of this study was to classify the binary dependent variables using LDA and QDA via MM, ML, MVE, and t-distribution methods. e independent variables (x) were simulated from the multivariate normal distribution in two, four, six, and eight sets of independent variables and constant correlation (ρ) values of 0.1, 0.5, and 0.9. e multivariate normal distribution function of the independent variables (x) consisted of mean (μ) and covariance matrix (Σ): e mean (μ) was fixed to generate at zero and the standard deviation (σ i , i � 1, 2, ..., p) was fixed to generate as 6.
e regression coefficients were denoted by β � (β 0 , β 1 , ..., β p ) T for the set of two, four, six, and eight independent variables. Finally, the dependent variables (y) were approximated by the logit function π i � 1/1 + e − x β from the logistic regression model.
If π i ≥ 0.5, the dependent variables were defined as y i � 1, and y i � 0, when π i < 0.5. e R program employed simulation data at 1,000 replications with sample sizes of 200, 300, 400, and 500. e MM, ML, MVE, and t-distribution methods approximated LDA and QDA parameters to predict the dependent variable. e confusion matrix was used to report the classification performance on a set of estimated data to compare with the real data using the percentage of accuracy (Table 1). e percentage of accuracy was computed as ( Table 1) Table 2. Additionally, the MM and ML methods in LDA had the highest average percentage in Tables 3-5. Meanwhile, the t-distribution method had the second highest average percentage of accuracy and a minor difference from the MM and ML methods. When the independent variable increased, the average percentage of accuracy decreased. When the sample sizes were large, the average percentage of accuracy was good. However, when the correlation coefficients increased, the average percentage of accuracy was shown to be a slightly different value.

Application of a Real Dataset
We applied LDA and QDA to classify a healthy person and a diabetic patient from the Pima Aboriginals diabetes dataset from the Applied Physics Laboratory of John Hopkins University. is dataset was obtained from the University of California, Irvine website (https://archive.ics.uci.edu/ml/ datasets/). e binary dependent variables were 0 or 1, with 0 indicating a healthy person and 1 indicating a diabetic patient. e independent variables were defined by the number of pregnancies (x 1 ), years of age (x 2 ), diabetes pedigree function (x 3 ), triceps skin fold thickness (x 4 ), 2-h serum insulin (x 5 ), plasma glucose concentration (x 6 ), diastolic blood pressure (x 7 ), and body mass index (kg/m2) (x 8 ). ese data consisted of 768 records: 500 healthy person records and 268 diabetic patient records. Table 6 presents the descriptive statistics related to the diabetic disease dataset.
Pearson's correlation analysis was used to determine whether there was a relationship between the eight continuous independent variables. e formula for computing the correlation between two variables was: e correlation coefficients of the independent variables are presented in Table 7 and Figure 1. e hypothesis testing for the significance of the relationship is defined as: H 0 : ρ � 0 , H 1 : ρ ≠ 0 , and the t-statistics were used for computing the significance of Pearson's correlation by with a degree of freedom (df ) n-2.
Finally, when the p-value of the t-statistics was less than 0.05, that indicated a significant relationship between two variables shown in Table 7. Based on our findings, a significant positive relationship at a moderate level was found in most cases such as in x 1 , x 6 and x 3 to x 8 , and a significant relationship at a strong level   International Journal of Mathematics and Mathematical Sciences was shown in some cases such as x 1 , x 2 and x 4 , x 5 . Most of the independent variables had a significant relationship except for x 2 , x 3 , x 2 , x 4 , and x 2 , x 5 . e Pearson correlation matrix from Table 7 is rewritten in Figure 1, and is easier to understand using different shades. e light shade signifies a moderate correlation, and the dark shade denotes a strong correlation. Most of the independent variables have the light shade, which means there were either correlations among the independent variables or multicollinearity problems. e MM, ML, MVE, and t-distribution methods via LDA and QDA were approximated to compute the accuracy percentages in Table 8. e sets of two, four, six, or eight independent variables were similar to those of the simulation data. e two, four, and six independent variables were selected when the correlations were at the significant 0.05 level.
From Table 8, it is apparent that the MM and ML methods showed equal accuracy percentages across all cases and the highest accuracy percentages in most cases.  However, the accuracy percentage of the t-distribution method was good in two independent variables using QDA. erefore, the MM and ML methods outperformed with LDA for four, six, and eight independent variables, which was the same as the simulation results. When the number of independent variables increased for two or four, the accuracy percentage was slightly different. However, the large independent variables demonstrated excellent performances in the diabetes dataset.    Tables 2-5. e number of independent variables and sample sizes influenced the average accuracy percentage. QDA outperformed on the small independent variables, but the large independent variables were more appropriate for LDA. However, the average accuracy percentages of the large independent variables were less than the percentages for the small independent variables. e correlation coefficient increase did not affect the classification because the average accuracy percentage showed slightly different values. If the sample size increased, the accuracy of all methods increased in all cases.
For the real data results in Table 8, the MM, ML, and t-distribution methods showed the highest accuracy percentages for two independent variables. We found that the independent variables of real data demonstrated skewed data ( Figure 2). e Shapiro-Wilk test [22] was used to confirm that all independent variables showed non-normality. However, the MM and ML methods supported classification on large independent variables. e multivariate normal distribution generated the simulation dataset. erefore, the real data results were different from the simulation study in some cases. Overall, it is clear that the MM and ML methods are most suitable for classifying diabetes.
e Pima Indian diabetic dataset is popular to analyze for classification and to improve medical diagnosis. Gupta et al. [23] compared Gaussian process, LDA, QDA, statistical gradient descent, ridge regression classifier, support vector machines, k-nearest neighbors, decision tree, naïve Bayes, logistic regression, random forest, and ELM for multiquadric, RBF, sigmoid activation functions. e results suggested that the logistic regression performs better than the other techniques. Chang et al. [24] proposed an e-diagnosis system based on a machine learning algorithm: Naïve Bayes classifier, random forest classifier, and J48 decision tree models to be trained and tested using the Pima Indians diabetes. It can be concluded that a Naïve Bayes model works well with a more fine-tuned selection of features for binary classification. Abedin et al. [25] studied a hierarchical ensemble model to combine two classifiers that had been trained, a decision tree and a logistic regression model, and feed the output of those models to a neural network.
e proposed model achieved classification accuracy by using PIMA Indian diabetes database.

Conclusion
is paper proposes the classification of binary data by applying MM, ML, MVE, and t-distribution methods based on LDA and QDA for data multicollinearity. e solutions provide the advantages and disadvantages of these methods. e advantage of the QDA results when using small independent variables, MM and ML, outperformed in classifying all correlation coefficients, but these methods are slightly different from the other methods. However, the LDA results showed that MM and ML perform the same when provided with four to eight independent variables. ese methods did not relate to the classification performance when considering the correlation coefficient. However, the large sample sizes showed good performances across all cases. Eight independent variables were selected for the medical data when using real data to classify diabetes patients. We selected two, four, six, and eight independent variables for large correlation. ese results showed that the t-distribution was efficient on two independent variables via QDA. However, the MM and ML methods were effective at classifying four to eight variables based on LDA. erefore, we concluded that MM and ML methods could classify in the presence of multicollinearity. Medical data classification has been extended to study a novel random vector functional link [26] and a novel random vector functional link with ε-insensitive Huber loss function [27]. A fuzzy-based Lagrangian twin parametricmargin support vector machine [28] reduced the effect of the outliers in medical data.

Data Availability
e data used to support the findings of this study are available from the author on request.

Conflicts of Interest
e authors declares that they have no conflicts of interest.