Gaussian Mixture Models Based on Principal Components and Applications

Data scientists use various machine learning algorithms to discover patterns in large data that can lead to actionable insights. In general, high-dimensional data are reduced by obtaining a set of principal components so as to highlight similarities and differences. In this work, we deal with the reduced data using a bivariate mixture model and learning with a bivariate Gaussian mixture model. We discuss a heuristic for detecting important components by choosing the initial values of location parameters using two different techniques: cluster means, k-means and hierarchical clustering, and default values in the “mixtools” R package. +e parameters of the model are obtained via an expectation maximization algorithm. +e criteria from Bayesian point are evaluated for both techniques, demonstrating that both techniques are efficient with respect to computation capacity. +e effectiveness of the discussed techniques is demonstrated through a simulation study and using real data sets from different fields.


Introduction
In real data such as engineering data, efficient dimension reduction is required to reveal underlying patterns of information. Dimension reduction can be used to convert data sets containing millions of functions into manageable spaces for efficient processing and analysis. Unsupervised learning is the main approach to reducing dimensionality. Conventional dimensional reduction approaches can be combined with statistical analysis to improve the performance of big data systems [1]. Many dimension reduction techniques have been developed by statistical and artificial intelligence researchers. Principal component analysis (PCA), introduced in 1901 by Pearson [2], is one of the most popular of these methods. e main purpose of PCA is to reduce the dimensionality of a data set consisting of a large number of interrelated variables, while retaining as much as possible of the variation present in the data set. Among the many PCA methods, singular value decomposition is used in numerical analysis and Karhunen-Loève expansion in electrical engineering. Eigenvector analysis and characteristic vector analysis are often used in the physical sciences. In image analysis, the Hotelling transformation is often used for principal component projection.
In recent years, there has been increasing interest in PCA mixture models. Mixture models provide a useful framework for the modelling of complex data with a weighted component distribution. Owing to their high flexibility and efficiency, they are used widely in many fields, including machine learning, image processing, and data mining. However, because the component distributions in a mixture model are commonly formalized as probability density functions, implementations in high-dimensional spaces are constrained by practical considerations.
PCA mixture models are based on a mixture-of-experts technique, which models a nonlinear distribution through a combination of local linear submodels, each with a fairly simple distribution [3]. For the selection of the model, a PCA mixture model was proposed by Kim, Kim, and Bang [4], which has a more straightforward expectation maximization (EM) calculation, does not require a Gaussian error term for each mixture component, and uses an efficient technique for model order selection. e researchers applied the proposed model to the classification of synthetic data and eye detection [4].
For multimode processes, the Gaussian mixture model (GMM) was developed to estimate the probability density function of the process data under normal operating conditions. However, in the case of high and collinear process variables, learning from process data with GMM can be difficult or impossible. A novel multimode monitoring approach based on the PCA mixture model was proposed by Xu, Xie, and Wang [5] to address this issue. In this method, first, the PCA technique is applied directly to each Gaussian component's covariance matrix to reduce the dimension of process variables and to obtain nonsingular covariance matrices. en, an EM algorithm is used to automatically optimize the number of mixture components. A novel process monitoring scheme for the detection of multimode processes was developed using the resulting PCA mixture model. e monitoring performance of the proposed approach has been evaluated through case studies [5].
In recent years, hyperspectral imaging has become an important research subject in the field of remote sensing. An important application of hyperspectral imaging is the identification of land cover areas. e rich content of hyperspectral data enables forests, urban areas, crop species, and water supplies to be recognized and classified. In 2016, Kutluk, Kayabol, and Akan [6] proposed a supervised classification and dimensionality reduction method for hyperspectral images, using a mixture of probability PCA (PPCA) models. e proposed mixture model simultaneously allows the reduction of dimensionality and spectral classification of the hyperspectral image. Experimental findings obtained using real hyperspectral data indicate that the proposed approach results in better classification than the state-of-the-art methods [6].
In the field of face recognition, Ahmadkhani and Adibi [7] proposed a supervised version of the PPCA mixture model. is model provides a number of local linear underlying data samples. e underlying manifolds are used for face recognition applications to achieve dimensionality reduction without loss of information.
In this work, we reduce the dimensions of data by applying a PCA technique and then deal with the reduced data or principal components scores using one GMM. en, we obtain estimates of the parameters using an EM algorithm. Finally, we compare the selection of initial values for location parameters in the mixture model using three different techniques: k-means, hierarchical clustering, and default values in the "mixtools" package. e rest of this paper is organized as follows. Section 2 briefly defines the concept of PCA. In Section 3, the mixture densities are discussed. Section 4 provides the probability density function of the Gaussian mixture distribution and the EM algorithm that is used to estimate the mixture's parameters. Section 5 shows a PCA mixture model based on the proposed scenario. In Section 6, the experimental results are presented in three subsections, and the main conclusions are listed Section 7.

Principal Component Analysis
Suppose we have p-dimensional vectors and need to reduce them to a q-dimensional subspace. e reduction can be achieved by projecting the original vectors on to q dimensions, the principal components, which span the subspace. Suppose that X is a vector of p random variables. To find the principal components, we compute several linear functions of X with maximum variance; most of the variation in X will be accounted for by m principal components, where m ≪ p. e PCA method determines the correlation between PCA components and data variables; a high correlation indicates important variables. Let Σ be the known covariance matrix for the random variable X. For k � 1, 2, . . . , p, the k th PC is given by an eigenvector corresponding to the k th largest eigenvalue λ k . If α k is chosen to have unit length, α k α k ‵ � 1 and var(z k ) � λ k . en, the technique of Lagrange multipliers can be used, by maximizing and choosing λ to be as large as possible. Here, α 1 is the eigenvector corresponding to the largest eigenvalue of Σ, that is, the first principal component of x. In general, the k th principal component of subject to being uncorrelated with α 1 x ' . e uncorrelation constraint can be expressed using any of the following equations: If we choose equation (3), we can write a Lagrange to maximize α 2 as follows: Differentiation of this quantity with respect to α 2 gives us Next, left multiplying α 1 ‵ into this expression, we have where, as mentioned above, the first two terms equal zero and α 1 α 1 ‵ � 1, resulting in ϕ � 0. erefore, Σα 2 − λα 2 � 0, or (Σ − λI p )α 2 � 0, is another eigenvalue equation, and we use the same strategy of choosing α 2 to be the eigenvector associated with the second largest eigenvalue that yields the second principal component of x, namely, α 2 ‵ x [8,9].

Mixture Densities
A mixture density is defined as a weighted sum of k component densities [9,10]. Denote the j th component density by p(x; θj), where θ j indicates the component parameters. We use π j to denote the weighting factor or "mixing proportions" of the jth component in the combination, with the constraints that π j ≥ 0 and K j�1 π j � 1, and p(j) represents the probability that a data sample belongs to the j th mixture component. A K component mixture density is then defined as e mixture model has a vector of parameters, We consider a mixture density to model a process by selecting a "source" according to the multinomial distribution π 1 , . . . π k and then drawing a sample from the corresponding component density p(x; θ j ). erefore, the probability of selecting source j and datum x is π j p(x; θ j ). Equation (7) gives the marginal probability of selecting datum x. We can think of the source that generated a data vector x as "missing information"; that is, given a data point x, we want to infer which source it is likely to belong to. Section 4 presents the EM algorithm, which is used to iteratively estimate this missing information [11,12].
In mixture models, we deal with hidden variables as a latent variable, denoted by Z. It takes values 1, . . . , K { } as a discrete set satisfying z k ϵ 0, 1 { } and z z k � 1. We define the joint distribution p(x, z) in terms of a marginal distribution p(z) and a conditional distribution p(x | z). Generally, in the mixture model, we first choose a sample z from a multinomial distribution and then draw observations for sample x from a distribution that depends on z, i.e., e marginal distribution over z is specified in terms of the mixing coefficients π k , such that,

Mixtures of Gaussians
e probability density function of X is defined as where μ x is a vector of means, (μ x1 , . . . , μ xN ) and Σ x is an N × N covariance matrix. e Gaussian mixture distribution can be written as a linear superposition of Gaussians in the form Now, the conditional distribution of x given a particular value of z is a Gaussian: e marginal distribution of x is obtained by summing the joint distribution over all possible states of z to give An important derived quantity is the "posterior probability" on a mixture component for a given data vector [11]: In the example shown in Figure 1, the resulting distribution is bimodal, suggesting that the data come from two different sources. In the figure, the red and green lines indicate two components of the Gaussian mixture distribution.

EM for Gaussian Mixtures.
e EM algorithm is an estimation method used to find the estimators of maximum likelihood when a data set has missing values or latent variables. In this work, we assume a GMM with a fixed number of k and that are known a priori. e EM is obtained as follows: (1) Initialize the means μ k , covariances Σ k , and mixing coefficients π k and evaluate the initial value of the log-likelihood. (2) E step: evaluate the responsibilities using the current parameter values: (3) M step: re-estimate the parameters using the current responsibilities: Mathematical Problems in Engineering where Evaluate the log-likelihood: and check for convergence of either the parameters or the log-likelihood. If the convergence criterion is not satisfied, return to step 2 [13].

PCA of Gaussian Mixture Model
In this section, we present the steps of the proposed method. ese steps are also illustrated in Figure 2.
(1) Use the PCA technique to reduce the dimensionality of a p-dimensional data set. To find the principal components, we first obtain the eigenvalues and eigenvectors of the covariance matrix. e eigenvalues are the principal components. e total number of principal components corresponds to the total number of variables in the data set. (6) Use the EM algorithm to estimate the unknown parameters representing the mixing proportions between the Gaussians and the means and covariances of each. (7) Use the Bayesian information criterion (BIC) test to assess the fit of the model; BIC is a model selection among a finite set of models, where the model with the lowest BIC is preferred [14].

Experimental Results
To study the effectiveness of the proposed method, we consider two scenarios. In the first scenario, the mixture model is fitted to the reduced data produced by the PCA method. In the second scenario, the clustering method is applied to the reduced data, and then the mixture model is fitted to the new data by taking the cluster means as initial values for the means in the mixture model. We use different types of data sets. e method is implemented on a scaled data set, and the results are illustrated in the following sections. We use the "stats", "mclust", and "mixtools" R packages to implement this method [15][16][17].

Simulation Case.
We implement the proposed method on simulation data with different sample sizes: 50, 100, and 500. Consider a data set of four variables defined as follows: where Z 1 , Z 2 , and Z 3 are scaled variables of X 1 , X 2 , and X 3 , respectively. e simulated data consist of four variables, X 1 , X 2 , X 3 , and X 4 . e mean values for each variable in the simulated data are μ 1 � 10.003, μ 2 � −0.1651, μ 3 � −1.287, and μ 4 � 0.032. e implementation includes a graphical plot of the data set as displayed in Figure 3, which shows a pairs plot of the simulation data and its three-dimensional surface. As a first step, we applied the PCA method; the results are summarized in Table 1, which presents the total variance of components. Practically, PCA describes the data in few variables without loss of information. As shown in Figure 4(a) and Table 1, two components explained 93% of the total variance. We considered those two components to comprise a new data set denoted RD data, which contained 93% of the information of the original data. e empirical distribution of the RD data is presented in Figure 4(b).
Hence, we fitted a two-component bivariate mixture model to the RD data. Figure 5(a) displays the fitting of the two components GMM on the new data (with 500 data points); the plot specifies each component's mean and sigma values. In order to estimate the density parameters for each component, we used an EM algorithm for mixtures of bivariate data. Table 2 presents the estimates of model parameters for each component; it also displays the estimates for different sample sizes. We computed the BIC of the model for the three cases presented in Table 2 and observed that the BIC value became large with increasing sample size. Figure 5(b) shows a plot of log-likelihood versus number of iterations; it is clear that the log-likelihood remained low when the number of iterations increased and that the EM method reached convergence. e second scenario involves the selection of the initial values for the means in the mixture model; this was done by applying the k-means method to the reduced data. en, the centres (or the means) of the clusters were taken as initial values. us, the RD data were partitioned into two clusters using the k-means method, denoted as PC1 and PC2, as shown in Figure 6(a). e cluster means for PC1 were 1.343257 and −1.210463, whereas those of PC2 were −0.02877005 and 0.02592586. In the next step, the bivariate GMM was fitted to the RD data using the cluster means as initial values. A visualization of the fitted bivariate GMM is presented in Figure 6(b). A summary of the results is given in Table 3. e resulting BIC value was 3389.031, which was the same as the BIC for GMM, as displayed in Table 2.
Parameter estimates were compared for the mixture and clustering methods. e k-means method computes the conventional Euclidean distance of given data, whereas   GMM computes the weighted distance by considering the variance in its measurement calculations. One can think of mixture models as generalizing k-means clustering to incorporate information about the covariance structure of the data as well as the centres of the latent Gaussians.

Forensic Glass Fragment Data.
In this section, the proposed method is implemented using forensic glass fragment (FG) data (available in an R package). e FG data include 214 observations and 10 variables. Consider the four selected variables, manganese (Mg), aluminium (Al), silicon (Si), and potassium (K), and the types of fragments (WinF, WinNF, Veh, Con, Tabl, and Head). Note that the variable type is used only to classify the data. Figure 7(a) shows a pairwise plot for four measurements of the scaled data set.
As a first step, we applied the PCA method to the data to obtain a linear estimate of dimensionality. e results are summarized in Table 4. As shown in the scree plot in Figure 7(b), three components explained 89% of the total variance and two components explained 70% of it. We therefore consider the first two components in the following.
Hence, the two-component bivariate mixture model was fitted to the FG data, as shown in Figure 8(a). e mean and sigma values for each component are also shown in Figure 8(a). In order to estimate the density parameters for each component, we used the EM algorithm for mixtures of bivariate data. Note that the initial values were chosen using the "mixtools" package. Table 5 presents estimates of the model parameters for each component, as well as the BIC. Figure 8(b) shows a plot of log-likelihood versus number of iterations; it is clear that the log-likelihood remained low  when the number of iterations increased and that the EM method reached convergence.
Next, we studied the selection of initial values for the location parameters in the mixture model based on the centres of clusters obtained with different clustering methods: k-means and hierarchical clustering. First, the kmeans method was applied to the reduced FG data, and then the centres (or the means) of clusters were taken as initial values. A visualization of the resulting data for PC1 and PC2 is given in Figure 9(a). e cluster centres (means) for PC1 were −0.36697 and 1.45934 and those for PC2 are 0.35596 and −1.41558. en, the bivariate GMM was fitted to the reduced FG data using the cluster means as initial values. e results are summarized in Table 6, and a plot of the fitted mixture model is shown in Figure 9(b). e resulting BIC was 1174.485.
Second, the initial values for the location parameters were selected using hierarchical clustering with the "mclust" package. As a first step, the hierarchical clustering was applied to the reduced FG data. A visualization of the resulting data for PC1 and PC2 is given in Figure 10(a). e resulting centres were 1.49372 and −0.51297 for PC1, while those for PC2 were −0.58542 and 0.20104. In the next step, the bivariate GMM was fitted to the reduced FG data using the cluster means as initial values, as shown in Figure 10(b). Table 7 shows the results for the mixture model, and the resulting BIC was 1111.483.
We observe that the selection of initial values for location parameters using clustering methods provided good results, similar to those obtained by selection of initial values with the "mixtools" package.

Applications to Real Data.
In this section, the proposed method is implemented on a real data set obtained from the "Knoema" website [18]. Knoema is one of the most comprehensive sources of global decision-making data. e data set for cancer incidence in 100 countries in a specific year (2016) includes 3168 observations and 32 variables and   In the first step, we used the PCA approach to approximate the dimensionality of the data in a linear manner. e results of the first five main components are summarized in Table 8. As shown in the scree plot in Figure 11, 93% of the overall variance was described by the first two components. We thus consider the first two components in the following. e data were thus fitted by the two-component bivariate mixture model, as shown in Figure 12(a), which gives the mean and sigma values for each component. We used the EM algorithm for bivariate data mixtures to determine density parameters for each component. e initial values were as suggested by the "mixtools" package. e estimates of model parameters are provided in Table 9 for each component, as well as the BIC. Figure 12(b) shows a plot of log-likelihood versus number of iterations; the log-likelihood remained low as the number of iterations increased, and the EM method reached convergence.

Model parameters First component
Second component en, we used the centres of clusters with different clustering techniques (k-means and hierarchical clustering) to analyse the choice of initial parameters for the location parameters of the mixture model. First, the k-means method was applied to the reduced data, and then the centres (or the        means) of clusters were taken as initial values. A visualization of the resulting data for PC1 and PC2 is given in Figure 13(a). e cluster centres (means) for PC1 were 14.19303 and −1.0799, while those of PC2 were −3.22915 and 0.24569. en, the bivariate GMM was fitted to the reduced data using the cluster means as initial values. e results are summarized in Table 10, and a plot of the fitted mixture model is shown in Figure 13(b). e resulting BIC was 556.9627.
Second, a hierarchical clustering procedure was used to specify the initial values of the location parameters; this was achieved using the "mclust" package. e hierarchical clustering of the reduced data was implemented as a first step. Figure 14(a) provides a visualization of the data obtained for PC1 and PC2: the resulting centres were 9.39997 and −0.58877 for PC1, while those for PC2 were 2.25038 and −0.14095. In the next step, the bivariate GMM was fitted on the reduced data using the cluster means as initial values; the fitted data are presented in Figure 14(b). Table 11 shows the results of the mixture model and the BIC of 556.9629.
We observe that the selection of initial values for location parameters using clustering methods provided good results, similar to those obtained when the initial values were selected using the "mixtools" package.

Conclusions
is work aimed to study the applications of PCA in mixture models. First, we discussed the use of the well-known PCA technique for dimension reduction and applied it to high-dimensional data sets. en, in the reduced data (which contained only two variables), we dealt with the two variables together and fitted a two-component bivariate GMM to the data. We used an EM algorithm to estimate the model parameters. is approach is suitable for large data sets with high dimension and can solve the problem of overfitting. We compared three different techniques for the selection of initial values of location parameters in the mixture model: two clustering methods, k-means and hierarchical clustering, and default values from the "mixtools" package. With all three techniques, EM convergence was reached and similar BIC values were obtained.
Data Availability e data were taken from Knoema website, which is one of the most comprehensive sources of global decision-making data in the world (world and regional statistics, national data, maps, rankings) (retrieved from https://knoema.com/ Atlas).

Disclosure
is paper was a component of a Masters' thesis by the first author under the supervision of the second author.

Conflicts of Interest
e authors declare that they have no conflicts of interest.  Figure 14: Cancer incidence data: (a) cluster criterion and centres (using "mclust") and (b) fitted GMM.