A Smoothed Matrix Multivariate Elliptical Distribution-Based Projection Method for Feature Extraction

Big data has the traits such as “the curse of dimensionality,” high storage cost, and heavy computation burden. Self-representation-based feature extraction methods cannot effectively deal with the image-level structural noise in the data, so how to character a better relationship of reconstruction representation is very important. Recently, sparse representation with smoothed matrix multivariate elliptical distribution (SMED) using structural information to handle low-rank error images caused by illumination or occlusion has been proposed. Based on SMED, we present a new method named SMEDP for feature extraction. SMEDP firstly utilizes SMED to automatically construct an adjacency graph and then obtains an optimal projection matrix by maximizing the ratio of the local scatter matrix and the total scatter matrix in the PCA subspace. Experiments on the COIL-20 object database, ORL face database, and CMU PIE face database prove that SMEDP works well and can achieve considerable visual and recognition performance than the relevant methods.


Introduction
Big data analytics as the premise and the foundation of artificial intelligence can easily cause the problem of "curse of dimensionality" [1], high storage cost, and heavy computation burden [2]. How to deal with such high-dimensional data is always a vital issue in many applications, including machine intelligence, data mining, pattern recognition, and image processing [3][4][5]. Feature extraction can effectively solve the above problems by reducing the size of a data set while preserving relevant information. us, it has become a research hotspot [6][7][8].
Principle component analysis (PCA) [9]and linear discriminant analysis (LDA) [10] are two commonly used linear feature extraction methods. e former projects the highdimensional data by maximizing the variance while the latter maximizing the ratio of the between-class scatter matrix and the within-class scatter matrix. Although they are simple and effective for some practical applications, they may fail to discover the underlying manifold structures of the observed data that are nonlinear.
To discover the intrinsic low-dimensional embedding of the observed data, researchers developed many manifold learning methods. Isometric feature mapping (ISOMAP) [11], Laplacian eigenmaps (LE) [12], and locally linear embedding (LLE) [13] are the most representative methods. ey try to preserve the intrinsic geometrical distribution of the high-dimensional data. ISOMAP takes the pairwise geodesic distance between samples as the feature of manifold structure, and LE builds the neighborhood information graph to maintain the neighborhood relationships among the samples, while LLE believes that local relations characterize the manifold structure of data. Although they are excellent in some respects, there are still some problems, such as sensitivity to noise and "out-of-sample" problem.
us, as improvement methods, neighborhood preserving embedding (NPE) [14], locality preserving projections (LPP) [15], and so on have been proposed. NPE and LPP are the linearized versions of LE and LLE, and they focus on the local information of the data and linearize the nonlinear structure to handle the "out-of-sample" problem. But apparently, the k-nearest graph or the ε-ball graph they use is very sensitive to parameter settings and noise interference [16].
Numerous researchers study it constantly to solve these problems. ey found that the methods mentioned above can be unified with a graph embedding framework [17]. It is critical to construct adjacency graphs in this framework, which can effectively reveal the essential structure of the data. More recently, sparse representation (SR)-based methods are adopted to automatically construct adjacency graphs, since it has the advantages of sparsity, robustness to noise, and datum-adaptive neighborhood. For instance, Qiao et al. [18] developed a sparsity preserving projections (SPP) by using a modified SR-based classification (SRC) [19] to automatically build the graph. And then, the projections are obtained such that the sparse reconstructive relationship of the data can be best preserved. But it is a time-consuming process since SPP aims to solve the L 1 -norm optimization problem. erefore, Yang et al. [20] solved the L 2 -norm optimization problem instead of the L 1 -norm optimization to reduce the running time, and proposed a collaborative representation-based projection (CRP). It benefits from the collaborative representation classifier (CRC) [21] and can obtain a comparable result with SPP. Note that both SPP and CRP choose their neighborhood automatically and contain natural discriminating information, hence can be more practical than some manifold learning methods. e limitations that still exist though SPP and CRP work well in some applications. First, they are all vector-based methods. at means they need to stretch the image matrix into a vector in advance, which will ignore the spatial structural information of the data. Second, for many error images caused by illumination variations or occlusions, the assumption of independence of error variables may not be appropriate in some practical situations. Many SR works attempt to handle the above problems. Yang et al. [22] and Xu et al. [23] studied the illumination variations and occlusions caused error images and observed that the structure of error variables is low rank or approximately low rank [24]. ey pointed out that low-rank constraints can be used to describe the structure noise, whereas it is NP-hard in solving the rank minimization problem [25]. A general way to address such an issue is to substitute the rank function with some tractable surrogate.
To this end, Yang et al. utilized nuclear norm to characterize the illumination variations and occlusions caused error images and developed a nuclear norm-based matrix regression model (NMR) [22]. Based on NMR, Yang et al. [26] explored NMR to construct the graph and gave a NMRbased projection (NMRP). NMRP has shown its promise in feature extraction. Qian et al. [27] integrated error detection and error support into one model and provided a robust nuclear norm regularized regression (RNR). Xie et al. [28] described the structural noise and the mixed noise by the robust nuclear norm and proposed a novel robust matrix regression (RMR). Significantly, these methods all assumed that each pixel of noise follows Gaussian or Laplacian distribution independently, which is not flexible for practical application. erefore, some researchers are devoted to the study of low-rank representation (LRR) [29]. Fang et al. [30] improved the latent low-rank representation (LatLRR) [31], which is an excellent LRR-based feature extraction method. ey proposed an approximate low-rank projection learning (ALP) and an extended ALPL (EALPL) to learn an optimal projection that can extract more discriminant features than LatLRR.
en, combining with rigid regression, they extended LatLRR to the supervised scenario and proposed a supervised ALPL (SALPL). Zhang et al. [32] introduced orthogonal matrix, row sparsity constraint on the projection matrix, and weighted truncated Schatten p-norm (WTSN) [32] to improve the robustness of the LRR model. However, these methods cannot handle the "out-of-sample" problem.
Recently, Qiu et al. [33] used a matrix multivariate elliptical distribution to characterize the error image and developed a sparse representation with smoothed matrix multivariate elliptical distribution (SMED). Compared with Gaussian or Laplacian distribution, a matrix multivariate elliptical distribution is more suitable for describing the structure noise. Besides, it introduces regularization terms to smooth the objective function, which makes the model easily to get a globally optimal solution. In short, SMED considers both of the global structure and the local structure based on sparse and low-rank representation. us, in this paper, we plan to explore SMED to automatically construct an adjacency graph and then learn an optimal projection matrix in the PCA subspace. e proposed method is named a smoothed matrix multivariate elliptical distribution-based projections (SMEDP). Since the projection matrix can be got directly, SMEDP can solve the "out-of-sample" problem well. Figure 1 is the flowchart of our method, and the major contributions of this paper include: (1) in order to avoid the adverse effects produced by neglecting structural information, we propose a method named SMEDP. SMEDP can learn an adaptive graph with a modified SMED. (2) A method of iteratively reweighted least squares is designed to obtain the globally optimal weight matrix, which can be used to calculate the final projection matrix. (3) Comprehensive experiments have been carried out to indicate the superiority of SMEDP when compared with several relevant methods. e rest of the paper is organized as follows. Section 2 gives a brief review of SMED. Section 3 introduces the SMEDP for feature extraction in detail. In Section 4, extensive experiments on three databases are presented to demonstrate the effectiveness of SMEDP. Section 5 draws the conclusions and gives the prospect of the future work.

SMED
e idea of SR-based methods is to derive a series of appropriate representation coefficients. In the following, we give an overview of SMED [33].
Given a dataset of n training image matrices {X 1 Computational Intelligence and Neuroscience and an image matrix Y ∈ R p×q , we can represent Y linearly using training image matrices, that is, where X i c represents the i-th training sample from the cth class, α = [α 1 , α 2 ,. . ., α n ]∈R n is a vector of representation coefficients, and E is the structural noise.
To simplify the formula, linear mapping from R n to R p × q can be defined as follows: (2) (3) is a matrix variate-based linear model. Qiu et al. [33] have studied the error image and found that the correlations between pixels in E can be effectively alleviated by the proposed matrix multivariate elliptical distribution. Subsequently, a sparse representation with smoothed matrix multivariate elliptical distribution (SMED) model is developed to preserve the spatial information about the structural noise.
SMED assumes that E follows the matrix multivariate elliptical distribution and the set of representation coefficients follows the independent Gaussian distribution (l = 2) or Laplacian distribution (l = 1). en, we have Based on the definitions of nuclear norm and l p -norm, (4) is equivalent to: where λ � 2/β, k > 0, and 0 < l ≤ 2. Since both two terms in (5) are nonsmooth, SMED introduces regularization terms to smooth them: where μ is a penalty parameter, I ∈ R q × q is an identity matrix, and 1∈R n is an all ones vector. We obtain the optimal solution α * by solving the (6), and the corresponding reconstruction error of the i-th class is defined as follows: where Y represents the reconstructed image of Y, Y i means the reconstruction of Y in class i, and δ i (α * ) is a vector with only nonzero entries associated with the i-th class.

Algorithm Derivation of SMEDP
It is essential to construct adjacency graphs in graph-based feature extraction methods. Since SMED can better handle with variations of structural noise than several relevant methods, such as SRC and CRC, we explore SMED to automatically construct an adjacency graph and then seek to find a rule to get an optimal projection matrix. Subsequently, a method called smoothed matrix multivariate elliptical distribution-based projection method (SMEDP) is developed.

Constructing an Adjacency Graph.
In this section, we will introduce how to construct an adjacency graph. Firstly, where m = p × q is adopted as the vertices of the graph. Secondly, smoothed matrix regression model based on the modified SMED is presented, and the method of iteratively reweighted least squares is used to solve the model to obtain the optimal weight vector α i * for each X i (i = 1, 2, . . ., n). e edges of the graph are composed of a weight matrix α whose elements are α i * . Subsequently, Algorithm 1 describes each step of the procedure in detail.
e smoothed matrix regression model based on the modified SMED is defined as follows: Computational Intelligence and Neuroscience where the parameters k and l in (6) are set to 1, , . . ., α in ] T , and the element α ij (i≠j) in α i represents the contribution of each X i to reconstructX j . Motivated by SMED, we provide details of using the method of iteratively reweighted least squares to solve Equation (8) [33], and it can be reformulated as follows: where where T=(Σ j = 1 n |α ij | 2 + μ 2 ) -1/2 . With the derived c(α i ) and δ (α i ), we get the derivative of ξ(α i , μ). By setting it to 0, we obtain that: Or equivalently, As a well-known Sylvester equation, the computational complexity of (13) is O(n 3 ) [34][35][36]. We can use MATLAB command lyap to get the weight matrix α = [α 1 , α 2 ,. . ., α n ]∈ R n×n . Algorithm 1 shows the detailed steps for solving α.

Calculating the Projection Matrix.
In this section, we will introduce how to calculate the projection matrix to improve classification performance. Firstly, we utilize PCA to reduce the dimensions of the raw data. Secondly, we seek to find an optimal projection matrix by maximizing the ratio of the local scatter matrix and the total scatter matrix in the PCA subspace. Subsequently, the final projection matrix is obtained based on the optimal projection matrix and the PCAtransformation matrix.
According to the ideas of CRP [20] and NMRP [26], the local scatter matrix is defined as follows: (14) where x i � Vec(X i ) means the vector form of X i . e total scatter matrix is expressed as follows: where x is the mean vector of the training samples and the matrix X is augmented by x. So, the objective function of SMEDP is expressed as follows: Or equivalently, where α β = α+α T -α T α. en, generalized eigenvector decomposition is adopted to solve Equation (17). e optimal projection matrix P can be calculated by where P = [p 1 , p 2 , . . ., p d ] is the first d largest eigenvalues of Equation (18).

SMEDP.
In this subsection, we summarize the details of SMEDP by Algorithm 2.

Experiments
We conduct experiments on COIL-20 object database, ORL face databases, and CMU PIE face databases to evaluate the performance of SMEDP. All the experiments are implemented in the same environment (CPU: Intel i7+ Lenovo, RAM: 8 GB, MATLAB: 2020b).

Experiments on COIL-20 Object
Database. COIL-20 object database [37] contains 20 objects. For each object, 72 images are taken from different view directions. We use a subset of the COIL-20 for our experiments. is subset includes 50 images of 10 objects (each object has 5 images). All images were in grayscale and resized to 40 × 40 pixels. Figures 2 and 3 show some images.
We test the visual performance of SMEDP and compare it with the classical methods, that is, PCA; some manifold learning methods, that is, LLE, ISOMAP, and LPP; and SR-based methods, that is, SPP, CRP, and NMRP. e number of nearest neighbor point k in LEE, ISOMAP, and LPP is, respectively, set to 15, 20, and 20. We select Justin Romberg's l1qc_logbarrier to calculate L1-norm in SPP. Besides, the parameters in SPP, CRP, NMRP, and SMEDP are determined by grid search. In experiments, we start by using PCA to reduce the dimensions of the data to 70, and then adopt the above methods to convert the 70-dimensional representation to a 2-dimensional map. Figure 4 shows the resulting map, where the color of a feature point in the scatterplot represents the class information of data.
From Figure 4, we can find that: (1) PCA confuses the projected data since it has a large variance. (2) LEE and ISOMAP fail to effectively separate data points, which are of different classes. Although they take into account the local structure of the data, the Euclidean distance measure or the geodesic distance measure cannot determine the real local structure. (3) e sparsity properties could enhance the visual performance, so the separation degree of SPP, CRP, and NMRP is better than LEE and ISOMAP. About SPP and CRP, CRP generates a smaller intraclass scatter in our experiment. is result demonstrates that it is the collaborative representation that plays an important role in the effectiveness of SRC. (4) e matrix-based methods usually surpass the vector-based methods (e.g., NMRP and our SMEDP method, which make use of the structural information outperform SPP). Notably, they enhance the performance of intraclass clustering. (5) NMRP can properly embed most classes, but some classes are still mixed. Its performance is inferior to CRP since the nuclear norm used in the matrix regression model is still a loose approximation to the natural rank constraint. (6) Due to the introduced regularization term in matrix regression model, our SMEDP method can achieve better results than others. Since SMEDP ignores the class label information, it also exists some overlapping phenomena among classes. We can try to absorb supervised information into the method to overcome the limitation further.

Experiments on ORL Face
Database. ORL face database [38] consists 40 people. For each people, 400 images are taken from various illumination and facial expressions. All images were in grayscale and resized to 32 × 32 pixels for  Input: a dataset of n training image matrices{X i c }∈R m×n , m � p×q, and an image matrix Y ∈ R p×q Initialize t � 1, R � I ∈ R n×n , T �e ∈ R n , λ > 0, μ > 0 and ε � 1×e −3 while not converged do (1) Update α i(t+1) as follows: Update R t+1 and T t+1 as follows: Computational Intelligence and Neuroscience 5 experiments in this subsection. Figure 5 shows sample images of one person.

Parameter Selection.
Feature extraction methods need to select insensitive parameters [39]. We need to tune three parameters, that is, λ, μc, and ρ, for SMEDP. So, an experiment on parameter sensitivity is conducted in this subsection. Since it is difficult to tune the three parameters simultaneously, the strategy of grid search [39] is employed to select the parameter. e core idea of grid search is fixing one parameter and tuning the others. Specifically, 8 images per class are randomly selected in the ORL database for training and the remaining for testing. e number of principal components is set to 290 during the PCA step. en, we perform SMEDP to convert the 290-dimensional representation to a 110-dimensional map. Finally, we use the NN classifier with cosine distance for classification. e experiment is run 50 times, and the results are reported in Figure 6.
Two cases are used to evaluate the effects of different parameter settings: (1) Fix ρ � 1.1, tune λ and μc from {0.0001, 0.001, 0.01, 0.1, 1}. Figure 6(a) shows the results, and it is obvious that the parameter λ and μc do affect the performance of SMEDP. When λ � 0.01, 0.1, 1 and μc � 0.01, 0.001, 0.0001, the performance of SMEDP changes a little and the recognition rate is more than 90%. When λ � 0.1 and μc � 0.01, SMEDP achieves the best effect; (2) fix λ � 0.  Figure 6(b) gives the average recognition rate curves of SMEDP with different values of ρ. From Figure 6(b), we can find that SMEDP is robust with different ρ, which means parameter ρ has a certain degree of influence on the recognition rate, but the influence is insignificancy. To facilitate the analysis of data, the part of Figure 6(b), where the recognition rate ranges from 87% to 93%, is shown in Figure 6(c). Besides, from the results, we can find that ρ with a value of 1.1 produces better performance. Overall, the above analysis provides a guidance of parameters selection for SMEDP in follow-up experiments.

Classification Results with Different Expressions and
Illuminations. We construct classification experiments to evaluate the robustness of SMEDP with different expressions and illuminations. We also compare SMEDP with PCA, LPP, SPP, CRP, and NMRP. e parameters of each method are set as follows: (1) nearest neighbor point k in LPP is set to 20; (2) Justin Romberg's l1qc_logbarrier is used to calculate L1-norm and λ is set to 1e-3 in SPP; (3) in CRP and NMRP, λ is set to 0.1 and 1, respectively; (4) for SMEDP, the parameters λ, μc, and ρ are set as 0.1, 0.01, and 1.1, respectively. We run the experiments 50 times. In each run, l (� 5, 6, 7, and 8) images are randomly selected from each subject as a training set and the rest 10-l images are treated as a test set. Besides, SPP, CRP, NMRP, and SMEDP all apply a PCA preprocess step to reduce the dimensions of the raw data. For l (�5, 6, 7, and 8), the numbers of principal components are set to 170, 210, 250, and 290, respectively. Finally, the numbers of subspace dimensions by all the methods except SPP are varied from 50 to the numbers of principal components at intervals of 30, 40, 50, and 60. Since SPP is sensitive to large variations in pose, the numbers of subspace dimensions for ORL database are varied from 30 to 50 to maintain a relatively high level of recognition rate. Figure 7   shows the average recognition results of different methods, and Table 1 (the best performing method has been highlighted) gives the average maximal recognition rate of each method.
From the results, we can see that SMEDP performs the best, which verifies that it can preserve the primary discriminant information for classification. Specifically, SMEDP can outperform the others by about 2% improvement on l (� 5, 6) and about 1% improvement on l (� 7, 8). Since NMRP and SMEDP get an adjacent weight matrix by matrix regression, their results are closer. Moreover, CRP, which is based on vector regression, achieves a performance comparable to NMRP in most cases. is is mainly because collaborative representation can improve the discriminating capability to some extent. Since the limitations of PCA, LPP, and SPP mentioned above, the classification performance of them is poor.

Experiments on CMU PIE Face
Database. CMU PIE face database [40] consists 41,368 images of 68 people. All images are taken from different poses, illumination, and facial expressions for each individual. We choose C09 (one of five subsets of near frontal poses, which contains 1632 images) for the test. All images were in grayscale and resized to 64 × 64 pixels. Figure 8 shows sample images of one person.

Classification Results with Different Illuminations.
For evaluating the robustness of SMEDP to the data with different illuminations, we conduct experiments and systematically compare it with the methods mentioned in Section 4.2.2. Specifically, we choose the first l (�7, 9,11,13) images of one person for training and the rest 24-l images for testing. It is worth noting that LPP, SPP, CRP, NMRP, and SMEDP all involve the selection of parameters, and they are adjusted based on grid search. In this subsection, they are set as follows: (1) nearest neighbor point k in LPP is set to 20; (2) in SPP, Justin Romberg's l1qc_logbarrier is used to calculate L1-norm and λ is set to 1e−4; (3) in CRP and NMRP, λ is set to 0.01 and 0.1, respectively; (4) for SMEDP, λ, μc, and ρ are set as 0.01, 0.01, and 1.1, respectively. PCA is first used to preprocess the images. We set the dimensions of the PCA subspace to 300, 250, 200, and 150, which correspond to l (�7, 9,11,13), respectively. Figure 9, Tables 2 and  3 show the results. Note that numbers in bold mean the best results in Tables 2 and 3. From the experimental results, it can be seen that the recognition rate of these methods generally increases with the growth of the feature dimensions. Our proposed SMEDP always obtains the best results, which demonstrates its robustness to illumination variations. One can also note that SMEDP can outperform the others by about 2% improvements on 150 dimensions when l � 7 and 11. Besides, NMRP and CRP are comparable to SMEDP in some cases since the former takes advantage of matrix structure and the latter uses collaborative representation to preserve the primary discriminant information. SPP is superior to PCA and LPP, but inferior to SMEDP and NMRP. e superiority is distinctive because the sparsity properties are considered in the construction of graph.    Figure 10 shows the results. From Figure 10, it is evident that the recognition rates of PCA and LPP are low whether the samples under the occlusions of black blocks or white blocks. It means that they cannot handle the structural noise very well. e proposed SMEDP is superior to some SR-based feature extract  methods (SPP and CRP). But the performance of NMRP is very close to SMEDP when the occlusion level is no more than 30%, that is, because they are all based on matrix regression, which can get more discriminative information from corrupted data. As the level of occlusion rises, NMRP no longer has an advantage, and the smoothed matrix multivariate elliptical distribution plays a role in SMEDP. Overall, this experiment confirms SMEDP is more robust than the others with block occlusions.

Conclusions
A method called smoothed matrix multivariate elliptical distribution-based projections (SMEDP) is proposed in this paper. Specifically, we use a modified SMED to automatically construct an adjacency graph and then learn an optimal projection matrix in the PCA subspace. Since SMED considers both the global structure and the local structure based on sparse and low-rank representation, SMEDP can get more discriminative information from corrupted data. Experimental results on three databases show that SMEDP achieves considerable visual and recognition performance. Specifically, its robustness to illumination variations and block occlusions. However, each method has its own strengths and weaknesses. SMEDP separates graph embedding and projection learning into two independent processes, which makes it difficult to learn the optimal projection matrix once the graph embedding is invalid. Furthermore, although SMEDP contains the natural discriminating information, it ignores the label information of the data, which will still affect the discriminative ability of the model. In the future, we will consider to simultaneously perform graph embedding and projection learning in a unified objective function and introduce the class label information in graph construction to extend SMEDP to the supervised scenario. We believe that the improvement of SMEDP will have better performance for pattern recognition task.