Sparse Principal Component Analysis via Fractional Function Regularity

In this paper, we describe a novel approach to sparse principal component analysis (SPCA) via a nonconvex sparsity-inducing fraction penalty function SPCA (FP-SPCA). Firstly, SPCA is reformulated as a fraction penalty regression problem model. Secondly, an algorithm corresponding to the model is proposed and the convergence of the algorithm is guaranteed. Finally, numerical experiments were carried out on a synthetic data set, and the experimental results show that the FP-SPCA method is more adaptable and has a better performance in the tradeoff between sparsity and explainable variance than SPCA.


Introduction
Principal component analysis (PCA) [1] has become more popular in data analysis, dimension reduction, image processing, and feature extraction [2]. It seeks the linear combinations of the original variables such that the derived variables capture maximal variance of the total original variables. We can obtain PCA(s) of an original variable by the singular value decomposition (SVD) of the data matrix which is composed of the observed variables. Let X be an n × p matrix, where n and p are the number of observations and the number of variables, respectively. Suppose the average value of columns of X are all zero and the singular value decomposition of X is en, the columns of matrix U D are the principal components of X, and the columns of V are the corresponding loadings of the principal components. e PCA method is popular due to the following two properties: first, principal components capture the maximum variability from the columns of matrix X, which guarantees minimal information loss; second, principal components are uncorrelated, so we can use one of them without considering the others.
At the same time, however, PCA has its own drawback, i.e., each principal component of matrix X is a linear combination of all p variables and the elements of the loading vectors (the columns of matrix V) are usually nonzero, which make them difficult to interpret the gained PCs.
Based on the abovementioned priority properties and drawback, some scholars consider it will be a wise option to keep the dimension reduction property at the same time to reduce the number of explainable variables. For this purpose, one method is to set the loading whose absolute value is smaller than a threshold to zero. e same problem also arises in multiple linear regression, where the response variable is explained by a linear combination of the explainable variables. But, which are the important explainable variables that account for response variable information most? To answer this question, researchers have proposed several approaches, and the first type method is the lasso-based approaches [3][4][5][6]. Others, for example, Efron et al. [7] proposed the LARS which is sensitive to noise of samples in 2004, and Friedman et al. [8] put forward a pathwise coordinate optimization approach in 2007. e second type is the lasso-bootstrap methods [9,10]. Other methods including overview type articles [11][12][13] and Radchenko et al. [14] proposed a variable inclusion and shrinkage method, Fan and Li [15] proposed a nonconcave penalized likelihood method, and Candès and Tao [16] proposed a dantzig selector method. Among them, the lasso method [5] is a well-known variable selection technique which produces an accurate and sparse model. As time went on, Zou and Hastie [4] proposed the elastic net in 2005, an adaptive method, which has the advantages of both lasso and ridge regression. e elastic net approach can be considered as a generalization of lasso and ridge regression. In 2006, Zou et al. [3] proposed sparse principal analysis (SPCA) using the lasso and ridge regression to produce a modified principal component with sparse loadings. ey show that PCA can be formulated as a regression-type optimization problem and sparse loadings are obtained by imposing l 1 -norm and l 2 -norm regularization on the regression coefficients of the model. In 2009, Leng and Wang [17] proposed a method of simple adaptive sparse principal component analysis, which uses the adaptive lasso penalty instead of the lasso penalty in SPCA. Moreover, they replace the least-squares objective function in SPCA by a general least-squares objective function, which leads to study many related SPCA estimators under a unified theoretical framework and get the method of general adaptive sparse principal component analysis. SPCA can produce a sparse model that has only fewer active coefficients, and the other coefficients are all set to zero. e proposed sparse model SPCA is more interpretable than PCA. In a word, SPCA can specifically identify structure information of the data matrix. e idea of making the model's coefficients sparse is not a new one. Jollife et al. [18] first proposed a SPCA method using l 1 -norm regularization which leads to a variety of sparsity-inducing methods. e success of SPCA in gaining more interpretable model motivates us to propose the method in this paper, where we proposed a fractional function penalty method with respect to sparse PCA. e method is to exploit the fractional function [19] to penalize the model coefficients in order to obtain sparse coefficients, which makes the PCA to have more interpretable ability. e rest of the paper is organized as follows. In Section 2, definition and derivation of principal components are reviewed. e sparse principal component is presented in Section 3. In Section 4, we propose the fractional functional penalized principal components. Numerical experiments and conclusions are formulated in Sections 5 and 6, respectively.
In this paper, without specification, scalar is denoted as lower case letter x, and vector is denoted as bold lower case letter x � (x 1 , x 2 , . . . , x n ) ⊤ . e matrix is denoted by bold capital letter M, and I denotes the identity matrix. e transpose of a real matrix M is denoted by M ⊤ . e Frobenius norm of · is denoted by ‖ · ‖, where · denotes column vector or matrix, and the spectral norm of a matrix M is denoted by ‖M‖ 2 .

Principal Components and Sparse
Principal Components . , x p ) be a vector of p random variables, and the variances of the p random variables and the covariance between the p variables will be of importance. If p is large, an alternative key approach is to seek for fewer variables that preserve well most of the information provided by variances and covariances of these p variables. One of the alternative techniques is principle component analysis (PCA) which concerns more variance of variables than covariance. PCA first looks for a linear function xα 1 owning to maximum variance of x, where α 1 � (α 11 , α 12 , . . . , α 1p ) ⊤ , that is, en, in the same way, we can find a linear function xα 2 , in which xα 2 and xα 1 are linear independence, so that at the k-th step, a linear function xα k can be found which has maximum variance and is linear independence with the former k − 1 linear functions. e xα k is the k-th principal component (PC) [1]. Many of the estimation problems in array signal processing can also start with the same issue in (2); however, there may be exist some perturbations [20][21][22]. Having known what are the PCs, the remaining question is how to find them. If the vector of random variables x has a known covariance matrix Σ whose (i, j)th entry is the known covariance of random variables x i and x j when i ≠ j, and variance of the element x i of x � (x 1 , x 2 , . . . , x p ) when i � j. When Σ is unknown, the more realistic method is to substitute Σ for sample covariance matrix S.
In fact, because Σ is often unknown, we consider the case S in the following only.
To obtain the PCs of the sampling matrix X n×p , where n and p are the number of observations and variables of random vector x, respectively. Consider the coefficient vector α 1 ∈ R p , which is forced to maximize the sample variance of Xα 1 , where α 1 is normalized. Let x i denote the i-th column of sampling matrix X, and x i is zero-centered, i � 1, 2, . . . , p.
e problem can be reformulated as the follows: and solving (3), we can obtain the relationship (X ⊤ X)α 1 � λ 1 α 1 ; furthermore, we have where λ 1 is the largest eigenvalue of matrix X ⊤ X and α 1 is the eigenvector of X ⊤ X corresponding to λ 1 . Xα 1 is called the 1th PC of sampling matrix X.
In the same way, the k-th PC of X is α k and the sample variance of Xα k is λ k , where λ k is the k-th largest eigenvalue of X ⊤ X and α k is the corresponding eigenvector.
In general, if n × p matrix X is known, x i , the i-th column of X, is zero-centered. e singular value decomposition of X is U DV ⊤ . en, the principal components of matrix X, which are denoted by Y, are (U D) r , where (U D) r is the matrix composed of the first r columns of the matrix U D and r is the rank of X. So, we can get y i � Xv i [1,3], where y i and v i are ith columns of matrix X and V, respectively.
Although principal component analysis is widely used in various technical fields, it has an obvious drawback, that is, each principal component is a linear combination of all original variables, but it is often difficult to interpret the result. To solve this problem, Zou et al. [3] proposed sparse principal component analysis. ey first consider PCA as a linear regression question.

Linear Regression and LASSO.
Consider matrix X n×p with n observations of the p variables, and let y � (y 1 , y 2 , . . . , y n ) ⊤ be the response vector and . Assume y and x j have zero means. e question of evaluating of PCA of matrix X can be considered as a linear regression problem: is model is simple, and the result is easy to get in some cases; however, the result is difficult to interpret [3]. Based on model (5), Tibshirani [5] proposed the lasso method: where β � (β 1 , β 2 , . . . , β p ) ⊤ and λ ≥ 0. Model (6) can produce accurate and sparse result at the same time, which strengths the result's interpretable ability. However, the lasso method has some limitations, and the most relevant one is that the number of variables produced by the lasso is limited by the number of observations. Given X n×p , when the number of rows n and number of columns p of X satisfy p > n, the lasso can select n explainable variables at most.
To overcome the abovementioned drawback, Zou and Hastie [4] proposed the elastic net model: where β � (β 1 , β 2 , . . . , β p ) ⊤ , λ 1 ≥ 0, and λ 2 ≥ 0. e elastic net model shares the properties of lasso and ridge regression at the same time because when λ 2 � 0 the elastic is the lasso, and when λ 1 � 0, it becomes the ridge regression. When p > n, by choosing proper λ 2 , the elastic net model can potentially include all variables.

Ridge Regression and the Naive Elastic Net.
In model (6) and (7), the sparse coefficients are all l 1 -norm-induced and the sparsity is none of l 2 -norm's business. Jolliffe et al. [18] proposed the SCoTLASS method: where α i � (α i1 , α i2 , . . . , α ip ) ⊤ and t is a parameter. is method has the ability to obtain sparse loadings by imposing l 1 -norm penalty on principle component analysis (PCA).
e SCoTLASS method has two drawbacks. One is difficulty of choosing parameter t and another problem is the high computational cost. So, Zou et al. [3] consider a modified method. ey first show that principle component analysis (PCA) can be expressed as a ridge regression problem.
Suppose the singular value decomposition of X n×p is U DV ⊤ . Let U i and D ii be the i-th column of matrix U and the i-th element of diagonal matrix D, respectively, and then, . e ridge estimation of β is given by the following expression: is conclusion shows the relationship between principal component analysis and the regression method. e term λ‖β‖ 2 in (9) is to guarantee the unique of β ridge when the inverse of X ⊤ X does not exist. Furthermore, Zou and Hastie [4] proposed a method to add l 1 -norm penalty to model (9) and got the following optimization problem: ey call the v i � (β nen /‖β nen ‖) is an approximation of v i , and Xv i is the i th approximated principal component. To distinguish this model from model (7), model (10) is called the naive elastic net.

Sparse Principle Component Based on PCA.
Based on the general conclusion in the end of Section 2.1, Zou et al. [3] proposed a two-stage analysis: perform PCs first, and then, find suitable spare approximation of PCs.
Consider the first r principal components of X n×p , where for any λ > 0.
en, we can perform SPCA, which is based on the PCA and regression method, by adding l 1 -norm penalty to the regression coefficients to induce sparse loadings. us, the following result is obtained: subject to A ⊤ A � I r . Where x i is the i th column of matrix X ⊤ and β j is the j th column of matrix B, r � rank(X).

Fraction Function-Penalized Method.
Recently, Li et al. [19] proposed a fraction-penalized function. ey study the problem of a nonconvex sparsity promoting penalty function, to relax the l 0 -norm in compressed sensing [15,16,23,24] to get the sparsity signal, where a > 0 is a parameter and is a fraction function which is nonconvex and sparse promoting. ey studied the properties of this fractional function and derived a closed form expression of the optimal solution to the fractional function penalty problem and proposed an iterative thresholding algorithm to solve the fraction function penalty problem. A series of experiments have been conducted and the results show, compared with the soft thresholding algorithm and the half thresholding algorithms, that the proposed algorithm has the property of better sparse signal recovery with and without measurement noise.
Based on the good performance of the fraction function in sparse promotion in compressed sensing [23,25], in this paper, we shall adopt this method to the sparse principal component analysis (SPCA), i.e., replace the l 1 -norm penalty in SPCA with the fraction function penalty.
Combining fd12 (12) and (13)fd13, we can get We now introduce the following fact for the purpose of solving the problem (14).
Owing to A ⊤ A � I r , let A ⊥ be the matrix satisfying while So, problem (14) can be transformed into Problem (19) can be solved by an alternating algorithm. If A is given, B � (β 1 , β 2 , . . . , β r ) can be estimated by and if B is given, problem (14) is to solve where x i is the i-th column of matrix X ⊤ . According to the procrustes rotation [26][27][28], the estimation of A can be obtained; suppose the SVD of matrix (X ⊤ X)B is (16). In (20), let A � (α 1 , α 2 , . . . , α r ) be given by α j � v j for j � 1, 2, . . . , r, where v i is the i-th column of V-the matrix in the U DV ⊤ which is the singular decomposition of the matrix X. Let y j � Xv j , and (22) can be reformulated as

Transformation of Problem
Let en, (22) can be rewritten as e relationship of β * j and β j can be explained as the following. Let According to the expressions of X * , y * j , c j , β * j and the definition of P a (β j ), we can get Owing to λ > 0, it is easy to obtain that H(β j ) < F(β j ), so that that is, Both G(β * j ) and H(β j ) are continuous functions with respect to β * j and β j , respectively, so as λ ⟶ 0, we have β * j ⟶ β j , and further, we get So, we conclude that β j can be approximated well by β * j when positive λ is small enough.

Remark 1.
e reason why we substitute F(β j ) with G(β * j ) is model (24) is easier to solve than model (22). Now, we will show that the optimal solution to the minimization problem, can be expressed as a thresholding operation. (29). For any c j , μ ∈ (0, +∞), and z ∈ R p , let

e Algorithm of the Problem
We have the following results.

Lemma 1.
For any fixed parameter μ, a, c j ∈ (0, +∞) and and according to [19], ϕ(s) � arc cos 27c j μa 2 Proof. We can see that C μ (β * j , z) can be expressed as that is to say, minimizing C μ (β * j , z) for z, given μ, c j , and a, is equivalent to Mathematical Problems in Engineering 5 , z) if and only if β * ji solves the following problem: According to Lemma 10 in [19], we complete the proof. Furthermore, we can get eorem 1.
jp ) ⊤ is an optimal solution to (29), and c j is positive value and μ satisfies 0 < μ ≤ ‖X * ‖ − 2 2 , and then, the optimal solution β * ji can be given by Proof. According to 0 < μ ≤ ‖X * ‖ − 2 2 , we have at is to say, for β * j ∈ R p , β * j is a local minimizer of C μ (β * j , β * j ) as long as β * j is a solution to (29). According to Lemma 10 in [19] and Lemma 1, we complete the proof.
With the expression of β * j in eorem 1, we can get an alternating thresholding algorithm for problem (29), that is, and g c j μ (·) is the thresholding operator in Lemma 1. We call this algorithm the alternating iterative FP-SPCA thresholding algorithm.

Remark 2.
e dot · in g c j μ (·) may be a real number as in Lemma 1 or a vector whose components are all real numbers. If the dot · represents a vector, g c j μ (·) means the result vector which components are the result that g c j μ (·) acts on every component of the vector · sequentially.
It is well known that the solution to problem (29) depends on the choice of the regularization parameter c j . Before we give out the proper choice for the regularization parameter c j , we need the following definition.

Definition 1.
e nonincreasing rearrangement of the vector β j ∈ R p is the vector [β j ] ∈ R p for which and there is a permutation π: 1, 2, . . . , p ⟶ 1, 2, . . . , p with [β j ] i � |β jπ(i) | for all i ∈ (1, 2, . . . , p), and β jπ(i) is the π(i)-th component of the vector β j . Similar to the choice of the regularization parameter in FP algorithm (see Scheme 2 in [19]), here, we can approximate β * j by using the current step value (β * j ) n and the c j can be chosen as where ε is a small positive number, such as 0.1, 0.01, or 0.001, and k is sparsity degree of β * j . By this way, the iterative algorithm is adaptive and free from the regular parameter choice. Noticing that 0 < μ < ‖X * ‖ − 2 2 in the (2) of eorem 2, we take μ � (1 − ε/‖X * ‖ 2 2 ), where ε ∈ (0, 1). With abovementioned parameter-choosing method, according to the iterative method of (40), we have the following alternating iterative FP-SPCA thresholding algorithm. If A is given, for each (j � 1, 2, . . . , r), y * j is the same as in formula (24), by using the iterative formula (40), (β * j ) n+1 is obtained. If B is fixed, according to (21), it is only try to minimize n i�1 ‖x i − AB ⊤ x i ‖ 2 � ‖X − XBA ⊤ ‖ 2 with respect to A, where A ⊤ A � I r . According to the reduced rank form of the Procrustes rotation in [26], if the singular value decomposition of matrix (X ⊤ X)B is U 1 D 1 (V 1 ) ⊤ , the solution is A � U 1 (V 1 ) ⊤ [3]. e alternating iterative FP-SPCA thresholding algorithm is described in Algorithm 1.
At last, we discuss the convergence of the iterative FP-SPCA thresholding algorithm which converges to a stationary point of the iteration (40).

Theorem 2.
Let (β * j ) n be the sequence generated by the iterative FP-SPCA thresholding algorithm (40) with

to a stationary point of the iteration (40)
Proof.
e proof is similar to the proof of eorem 4.1 in [29] and Lemma 2 in [30], so it is omitted.

Numerical Experiments
In this section, we present numerical results of the FP-SPCA method and compare the result with the SPCA method [3]. e simulations are conducted on a personal computer (Intel(R) core(TM) i7-6700 cpu@3.40GHZ 3.40GHZ, RAM 16.0 GB) with MATLAB R2014a programming platform.

Experimental Preparation.
In order to compare with the SPCA method, we use the synthetic data as in [3]. e synthetic data has three hidden factors, where V 1 , V 2 , and ε are independent. e 10 observable variables are generated as follows: where ε j i are independent, j � 1, 2, 3 and i � 1, 2, . . . , 10. We use the matrix X � (X 1 , X 2 , . . . , X 10 ) to perform PCA, SPCA, and FP-SPCA. e three hidden factors have three different variance 290, 300, and 283.7875. e number of variables with respect to the three factors are 4, 4, and 2, so V 1 and V 2 are nearly equally important, and they are more important than V 3 . ese facts suggest that we need to consider the relatively important variables with proper sparse representations only. e two derived variables should recover the factors V 1 and V 2 well using (X 1 , X 2 , X 3 , X 4 ) and (X 5 , X 6 , X 7 , X 8 ) accordingly.
e SPCA and FP-SPCA methods were carried out by the method proposed by Zou et al. [3] and the fractional penalty method proposed in [19]. We compared the performance of the two methods using the synthetic data matrix X and summarize the results in two cases below.

Parameter Selection and the Result of Experiments
Case 1. In this situation, the method proposed by Zou et al. [3] which is called SPCA and our proposed method which is called FP-SPCA have the same performance for the almost equally important hidden factors. at is, for most of the generated data matrices, the SPCA and FP-SPCA have the same loadings but different adjusted variance. Table 1 reports a numerical result of PCA, SPCA, and FP-SPCA and summarizes the results in loadings and adjusted variance. e parameters are (λ, λ 1,j , ε) � (0.01, 3.7, 0.01) in SPCA and (λ, a, ε, ε) � (0.01, 1.2, 0.01, 0.001) in FP-SPCA. We can see, from Table 1, that both SPCA and FP-SPCA have same orthogonal loadings of PC1 and PC2 after normalization, but the adjusted variance obtained from FP-SPCA is bigger than that from SPCA. at is to say, PC1 and PC2 obtained by FP-SPCA method have more information than PC1 and PC2 by SPCA. Numerical simulations using other generated data matrices have the same result also.
Case 2. For a part of the generated matrices, according to our experimental results, the FP-SPCA does work, but SPCA does not work well. at is, by adjusting parameters λ and λ 1,j in the SPCA method, we cannot get the sparse loadings of PC1 or PC2. e parameters and the corresponding numerical results are reported in Table 2, where capital letter N represents no sparse loadings of PC1 and PC2 have been obtained when parameters are the value above the N and the value(s) on the left-hand side of the N. NaN means not a number when parameters acquire the value(s) in the same way as capital letter N in Table 2. Also, we can see, from Table 2, that the loadings of PC1 or PC2 show a meaningless result when λ 1,j is between 3.5 and 4.0. Numerical simulations show the same meaningless result when λ 1,j is increasing by a large degree. We give, in Table 3, a result of loadings of PC1 and PC2 when (λ 1,j , λ, ε) � (1.8, 0.5, 0.01) using the SPCA method and the result of the FP-SPCA method when (λ, a, ε, ε) � (0.001, 1.3, 0.01, 0.001). In a word, the FP-SPCA method can find the sparse loadings of PC1 and PC2 while the SPCA method is at least difficult to obtain the sparse result in this case.
From Case 1 and Case 2, it can be seen that the FP-SPCA method is more adaptable than the SPCA method. e reason why the FP-SPCA method is more preferable is that it has a parameter a in the penalty function which makes FP-SPCA more flexible and easier to adjust to get the desirable result.

Conclusions
In this paper, the FP-SPCA method is proposed based on the SPCA method [3,4], where a fractional penalty function [19] is proposed to replace the l 1 -norm in the SPCA method. Numerical simulations show that the proposed FP-SPCA method is more adaptable and flexible than the SPCA method. However, the FP-SPCA has its limitation, and during the numerical simulations, we find a few generated matrices that FP-SPCA fails to produce sparse loadings of principal components and the SPCA method does not work also. How to deal with this problem and how to compare the method we proposed with other existing penalty methods are remaining questions that will be worth investigating further.

Data Availability
e data used to support the findings of this study are available from the corresponding author on request.

Conflicts of Interest
e authors declare that there are no known conflicts of interest associated with this publication.  Table 3: Loadings and variance of numerical results of PCA, SPCA, and FP-SPCA methods in Case 2, where the FP-SPCA method works better than SPCA in obtaining the sparse loadings. AV † (%) denotes the adjusted variance.