A Folded Concave Penalty Regularized Subspace Clustering Method to Integrate Affinity and Clustering

. Most sparse or low-rank-based subspace clustering methods divide the processes of getting the aﬃnity matrix and the ﬁnal clustering result into two independent steps. We propose to integrate the aﬃnity matrix and the data labels into a minimization model. Thus, they can interact and promote each other and ﬁnally improve clustering performance. Furthermore, the block diagonal structure of the representation matrix is most preferred for subspace clustering. We deﬁne a folded concave penalty (FCP) based norm to approximate rank function and apply it to the combination of label matrix and representation vector. This FCP-based regularization term can enforce the block diagonal structure of the representation matrix eﬀectively. We minimize the diﬀerence of l 1 norm and l 2 norm of the label vector to make it have only one nonzero element since one data only belong to one subspace. The index of that nonzero element is associated with the subspace from which the data come and can be determined by a variant of graph Laplacian regularization. We conduct experiments on several popular datasets. The results show our method has better clustering results than several state-of-the-art methods.


Introduction
In machine learning and data mining, clustering is one of the most important topics in unsupervised learning. Given a set of data points, clustering is to partition these points into several groups, with each of which called a cluster, such that data points in the same group have higher similarities than those in different groups. In the past decades, an enormous number of clustering algorithms have been proposed, such as K-means and its variants [1,2], spectral clustering [3], nonnegative matrix factorization (NMF) [4], and subspace clustering [5]. In this paper, we focus on the methods of subspace clustering.
Contemporary is the era of high-dimensional data explosion. However, there is redundancy information included in those high-dimensional data so that their intrinsic dimension is often much smaller. In many computer vision and machine learning problems, one often assumes that the data are drawn from a union of multiple low-dimensional linear subspaces. us, subspace clustering of such data has been studied extensively. Nowadays, many works proposed the assumption of linear subspace may not always be true for many real high-dimensional data. e proposed data may be better modeled by nonlinear manifolds [6][7][8].
e common strategy employed in these works is to use the logarithm mapping projecting data onto the tangent space at each data point which is a linear space, so that all the strategies applicable to linear space can be used. ese methods finally still need to establish a model in linear space. erefore, research of modeling methods in linear space is still essential. e existing methods for subspace clustering can be roughly divided into four groups: statistical learning-based methods, factorization-based methods, algebra-based methods, and sparsity-based methods (e.g., Sparse Subspace Clustering (SSC) [9] and Low-Rank Representation (LRR)) [10]. In this paper, we focus on the fourth group.
In recent years, a lot of methods basing on deep learning [11] have been proposed. ese methods obtained extremely competitive results in many fields on image and computer vision. A discussion hot point in computational imaging is if it is time to discard the classic methods and fully replace them with deep learning-based methods. On the one hand, a prerequisite for deep learning-based methods is a huge number of samples. However, there exist some situations where no data or only a small number of data can be obtained. In this case, the knowledge-based modeling methods are more suitable. On the other hand, classic methods have a clear structure and theoretical guarantee.
ey are based on the knowledge of the problem we are trying to solve rather than seeking for best performance by intuitively choosing architectures or trial and error. To the best of our knowledge, most of the existing learning-based clustering methods only learn some autoencoder features [12]. e final clustering is still obtained by applying k-means or SSC. Another challenge for deep learning is that a clustering model trained from some datasets may not be effective for other sets, but classic methods have better generalization ability.
Let X � [x 1 , x 2 , . . . , x n ] ∈ R m×n (m < n) be a collection of m-dimensional data drawn from the union of linear subspaces S i c i�1 . Each S i includes n i data of X. e task of subspace clustering is to cluster the data in X according to those independent subspaces. For each j ∈ 1, 2, . . . , n { }, consider the data x j as a linear combination of all data in X, i.e., x j � Xz j , here z j � (Z kj ) n k�1 is called a representation vector, we assume Z jj � 0 to avoid the trivial solution. Suppose x j ∈ S i (i ∈ 1, 2, . . . , c { }), then only the coefficients associated with the data from S i are not zero, and the others are all zero. Assuming A j � k: Z kj ≠ 0 , we want to find the solution to the following problem: arg min which means to find the oracle solution [9]. According to the above statement, there must be at least one matrix Z � (Z ij ) n×n ∈ R n×n satisfying X � XZ, s.t. Z jj � 0(j � 1, 2, . . . , n). (2) Equation (2) actually has an infinite amount of solutions. Any solution is called a representation matrix. When handling 2D data, with each data being a matrix, [13] shows the strategy of converting all data to vectors severely damages inherent structural information and correlations of the original data. ey proposed to learn a 2D projection matrix such that the most expressive structural information is retained in the spanned subspace. In our work, we still consider the vectorized data for simplicity since our method is also suitable for the projected vectorized data, and it is not within the scope of this paper to discuss how to deal with 2D data.
A good representation matrix should have the properties of sparsity between subspaces and density within a subspace, which means each query data x j is represented by a small number of subspaces, and once one subspace is selected, it is in favor of using more data from the same subspace. e work of [14] introduced a family of new regression models and estimated a representation model while accounting for discriminativeness between clusters. Here we achieve this property of the representation matrix by forcing it to have block diagonal structure since the block diagonal structure of Z directly induces a segmentation of the data (each block corresponds to a cluster). Reference [15] stated that under the ideal conditions, i.e., the data are noiseless and the subspaces are independent (i.e., none of the subspaces can be represented by other subspaces), as long as the regularizer for Z satisfies the EBD (Enforced Block Diagonal) conditions, and the optimal representation matrix is block diagonal. However, as X contains noise, which is inevitable in any application, Z may not be a strict block diagonal matrix. erefore, it is difficult to decide how large the representation coefficient between two data should be to group them into the same subspace. Usually, many previous subspace clustering methods are to find a matrix Z firstly; then using (|Z| + |Z′|/2) as an affinity matrix, the spectral clustering such as normalized cuts can be applied to get the subspace clustering result, just as the classic SSC and LRR and many other variants thereafter have done. All these methods divide the solution of the representation matrix and the final clustering result into two independent steps. We propose to integrate the affinity matrix and the data labels into a model to make them interact and promote each other and finally improve the clustering performance.
Furthermore, in terms of the penalty for representation matrix Z, most methods based on SSC and LRR seek Z with the most sparse or lowest rank constraint. In fact, for subspace clustering, it is more important for Z to have block diagonal structure than to be the most sparse or lowest rank matrix. We give a low-rank-based regularization term to enforce the block diagonal structure of Z directly. is regularization term is applied to a combination of the semisupervised label matrix and representation vector z j other than Z. Note that the rank function of a matrix is the l 0 -norm of the singular value vector and solving such a l 0 minimization problem is usually difficult or even NP-hard. e standard approach is to replace the rank function with the convex nuclear norm [16,17]. It has been proved that under certain incoherence assumptions on the singular values of the matrix, solving the convex nuclear norm regularized problem leads to a near-optimal low-rank solution [18]. On the other hand, it is pointed out that the nuclear norm is not accurate for rank approximation. Recent works develop various more accurate approximations to the rank function, such as the logdeterminant rank approximation [19,20], which significantly improves the learning performance. In addition, the nuclear norm of a matrix is the l 1 -norm of the singular 2 Mathematical Problems in Engineering values vectors. Fan and Li pointed out that l 1 -norm penalty overpenalizes large entries of vectors; therefore, nuclear norm overpenalizes large singular values. Moreover, they proposed three criteria for good penalty functions [21]: unbiasedness, sparsity, and continuity at the origin. e l 1 -norm satisfies both sparsity and continuity, but it is biased.
Recently, nonconvex penalties have drawn more and more attention to sparse learning problems because people believe that one of the possible solutions of nonconvex penalization problem could overcome the drawbacks of the unique solution of convex penalization problem. As a common practice, the l 1 -norm can be replaced by the l q -norm with 0 < q < 1 if a more sparse solution is expected to be obtained [22,23]. However, no theoretical guarantee with l q -norm is made for reducing the modeling bias of l 1 -norm. Based on those three properties proposed by Fan and Li, they proposed a new penalty function called the smoothly clipped absolute deviation penalty (SCAD) [21]. Recently, Zhang proved that a so-called min-max concave plus (MCP) penalty [24] also possesses three properties and achieves better performance than SCAD. Both SCAD and MCP are nonconvex and nearly unbiased. Extensive experiments [21,[24][25][26][27][28][29][30] have demonstrated the superiority of SCAD and MCP over the l 1 -norm penalty. Furthermore, folded concave penalty (FCP) methods, including SCAD and MCP, have been shown to have strong oracle property for high-dimensional sparse estimation. As described above, we do expect to get the oracle solution of problem (1).
In this paper, a FCP regularized subspace clustering model is presented as follows: We define a subspace dependent label matrix G � [g 1 , g 2 , . . . , g n ] ∈ R c×n , where g j � (g 1j , g 2j , . . . , g cj ) T ∈ R c (j � 1, 2, . . . , n) is a label vector for data x j ; here, c is the number of subspaces and n is the number of data. Since we do not know c, the dimension of g j only needs to be set larger than c and at most equals to min m, n { }. Here, we still use c to denote the dimension of g j . In our work, we mainly discuss the semisupervised case; i.e., some data already have labels and the others do not. Index set for all data and the labeled data are denoted as P and S separately; then P\S represents index set for the unlabeled data. For every j ∈ S, we assume g ij � 1 when x j is in ith class, while g ij � 0 otherwise. e label vectors for j ∈ P\S need to be solved, and the set of all the unknown labels is defined as . Encouraged by the good properties of the FCP penalty, we give FCP-based norm ‖·‖ P λ,a to approximate rank function. e last term of our model (3) is to obtain rank minimization for matrix multiplication G Diag(z j ). α, β, and c are parameters used to balance roles of the three regularization terms. It is obvious that the larger the parameter, the more important the corresponding term in the minimization problem.
We give three regularization terms. e first one is minimizing the difference of l 1 norm and l 2 norm of the label vector g j to make it have only one nonzero element. e clustering result for each data is induced by the index of that nonzero element. e second one is a variant of the graph Laplace regularization, which captures the nonlinear structures of the data. is term makes sure the data from the same subspace have the same label as much as possible. erefore, the nonzero element of each label vector can properly correspond to the subspace from which the data is drawn. We apply a low-rank constraint to the combination of the label matrix and representation vector, as indicated in the third regularizer, to enforce Z to better satisfy block diagonal structure. In our work, the clustering result, namely the label for each data, is directly solved from the model rather than using spectral clustering methods. e labels and the representation matrix are contained in two regularization terms. e first part of the third term of (3) makes the nonzero element of each label vector accurately correspond to the subspace from which each query data comes if the representation matrix Z has the block diagonal structure. Vice versa, if each label has only one nonzero element which is associated with the accurate subspace, the second part of the third term and the fourth term make Z better meet the block diagonal structure. erefore, the labels and the representation matrix interact and promote each other during the whole computing process. e problem can be solved by using the Alternating Direction Method of Multiplier (ADMM) framework. e resulting nonconvex FCP minimization problem can be solved by the Linear Local Approximation (LLA) method [23], which solves the problem by minimizing a surrogate function that upper bounds the objective functional. e surrogate function is constructed by linearizing the penalty function. LLA guarantees to decrease the objective function value in each iteration. Due to the nonconvex of the FCP Mathematical Problems in Engineering problem, there are usually multiple local solutions, and the oracle property is established only for one of the local solutions. Breheny and Huang [25] have shown that with a Lasso-based initialization, LLA can avoid local maxima and saddle points, and with a high probability, an oracle solution can be obtained by using one-step LLA. In addition, once the oracle estimator is obtained, the LLA algorithm converges; namely, it produces the same estimator in the next iteration. We use the result of singular value thresholding as the initialization, which corresponds to the Lasso-based solution when applying to the nuclear norm minimization problem.
Some notations used in this work are defined as follows.
is the nuclear norm, and here, σ i (X) is the ith singular value of X. |X| represents the element-wise absolute value of X. X T denotes the transpose of X. Diag(X) describes the diagonal matrix with diagonal components being X ii . diag(X) is a vector with entries X ii . tr(X) is the trace of the square matrix X. I and 0 denote the identity matrix and the zero matrix. X ≥ 0 means all entries of X are nonnegative. e remainder of this paper is organized as follows. In Section 2.1, we primarily give the three regularization terms then the low-rank-based semisupervised subspace clustering model is proposed. In Section 2.2, we define an MCP-and SCAD-based norm to approximate rank function. en a FCP-based nonconvex minimization model results for subspace clustering. In Section 3, for solving the proposed model, we present an algorithm that combines several approaches such as ADMM, LLA, weighted singular value thresholding, and so on. In Section 4, we conduct a series of simulations with several datasets to demonstrate the superiority of our method. In Section 5, we conclude this paper with some summation and future plans.

e Low-Rank Model Integrating Affinity and Clustering.
We ] ∈ R c×n i as the submatrix of G composed of the label vectors for all the data from subspace S i . e ideal G i is the elements in ith row are all one, and those in the other rows are all zero. With the expression x j � Xz j (j � 1, 2, . . . , n) and the ideal structure of G i , we can automatically seek the block diagonal structure of Z by minimizing the rank of matrix multiplication GDiag(z j )(refer to [31,32] for the detailed interpretation).
is sparsity between subspaces and density within a subspace implied by the block diagonal structure are preferred to the aim of subspace segmentation. So, we first give a minimization model as below. Here for every j ∈ P\S, we set I j � i|g ij ≠ 0, i � 1, 2, . . . , c , and #I j denotes the number of elements in I j .
For every j ∈ P\S, #I j � 1 and I j corresponds to the subspace the data x j is from.
It is rational for the label vector g j to have only one nonzero element since, in real application; one data only belongs to one subspace. For example, it is impossible for one face image belonging to two persons. Since g j has only one nonzero element if and only if ‖g j ‖ 1 − ‖g j ‖ 2 � 0. Under the constraint of g j (j ∈ P\S) ∈ Δ + , its unique nonzero element has a value of one. For the simplicity of computing, we relax ‖g j ‖ 1 − ‖g j ‖ 2 � 0 as enabling the difference of l 1 norm and l 2 norm of g j to achieve minimization and use as a regularized constraint. e problem becomes min g j (j∈P\S)∈Δ + ,Z for every j ∈ P\S, I j corresponds to the subspace the data x j is from.

Mathematical Problems in Engineering
Now the problem is how to make sure the nonzero element of g j properly associate with the subspace the query data come. To do this, we use the following term: We denote elements of Z as Z ij (i, j ∈ 1, 2, . . . , n { }). e quantity of |Z ij | + |Z ji | describes the similarity between x i and x j . We analyze from the following several aspects: (1) For the case i ∈ P\S, j ∈ S, or i ∈ S, j ∈ P\S, that is, one of x i and x j already has a label. Taking i ∈ P\S, j ∈ S for instance, since x j already has a label, x i should belong to the same subspace with x j as |Z ij | + |Z ji | is large enough. is can be achieved by the first part of formula (6). In this case, the index of the nonzero elements of the unknown label g i can be properly correlated to the subspace to which the data belongs. (2) As neither x i and x j has a label, the first part of (6) also works for this case. Since both g i and g j have only one nonzero element, indexes of their nonzero elements must be the same as |Z ij | + |Z ji | is large enough. erefore, data x i and x j can be clustered into the same subspace, but which subspace cannot be decided. Figure 1 shows labels can be properly propagated from labeled data to unlabeled data; then the index of the nonzero element of each label can correspond to the accurate subspace even in this case.
(3) If both x i and x j have labels, which means g i and g j do not need to be solved, we only do not know the representation coefficient. If x i and x j are in the same subspace, namely, ‖g i − g j ‖ 2 2 � 0, |Z ij | and |Z ji | should be large. If x i and x j are not in the same subspace, i.e., ‖g i − g j ‖ 2 2 � 2, |Z ij | and |Z ji | had better be zero. To this end, we use the second part of (6).
Here the parameter p needs to be taken a large value.
Using all the data x j (j � 1, 2, . . . , n) as nodes, x i and x j have a connection between them as |Z ij | + |Z ji | is large enough, and no connection as |Z ij | + |Z ji | is small; then we obtain a graph of all data. Since for each data, the most similar data must come from the same subspace with it, each node must have a connection with at least one node in the same subspace. When there is at least one labeled data in each subspace, the label can be propagated from the labeled node to unlabeled nodes through formula (6). erefore, the connected nodes can share the same label. is can be simply illustrated by Figure 1, in which there are only two classes and two labeled data (one for each subspace).
So far, we obtain the complete low-rank-based semisupervised subspace clustering model e minimization problem (7) is difficult to solve due to the combined nature of rank function. e standard approach is to replace rank function with the nuclear norm. Considering the fact that matrix nuclear norm is prone to overpenalize large singular values and thus usually leads to a biased estimation and the advantages of FCP over l 1 norm described in lots of works, we will utilize the FCP of singular value vector of a matrix to approximate rank function; thus, the model (7) is transformed into our model (3) presented in the introduction section, where matrix norm ‖ · ‖ P λ, a is defined as follows: σ k (I) is the kth singular value of matrix I ∈ R c×n with c ≤ n. We choose the function P λ,a (·) as P MCP λ,a (·) or P SCAD λ,a (·) defined as following separately. MCP is of the following form:

Mathematical Problems in Engineering
SCAD is of the following form: e matrix FCP norm defined in (8) satisfies the following properties.

Proposition 1.
For a ∈ (1, ∞)(MCP) or a ∈ (2, ∞)(SCAD), then (1) ‖I‖ p λ,a ≥ 0, with equality iff I � 0 (2) ‖I‖ p λ,a ≤ λ‖I‖ * , and lim a⟶∞ ‖I‖ P λ,a � λ‖I‖ * (3) ‖I‖ P λ,a is unitarily invariant, that is, ‖UIV‖ P λ,a � ‖I‖ P λ,a whenever U and V are orthonormal Property (1) is obvious. e second property can be easily verified since both SCAD and MCP penalty functions are upper bounded by Lasso penalty function and tend to Lasso penalty function as a ⟶ ∞. e third property is true due to the fact that singular values are not changed from I to UIV whenever U and V are orthonormal. e fourth property holds because of the concavity of the FCP penalty function. It is worth noting that the matrix FCP is not a real norm since it does not satisfy the triangle inequality of a norm.

Optimization
To solve problem (3), the original problem is converted to the following equivalent problem: e solution to problem (11) is difficult to achieve directly. We adopt the Augmented Lagrangian Multiplier (ALM) scheme to derive the following unconstrained optimization problem: min g j (j∈P\S)∈Δ + , Z,Q,y j (j∈P\S), where Y j ∈ R c×n (j ∈ P), F ∈ R n×n , u j ∈ R c (j ∈ P\S) are Lagrangian multipliers and μ > 0 is the penalty parameter. Instead of optimizing all arguments simultaneously, as g j (j ∈ P\S), Z, Q, y j (j ∈ P\S), J j (j ∈ P) are separable, we solve them individually and iteratively for k � 0, 1, 2, . . ..

Mathematical Problems in Engineering
(1) By fixing g k j (j ∈ P\S), Q k , y k j (j ∈ P\S), J k j (j ∈ P), we optimize every column z j (j ∈ P) of Z by the following subproblem: here f k j is jth column of F k and q k j is jth column of Q k . en the solution can be achieved via solving a linear system as follows: (2) By fixing Z k+1 , Q k , y k j (j ∈ P\S), J k j (j ∈ P), we optimize every g j (j ∈ P\S) by the following subproblem: e above problem can be solved using the following equation: Here, J k i,j and Y k i,j represent jth column of matrix J k i and Y k i separately. Z k+1 ji is the jth element of vector z i . g k+1 j is then projected onto the set Δ + � g|g � (g k ) c k�1 , g k ≥ 0, c k�1 g k � 1} by the algorithm presented by [33].
(3) For i ∈ P\S or j ∈ P\S, the optimized Q k+1 ij can be obtained using a weighted soft-thresholding algorithm As i ∈ S and j ∈ S, since g i and g j are known, we get Q ij from the following equation: (4) To update y j (j ∈ P\S), the following subproblem is solved: Problem (18) can be solved by the soft-thresholding operator. (5) To update J j (j ∈ P), the following subproblem is solved: which can be solved through LLA by using the result of singular value thresholding as the initial value for J j . e Lagrangian multipliers are updated as follows: Steps (14)- (23) are repeated until the convergence conditions are attained. In this algorithm, the penalty parameter μ k starts with μ 0 � 10 − 9 , then μ k+1 � max(ρμ k , 10 9 ) in each iteration step with ρ � 1.1.

Evaluation Metrics.
We use three evaluation metrics to testify the effectiveness of the proposed method, including clustering accuracy (CA), normalized mutual information (NMI), and purity. All experiments run ten times using n random choices of subjects for each time. For each subject, we randomly select t images, of which m images are prelabeled in our method. en the average CA, NMI, and purity are reported. We compute the clustering accuracy as follows: clustering accuracy � number of correctly clustered data n(t − m) .

(24)
Clustering accuracy measures the accuracy rate of clustering. NMI measures the quality of the clusters. Purity measures the extent to which each cluster contains samples from primarily the same subject. Higher values of these metrics indicate better clustering quality. More details of the last two metrics can be found in [45]. e best results are shown in bold font. Experiments verify that there is no obvious difference between MCP and SCAD when they are applied to our model, so we do not list them separately. In all of our experiments, we use the intensity feature of each image and stretch it to a column of the data matrix X.

Parameter Selection.
e parameters for all the methods are adjusted to obtain the best clustering results. Our method involves six parameters a, λ, α, β, c, and p. Among these parameters, a and λ are associated with the FCP penalty functions. In our experiments, a � 5 and λ � 1 can obtain the best results. α, β, and c are used to balance the roles of three regularization terms. e best choice for α is 12.5735 ± 0.0115, for β is 100 ± 0.8, and for c is 7.005 × 10 − 3 ± 2.5 × 10 − 5 . Parameter p needs to be set large enough to penalize the large representation coefficients when two labeled data have different labels. We fix p � 20 which can achieve the best results for all the following experimental settings.

Initialization and Stopping Criterion.
All the variables Z, Q, g j (j ∈ P\S), y j (j ∈ P\S) and Lagrange multipliers Y j (j ∈ P), F, u j (j ∈ P\S) are started with all elements being zero. J j (j ∈ P) is initialized to G 0 Diag(z 0 j ) where G 0 consists of the known labels and initial unknown labels g 0 j (j ∈ P\S). We use three errors j∈P ‖GDiag(z j ) − J j ‖ F , ‖Z − Q‖ F , j∈P\S ‖y j − g j ‖ 2 all less than 10 − 6 as the stopping criterion of iterations.

4.5.
Performances. e average CA, NMI, and purity of all the methods on the two face image datasets; i.e., Extended Yale B and ORL, are reported in Tables 1 and 2       images already have labels for our method. Similar to the above datasets, the result of our method is the best among all the eleven methods.
To further verify the effectiveness of our proposed method, we conduct experiments on two datasets containing handwritten digits, i.e., MNIST and Alphadigits. e Alphadigits database also contains 26 letters. For the MNIST database, we randomly choose t images for each digit 0-9, where t ∈ 30, 40, 50 { }, and then apply all the methods to cluster the images. For our methods, m ∈ 5, 7, 8 { } images are prelabeled of the t images for each digit. For the Alphadigits database, we use n ∈ 5, 10 { } random choices of subjects. For each subject, we use all the images, among which m � 6 images are prelabeled for our method. e CA, NMI, and purity are reported in Tables 4 and 5. Our method gets the best clustering result among all the compared methods.

Conclusion
We propose a novel nonconvex formulation for the subspace segmentation problem. In our work, the labels of all data are directly solved from the model rather than using spectral clustering algorithms. We give two regularization constraints about the label vectors. One is to force the label vector only to have one nonzero element by minimizing the difference of l 1 norm and l 2 norm. Another is to make sure data from the same subspace have the same label as much as possible so that the index of that nonzero element of each label vector can indicate the subspace from which the data come. Due to many advantages of FCP over l 1 norm, we present an FCP-based low-rank approximation norm and apply it to the combination of semisupervised label matrix and representation vector. is term can enforce the representation matrix better meeting block diagonal structure. e labels and the representation matrix are contained in two regularization terms. erefore, they can interact and promote each other during the computing process. We give a solving algorithm based on ADMM, LLA, weighted singular value thresholding, weighted soft thresholding, and so on.
In the future, we will continue conducting research on estimation or learning methods for the parameters in our proposed model because we have to tune those parameters in order to achieve better results. It is possibly better to integrate the classical knowledge-based approaches into the deep learning architecture, making the algorithm enjoy both the flexibility of the deep learning-based methods and the clear structure of the classical approaches.

Data Availability
No data were used to support this study.

Conflicts of Interest
e authors declare that they have no conflicts of interest.