Subspace Clustering with Sparsity and Grouping Effect

. Subspace clustering aims to group a set of data from a union of subspaces into the subspace from which it was drawn. It has become a popular method for recovering the low-dimensional structure underlying high-dimensional dataset. The state-of-the-art methods construct an affinity matrix based on the self-representation of the dataset and then use a spectral clustering method to obtain the final clustering result. These methods show that sparsity and grouping effect of the affinity matrix are important in recovering the low-dimensional structure. In this work, we propose a weighted sparse penalty and a weighted grouping effect penalty in modeling the self-representation of data points. The experimental results on Extended Yale B, USPS, and Berkeley 500 imagesegmentationdatasetsshowthattheproposedmodelismoreeffectivethanstate-of-the-artmethodsinrevealingthesubspace structureunderlyinghigh-dimensionaldataset.


Introduction
Many high-dimensional datasets such as face images, motion trajectories, and text can be well approximated by multiple low-dimensional subspaces [1].Recovering the lowdimensional structure underlying these datasets leads to the subspace clustering problem.It aims to group a set of data from a union of subspaces into the subspace (or cluster) from which it was drawn.It has shown promising applications in face clustering [2], motion segmentation [3], image segmentation [4], and system identification [5].
Let X = [x 1 , . . ., x  ] ∈  × be a set of sufficient sampled data vectors, where X is the data matrix with each column corresponding to a data point,  is the feature dimension, and  is the number of data vectors.Assume that the data points are drawn from a union of  subspaces {  }  =1 of unknown dimensions, respectively.Subspace segmentation aims to cluster the data into the underlying subspace from which they are drawn.Figure 1 presents an example of the subspace clustering problem.
1.1.Related Works.Most recent works focus on the spectral clustering-based methods [6][7][8][9][10][11][12].They first learn an affinity matrix capturing the similarity between pairs of sample points from the dataset and then use a spectral clustering method such as the normalized cut (Ncut) [13] to obtain the final clustering result.The affinity matrix is generally learnt by using the self-representation of the dataset under the assumption that each data point can be approximated by a linear or affine combination of other data points.To ensure that the affinity matrix has some expected properties, the previous methods [6][7][8][9][10][11][12] solve the following minimization problem: min where ‖Z‖  is a proper norm of the representation matrix Z = [z 1 , . . ., z n ] = [  ] ∈  × and is used to enforce Z to have expected property, ‖E‖  measures the error matrix E, noise, or corruption, and  is a tradeoff parameter.The constraint diag(Z) = 0 is optional and it is used to avoid the trivial solution.Once the optimal solution Z * of (1) is obtained, it is used to construct the affinity matrix A. By the theory of spectral clustering, the affinity matrix which measures the similarity between data points should be nonnegative and symmetrical.A commonly used definition is where | ⋅ | denotes the element-wise absolute value of the matrix Z * .The method of sparse subspace clustering (SSC) [6,7] pursues the sparsest representation for each data point by using the  1 -norm regularization.Low-rank representation (LRR) [2,8] seeks for the lowest-rank representation of all data points by employing the nuclear norm ‖ ⋅ ‖ of the coefficient matrix, which can capture the global structures of the data.Least square regression (LSR) [9] uses the Frobenius norm regularization.The correlation adaptive subspace segmentation (CASS) [10] uses the trace Lasso defined by ‖X diag(z)‖ * to regularize each column of Z to gain a tradeoff effect between  1 -norm and  2 -norm depending on the correlation of data points.In particular, when the data are highly correlated (i.e., X  X is close to 11  ), it is close to the  2 -norm, while when the data are almost uncorrelated (i.e., X  X is close to I), it behaves like the  1 -norm.The smooth representation (SMR) [11] and SSC with the weighting function (W-SSC) [12] use the spatial information to design the regularization term of Z to improve the clustering results.The previous works pursue two properties of the selfrepresentation and the resulting affinity matrix to ensure the success of subspace clustering.SSC [6,7] and W-SSC [12] pursue sparsity.LRR [2], LSR [9], CASS [10], and SMR [11] are shown to have the grouping effect (GE, Definition 1), which tend to encourage clustering highly correlated data together.Under some ideal conditions, sparsity results in zero connection between data points from different subspaces and tends to group them into different clusters.However, SSC has the instability problem: if the data from the same subspace are highly correlated or clustered, it will only select one of the several related pieces of data at random and ignore other correlated data.Moreover, the datasets in practice not necessarily satisfy the ideal conditions.These limit the performance of SSC.W-SSC compresses the coefficients corresponding to data points spatially far away from the data in consideration by introducing a weight depending on the spatial distance between data points.However, SSC and W-SSC do not take the GE into consideration and both have the instability problem.LRR [2], LSR [9], CASS [10], and SMR [11] have the GE but they are insufficient in enforcing the sparsity.

Contributions.
In this work, we pursue both sparsity and grouping effect in modeling the self-representation of data points.Specifically, for the representation vector z  of a data point x  , the sparsity is enforced by minimizing the  1 -norm of coefficients only corresponding to data which is far away from x  .We also seek for GE by enforcing close representation vector for close data.
We present an algorithm based on the Alternating Direction Method of Multipliers (ADMM) to solve our model and two subproblems have analytical solution.
We test our method on the face dataset Extended Yale B and handwritten digit dataset USPS.The experimental results show that the proposed model is more effective than state-of-the-art methods in revealing the cluster structure underlying these high-dimensional datasets.We also test our method on Berkeley 500 image segmentation dataset for image segmentation; both the visual segmentation effects and segmentation metrics show that our method is better than state-of-the-art methods.

Self-Representation with
Sparsity and Grouping Effect To make zero connection between data points from different subspaces, only ∑ ∉  |  | should be minimized; here |  | denotes the absolute value of   .However, in the scenario of subspace clustering,   is unknown in advance.SSC pursues sparsity by minimizing So it is possible for SSC to make   = 0 if  ∈   and cause the instability problem.It is also possible that   ̸ = 0 if  ∉   , thus causing a nonzero connection between clusters.W-SSC compresses the coefficient   corresponding to the data points x  by introducing a weight depending on the spatial distance between x  and x  .However, SSC and W-SSC do not take the GE into consideration and both have the instability problem.
In [11], the authors define the following "grouping effect."The works in [9][10][11] confirm that the grouping effect tends to group highly correlated data into the same cluster and experiments show that the grouping effect improves the clustering results on several commonly used datasets [14,15].Definition 1 (grouping effect [11]).Given a set of - Taking both the sparsity and GE into consideration of the self-representation of a data point, we propose to solve the following minimization problem: with where where ⊙ denotes the element-wise product of two matrices.In case of noisy or corrupted data, we reformulate the problem as min where the parameters  > 0 and  > 0 are tuning parameters to balance the effect of the corresponding terms.The constraint diag(Z) = 0 enforces that each data point be represented as a linear combination of other data points.

Minimization Algorithm.
To solve problem (7), we first convert it into the following equivalent problem: We solve this problem by using the Alternating Direction Method of Multipliers (ADMM) [16].The augmented Lagrangian [17] is given by min where Y 1 and Y 2 are matrices of Lagrange multipliers and  > 0 is a parameter.To find a saddle point for , we update each of Z, C, E, Y 1 , and Y 2 alternatively while keeping the other variables fixed.
Update for Z.We update Z as follows: where U  = C  + Y  2 /  ; the solution for (10) is and Z can be solved by using the soft-thresholding operator: Update for C. We update C as follows: ) . ( This is a smooth convex program and the optimal solution condition is Input: data matrix X, model parameters  and  and . 1,  max = 10 10 ,  = 0,  = 1.1,  = 10 −5 .Output: the optimal coefficient matrix Z * .While not converge do (1) fix others and update Z by Eq. ( 11) and ( 12); (2) fix others and update C by solving Eq. ( 14); (3) fix others and update E by Eq. ( 16) or ( 17); (4) update the Lagrangian Multipliers: where , and B = 2L.Equation ( 14) is a standard Sylvester equation [18], which has one unique solution.
Update for E. We update E as follows: Here we use  1 -norm or squared Frobenius norm to handle corruptions or noise.If  1 -norm is used, the solution for ( 15) is If squared Frobenius norm is used, the solution is Update for Y 1 and Y 2 .We update Y 1 and Y 2 as follows: The update for the Lagrange multipliers is a simple gradient descent step.The procedure for solving problem ( 7) is outlined in Algorithm 1.When the optimal solution Z * of problem ( 7) is obtained, we use (2) to construct the affinity matrix A and normalize each column vector of A by dividing the maximum value of that column.Finally, the spectral clustering method is applied to obtain the clustering results.The computation burden is mainly focused on Algorithm 1.Let  denote the number of iterations and let the matrix X have size  × .The complexity of updating Z, E, and the multipliers Y 1 and Y 2 is, respectively, ( 2 ), ( 2 ), ( 2 ), and ( 2 ).In the process of updating C, we used the Bartels-Stewart algorithm [18] to solve the Sylvester equation; the algorithm has a computational complexity of ( 3 ).So the total computational complexity is ( 2 +  2 +  3 ).Assuming that  ≤ , the complexity is ( 3 ).

Experiments and Analysis
In this section, we evaluate the performance of the proposed method on several publicly available datasets, including Extended Yale B face dataset [14], the handwritten digits dataset USPS [15], and Berkeley 500 image segmentation dataset [19].We also compare our method with several stateof-the-art methods: SSC [7], LRR [2], LSR [9], CASS [10], and SMR [11].We use the source codes provided by the authors and tune the parameters of each method to achieve the best performance for fair comparison.We define the affinity matrix by using (2).
For the Extended Yale B and USPS datasets, we use the subspace clustering error to evaluate the performance of the methods: where  error denotes the number of misclassified points and  total denotes the total number of points.For the Berkeley 500 image segmentation dataset, three metrics for comparing pairs of image segmentations are used: PRI (probabilistic rand index) [20], VOI (variation of information) [21], and GCE (global consistency error) [22].The value of PRI is within [0, 1] and the higher value shows the better segmentation performance.The value of VOI is within [0, +∞) and the lower value shows the better segmentation performance.The value of GCE is within [0, 1] and the lower value shows the better performance.

Datasets and Experimental Settings
Extended Yale B. Extended Yale B is a face dataset that consists of 192 × 168-pixel cropped face images of 38 human subjects under varying poses and illuminations.Each subject has 64 frontal faces.To reduce the computational cost and memory requirements of all algorithms, each face image is downsampled to 32 × 32 pixels and rearranged into a 1024dimensional vector.To study the effect of the number of subjects on the clustering performance of different methods, we divide the 38 subjects into four groups: 1 to 10, 11 to 20, 21 to 30, and 31 to 38; for the first three groups (each group has 10 subjects in all), we test all methods on 2, 3, 5, 8, and 10 subjects; for the last group (has 8 subjects), we perform experiments on 2, 3, 5, and 8 subjects.Finally, the mean and median clustering errors of different number of subjects are reported for each method to evaluate their performance.USPS.USPS is a database of handwritten digits.It consists of 10 subjects, corresponding to 10 handwritten digits, 0-9.There are 9298 images, with each image having 16 × 16 pixels.We use the first 100 images of each digit in experiment.
Berkeley 500.Berkeley 500 is an image dataset consisting of 500 multiple natural images for segmentation.It covers a variety of scene categories in nature, such as portraits, animals, landscape, and beaches.It also provides ground truth segmentation results of all the images obtained by several human subjects.We select 50 images from this dataset in experiment for image segmentation.We first partition each image into superpixels using the method presented in [23].Then we extract the Color Histogram (CH), Local Binary Pattern (LBP), and Histogram of Gradient (HOG) features of each superpixel and obtained the data X.Finally, we use our subspace clustering method to cluster the superpixels into several regions.

Parameter Setting.
There are three model parameters, , , and , affecting the performance of our method.In the following, we analyze the effect of these three parameters on the clustering performance.We choose the first 10 subjects of Extend Yale B and USPS as the test database.We first test the influence of . Figure 2 shows the clustering error rate varying with the size of neighborhood  from 3 to 9 on Extend Yale B database, with  = 0.1 and  = 0.5.It shows that our method performs best with  = 6.On USPS, we have the same result.So we take  = 6 in all experiments.
Figure 3 shows how the clustering error rate of the proposed method depends on the parameter , with several fixed values of  on the datasets Extend Yale B (a) and USPS (b).We applied our method on the first 10 subjects of Extend Yale B and all subjects of USPS.For Extend Yale B, one can observe that for  = 0.2 and  = 0.5 the proposed method   performs well with  ∈ [0, 1.5].We set  = 0.5 because the error rate behaves more stably with respect to .So we set  = 0.5 and  = 0.1 because the error rate obtains the smallest value in this case.For USPS, the clustering error rate behaves relatively stably with respect to  when  = 1 or  = 1.2.We set  = 1 and  = 0.1 because this leads to the smallest error rate.

Experimental Results.
It should be noted that the Extend Yale B dataset is challenging for subspace clustering due to heavy corruptions in the data.To show the effectiveness, we select all the face images in test.We use the  1 -norm to measure the representation error matrix E, and we set  = 0.1,  = 0.5, and  = 6.We divide the 38 subjects into four groups: 1 to 10, 11 to 20, 21 to 30, and 31 to 38; for the first three groups (each group has 10 subjects in all), we test all methods on 2, 3, 5, 8, and 10 subjects; for the last group (has 8 subjects), we perform experiments on 2, 3, 5, and 8 subjects.Finally, we calculate the mean and median clustering error rates of the four groups.The clustering error rates by all methods in comparison are presented in Table 1.The shapes of affinity matrices of 5 subjects obtained by all methods are illustrated in Figure 4. Our method obtains an affinity matrix which is most close to block diagonal and obtains the lowest error rate in case of multiple subjects.The clustering error rates of LRR, LSR, CASS, and SMR increase quickly when the number of subjects increases.This is mainly because their affinity matrices are not strictly block diagonal, thus leading to grouping data from different subspaces into the same cluster.Due to the sparsity between clusters, the performance of SSC degenerates very slowly as the number of subjects increases.But our method behaves most stably with respect to the number of subjects and it consistently obtains the lowest clustering error rate among all methods.
For USPS dataset, we perform the clustering experiment on all data points of all subjects.We use the Frobenius norm to measure the representation error matrix, and we set  = 0.1,  = 1, and  = 6.Table 2 presents the clustering error rates, which shows that our method performs best and it reduces the clustering error rate of other methods greatly.
For the Berkeley 500 image segmentation dataset, we select 50 images randomly for image segmentation and use three metrics (PRI, VOI, and GCE) for objective assessment of the methods used.Because of the variety of structures underlying the natural images, we tune the parameters to achieve the best results.Table 3 shows the values of the three metrics averaged on the 50 images.We can see that our method can obtain a PRI of 0.8745, a VOI of 1.0068, and a GCE of 0.0886, which distinctly outperforms other methods.The detailed values of the three metrics on the 50 images are shown in Figure 5.Some examples of the segmentation results obtained by our method are shown in Figure 6 for visual assessment.It can be seen that the segmentation results produced by our method are similar to the ground truth and are visually satisfying.
In all, the experimental results on the three datasets show that, by taking both sparsity and grouping effect into consideration, our method is more efficient in recovering the subspace structure underlying high-dimensional dataset.

Conclusion
The state-of-the-art methods show that sparsity and grouping effect of the affinity matrix are useful in clustering.In this work, we enforce both requirements in modeling the selfrepresentation of data points.The experimental results on Extended Yale B, USPS, and Berkeley 500 image segmentation datasets show the requirements of sparsity and grouping effect and the proposed model is more effective than state-ofthe-art methods in revealing the cluster structure underlying high-dimensional dataset.

Figure 2 :Figure 3 :
Figure 2: The clustering error varies with the size of neighborhood .

Figure 4 :
Figure 4: Affinity matrices of 5 subjects from Extend Yale B dataset by different methods.

Figure 5 :
Figure5: The values of the three metrics, PRI (higher is better), VOI (lower is better), and GCE (lower is better), on the 50 images produced by SSC, LRR, LSR, CASS, SMR, and our method.(a) PRI, (b) VOI, and (c) GCE.

Figure 6 :
Figure 6: Some segmentation results of the Berkeley 500 dataset by our method.
Let   be the set of indices  such that x  and x  are drawn from the same subspace or cluster and consider the representation z  = ( 1  2 ⋅ ⋅ ⋅   ) 2.1.Modeling. (  = 0) (the th column of Z) for the data x  : (5) and D is a diagonal matrix with diagonal entries   = ∑    and collecting problem(5)for all data points, we have the following problem: (x  ) denotes the set of -nearest neighbors of x  and  is the -nn graph of the dataset.The first term ∑    |  | only minimizes the coefficients corresponding to the data points that are not connected.The second term enforces close representation vectors for connected close data.These two terms enforce the representation matrix and the resulting affinity matrix to have both sparsity and GE.The sparsity encourages grouping data from different subspaces into different clusters and the GE encourages grouping highly correlated data together.Considering that (1/2) ∑ ,   ‖z  − z  ‖ 2 2 = tr (Z T LZ), where L = D

Table 1 :
The clustering error rate (%) of Extended Yale B dataset.

Table 2 :
The clustering error rate (%) of USPS dataset.

Table 3 :
The average metrics of PRI, VOI, and GCE on Berkeley 500 image segmentation dataset.and the Fundamental Research Funds for the Central Universities (Grant no.NSIY21) for supporting their research works.