LogDet Rank Minimization with Application to Subspace Clustering

Low-rank matrix is desired in many machine learning and computer vision problems. Most of the recent studies use the nuclear norm as a convex surrogate of the rank operator. However, all singular values are simply added together by the nuclear norm, and thus the rank may not be well approximated in practical problems. In this paper, we propose using a log-determinant (LogDet) function as a smooth and closer, though nonconvex, approximation to rank for obtaining a low-rank representation in subspace clustering. Augmented Lagrange multipliers strategy is applied to iteratively optimize the LogDet-based nonconvex objective function on potentially large-scale data. By making use of the angular information of principal directions of the resultant low-rank representation, an affinity graph matrix is constructed for spectral clustering. Experimental results on motion segmentation and face clustering data demonstrate that the proposed method often outperforms state-of-the-art subspace clustering algorithms.


Introduction
Matrix rank minimizing [1] is ubiquitous in machine learning, computer vision, control, signal processing, and system identification. For instance, low-rank representation based subspace clustering [2][3][4] and matrix completion [5,6] methods have achieved great success recently. Subspace clustering [7] is one of the fundamental topics with numerous applications, for example, image representation [8,9], face clustering [3,10], and motion segmentation [11,12]. It is assumed that high-dimensional data is more likely a union of low-dimensional subspaces rather than one individual subspace. For example, different subspaces are needed to describe trajectories of different moving objects in a video sequence. Subspace clustering is an intrinsically difficult problem, since we need to simultaneously cluster all data points into multiple groups and find a low-dimensional subspace fitting each group of points.
Subspace clustering has been an active research topic over the past decades. Four main categories of methods are proposed [10]: iterative, algebraic, statistical, and spectral clustering-based methods. The first three kinds of approaches are sensitive to initialization, noise, and outliers; in addition, they are difficult to optimize [10]. Spectral clustering-based methods have achieved promising performance, where the key is to learn a good affinity matrix of data points. For instance, the algorithms of local subspace affinity (LSA) [13], locally linear manifold clustering (LLMC) [14], and spectral local best-fit flats (SLBF) [15] use local information around each point to construct the affinity matrix, while spectral curvature clustering (SCC) [16] method preserves the global structures of the whole dataset in deriving the affinity matrix. Subsequently, -means [17] or Normalized Cuts (NCuts) [18,19] are applied to the affinity matrix to obtain clustering results.
Recently, some spectral clustering-based methods, such as sparse representation (SSC) [10] and low-rank representation (LRR) [3], have been proposed to obtain state-of-theart results in subspace clustering. SSC represents each data point as a sparse linear combination of the other points and solves an 1 -norm regularized minimization problem for sparsity. SSC shows promising results if the subspaces are either independent or disjoint [20].
The basic idea of LRR is to learn a low-rank representation of data by capturing the global Euclidean structure of the whole data. In this scheme, each data point is represented 2 Computational Intelligence and Neuroscience as a linear combination of the examples in the data matrix itself, and a convex nuclear norm minimization is used as a surrogate of the rank function to obtain the desired lowrank representation. Though its optimization is well studied and has a global optimum, its performance may be far from optimal in real applications because the nuclear norm might not be a good approximation to the rank function. Compared to the rank function to which all nonzero singular values have equal contributions, the nuclear norm treats those values differently by simply adding them together. As a result, the nuclear norm may be dominated by a few very large singular values and significantly deviated from the true value of the rank. Several papers have considered this problem of using the nuclear norm and designed methods to alleviate it by either thresholding or removing some of the singular values; for instance, singular value thresholding [21] and truncated nuclear norm [6] both considerably enhance the performance of matrix completion.
In this paper, we propose using a log-determinant (LogDet) function for rank approximation and study its minimization in subspace clustering. Different from the nuclear norm-based approaches which minimize the summation of all singular values, our approach aims to minimize the rank by making the contribution to be much closer to one from a big singular value, while being zero from a small singular value. In this way, we can get closer and more robust approximation to the rank function than the nuclear norm. Since the LogDet function is nonconvex, we apply the method of augmented Lagrange multipliers (ALM) to solve the associated optimization for potentially large-scale applications, in which the subproblem for minimizing the LogDet function in each iteration has a closed-form solution.
To demonstrate the effectiveness of our LogDet minimization method, we apply it to subspace clustering. By employing a rather simple formulation based on the LogDet function, we obtain a low-rank representation for subspace clustering. Subsequently, we exploit the angular information of principal directions of such a representation to further enhance the separation ability of the affinity matrix. In summary, our main contributions of this work include the following.
(i) More accurate and robust rank approximation is used to obtain the low-rank representation, which is able to capture the global structure of the dataset.
(ii) An iterative optimization algorithm is designed for minimizing this rank approximation-based objective function. Theoretical analysis shows that our algorithm converges to a stationary point. Specifically, the proposed optimization method is applied to subspace clustering.
(iii) Angular information of principal directions of the low-rank representation is employed to further exploit the intrinsic local geometrical structure relevant to the membership of data points.
(iv) Extensive experiments demonstrate the effectiveness of the proposed LogDet minimization method for rank approximation. Particularly, when used for subspace clustering, our simple formulation shows favorable performance compared to other state-of-theart methods, although we do not explicitly account for outliers in our model. This demonstrates the robustness of our approach.
The remainder of the paper is organized as follows: Section 2 provides a brief review of LRR and SSC. In Section 3, we present the proposed approximation and design an efficient optimization scheme. We give convergence analysis in Section 4. Experimental results are shown in Section 5. Finally, conclusions are drawn in Section 6.

Review of LRR and SCC
In this section, we give a brief review of SSC and LRR. Let = [ 1 , 2 , . . . , ] ∈ R × be a set of -dimensional data points drawn from an unknown union of linear subspaces 1 , 2 , . . . , . The task of subspace clustering is to segment data points into subspaces.
LRR tries to seek the lowest rank representation among many possible linear combinations of the bases in a given dictionary, which typically is the data matrix itself. The problem can be formulated as min rank ( ) where = [ 1 , 2 , . . . , ] is the coefficient matrix with each being the representation of . The above problem is NPhard due to the combinatorial nature of the rank function.
The tightest convex relaxation of the rank function [22] is the nuclear norm. For a matrix ∈ R × , its nuclear norm is defined as ‖ ‖ * = ∑ min( , ) =1 ( ), where ( ) means the th singular value of . Using this relaxation, LRR solves the following problem: = . (2) After obtaining , the affinity matrix is defined as Then the spectral clustering algorithm, Normalized Cuts [18], is used to produce the final segmentation. SSC aims to find a sparse representation of by solving the following convex optimization problem: min , , where ‖ ‖ 1 = ∑ | |, is a sparse matrix containing the gross error, ‖ ‖ 2 = ∑ ∑ 2 , and is a matrix of fitting residuals. After obtaining , subsequent procedures are similar to LRR.
Computational Intelligence and Neuroscience 3

LogDet Rank Approximation and Its Minimization Algorithm
is absolutely symmetric if ( ) is invariant under arbitrary permutations and sign changes of the elements of . Based on this function ( ), we have the following theorem [23].
. And the gradient of ( ) at is Equation (5) can be obtained directly from Theorem 3.1 of [23].
In this work, we utilize unitarily invariant function LogDet to achieve a closer, though not convex, rank relaxation than the nuclear norm. We apply the method of ALM for LogDet rank approximation associated minimization. To explain our method, we specifically consider using LogDet as a rank surrogate in subspace clustering. We first obtain a low-rank representation of high-dimensional data based on the LogDet optimization. Then we construct an affinity graph matrix for spectral clustering by using the angular information of principal directions of the low-rank representation.

LogDet Rank Minimization.
We use LogDet( + ) as a surrogate of the rank function of . It is obvious that LogDet( + ) = ∑ =1 log(1 + 2 ( )). Because it can be easily verified that log(1 + 2 ( )) ≤ ( ) for any ( ) ≥ 0, we always have LogDet( + ) ≤ ‖ ‖ * ; particularly, if there are large nonzero singular values, the LogDet function will be much smaller than the nuclear norm since log(1 + 2 ( )) ≪ ( ) for a large ( ) > 1. It is noted that, for small nonzero singular values, their contribution to the LogDet function will be significantly reduced compared to the nuclear norm. Because small nonzero singular values are often regarded as being from noise in the data, the LogDet function reduces noise effect more compared to the nuclear norm.
It is worthwhile to note that a similar function LogDet( + ) was proposed in [24] to approximate rank and iterative linearization was used to find a local minimum. However, is a very small constant (e.g., 10 −6 ), which leads to biased approximation for small singular values.
This LogDet function is differentiable with respect to the singular values by Theorem 1, and even though it is nonconvex, its minimization is rather simple by using our optimization method. To explain its minimization, we consider its specific application to subspace clustering. By employing the above LogDet function, we simply formulate the subspace clustering into the following unconstrained nonconvex minimization problem: where ∈ R × is the identity matrix. The first term of (6) is to minimize the rank of , while the second is a relaxation of = , which is referred to as the self-expressiveness of with representing the similarity between data points. Because the LogDet function is not convex in , we resort to ALM technique to solve (6), by rewriting (6) as follows: We turn to the minimizing of the following augmented Lagrangian function: where > 0 is a penalty parameter and is the Lagrangian dual variable. With a sufficiently large , the objective function converges to objective function in (6). This can be solved by updating , , and alternatively while fixing the other variables. Specifically, assume that at the th iteration we have obtained , , and ; then, for the ( + 1)th iteration, optimization problem (8) can be updated via the following four steps.
Step 1. Compute +1 . Fix and and then calculate +1 : which has a closed-form solution: Step 2. Compute +1 . Fix +1 and and minimize ( , , +1 , ) as follows: This can be converted to a scalar minimization problem due to the following theorem. As we notice, this can also be rewritten as a special case of the problem in a recent work [25].

4
Computational Intelligence and Neuroscience

Theorem 2. For unitarily invariant function
), the optimal solution to the problem ) obtained by solving scalar minimization problems * = arg min ( ) Proof. Let = Σ be SVD of ; then Σ = . Denoting = which has exactly the same singular values as , that is, Σ = Σ , we have In the above, (15) (15) and (19) can also be obtained by the Hoffman-Wielandt inequality. Therefore, (20) is a lower bound of (14), where Σ * is obtained by minimizing (20). Note that the equality in (18) is attained if = Σ . Because Σ = Σ = = , the SVD of is = Σ , which is the minimizer of problem (12). Hence the proof is completed.
The first-order optimality condition is that the gradient of (13) with respect to each singular value should vanish. Thus, for subproblem (11), we have where SVD of . The above equation is cubic and gives three roots. In addition, we need to enforce the nonnegativity of . It is easily seen that there exists at least one nonnegative root. And there is a unique minimizer * ∈ [0, Σ ) if > 1/4. Finally, we obtain the update of variable with +1 = diag( * 1 , . . . , * ) .
Problem (6) is nonconvex. It is difficult to give a rigorous mathematical argument for convergence to (local) optimum. We will provide a theoretical proof that our algorithm converges to an accumulation point and this accumulation point is a stationary point. Our empirical experiments confirm the convergence of the proposed method on the benchmark datasets. The experimental results are promising, despite the fact that the solution obtained by the proposed optimization method may be a local optimum.

Affinity Graph Matrix
Construction. Now we will construct an affinity matrix for subspace clustering. Optimal * may not accurately describe the relationship between samples if the data is severely corrupted. Therefore, in general, it is not a good idea to construct by directly using * . In the spirit of [3,12], we construct an affinity matrix in the following way.
(3) Update the augmented multiplier and the augmented Lagrange multiplier : Until stopping criterion is satisfied. Return * = +1 .
(2) Compute the skinny SVD * = * Σ * ( * ) . a suitable needs to balance within-cluster cohesiveness and between-cluster separability. In this paper, we set to be 2. Then we have the same postprocessing as LRR. (For LRR, we use (12) in [3] rather than (3) to construct . We also confirmed with an author of [3] that the power 2 of (12) is a typo and it should be 4.) As * or * spans the principal directions of * , we employ the angle information or powered correlation coefficients of the examples, because their lengths may be affected significantly by the noise or outliers in the data. Now using the resultant affinity matrix, we can apply spectral clustering algorithm to do segmentation. In this paper, we simply perform NCuts [18] on . The proposed subspace clustering procedure is summarized in Algorithm 2.

Experiments and Analysis
In this section, we conduct experiments on the subspace clustering task with both synthetic and real data.

Experiments with Synthetic Data.
We construct 5 independent subspaces whose bases { } 5 =1 are generated by a random rotation matrix through +1 = , 1 ≤ ≤ 4, where 1 ∈ R 100 × 4 is a random orthogonal matrix [2]. We sample 20 data vectors from each subspace by = , 1 ≤ ≤ 5, where is a 4 × 20 i.i.d. N(0, 1) matrix. Some data vectors are randomly chosen to corrupt; for example, for a data vector , it is corrupted by adding Gaussian noise with zero mean and variance 0.2‖ ‖. We then use SCLD to segment the data into 5 clusters. Subspace clustering error rate defined as # of misclassified points/total # of points is used to assess the performance. We report the clustering error rate (averaged from 30 trials) with different corruption levels in Figure 1. Without any corruption, SCLD can cluster all data points correctly.

Experiments with Real Data.
In this section, we evaluate the effectiveness and robustness of SCLD on benchmark datasets, Extended Yale B (EYaleB) [27,28] and Hopkins 155 [29]. We compare the proposed method SCLD with several state-of-the-art subspace clustering algorithms: LRR [3], SSC [10], LRSC [4,30], and local subspace affinity (LSA) [13]. For these methods, we use the parameters given by the respective authors. For our method, we also tune to obtain the best performance. Generally, should be relatively large if the data are slightly corrupted. and have little influence on the clustering results, so we just set 0 = 0.3 to ensure the uniqueness of minimizer and use = 1.1 empirically. Other parameters are shown in Table 1. The experiments are conducted on Windows 7 with 16 GM memory and Intel Core i5-2300 CPU.    [31]. The images are resized to 48 × 42 pixels and each vectorized image is regarded as a data point. Figure 2 shows some example images from the database.
(1) First Experiment Scenario. As done in [2], we test the algorithms on the first 10 classes of EYaleB, which consists of 640 frontal face images. More than half of the images are corrupted by shadow and noise. We use this heavily corrupted data to test the effectiveness of our method. As shown in Table 2, SCLD significantly enhances the performance. Specifically, it improves the clustering accuracy by at least 17% when compared to the other algorithms. Since the only difference between our approach and LRR is rank approximation, this improvement is due to LogDet.
(2) Second Experiment Scenario. For a fair comparison, we have followed the experimental setup of [10]. We divide the 38 subjects into four groups: subjects 1 to 10, 11 to 20, 21 to 30, and 31 to 38. We consider all choices of ∈ {2, 3, 5, 8, 10} subjects for the first three groups. For the last group, we consider all choices of ∈ {2, 3, 5, 8}. We implement our subspace clustering algorithm on each set of subjects. For all experiments, the stopping criterion for is triggered by a relative difference of 10 −5 between two successive iterations or by a maximum of 100 iterations.
The results are presented in Table 3. For other methods, we cited the results from Table 5 of [10]. SCLD consistently has low clustering error rates and is more stable than the other methods whose error rates increase drastically as the number of subjects increases to 8 and 10. As shown in Figure 2, there are many sparse within-sample outliers in the face images, for example, shadows. Although LRR uses a regularization term to count for corruptions, the regularization term does not appear to be well suited to EYaleB. LSA has inferior performance possibly because it does not explicitly exploit the low-rank structure of the data.
(3) Third Experiment Scenario. In this section, we compare SCLD with other algorithms with RPCA [32] as a preprocessing step. In practice, we do not know the clustering of the data beforehand and hence we apply RPCA to the collection of all data points for each trial prior to clustering. As shown in Table 4, SCLD is still superior to other methods though they apply RPCA to deal with sparse outlying entries. Compared to Table 3, only the clustering error rates of LRSC reduced in some cases. We can conclude that applying RPCA to all data points simultaneously is not effective in improving clustering performance. This is due to the fact that RPCA seeks a common low-rank subspace, which will decrease 8 Computational Intelligence and Neuroscience  the principal angles between subspaces and decrease the distance between data points in different subjects [10].

Motion Segmentation.
Motion segmentation is to segment the trajectories associated with different moving objects into different groups according to their motions in a video sequence. Because different motions can be treated as different subspaces, we use the Hopkins 155 Dataset to validate SCLD. This dataset is slightly corrupted as shown in Figure 3. It consists of 155 sequences of two or three motions and 1 sequence of 5 motions; the latter is regarded as outlier. Each sequence is regarded as a separate clustering problem.
The experimental results are reported in Table 5. We also used the results in Table 1 of [10]. It can be seen that SCLD produces superior results compared to the other methods. For all 155 sequences, the error rate is as low as 1.79%. If we use all 156 sequences, the overall error rate of our proposed algorithm will be 1.87%. We report the average computation time for every sequence at the bottom of Table 5. The computational cost of LRSC is much lower than the other methods, while LRR, SSC, and SCLD are comparable.
To testify the influence of parameter in our algorithm, we show the clustering error rates of SCLD for different over all 155 sequences in Figure 4. As we can see, when was between 1 and 200, the clustering error varied between 1.79%   To test the dependence of SCLD on initialization, we apply another two different initializations. First, we use the solutions from LRR as initial guess for SCLD. Second, we just generate some random numbers. We find that we can still get the same results. Actually, it is recommended to use convex relaxation solutions as initialization for nonconvex formulations [33,34].
Computational Intelligence and Neuroscience 9

Conclusion
In this paper we propose using a log-determinant function (LogDet) as a rank approximation to recover the low-rank representation of high-dimensional data. When applied to subspace clustering, the proposed algorithm, called SCLD, exploits both global and local structures of the data through the LogDet rank approximation and angle-based affinity matrix. Consequently, it captures more intrinsic information of the data that benefits subspace clustering. Our extensive experimental results show that it outperforms other lowrank representation algorithms based on the nuclear norm. Therefore LogDet appears to be an effective rank approximation function well suited to subspace clustering applications. Although our model is simple and with no explicit modeling of outliers, it is resilient to various corruptions. Our future research will consider modeling corruptions explicitly.