A Novel Convex Clustering Method for High-Dimensional Data Using Semiproximal ADMM

Clustering is an important ingredient of unsupervised learning; classical clustering methods include K-means clustering and hierarchical clustering. These methods may suﬀer from instability because of their tendency prone to sink into the local optimal solutions of the nonconvex optimization model. In this paper, we propose a new convex clustering method for high-dimensional data based on the sparse group lasso penalty, which can simultaneously group observations and eliminate noninformative features. In this method, the number of clusters can be learned from the data instead of being given in advance as a parameter. We theoretically prove that the proposed method has desirable statistical properties, including a ﬁnite sample error bound and feature screening consistency. Furthermore, the semiproximal alternating direction method of multipliers is designed to solve the sparse group lasso convex clustering model, and its convergence analysis is established without any conditions. Finally, the eﬀectiveness of the proposed method is thoroughly demonstrated through simulated experiments and real applications.


Introduction
Clustering is an important ingredient of unsupervised learning. It assigns samples into different clusters by minimizing the differences in the same cluster and maximizing the differences between different clusters. As an exploratory data analysis technique, it has been applied in many fields, such as image processing, energy engineering, and social networks [1][2][3]. To date, a wide variety of clustering methods have been introduced, including classical clustering methods such as K-means clustering [4,5] and hierarchical clustering [6,7]. However, the convexity of the corresponding optimization models cannot be guaranteed in general, so their global optimal solutions are hard to find. In addition, the performances of these methods are strongly dependent on their initial settings. To overcome the aforementioned disadvantages, convex clustering (CC), which provides a convex optimization perspective toward the task of clustering via shrinkage or regularization techniques, has been considered by many researchers, e.g., [8][9][10][11]. Its corresponding optimization model is given by where A is a given data matrix with n observations and p features; ‖ · ‖ F denotes the Frobenius norm of the matrix; c 1 ≥ 0 is a tuning parameter that controls the balance between the model fit and the number of clusters; Θ � ℓ � (ℓ 1 , ℓ 2 ) | 1 ≤ ℓ 1 < ℓ 2 ≤ n is the index set; ω ℓ ≥ 0, ℓ ∈ Θ are given weights that are generally chosen based on the given data matrix A; X ℓ 1 · (X ·ℓ 1 ) is the ℓ 1 th row (column) of matrix X; and the most common choices of q are 1, 2, and ∞, which ensure the convexity of model (1). Let X be an optimal solution of (1); then, X i· � X j· indicates that the observations A i· and A j· are assigned to the same cluster. Note from (1) that if we set c 1 � 0, this will lead to the trivial solution X � A, i.e., the partition of the observations into singletons. As c 1 increases, some rows of X become identical. If we set c 1 to be sufficiently large, all rows of X become identical, which indicates that all observations are lumped into a single cluster. Hence, the clustering path can be obtained by solving (1) with a range of tuning parameters, and then the number of clusters is determined by learning from the data rather than being given in advance.
is fascinating characteristic of convex clustering has attracted increasing attention, see, e.g., [12][13][14][15][16][17][18][19][20][21]. e theoretical results on cluster recovery of (1) with ω ℓ � 1 had been established in [12,13]. Later, Sun et al. [14] extended these results to the general weighted convex clustering model. Furthermore, in [15], the authors analyzed the statistical properties of (1) with ω ℓ � 1, such as an unbiased estimator of the degrees of freedom and a finite sample bound for the prediction error. e large sample behavior of (1) with l 1 fusion penalization is presented in [16]. In [17], the authors present conditions that guarantee that the convex clustering solution path recovers a tree and explicitly describes how the affinity parameters modulate the structure of the recovered tree. Meanwhile, some researchers concentrated on the solution algorithms of (1). For example, Chi and Lange [10] adopted the Alternating Direction Method of Multipliers (ADMM) and Alternating Minimization Algorithm (AMA) to solve (1). More recently, in [14,18], the semismooth Newton-based augmented Lagrangian method was applied to (1) and outperformed ADMM and AMA in efficiency, scalability, and robustness. Moreover, the idea of convex clustering was applied to process missing data [19], binary data [20], and outlier detection [21].
Although convex clustering (1) has desirable theoretical results and performs well computationally, Wang et al. [22] pointed out that the clustering performance may be destroyed in high-dimensional scenarios where the number of features is much more than observations. e noninformative features included in the clustering are the key reason behind why convex clustering (1) is disheartening. Hence, eliminating some noninformative features is indispensable for high-dimensional data clustering. is motivates us to propose a more reasonable convex clustering method that can perform cluster analysis and feature screening simultaneously.
In this paper, we propose the Sparse Group Lasso Convex Clustering (SGLCC) method by adopting the sparse group lasso penalty [23]. e optimization problem is summarized as where the tuning parameter c 2 controls the number of informative features and u � (u 1 , u 2 , . . . , u p ) T ∈ R p is the weight vector. e form of the penalty term is much like the elastic-net penalty [24], and ‖X ·j ‖ 2 and ‖X ·j ‖ 1 denote the Euclidean norm and ℓ 1 norm of vector X ·j , respectively. e parameter 0 ≤ α ≤ 1 controls the balance between the group lasso [25,26] and the lasso [27]. Obviously, convex clustering (1) is a special case of (2) if we set c 2 � 0. Unlike (1), model (2) can perform variable selection. When α � 0, model (2) reduces to sparse convex clustering (SCC) [22]. SCC uses the group lasso penalty to promote group sparsity, that is, all components of the solution X ·j are either zero or nonzero. However, it cannot induce ubiquitous within-group sparsity in genetic data. For example, a biological pathway may be implicated in the progression of a particular type of cancer, but not all genes in the pathway need to be active [28]. e sparse group lasso penalty is designed to achieve such within-group sparsity. erefore, we believe that the proposed clustering model (2) can significantly improve the clustering performance of model (1). It is worth noting that solving the optimization problem (2) is more challenging than convex clustering (1) and SCC. If the optimization algorithms utilized in [10,22] are adopted to solve (2) directly, the computation efficiency may be poor because the subproblem might have no closed-form solution and the step length is too small. Hence, we developed the steps in this paper along with the theory, algorithm, and applications of SGLCC. e main contributions of the paper are summarized as follows: (1) For the proposed SGLCC, we obtain a finite sample error bound and feature screening consistency under mild conditions. Our results are not only applicable to the more practical SGLCC model but also subsume the known results of (1) and SCC as special cases.
(2) We adopt the semiproximal alternating direction method of multipliers to solve (2), and its convergence analysis is established without any additional conditions. e subproblem per iteration of this algorithm has a closed-form solution and can be implemented efficiently.
(3) e experimental results on both synthetic and real datasets illustrate that SGLCC provides superior clustering performance and feature selection abilities to other clustering methods. e remainder of this paper is organized as follows. In Section 2, we summarize some preliminaries and notations. We study the finite sample error bound and feature screening consistency of our proposed method in Section 3. In Section 4, we introduce an efficient optimization algorithm for solving (2). After that, in Section 5, we conduct numerical experiments to verify the effectiveness of the proposed method in synthetic datasets and four microarray datasets. Conclusions are given in Section 6. e proof of the main results can be found in the Supplementary Materials (available here).

Preliminaries
In this section, we introduce some preliminaries that are used later in the paper.
For convenience, we start with some necessary notations. For a vector x ∈ R n and q ≥ 1, ‖x‖ q denotes the ℓ q norm of x and 1 n ∈ R n denotes a vector with all components equal to one. Let |x| � (|x 1 |, |x 2 |, . . . , |x n |) T . For a matrix X ∈ R n×m , X i· (X ·j ) is the ith row (the jth column) of X, ‖X‖ F denotes 2 Mathematical Problems in Engineering the Frobenius norm of X, ‖X‖ 2 denotes the spectral norm of X, and the vectorization of X is defined by vec(X) � (X 11 , X 21 , . . . , X n1 , X 12 , . . . , X nm ) T . e symbols o and ⊗ denote the Hadamard product and Kronecker product, respectively. e linear operator H: R n×p ⟶ R |Θ|×p is given by e ℓth row of H is denoted as H ℓ· � [e ℓ 1 − e ℓ 2 ] T , ℓ � (ℓ 1 , ℓ 2 ) ∈ Θ, and e ℓ 1 is a vector with the ℓ 1 th component equal to one and the rest equal to zero. e adjoint operator of H is given by Next, some definitions and results from convex analysis [29] are provided. Proximal mapping is important for designing optimization algorithms and has been well studied. For a proper closed convex function g: R n ⟶ (−∞; +∞], the proximal mapping Prox tg (x) for g at any x ∈ R n with t > 0 is defined by In particular, if g is an indicator function of the set C (g(x) � 0 if x ∈ C; otherwise g(x) � ∞); then, the proximal mapping reduces to the projection operator onto C. e Fenchel conjugate of g is defined by A key property that connects the proximal mapping and Fenchel conjugate is the so-called Moreau decomposition: For example, if g(x) � ‖x‖ q with q ≥ 1, from Moreau decomposition, we obtain where B � x | ‖x‖ q * ≤ 1 and (1/q) + (1/q * ) � 1.

Statistical Properties
In this section, we study the finite sample error bound and feature screening consistency of sparse group lasso convex clustering (2). We start with the following conditions that facilitate the technical proofs.
is the mean vector and ε ∈ R np is an error vector of independent sub-Gaussian random variables with mean zero and variance δ 2 A2: only s(s ≪ p) features are informative; the informative feature set and its complementary set are then denoted as I and I c , respectively e condition A1 implies that the data matrix A is composed of sub-Gaussian random variables and is commonly used in the literature, see, e.g., [15,22]. In high-dimensional scenarios, only a few features are active in general. us, condition A2 has been widely used in high-dimensional data analysis, see, e.g., [22,30]. For simplicity, as described in the literature [12,13,15,22], we consider the case with ω ℓ � 1.

Bounds for Prediction Error.
e following theorems provide the finite sample bounds for prediction error of model (2), and our theoretical results subsume the existing results for CC and SCC as special cases.
����������� � (log(p|Θ|))/n, then there exists positive constants b 1 and b 2 such that Here, From eorem 1, we can observe that the average prediction error is bounded, and c 2 tends to zero as n, p ⟶ ∞. We know that ‖vec( and c 4 decays to zero as n, p ⟶ ∞. Additionally, by including Mathematical Problems in Engineering e next theorem presents the finite sample bound for the prediction error of model (2) with q � 2.
We only discuss the first term on the right-hand side of (11) in eorem 2 because all other terms are the same as (8) in eorem 1. We find that ‖H ℓ· X * ‖ 2 � O(1) by condition A2.

Remark 1
(1) In high-dimensional scenarios, the number of features p is considerably larger than the number of observations n. (1) hold automatically, and this condition can be found in [15].
(3) eorems 1 and 2 subsume the known results for CC and SCC by taking c 2 � 0 and α � 0, respectively. In addition, we obtain encouraging results provided in Corollary 1 when α � 1.
e following corollary follows directly from eorems 1 and 2 by letting α � 1.

Corollary 1.
Suppose that conditions A1-A2 are satisfied. Let X be the solution of model (2) with α � 1. We obtain the following statements.
As we can see from Corollary 1, SGLCC with α � 1 has prediction consistency and only need some conditions that are easy to satisfy.

Feature Screening Consistency.
Feature screening consistency is a desirable property in high-dimensional scenarios where many noninformative features need to be eliminated. is section demonstrates the asymptotic feature screening consistency of SGLCC.
eorems 3 and 4 give the asymptotic consistency of feature selection of SGLCC with q � 1 and q � 2, respectively. In this sense, SGLCC possesses the ability to correctly eliminate the noninformative features that distort clustering performance. Hence, we believe that the clustering performance of SGLCC is promising.

Algorithmic Design
In this section, we adopt the semiproximal alternating direction method of multipliers to solve (2), and its global convergence can be established. For simplicity, we focus on designing an algorithm to solve (2) with q � 2. e same algorithmic design and implementation can be applied to cases q � 1 and q � ∞ without difficulty.

Optimality Conditions.
By ignoring the terms with ω ℓ � 0, model (2) can be rewritten as min X∈R n×p where E � ℓ � (ℓ 1 , ℓ 2 ) | ω ℓ > 0 . For the convenience of mathematical expression, we define the linear operator J: R n×p ⟶ R |E|×p and its adjoint operator J * : R |E|× p ⟶ R n×p , respectively, by where J ℓ· � [e ℓ 1 − e ℓ 2 ] T . Furthermore, model (13) can be reformulated as the following linearly constrained convex optimization problem with a separable structure: where e Lagrangian function associated with (16) is defined by where λ ∈ R |E|×p , μ ∈ R n×p , and ξ ∈ R n×p are Lagrangian multipliers. For a given parameter σ > 0, the augmented Lagrangian function for (16) is defined by e Karush-Kuhn-Tucker (KKT) conditions for (16) are given by (16). Obviously, F ≠ ∅ and F(X, Y, Z, W) ≥ 0. en, we know from Corollaries 28.2.2 and 28.3.1 in [29] that (X, Y, Z, W) is an optimal solution for (16) if and only if there exists a Lagrangian multiplier (λ, μ, ξ) such that (X, Y, Z, W, λ, μ, ξ) satisfies the KKT conditions (19). Hence, we could design a stopping criterion based on the KKT conditions in terms of guaranteeing the optimality of a point generated by the algorithm proposed in Section 4.2.

Algorithm and Convergence Analysis.
Although the optimization problem (16) is a convex optimization problem with linear constraints, it is difficult to minimize all four variables at the same time. Hence, we divide all four variables into two groups and update the two groups of variables alternately. According to the separable structure, we employ the semiproximal alternating direction method of multipliers (sPADMM) to solve (16) and view X as one block and V � (Y, Z, W) as another.
Next, we discuss the iteration schemes of sPADMM for solving the optimization problem (16). Given the kth iteration (X k , V k , λ k , μ k , ξ k ) and two self-adjoint positive semidefinite operators P 1 and P 2 , first update variables X and V as follows: en, update Lagrangian multipliers: Here, τ is step length and ‖X − X k ‖ 2 . It is well known that the sPADMM reduces to the classical ADMM introduced by [31,32] when P 1 � 0 and P 2 � 0. A comprehensive review of sPADMM can be found in Appendix B of [33].

Remark 2.
e choice of the two self-adjoint positive semidefinite linear operators P 1 and P 2 depends greatly on the problem. e general principle is that both P 1 and P 2 should be as small as possible, while the corresponding subproblems are still relatively easy to solve.
Each block of (20) involves solving a convex optimization problem. Next, we show the closed-form solutions of each block. First, we deal with the first block in (20):

Mathematical Problems in Engineering
e X-subproblem is a smooth, strongly convex optimization that has a unique global minimizer X k+1 . Finding the minimizer X k+1 is equivalent to finding the solution of the following linear system: where M � σ(J T J + P 1 ) + (1 + 2σ)I n . To solve this linear system efficiently, we design an appropriate P 1 . In our experiment, let P 1 � nI n − 1 n 1 T n − J T J, which is a positive semidefinite matrix. en, M � (1 + 2σ + nσ)I n − σ1 n 1 T n , and by the Sherman-Morrison-Woodbury formula [34]. Hence, the closed-form solution for this linear system can be obtained as For the second block in (20), we let the linear operator P 2 � 0 since the linear operator of V in (16) is an identity operator that ensures the well-defined V-subproblem: which can be divided into three parts: Together with the definition of proximal mapping and Moreau decomposition, we show the closed-form solutions of the above subproblems. For any ℓ ∈ E, the Y-subproblem has a closed-form solution that is given by Similarly, for any j � 1, 2, . . . , p, the solution of the Z-subproblem is given by e last subproblem is a soft threshold operator for each column, that is, for any j � 1, 2, . . . , p, Based on the above analysis, the algorithm framework of sPADMM for solving (16) is summarized in Algorithm 1.
e following theorem provides the convergence analysis for Algorithm 1. e proof is inspired by eorem B.1 in [33], but no additional conditions are required in our proof.

Theorem 5.
If sequence (X k , V k , λ k , μ k , ξ k ) is generated from Algorithm 1, then sequence (X k , V k ) converges to an optimal solution for (16) and any accumulation point of (X k , V k , λ k , μ k , ξ k ) such that KKT conditions (19) hold.
Proof. We first justify that the conditions required in eorem B.1 in [33] automatically hold for (16).
(a) e existence of a solution for (16): note that the objective function F(X, Y, Z, W) tends to infinity as ‖(X, Y, Z, W)‖ F ⟶ ∞. erefore, the level set is bounded by eorem 4.11 in [35]. Hence, we further know that the minimizer of (16) can be attained by eorem 4.10 in [35]. (b) e constraints of (16) can be rewritten as Hence, we find that AA * � J T J + 2I > 0, so σ(P 1 + AA * ) is positive definite. Moreover, σ(P 2 + BB * ) is positive definite because B * is the identity operator.
Based on these ingredients, the remainder of the proofs can be obtained easily (see Appendix B of [33] for details).

Numerical Results
e aim of this section is to illustrate the practical performance of sparse group lasso convex clustering (SGLCC), which performs well in theory. We conduct experiments both on synthetic and real datasets and compare the proposed method to K-means clustering, CC [10] and SCC [22]. e clustering results of K-means clustering can be obtained by built-in functions (kmeans) in MATLAB. All numerical experiments were performed in MATLAB R2017b on a personal computer with a 3.30 GHz processor and 4 GB of RAM.
In our experiments, we stop sPADMM based on the relative KKT residual of (16), which is given by where and Tol is a given tolerance. If the requirement ε ≤ Tol is not satisfied after a maximum number of iterations (maxiter), we also terminate the algorithm. In our experiments, we set Tol � 10 − 3 and maxiter � 10000. Moreover, we use the same method to select the weights ω ℓ and the adaptive weights u j in all experiments. For the weights ω ℓ , we select the most popular and practical method proposed by [10]: , ℓ 2 is among ℓ 1 ′ sm − nearest neighbors, 0, otherwise, where ϕ > 0 and m < n. We set ϕ � 0.5 and m � 5 in all experiments. Inspired by [36], we let the adaptive weights ·j is the estimator of X ·j in (1). As can be seen from models (1) and (2), the tuning parameters c 1 and c 2 , α control the number of clusters and the number of informative features, respectively. In our experiments, the principle for selecting c 1 and c 2 , α is to make the results match the true number of clusters and the true number of informative features as much as possible, and the Rand index as large as possible. is principle is also used in the literature [21] to determine the tuning parameters.

Synthetic Data.
In this section, we evaluate the performance of SGLCC on synthetic datasets by comparing the results with those obtained with K-means clustering, CC and SCC. We consider three synthetic datasets. Each synthetic dataset consists of n � 200 observations with either K � 2 or 4 clusters. e number of features is p and ranges from 2000 to 5000, where the percentage of informative features is 2%. e cluster label L i (i � 1, 2, . . . , n) is uniformly sampled from 1, 2, . . . , K { }. Without loss of generality, we assume that the first s(2%p) features are informative and the remaining features are noninformative. e first s informative features are generated from an s-dimensional normal distribution with mean x K (L i ) and covariance Σ s , whose (i, j) entry is Σ s (i, j) � ρ |i− j| , 1 ≤ i, j ≤ s, and the noninformative features are generated from a (p − s)-dimensional standard normal distribution. Here, where I(·) is a function whose value is one if the event happens and zero otherwise. Let us illustrate this definition with an example. Specifically, if K � 2, the first s informative features of the observations with cluster labels 1 and 2 are generated from s-dimensional normal distributions with mean x 2 (1) � 1 s and x 2 (2) � −1 s , respectively. In this part, (1) Initialization Let τ � 1.618 and σ > 0. Given a start point (X 0 , V 0 , λ 0 , μ 0 , ξ 0 ), for k � 0, 1, 2, · · · , (2) repeat (3) Step 1. Update X k+1 according to (25) (4) Step 2. Update V k+1 according to (28), (29) and (30)  (5) Step 3. Update the Lagrangian multipliers by (21) (6) until Stopping criterion is satisfied.
ALGORITHM 1: sPADMM for solving (16).   we consider the following three synthetic datasets. Case I: We repeat each experiment 50 times and evaluate the performance through four common criteria for cluster analysis. e Rand index [37] and Fowlkes-Mallows index [38] are two measures of the similarity between the estimated clustering result and the underlying true clustering assignment. eir values range from 0 to 1, and a larger value indicates better performance. For simplicity, we shall abbreviate the Rand index and Fowlkes-Mallows index as RI and FMI, respectively. Moreover, the false negative rate (FNR) and false positive rate (FPR) are reported to evaluate the performance of feature screening. All experimental results for the synthetic datasets are summarized in Tables 1-3.
From Table 1, several interesting observations can be made. (a) When the feature dimensions are high (p � 2000, 3000, 4000, 5000), CC does not perform well, and its clustering results worsen as p increases. (b) Because CC and K-means do not have the ability to eliminate noninformative features, their FNR and FPR are zero and one, respectively. (c) e clustering results of SGLCC and SCC are significantly better than those of CC when the feature dimensions are high because the penalty term is incorporated in convex clustering (1). (d) SGLCC has more stable performance than K-means and CC because it has the smallest standard deviation (SD). (e) SGLCC selects informative features with greater accuracy than the SCC, that is, with a lower FPR and almost the same FNR. e phenomenon mentioned above can also be observed from the results for Case II, see Table 2 for details. It is worth noting that SCC and SGLCC have similar performance for Case II. is reflects the ability of SCC to conduct feature selection, especially for data with group sparsity and dimensions that are not very large. However, compared with that of SGLCC, the performance of SCC becomes unstable as the number of dimensions increases. e reason behind this difference is that SGLCC allows a more nuanced selection of informative features. e experimental results for Case III are reported in Figure 1 and Table 3. As we can see from Figure 1, SGLCC is significantly better than SCC when there are highly correlated variables in the dataset, and the improvement becomes increasingly evident as p increases. e main reason behind this may be the form of the sparse group lasso penalty, which, much like the elastic net, is good at handling the collinearity problem.  [39]. In our experiments, each feature from all of the datasets is centered. e Rand index of K-means, hierarchical clustering (Hclust),  spectral clustering (Sclust), CC, SCC, and SGLCC for all five real datasets are reported in Table 5, and the best results are highlighted in bold. Here, the clustering results of Hclust and Sclust are obtained through two built-in functions in MATLAB, namely, clusterdata and spectralcluster.

Real
As we can see from Table 5, the performance of CC is unsatisfactory and worse than that of K-means for Brain A, Colon, and ARB. A similar phenomenon can also be observed in [15,22]. SGLCC is significantly superior to K-means, Hclust, Sclust, and CC in term of clustering performance, which implies that feature screening is an indispensable part of high-dimensional data analysis. Moreover, SGLCC is better than SCC because a more reasonable penalty term is used. In summary, our proposed SGLCC method has the largest Rand index among the six methods.

Conclusion
In this paper, we propose the Sparse Group Lasso Convex Clustering method for high-dimensional datasets, which can determine the number of clusters by learning from the data. SGLCC not only divides the observation set but also eliminates noninformative features. For the proposed clustering method, we provide the theoretical finite sample bounds of the prediction error and feature screening consistency. Moreover, an efficient sPADMM is designed to implement our proposed estimator, and each subproblem yields closed-form solutions. Convergence analysis of the sPADMM is established without any additional conditions. e experimental results illustrate that SGLCC provides superior clustering performance and feature selection abilities than other compared clustering methods. In particular, our method is also very effective for real data applications.

Conflicts of Interest
e authors declare no conflicts of interest.