An Implementable First-Order Primal-Dual Algorithm for Structured Convex Optimization

al.


Introduction
In this paper, we consider the following separable optimization problem: min ∈X,∈Y  () +  () where  ∈ R × and  ∈ R × are given matrices,  ∈ R  is a given vector, and X ⊂ R  and Y ⊂ R  are nonempty closed convex sets. : R  → R and  : R  → R are convex functions (not necessarily smooth).Throughout this paper, we assume that the solution set of (1) is nonempty.Problem of this type arises in many applications, ranging from machine learning to compressed sensing.We refer to, for example, [1][2][3][4][5][6][7][8][9], for a few examples of applications.
To solve (1), one can use the classical augmented Lagrangian method (ALM).Starting from any initial iterate ( 0 ,  0 ,  0 ), ALM iterates via the following procedure: where the augmented Lagrangian function L A (, ; ) associated with the problem (1) is given by L A (, , ) =  () +  () −   ( +  − ) and  ∈ R  is the Lagrangian multiplier vector associated with the linear constraint and  > 0 is a penalty parameter for the violation of the linear constraint.ALM enjoys very nice convergence and has been shown to be equivalent to a proximal point algorithm applied to the dual of (1) [10].A noticeable feature of ALM is that it treats (1) as a generic minimization problem and ignores completely the nice separable structure emerging in the objective function.The minimizations of the two functions  and  in (2a) are strongly coupled because of the quadratic term (/2)‖ +  − ‖ 2 .Hence, the implementation of ALM ((2a) and (2b)) can be computationally challenging.
To utilize the separable structure of the problem, the wellknown alternating direction method (ADM) essentially splits the ALM subproblem into two subproblems with respect to 2 Abstract and Applied Analysis  and  in Gauss-Seidel manner.More specifically, at each iteration, ADM takes the following form: ADM can be interpreted as Douglas-Rachford splitting method applied to the dual problem [11] and proximal point method [12].We refer to, for example, [13][14][15][16][17] for recent study of ADM and its variants.Comparing to ALM, ADM minimizes (3) with respect to  and  alternatingly at each iteration, rather than with respect to both  and  simultaneously.This splitting procedure makes it possible to exploit the special separable structure of the objective functions.Thus, ADM appears to be a natural fit for solving very large scale distributed machine learning and big-data related optimization problems.
When  and  in (1) are both identity matrices, ADM can be very efficient.The first two subproblems ((4a) and (4b)) correspond to evaluating the resolvent operators of component functions  and , respectively.Here, the proximal operator of the function  : R  → R is defined by where  ∈ R  and  > 0. In most popular applications of sparse optimization, the proximal operator can be computed exactly and efficiently (e.g., () = ‖‖ 1 := ).While when  (resp., ) is not an identity matrix, the resulting ADM subproblems may not be easily solvable, since they involve inverting of  (resp., ), and there is no efficient way of doing so directly.This difficulty could result in inefficiency of the ADM greatly.As a result, first-order algorithms that preserves both the alternating computation feature of ADM and the advantages of not involving the inverse of  and  are highly desirable.
In this paper, we also focus on the specific scenario of (1), where both  and  are not identity matrices.We propose a new first-order primal-dual algorithm for (1).We show that our algorithm can be viewed as a customized proximal point method with a special proximal regularization parameter, and it is within the framework of the contraction type methods.The proposed algorithm has three main advantages: first, it can be easily implementable, the main computational effort of each iteration is to evaluate the proximal operators of the component objective functions; second, the involved subproblems are solved consecutively in the ADM manner, which makes the algorithm amenable to distributed optimization.Finally, it only uses the first-order information and does not require any matrix inversion or solving linear systems.Hence, the method is well suited for solving large scale distributed optimization problems.
The paper is organized as follows.In Section 2, we review some preliminaries.In Sections 3 and 4, we present the new method and analyze its convergence.In Section 5 we conduct numerical experiments to compare the proposed method with APGM for solving the stable principal component pursuit problem.We conclude the paper in Section 6. (1).In this section, we reformulate (1) as a variational form, which is useful for succedent algorithmic illustration and convergence analysis.

Variational Characterization of
The Lagrangian function associated with (1) is where  ∈ R  is a Lagrangian multiplier.According to the previous convex assumption of (1), finding optimal solutions of (1) and its dual form is equivalent to finding a saddle point of L.More precisely, let ( * ,  * ,  * ) be a saddle point of L.
We have Then we can directly read off the optimality conditions with variational characterization.More specifically,  * = ( * ,  * ,  * ) ∈ Ω is a saddle point of L if and only if it satisfies the following mixed variational inequality (VI): Abstract and Applied Analysis Hence, a solution of ((9a)-( 9c)) yields a solution of (1).Note that the mapping (⋅) is said to be monotone with respect to Ω if Consequently, it can be easily verified that () is monotone.
Under the aforementioned nonempty assumption on the solution set of (1), the solution set of ((9a)-(9c)), denoted by W * , is also nonempty.

Proximal Point Algorithmic Framework.
In this subsection, we review the classical proximal point algorithm (PPA) for solving the VI ((9a)-(9c)).PPA, which was proposed by Martinet [20] and further studied by Rockafellar [10], plays a fundamental rule in optimization.For given iterate   ∈ Ω, PPA generates the new iterative  +1 via the following procedure: where  is a positive definite matrix, playing the role of proximal regularization parameter.A simple choice of  is that  =  ⋅  where  > 0 and  is the identity matrix, regularizing the proximal terms  +1 −   in the uniform way.We refer the reader to example [21][22][23][24] for some special choices of  in different scenarios.

The Main Algorithm
3.1.Assumption.Before we present our new algorithm, we need to make the following assumption: Assumption 1.For any give  ∈ R + and  > 0, the proximal operator of ℎ() (see (5)) has a closed-form solution or it can be efficiently solved up to a high precision.
Whenever the assumption holds, we say that the proximal operator of ℎ is "easy" to evaluate.Note that ℎ is separable across two variables, that is, ℎ =  + ; according to the definition (5), we have where  ∈ R  and  ∈ R  .Hence, under the assumption, the proximal operators of the component objective functions  and  are also "easy" to evaluate.

Motivation.
The motivation for our algorithm is directly related to the linearized augmented Lagrangian method proposed in [25] and the customized proximal point algorithm proposed in [21] for convex problems with linear constraints.
In order to obtain a closed-form solution of , we first add a proximal regularization parameter  = ( ) to the variational characterization of the problem ((9a)-(9c)).Then, like the algorithm in [21], we get the following proximal point algorithm: However, (13b) is still not implementable.Inspired in [25], in order to alleviate the computation required by subproblems, we try to linearize the quadratic term in (13b) by where  > 0 is a proximal parameter, and With a simple manipulation, we get the following approximation to (13b): In this case, the closed-form solutions of the resulting subproblems can be easily obtained.

Description of the Algorithm.
In this section, we formally present Algorithm 1.

Remark 2.
From the algorithm, we can see that the minimizations (( * ) and ( * * )) each requires the evaluation of the proximal operators for the component objective functions  and .It is clear that the implementation of the proposed method is simple under our assumption.
Step Step 2. If a termination criterion is not met, go to Step 1.

Global Convergence
In this section, we show that the proposed method is in some sense equivalent to the proximal point algorithm with a special proximal regularization parameter.Then its convergence can be easily established under the analytic framework of contraction type methods.
Proof.First, by deriving the optimality condition for the subproblem ( * ), we have It can be further rewritten as Similarly, the optimality condition for the subproblem ( * * ) is It can be further rewritten as Substituting ( * * * ) into (19), we have In addition, it follows from ( * * * ) that Combining ( 18), (21), and ( 22) together, we get with  = (, , )  ∈ Ω.Using the notation of , the assertion ( 16) is proved.
Note that when  > 0 and  > ‖  ‖,  > ‖  ‖,  is a positive definite matrix.Lemma 3 implies that the proposed algorithm can be viewed as a customized version of the PPA, where  is the metric proximal parameter.Henceforth, we denote the proposed algorithm as CPPA.Now we state and prove the contractive property of our algorithm.
Proof.Note that  * ∈ Ω; it follows from ( 16) that which can be further written as On the other hand, using the fact that (⋅) is a monotone operator, we have Combining ( 28) and ( 27), we have Since  * is a solution of (9a) and  +1 ∈ Ω, we have Thus, we get Since  is positive definite under the requirements of the parameters ( 24), (31) can be further written as On the other hand,       +1 −  * Since    →  ∞ , for the  > 0 given above, there is an integer   >  0 , such that, Therefore, for any  >   , it follows from (38) and (39) that This implies that the sequence {  } converges to  ∞ ∈ Ω * .

Application to the Stable Principal Component Pursuit Problem
In this section, we apply the proposed method to solve the stable principal component pursuit problem with nonnegative constraints (SPCP).The problem tested is from Example 2 of [19].Our code was written in Matlab R2009b and all experiments were performed on a laptop with Intel Core 2 Duo @ 2.0 GHz CPU and 2 GB of memory.
Let  = ++ be a given observation matrix, where  is a low-rank and non-negative matrix,  is a sparse matrix,  is a noise matrix.SPCP arising from image processing seeks to recover  and  by solving the following nonsmooth convex optimization problem: min ,, where ‖ ⋅ ‖ * is the nuclear norm (defined as the sum of all singular values), ‖ ⋅ ‖ 1 and ‖ ⋅ ‖  denote the  1 norm and the Frobenius norm of a matrix, respectively, and I(⋅) is the indicator function for the nonnegative orthant R × .In the above model,  denotes the noise matrix and  and  are some fixed parameters.
Following the procedure described in [19], by introducing an auxiliary variable  and grouping  and  as one big block [; ] and grouping  and  as another big block [; ], model (41) can be easily reformulated as min ,,, which fits the setting of (1).The proposed algorithm (( * )-( * * * )) is therefore applicable for (42), and we obtain the following iterative scheme: The main advantage of CPPA applied to SPCP is that the generated minimizations in (( 43)-( 46)) are all simple enough to have closed-form solutions.For completeness, we elaborate on the strategy of solving the resulted subproblems at each iteration.
(i) The -subproblem (43) amounts to evaluate the proximal operator of the nuclear norm function and is given by the matrix shrinkage operation: where the matrix shrinkage operator MatShrink (, ) ( > 0) is defined as MatShrink (, ) := Diag (max { − , 0})   , and Diag()  is the SVD of matrix .
For the numerical experiments, we follow [19] to randomly generate the data of (41).We consider the scenario of  = .For given ,  < , we generate  * =  1   2 , where  1 and  2 are independent  ×  full row rank matrices whose elements are independently and identically (i.i.d.) uniformly distributed in [0, 1].Note that in this experiment,  * is a component-wise nonnegative and low-rank matrix to be recovered.The support of the sparse matrix  * was chosen uniformly and randomly, and the nonzero entries in  * are generated i.i.d.uniformly in [−500, 500].The entries of matrix  * for noise were generated as i.i.d.Gaussian with standard deviation 10 −4 ; we set  :=  * +  * +  * .
In the following, we compare CPPA with APGM in [18,19], since they are all designed based on the "simple" assumption and share the same conditions for convergence.For the penalty parameter , we take  = 0.01.For other individual parameters required by these methods, we choose  = 2.618,  = , and  := 1/√.To implement all the compared methods, the initial iterate for both CPPA and APGM is chosen  0 =  0 = −,  0 =  0 = 0, Λ 0 1 = Λ 0 2 = 0.The stopping criterion is set as where   is the tolerance set as   = 10 −4 .We denoted Rank  := / so that the rank of  * is  * Rank  and Card  := cardinality ( * )/( 2 ) so that the cardinality of  * is  2 * Card  .Some preliminary numerical results are reported in Table 1.Since they are synthetic examples with random date, for each scenario we test for 10 times, and the results were averaged over ten runs.Specifically, we reported the number of iteration (Iter.),relative error of the low-rank matrix (rel  ), relative error of the sparse matrix (rel  ), and CPU