An Inexact Penalty Decomposition Method for Sparse Optimization

The penalty decomposition method is an effective and versatile method for sparse optimization and has been successfully applied to solve compressed sensing, sparse logistic regression, sparse inverse covariance selection, low rank minimization, image restoration, and so on. With increase in the penalty parameters, a sequence of penalty subproblems required being solved by the penalty decomposition method may be time consuming. In this paper, an acceleration of the penalty decomposition method is proposed for the sparse optimization problem. For each penalty parameter, this method just finds some inexact solutions to those subproblems. Computational experiments on a number of test instances demonstrate the effectiveness and efficiency of the proposed method in accurately generating sparse and redundant representations of one-dimensional random signals.


Introduction
In this paper, we consider solving the following sparse optimization problem by an inexact penalty decomposition (iPD) method: min x∈X l(x) + λ‖x‖ 0 , where λ ≥ 0 controls the sparsity of the solution, X ⊂ R n is a closed convex set in the n-dimensional Euclidean space R n , l: R n ⟶ R, g(x): R n ⟶ R p are continuously differentiable convex functions, h: R n ⟶ R m is an affine function, and ‖x‖ 0 denotes the number of nonzero components of x. Sparse optimization is to solve some problems whose solutions are sparse or compressed. And it has attracted considerable attention in the past ten years since its broad applications, such as signal (image) processing [1][2][3], linear regression [4], inverse problem [5], model selection [6], and machine learning [6,7]. In those applications, most information of interest has or can be coded by much low dimension though its own dimension is high.
In this paper, an inexact penalty decomposition (iPD) method is proposed for the sparse optimization problem (1). e iPD method just finds some inexact solutions to those subproblems for each penalty parameter. In more detail, for the first convex subproblem, the iPD method just takes one gradient step and then goes to solve the second nonconvex subproblem. e second subproblem can be solved by the iterative hard thresholding method [26]. After the two steps, the penalty parameter is updated. Computational experiments on a number of random instances demonstrate the effectiveness of the proposed method in accurately generating sparse and redundant representations of one-dimensional random signals. e rest of this paper is organized as follows. Section 2 is the preliminary, in which some notations and the basic method are described. Section 3 presents the iPD method. Computational experiments are presented in Section 4, and conclusions are drawn in Section 5.

Notations.
In this subsection, some notations are presented to simplify presentation. e transpose of a vector x ∈ R n is denoted by x T . If without special statement, all norms used are the Euclidean norm, denoted by ‖ · ‖ 2 . P X (·) denotes projection on a set X. Given a vector x ∈ R n , the nonnegative part of x is denoted by x + , i.e., x + � max(x, 0). e index of nonzero components of a vector x is denoted by S(x) � i : x i ≠ 0 (called support set) and S k : � S(x k ). e size of S(x) is denoted as s � |S(x)|. Now, let us consider problem (1). It is easy to verify that problem (1) is equivalent to the following problem: And the relative penalty function of problem (2) is defined as where ρ > 0 is the penalty parameter. For simplicity, we also denote

e PD Method.
In this subsection, we show the PD method proposed in [22]. First, the outline of the PD method is as presented in Algorithm 1. en, we explain why the PD method is time consuming by a random example.

Remark 1
(i) e termination condition in Step 8 of Algorithm 1 is used to establish the global convergence of the PD method.
In practice, the termination criterion is based on the relative change of the sequence (x k,i , y k,i ) such as the sequence satisfying max for some ϵ I > 0. In addition, the PD method terminates the outer iterations when holds for some ϵ O > 0.
where [·] j denotes the j-th entry of a vector, j ∈ 1, 2, . . . , n { }. In Step 5 of Algorithm 1, minimizing the function p ρ k (x, y k ) with respect to x is a convex problem. ere exist many efficient methods for this purpose if X is simple. However, for each penalty parameter, the PD method solves the penalty subproblems a few times until some termination conditions are reached, which is time consuming.

Computational Intelligence and Neuroscience
Consider a special case-compressed sensing [31]. One important task of compressed sensing is to find the sparsest solution to the underdetermined linear system, which is formulated as where A ∈ R m×n is the sensing matrix and b ∈ R m is the observation data. For this special problem, f(x) � (1/2) ‖Ax − b‖ 2 2 and F ρ (x) � ‖x‖ 0 + (ρ/2)‖Ax − b‖ 2 2 . e value of f(x) is called data fidelity, and it can measure the feasibility of a solution x. F ρ (x) is the penalty function of problem (9). Example 1. We generate a sparse vector x * with length n � 5000 and s � 100 nonzero components. ese components independently follow the standard Gaussian distribution, and their locations are assigned randomly to x * . en, we create a Gaussian random matrix A with size 1000 × 5000, and let b � Ax * . en, we solve this instance by the PD method package, and the process data are presented as Figure 1. Figure 1 shows that the value of f(x) decreases slowly. It decreases steep just at the first few steps for each penalty parameter.
ere are many almost null steps during the process. And the value of the penalty function F ρ (x) increases too much when updating the penalty parameter. Hence, we can just take one or a few iterations for each penalty parameter to save some time. In Section 3, we will improve the PD method by the above observations.

The Proposed Method
In this section, we describe the process of the iPD method. From the outline of Algorithm 1, we find that, for each penalty parameter ρ k , the block coordinate descent method needs to alternately solve two minimization subproblems many times, and an example in Section 2 shows that there are many almost null step for each penalty parameter. Hence, the original PD method may be time consuming if convergence speed of the block coordinate descent is slow.
Motivated by the analysis in Section 2 and the above demonstration, we accelerate the PD method by alteratively solving the two penalty subproblems once a time after updating the penalty parameter. For solving the first penalty subproblem, a gradient step is taken, and its step-length is searched by the backtracking line search method. Now, we present the outline of the accelerated penalty decomposition method as follows.
Computational Intelligence and Neuroscience it together with x k+1 � y k − (∇ x p ρ k (y k , y k )/L k ) implies that holds, which means that the while loop in Algorithm 2 terminates if L k ≥ ((L p + η)/2). Let L k be the final value of L k after the while loop. en, (L k /c inc ) ≤ ((L p + η)/2) holds, i.e., L k ≤ (c inc (L p + η)/2). Let n k be the number of iterations in the while loop at the k-th iteration. en, one can get that where L 0 k is the initial value of L k in the line search. erefore,

Experiments
In this section, we implement the proposed accelerated PD method to solve the compressed sensing problem. To verify the efficiency of PD empirically, a large number of computational experiments are performed on one-dimensional random signals. We mainly compare the performance of our improved PD method with that of the original PD method [22]. All experiments were performed on a personal computer with an Intel(R) Core(TM)i7-7700HQ CPU (2.80 GHz) and 8 GB memory, using a MATLAB toolbox (version R2018b).
We compare the performance of the compared methods by the CPU time (in seconds) required, the size of the x k+1 � y k − (∇ x p ρ k (y k , y k )/L k ); (6) end while (7) y k+1 ∈ argmin y p ρ k (x k+1 , y); (8) ρ k+1 � σ k ρ k (9) L k+1 � min((L k /c dec ), L min ) (10) k⟵k + 1; (11) until some termination conditions reach (12) x⟵y k ; ALGORITHM 2: e inexact PD method. 4 Computational Intelligence and Neuroscience support set of the reconstructed data x, and the mean squared error (MSE) with respect to x * , which is defined as and the data fidelity of Ax − y is defined as and NS as the number of successfully recovered instances. We say a signal x is successfully recovered if the positions of the nonzero components of x are the same as x * and the corresponding MSE value is less than 10 − 4 .

Data Generation and Parameter Setting.
Each instance is generated randomly with size (m, n, s), where m × n is the dimension of matrix A and s is the sparsity level, such as m � 1000, n � 5000, and s � 100. e elements of matrix A follow the Gaussian distribution. e vector x * is generated with the same distribution at s randomly chosen coordinates. Finally, the vector b is generated by b � Ax * . Unless otherwise stated, all parameters in the PD method are set as default, and parameters in the IPD package are set as in Table 1.

4.2.
Compare with the Original PD Method. Firstly, we compare the iteration process of the iPD method with that of the PD method on a random instance. All parameters are set as before, and the problem size is m � 1000, n � 5000, and s � 100. Figure 2 describes the data fidelity and the penalty function value over the iteration process. From Figure 2(a), we find that the iPD method does not have many null steps, and the values of data fidelity generated by the iPD method decrease much fast than those of the original PD method. Furthermore, the iPD method just requires about 150 steps while the original PD method requires about 400 steps. And the running time of the iPD method is about 7 seconds, which is less than half of the time required by the original PD method. Moreover, the penalty function value generated by the iPD method is much stable than that by the original PD method.
In the second experiment, we compare the accelerated PD method with its original PD method at different sampling numbers. We fix the dimension m � 5000 and the sparsity level s � 100. For each sampling number m, 100 instances are generated, and the averaged performance of the two methods is presented in Figure 3.
From Figure 3(a), we see that the accelerated PD method requires not more than 10 seconds while the original PD method requires much more time. And the time required by Table 1: Parameter settings in the acceleration of the PD method.

Parameter
Value  Computational Intelligence and Neuroscience the accelerated PD method is stable at different sampling numbers. Figure 3(b) shows that the recovered rate by the accelerated PD method is higher than that by the original PD method when m is bigger than 600. When the sampling number is bigger than 700, the accelerated PD method can recover all signals successfully. We find that the MSE value and the DF value generated by the accelerated PD method are lower than those generated by the original PD method. e averaged number of nonzero components also shows that the accelerated PD method performs better.
In the next experiment, we compare the accelerated PD method with its original version for solving the compressed sensing problem with different sparsity levels s. All parameters are set as the same value as those stated before. e averaged computational results on 100 instances are presented in Table 2.
From Table 2, we find that the PD method not works well when the sparsity level is greater than 150, especially when it is greater than 200. However, the sparsity level recovered by the iPD method can reach 200. When the two methods can recover sparse signals, the iPD method just needs about one third of the time required by the PD method. Moreover, the recovered rate of the iPD method is higher than that of the original PD method. From MSE and DF value, we see that the signals recovered by the iPD method are more exact than those recovered by the PD method. When s � 100, there is one instance not recovered exactly by the iPD method since there exist several very small components and one of them is not recovered.

Conclusions
In this paper, we have proposed an acceleration of the penalty decomposition for the sparse approximation problem. e proposed method does not solve the penalty subproblems exactly and alternately solve penalty subproblems once a time after updating penalty parameters. We show that this method enhances the performance of the penalty decomposition method by computational experiments on a number of random instances for solving the compressed sensing problem. e experiments demonstrate that the proposed method indeed improves the original PD method since it recovers better solutions with less running time.

Data Availability
e data used to support the findings of this study are included within the article.

Conflicts of Interest
e authors declare that they have no conflicts of interest. Computational Intelligence and Neuroscience 7