A General Solution to Least Squares Problems with Box Constraints and Its Applications

. The main contribution of this paper is presenting a flexible solution to the box-constrained least squares problems. This solution is applicable to many existing problems, such as nonnegative matrix factorization, support vector machine, signal deconvolution, and computed tomography reconstruction. The key concept of the proposed algorithm is to replace the minimization of the cost function at each iteration by the minimization of a surrogate, leading to a guaranteed decrease in the cost function. In addition to the monotonicity, the proposed algorithm also owns a few good features including the self-constraint in the feasible region and the absence of a predetermined step size. This paper theoretically proves the global convergence for a special case of below-bounded constraints. Using the proposed mechanism, some valuable algorithms can be derived. The simulation results demonstrate that the proposed algorithm provides performance that is comparable to that of other commonly used methods in numerical experiment and computed tomography reconstruction.


Introduction
Solving the linear system  =  is a classic inverse problem, where  ∈   and  ∈   are vectors and  ∈  × is a matrix.This problem is applicable in many fields, including nonnegative matrix factorization (NMF), support vector machine (SVM), signal deconvolution, and medical image reconstruction (e.g., computed tomography (CT)) [1][2][3][4][5].In many cases, we need to impose a box constraint on the problem.A classical approach to choosing  is to minimize the least squares (LS) error between  and : where  1 and  2 are given constant vectors and ‖ ⋅ ‖ denotes the Euclidean distance.
In fact, (1) can be easily converted into an equivalent form by a linear transformation  =  −  1 , which is convenient to study.In the following, we only focus on such LS problems and still denote variables with  for simple notation; subsequently, the readers may generalize all the results of this paper to the generic form by themselves: where  is a constant vector.
If we denote  = +∞, then it is the widely used case, that is, the nonnegativity constrained LS problem: min ≥0  () = ‖ − ‖ 2 . ( There are many methods for solving nonnegativity constrained problems.A simple method for solving such problems is by using Of course, we must truncate the negative pixel values to satisfy the nonnegativity constraint.However, we will encounter two difficulties: perhaps    is not invertible or    is too large.A viable example for the minimization of ( 3) is 2 Mathematical Problems in Engineering the gradient-based method.The steepest descent method is perhaps the simplest technique to implement, which takes the negative gradient as the descent direction: +1 =   −  () ∇ (  ) , where the superscript () denotes the th iteration and () is the step size.
Let () =  be a constant, where 0 <  < 2/‖∇ 2 (  )‖  , and ‖⋅‖  denotes the maximum eigenvalue for a matrix; then, (5) becomes the projected Landweber (PL) method [6]: In addition, Lantéri et al. [7,8] provided a general multiplicative algorithm, which was also a gradient-based method.Let   (  ) be a positive function with positive values for any    > 0; then, − diag[     (  )]∇(  ) is also a descent direction.Therefore, the algorithm in a modified gradient form can be written as Again, let   (  ) and   (  ) be two positive functions for any    > 0, which satisfy −∇  (  ) =   (  ) −   (  ).Taking   (  ) = 1/  (  ), Lantéri et al. obtained a multiplicative update The conjugate gradient (CG) method is also a popular method, and it is often implemented as an iterative algorithm, applicable to sparse systems that are too large to be handled by a direct implementation.
The above approaches are not effective for processing the upper-bounded constraints, and, generally, a truncation is imposed to the update rule, which may lead to a divergent modification.
In this paper, we consider a specific application of the surrogate-based methods [9] to a specific type of LS problem with box constraints.We derive a multiplicative update rule to iteratively and monotonically (in the sense of decreasing the cost function) solve the problem, similar to the EM algorithm [10].The nonnegativity constraints will be satisfied automatically, and the upper-bounded constraints are performed by an upper truncation.Meanwhile, the algorithm still has the monotonicity and does not require an adjustable step size.We provide a rigorous global convergence proof for the case of only nonnegativity constraints, which can be easily generalized to the below-constrained case.We demonstrate the power of this mechanism by deriving many existing and new algorithms.We also present some computer simulation results to show the desirable behavior of the proposed algorithm with respect to the convergence rate and the stability compared with existing methods.

Methodology
A note about our notation: all vectors will be column vectors unless transposed to a row vector by a prime superscript .For a matrix or vector ,  ≥ 0 means that any component of  is equal to or greater than 0. For a matrix ,   * and  *  represent the th row and th column of , respectively.
As mentioned in many articles [1,2,9,11], a surrogate as defined below is useful in algorithm derivations and convergence proofs.
Definition 1 (surrogate).Denote ( |   ) as a surrogate for Clearly, Ψ() is decreasing under the update There are two obvious and important properties for a surrogate: additivity and transitivity.For the former, the sum of two surrogates is a surrogate of the sum of the two original functions.For the latter, the surrogate of a surrogate of a function is a surrogate of this function.
In the following, we proceed step by step: firstly, assume that  ≥ 0 and  = +∞; secondly, remove the restriction of  ≥ 0; and thirdly, remove the restriction of  = +∞.
2.1. ≥ 0 and  = +∞.Let − =  + − − , where  + and  − are nonnegative vectors; then, () = ‖ +  + − − ‖ 2 .Note that we decompose − instead of  for convenient derivation, which can be seen below.We construct a surrogate ( |   ) by the convexity of ().Denote that satisfy   * ,   ≥ 0 and   * + ∑  =1   = 1.They can be the convex combination coefficients such that It is easy to verify that (  |   ) = (  ).If considering Jensen's inequality and the convex combination coefficients   , then ( |   ) ≥ () is proven by the following inequality: Take the partial derivatives of ( |   ); then, we can solve the one-dimensional equations ( |   )/  = 0 to obtain a multiplicative update rule: where It is easy to verify that  mid (  |   ) = (  ).By the convexity of (), it is clear that Following the same process as in Section 2.1, we can construct surrogates for ( |   ) and f( |   ).Let then, Note that  +  * and  −  * are relative to the constant terms, which do not work for the optimization problems; thus, we ignore them.We can now obtain a surrogate for () at   : Take the partial derivatives of ( |   ) and f( |   ): Then, we solve the one-dimensional equations ( |   )/  = 0 to obtain a multiplicative update rule: 2.3.Any  and .As shown, (21) will ensure the monotonic decrease of the cost function and satisfy the nonnegativity constraints.In the following, it will be modified by a truncation at the end of each iteration to satisfy the box constraints: We will show that the truncation still ensures the monotonic decrease of the cost function.
It is easy to see the separability of the variables of (19); that is, (,   ) = ∑  =1   (  ,   ), and every separated function   (  ,   ) has a quadratic form: We further simplify it into a general form: where  > 0,  > 0, and ) such that (  ) strictly monotonously decreases on that open interval.Thus, the proof is established.
In fact, we know that  = min{, +∞}; therefore, we may provide a generic solution to (2), which is the main result of this paper.(2) truncate  by (22).
It is easy to generalize the constraints from 0 ≤  ≤  to  1 ≤  ≤  2 by the transformation  =  −  1 , which is left for readers to complete by themselves.

Convergence
The convergence proof with box constraints is very difficult; thus, we only focus on the below-bounded constraints, which can be viewed as a special case of box constraints with  = +∞.Additionally, we consider the equivalence between the general below-constrained form and the nonnegative constrained one by a linear transformation  =  −  1 ; then, we will theoretically prove the global convergence of the latter case for simplification, which is easy to extend to that of the former.In theory, the Kuhn-Tucker (KT) point will be a global solution if the cost function is convex.It is easy to see the convexity of ().By Theorem 2.19 in [12], the KT conditions of (3) ( ≥ 0) are as follows: We entertain several important and reasonable assumptions.
The first assumption forces the iterations to be positive, but the limit may be zero.The second condition is reasonable because (  )  = 0 suggests   = 0 for any .Thus, the equation  =  is irrelevant to   , and, then,   in  is removable.The third one is indeed restrictive; however, it is important for the following derivation.

Lemma 4. The set of accumulation points of a bounded sequence {𝑍
Proof.This is Theorem 28.1 from Ostrowski [18].The reader is kindly referred to this paper for the proof.

Lemma 5. The iteration sequence
Proof.Because  is a strictly convex function,    is a positive definite symmetric matrix, and, thus, it can be factorized into    =    by Cholesky's method, where  is an invertible matrix.We assume that  * is the sole minimum, and, then, we expand  on  * using a Taylor series: Let  = ( −  * ); then, for any constant , it is easy to know the boundedness of  = { :    < } such that { : Proof.Take into account that (see the second one of Assumption 3) . ( Because {(  )} monotonically decreases and is bounded from below, {(  ) − ( +1 )} → 0; therefore, {‖  −  +1 ‖} → 0. ( Lemma 8.At each iteration, one knows that Proof.We consider that It is easy to verify the correctness of (31) if replacing ∇(  ) by (32).
The global convergence will be proven from the three theorems below.Theorem 9. Let {   } →  * be any convergent subsequence; then,  * meets the first KT condition (25).Proof.According to Lemmas 4, 5, and 6, the set of accumulation points of {  } is connected and compact.If we can prove that the number of accumulation points is finite, then the desired result follows because a finite set can be connected only if it consists of a single point [16].
To prove the existence of a finite number of accumulation points, we consider any accumulation point  * .Given an integer set Ω = {1, 2, . . ., T}, where T is the total number of components of , Ω * = { :  *  = 0} is a subset of Ω.Let  Ω * be the restrictions of  to the set { :   = 0,  ∈ Ω * }, which is a strictly convex function of the reduced variables.It follows that  Ω * has a unique minimum (Theorem 9: ( * )/  = 0 if  *  > 0).This means that an accumulation point must correspond to a subset of Ω.The number of subsets of Ω is finite, and, thus, the number of accumulation points is also finite.

Mathematical Problems in Engineering
In Theorem 9, we prove that every accumulation point meets the first KT condition, by which the full sequence convergence is provided in Theorem 10.Naturally, the limit of {  } satisfies the first KT condition.In the following, we will show that the second KT condition is satisfied.

Solutions to Some Existing and New Examples
4.1.NMF.NMF has been widely used in pattern recognition, machine learning, and data mining.It decomposes the nonnegative matrix  by the product of two other nonnegative matrices  and . Lee and Seung [1] have proposed using the square of the Euclidean distance to measure the similarity between  and : Minimizing them under the constraints  ≥ 0 and  ≥ 0, Lee and Seung [1] alternated between solving an optimization problem in the variables  and then solving another one in the variables .As can be seen,  and  have the same roles.
To take , for instance, with   > 0 and  > 0 given, denote   =       /(  )  such that the surrogate function can be constructed: The update rule can be obtained by solving (,   )/  = 0. Reversing the roles of  and , we can similarly construct the surrogate for  and acquire the update rule: where  is the input data,  is the class label,  is the penalty parameter, and  represents the Lagrangian multiplier that needs to be optimized.Hsieh et al. [21] handled the constraint    = 0 by removing it, which corresponded to removing the "intercept" from the classifier.For a simplified expression, we denote  = , and, then, we obtain the following simple formula: We can solve this optimization problem using the proposed method.Let  1 =  2 = 1/2 and  =  + −  − , where  + and  − are two nonnegative matrices; then, where We, respectively, construct surrogates for ( |   ) and f( |   ).Let Mathematical Problems in Engineering 7 then, We now obtain a surrogate for () at   : We solve the one-dimensional equation ( |   )/   = 0 and truncate it to obtain an update: 4.3.Nonnegative Image Deblurring.In many optical devices, the process of image blurring can be considered to be the result of convolution by a point spread function (PSF) [22,23].It is assumed that the degraded image  is in the form  = , where  is the true image and  is the PSF matrix consisting of PSFs at every pixel.As noted in [24,25], a simple way to approach the deconvolution problem is to find the least squares (LS) estimation between  and .It is well known that the problem of restoring the original image from the noisy and degraded version is an ill-posed inverse problem: small perturbations in the data may result in an unacceptable result [26].The Tikhonov regularization method [27,28] is a popular method that generally leads to a unique solution, which is formulated as Note that we must approve negative values existing in the matrix .
In [17], we assume that  and  are a nonnegative matrix and vector and that there is no restriction on ; then, we, respectively, construct surrogate functions for ‖ − ‖ 2 and ‖‖ 2 using the method proposed in this paper such that we obtain a multiplicative update rule as follows: If  = 0, it is De Pierro's ISRT (Image Space Reconstruction Technique) [4]: 4.4.CT Reconstruction with  1 Regularization.In CT reconstruction, let  be the system probability matrix,  projections, and  a predetermined matrix.Then, a classical approach is to choose an  (X-ray attenuation image) such that the LS error [5] between  and  is minimized: where  ≥ 0 serves as a penalty parameter.Many  matrices have been used, such as the first-or second-order derivative matrix or wavelet basis matrix.Note that both  and  are a nonnegative matrix and vector; however, we must accept negative values existing in the matrix .
The alternating direction method of multipliers (ADMM) [29,30] has been developed for solving optimization problems.This method decomposes the original problem into three subproblems and then sequentially solves them at each iteration.
Introducing the additional variable  = , a fixed penalty parameter , and the Lagrangian multiplier , a unified framework can then be given to solve the  1 -norm regularized LS reconstruction problem [29].( The -update is the most difficult problem that minimizes (,  +1 ,   ) with nonnegativity constraints.We can, respectively, construct surrogate functions for ‖ − ‖ 2 and ‖ −  +1 +   ‖ 2 2 ; then minimize the sum of the surrogates to update ; however, there is a high overlap ratio with the above, so we leave them to the readers.

Experimental Results
We compare the performance of the proposed surrogate method, with those of the PL method (6) and the CG method on simulated data and CT projection data.For the PL method, we use  = 1/(‖‖ 1 ‖‖ ∞ ) [31], where ‖ ⋅ ‖ 1 and ‖ ⋅ ‖ ∞ denote the 1-and ∞-norm of a matrix.The code for the CG method comes from [32], which is slightly modified to meet our criteria.
The experiments are performed on a HP Compaq PC with a 3.00 GHz Core i5 CPU and 4 GB memory.The algorithms are implemented in MATLAB 7.0.All of the algorithms are initiated by the same uniform image for a fair comparison.The mean square error (MSE) is used to measure the similarity to the true solution, as given below: The following criterion is applied to stop the iterative process: where  is a difference tolerance.

Numerical Simulation.
Here, we randomly generate  ∈  500 and  ∈  100×500 , where the former are uniformly distributed on [0 1] and the latter are on [−1 1].Then we obtain  by  = .We will minimize () = ‖ − ‖ 2 under the box constraints 0 ≤  ≤ 1.In the experiment, we generate , , and  once, but we randomly generate start points (uniformly distributed on [0 1]) 50 times to obtain a reliable averaged result.Figure 1 presents comparison of MSE and  versus iteration number.As shown, CG rapidly finishes iteration to a static solution, so that () = 0 after that, which can not be shown with a log-scaled -axis.However, such a phenomenon does not mean a good result.Instead, it is the worst result among the three algorithms by observing the MSE curve.We can conclude that CG is not suitable to the box-constrained LS problem.PL and the proposed method always decrease the curves of MSE and .However, the proposed method shows a superior performance than the PL method.
Table 1 shows the quantitative comparison of the three algorithms; then we can draw the same conclusion as above.This table further proves that the proposed method is rapid and stable, and CG is not suitable to the box-constrained LS problem.
Table 2 shows the computational time of the algorithms.As can be seen, they have similar computational complexity; however, the proposed algorithm requires only a little more time because of the nonnegative decomposition of  and .In fact, from the above theoretical analysis, we can observe that the proposed algorithm is also some gradient-based method; thus, it has a similar computational complexity with the CG and PL methods.

CT Reconstruction. CT reconstruction is a medical imaging technique for creating a meaningful diagnostic image.
A computer can take the input from the CT machine, run it through the formula, and return a set of images for a physician to examine.Here, we only consider the simplest mathematical model min ≥0 ‖−‖ 2 to pursue the structural image.A simulated thorax phantom with 128×128 grids and 0.5 mm pixels, as shown in Figure 2, is used in the following experiments.There are many advantages to using simulated phantoms, including prior knowledge of the pixel values and the ability to control noise.The total attenuation value is approximately 5 × 10 7 .For this case, the proposed algorithm is called ISRT, which is widely used in the field of CT reconstruction [4].
The system matrix is obtained using the "angle of view" method [10].From the system matrix, we forward project the phantom on the sinogram with 128 radial bins (bin size of 0.5 mm) and 128 angular views evenly spaced over .The noisy projections are obtained by the Poisson-random formula   = Poisson( *  ), where  *  is the noise-free projection.
Figure 3 shows the reconstructions with 50 iterations for every algorithm.As shown, the CG's image suffers from serious noise artifacts.The PL algorithm provides a reconstructed image that is slightly blurred compared with that of ISRT.The proposed ISRT provides the sharpest edge and clearest interior region.
Figure 4 quantitatively compares the histograms of the reconstructed images.The CG's result still shows a clear turbulence around the true value.PL and ISRT provide two comparable curves, and neither is significantly better than the other.
We also present a comparison of the MSE curves and  with increasing iteration number in Figure 5.As shown, CG exhibits an unsteady reconstruction property in which more iterations may lead to a worse result.Both PL and ISRT overcome this shortcoming so that more iterations always provide considerably better results.In addition, ISRT will provide a more rapid convergence than PL.

Conclusion
We present a special application of a surrogate-based method for solving box-constrained LS problems.A new update rule is developed to iteratively update variables, and this rule exhibits desirable properties, including monotonic decrease of the cost function, self-constraining in the feasible region, and no need to impose a step size.This algorithm covers many existing algorithms, as well as new ones, as the special examples.Targeting the special case of only belowbounded constraints, we provide a rigorous theoretical global convergence proof.We use the simulated data to evaluate the performance of the algorithm, demonstrating that the proposed algorithm provides a stabler and faster convergence than the PL and CG approaches.
In this paper, we only provide a global convergence proof for the below-bounded case, not the box-constrained case; thus, proving the convergence for the latter case will be the focus of future work.
) 2.2.Any  but  = +∞.Let  =  + −  − ,  1 +  2 = 1, and  1 ,  2 > 0, where  + and  − are two nonnegative matrices; then, we can construct the surrogate  mid ( |   ) for () as follows: [20] surface.Several practical applications of SVMs use nonlinear kernels, such as the polynomial and radial basis function kernels.However, in applications such as text classification, linear SVMs are still used because it has been observed that many text classification problems are linearly separable[19].Most literature on large-scale SVM training have targeted the linear SVM problem, citing this fact.A dual form of linear SVM is usually used because of the simple structure, which can be formulated as[20] ) 4.2.Variation of Linearized SVM.SVM attempts to separate points belonging to two given sets in real Euclidean space

Table 1 :
Performance comparison of the three algorithms.

Table 2 :
Total running time of the three algorithms with 300 iterations and 50 random start points.