An Approximate Quasi-Newton Bundle-Type Method for Nonsmooth Optimization

and Applied Analysis 3 rate of convergence under some additional assumptions, and it should be noted that we only use the approximate values of the objective function and its subgradients which makes the algorithm easier to implement. Some notations are listed below for presenting the algorithm. (i) ∂f(x) = {ξ ∈ Rn | f(z) ≥ f(x) + ξT(z − x), ∀z ∈ R}, the subdifferential of f at x, and each such ξ is called a subgradient of f at x. (ii) ∂ ε f(x) = {η ∈ Rn | f(z) ≥ f(x) + ηT(z − x) − ε}, the ε-subdifferential off at x, and each such η is called an ε-subgradient of f at x. (iii) p(x) = argmin z∈R n{f(z)+(2λ)−1 ‖z−x‖2}, the unique minimizer of (2). (iv) G(x) = λ−1(x − p(x)), the gradient of F at x. This paper is organized as follows: in Section 2, to approximate the unique minimizer p(x) of (2), we introduce the bundle idea, which uses approximate values of the objective function and its subgradients. The approximate quasiNewton bundle-type algorithm is presented in Section 3. In the last section, we prove the global convergence and, under additional assumptions, Q-superlinear convergence of the proposed algorithm. 2. The Approximation of p(x) Let x = xk and s = z − x, where xk is the current iterate point of AQNBT algorithm presented in Section 3, then (13) has the form F (xk) = min s∈R n {f (xk + s) + (2λ) −1 ‖s‖ 2} . (12) Nowwe consider approximatingf(x+s) by using the bundle idea. Suppose we have a bundle Jk generated sequentially starting from xk and possibly a subset of the previous set used to generate x.The bundle includes the data (zi, fi, ga(zi, ε i )), i ∈ J, where zi ∈ Rn, fi ∈ R, and ga(zi, ε i ) ∈ Rn satisfy f (zi) ≥ fi ≥ f (zi) − ε i , f (z) ≥ f i + ⟨ga (zi, ε i ) , z − zi⟩ , ∀z ∈ Rn. (13) Suppose that the elements in Jk can be arranged according to the order of their entering the bundle. Without loss of generality we may suppose Jk = {1, . . . , j}. ε i is updated by the rule ε i+1 = γε i , 0 < γ < 1, i ∈ J. The condition (13) means ga(zi, ε i ) ∈ ∂ ε i f(z), i ∈ J. By using the data in the bundle we construct a polyhedral function f a (xk + s) defined by


Introduction
In this paper we are concerned with the unconstrained minimization of a real-valued, convex function  :   → , namely, min  () and in general  is nondifferentiable.A number of attempts have been made to obtain convergent algorithms for solving (1).Fukushima and Qi [1] propose an algorithm for solving (1) under semismoothness and regularity assumptions.The proposed algorithm is shown to have a Q-superlinear rate of convergence.An implementable BFGS method for general nonsmooth problems is presented by Rauf and Fukushima [2], and global convergence is obtained based on the assumption of strong convexity.A superlinearly convergent method for (1) is proposed by Qi and Chen [3], but it requires the semismoothness condition.He [4] obtains a globally convergent algorithm for convex constrained minimization problems under certain regularity and uniform continuity assumptions.Among methods for nonsmooth optimization problems, some have superlinear rate of convergence, for instance, see Mifflin and Sagastizábal [5] and Lemaréchal et al. [6].They propose two conceptual algorithms with superlinear convergence for minimizing a class of convex functions, and the latter demands that the objective function  should be differentiable in a certain space  (the subspace along which () has 0 breadth at a given point ), but sometimes it is difficult to decompose the space.Besides these methods mentioned above, there is a quasi-Newton bundle type method proposed by Mifflin et al. [7] it has superlinear rate of convergence, but the exact values of the objective function and its subgradients are required.In this paper, we present an implementable algorithm by using bundle and quasi-Newton ideas and Moreau-Yosida regularization, and the proposed algorithm can be shown to have a superlinear rate of convergence.An obvious advantage of the proposed algorithm lies in the fact that we only need the approximate values of the objective function and its subgradients.
It is well known that (1) can be solved by means of the Moreau-Yosida regularization  :   →  of , which is defined by where  is a fixed positive parameter and ‖ ⋅ ‖ denotes the Euclidean norm or its induced matrix norm on  × .The problem of minimizing (), that is, min  () where () is the unique minimizer of (2) and  is the subdifferential mapping of .Hence, by Rademacher's theorem,  is differentiable almost everywhere and the set where  is differentiable at is nonempty and bounded for each .We say  is BDregular at  if all matrices  ∈   () are nonsingular.It is reasonable to pay more attention to the problem (3) since  has such good properties.However, because the Moreau-Yosida regularization itself is defined through a minimization problem involving , the exact values of  and its gradient  at an arbitrary point  are difficult or even impossible to compute in general.Therefore, we attempt to explore the possibility of utilizing the approximations of these values.
Several attempts have been made to combine quasi-Newton idea with Moreau-Yosida regularization to solve (1).For related works on this subject, see Chen and Fukushima [9] and Mifflin [10].In particular, Mifflin et al. [7] consider using bundle ideas to approximate linearly the values of  in order to approximate  in which the exact values of  and one of its subgradients  at some points are needed.In this paper we assume that for given  ∈   and  ≥ 0, we can find some f ∈  and   (, ) ∈   such that which means that   (, ) ∈   ().This setting is realistic in many applications, see Kiwiel [11].Let us see some examples.Assume that  is strongly convex with modulus  > 0, that is, and that () = (V()) with V :   →   continuously differentiable and  :   →  convex.By the chain rule we have () = {∑  =1   ∇V  () |  = ( 1 ,  2 , . . .,   )  ∈ (V())}.Now assume that we have an approximation ∇ ℎ V() of ∇V() such that ‖∇ ℎ V()−∇V()‖≤ (ℎ), ℎ > 0. Such an approximation may be obtained by using finite differences.
By the definition of , the bound  ℎ depends on , we obtain From the local boundedness of (V()), we infer that  ℎ > 0 is locally bounded.Thus,  ℎ () is an  ℎ -subgradient of  at , see Hintermüller [12].As for the approximate function values, if  is a max-type function of the form where each   :   →  is convex and  is an infinite set, then it may be impossible to calculate ().However, for any positive  one can usually find in finite time an solution to the maximization problem (11), that is, an element   ∈  satisfying    ≥ () − .Then one may set   () =    ().On the other hand, in some applications, calculating   for a prescribed  ≥ 0 may require much less work than computing  0 .This is, for instance, the case when the maximization problem (11) involves solving a linear or discrete programming problem by the methods of Gabasov and Kirilova [13].Some people have tried to solve (1) by assuming the values of the objective function, and its subgradients can only be computed approximately.For example, Solodov [14] considers the proximal form of a bundle algorithm for (1), assuming the values of the function and its subgradients are evaluated approximately, and it is shown how these approximations should be controlled in order to satisfy the desired optimality tolerance.Kiwiel [15] proposes an algorithm for (1), and the algorithm utilizes the approximation evaluations of the objective function and its subgradients; global convergence of the method is obtained.Kiwiel [11] introduces another method for (1); it requires only the approximate evaluations of  and its subgradients, and this method converges globally.It is in evidence that bundle methods with superlinear convergence for solving (1) by using approximate values of the objective and its subgradients are seldom obtained.Compared with the methods mentioned above, the method proposed in this paper is not only implementable but also has a superlinear rate of convergence under some additional assumptions, and it should be noted that we only use the approximate values of the objective function and its subgradients which makes the algorithm easier to implement.Some notations are listed below for presenting the algorithm.This paper is organized as follows: in Section 2, to approximate the unique minimizer () of ( 2), we introduce the bundle idea, which uses approximate values of the objective function and its subgradients.The approximate quasi-Newton bundle-type algorithm is presented in Section 3. In the last section, we prove the global convergence and, under additional assumptions, Q-superlinear convergence of the proposed algorithm.

The Approximation of 𝑝(𝑥)
Let  =   and  =  −   , where   is the current iterate point of AQNBT algorithm presented in Section 3, then (13) has the form Now we consider approximating (  +) by using the bundle idea.Suppose we have a bundle   generated sequentially starting from   and possibly a subset of the previous set used to generate   .The bundle includes the data (  , f ,   (  ,   )),  ∈   , where   ∈   , f ∈ , and   (  ,   ) ∈   satisfy Suppose that the elements in   can be arranged according to the order of their entering the bundle.Without loss of generality we may suppose   = {1, . . ., }.  is updated by the rule  +1 =   , 0 <  < 1,  ∈   .The condition (13) means   (  ,   ) ∈    (  ),  ∈   .By using the data in the bundle we construct a polyhedral function   (  + ) defined by Obviously   (  + ) is a lower approximation of (  + ), so   (  + ) ≤ (  + ).We define a linearization error by where f  ∈  satisfies Then   (  + ) can be written as Let The problem ( 18) can be dealt with by solving the following quadratic programming: As iterations go along, the number of elements in bundle   increases.When the size of the bundle becomes too big, it may cause serious computational difficulties in the form of unbounded storage requirement.To overcome these difficulties, it is necessary to compress the bundle and clean the model.Wolfe [16] and Lemaréchal [17], for the first time, introduce the aggregation strategy, which requires storing only a limited number of subgradients, see Kiwiel and Mifflin [18][19][20].Aggregation strategy is the synthesis mechanism that condenses the essential information of the bundle into one single couple ( ĝk  , αk ) (defined below).The corresponding affine function, inserted in the model when there is compression, is called aggregate linearization (defined below).This function summarizes all the information generated up to iteration .Suppose  max is the upper bound of the number of elements in   ,  = 1, 2, . . . .If |  | reaches the prescribed  max , two or more of those elements are deleted from the bundle   ; that is, two or more linear pieces in the constraints of (19) are discarded (notice that different selections of discarded linear pieces may result in different speed of convergence), and introduce the aggregate linearization associated with the aggregate -subgradient and linearization error into bundle.Define the aggregate linearization as where ĝk  = ∑ ∈      (  ,   ), αk = ∑ ∈    (  ,   ,   ).Multiplier  = (  ) ∈  is the optimal solution of dual problem for (19), see Solodov [14].By doing so, the surrogate aggregate linearization maintains the information of the deleted linear pieces and at the same time the problem ( 19) is manageable since the number of the elements in   is limited.Suppose (  ) solves the problem (19), and let   (  ) =   + (  ) be an approximation of (  ) and    (  ) =  +1 =   .Let where f  (  ) ∈  is chosen to satisfy The results stated below are fundamental and useful in the subsequent discussions.
(P4) Suppose that   is not the minimizer of .If (24) is never satisfied, then (  ) → 0 as the new point  +1 is appended into the bundle   infinitely.
(P6) If   does not minimize , then we can find one solution (  ) of ( 18) such that (24) holds.

Convergence Analysis
In this section we prove the global convergence of the algorithm described in Section 3, and furthermore under the assumptions of semismoothness and regularity, we show that the proposed algorithm has a Q-superlinear convergence rate.Following the proof of Theorem 3, see Mifflin et al. [7], we can show that, at each iteration ,   is well defined, and hence the stepsize   > 0 can be determined finitely in Step 4. We assume the proposed algorithm does not terminate in finite steps, so the sequence {  } ∞ =1 is an infinite sequence.Since the sequence {  } ∞ =1 satisfies ∑ ∞ =1   < ∞, there exists a constant  such that ∑ ∞ =1   ≤ .Let   = { ∈   | () ≤ ( 1 ) + 2}.By making a slight change of the proof of Lemma 1, see Mifflin et al. [7], we have the following lemma.