Diagonal Hessian Approximation for Limited Memory Quasi-Newton via Variational Principle

This paper proposes some diagonal matrices that approximate the (inverse) Hessian by parts using the variational principle that is analogous to the one employed in constructing quasi-Newton updates. The way we derive our approximations is inspired by the least change secant updating approach, in which we let the diagonal approximation be the sum of two diagonal matrices where the first diagonal matrix carries information of the local Hessian, while the second diagonal matrix is chosen so as to induce positive definiteness of the diagonal approximation at a whole. Some numerical results are also presented to illustrate the effectiveness of our approximating matrices when incorporated within the L-BFGS algorithm.


Introduction
Our investigation begins by seeking effective ways to diagonally scale an identity matrix, which is often used to initialize the L-BFGS method.For this purpose, it is useful to state the weak quasi-Newton equation: where the -dimensional vector   =  +1 −   denotes the step corresponding to two different points   and  +1 ,   =  +1 −   denotes the gradient change corresponding to the gradients   and  +1 at the two points,  +1 is a full  ×  matrix that approximates the Hessian of .Both  and  are used explicitly and ( 2 ) storage is required for the matrix  +1 .If  is further chosen to be a diagonal matrix, say  +1 , then we can establish the so-called quasi-Cauchy relation denwol 1993, [1] by which Different from the standard quasi-Newton which requires ( 2 ) of storage, here only () storage is required to store the diagonal update that satisfies the quasi-Cauchy relation.
In addition, we suppose the matrix  +1 to be positive definite by which it is able to define a metric.
The first example of diagonal updating that satisfies the quasi-Cauchy relation is the well-known Oren-Luenberger OREN1974 scaling matrix, given by where  is the identity matrix.Expression (3) would be obtained from the quasi-Cauchy relation with the further restriction where the diagonal matrix is a scalar multiple of the identity matrix.Therefore, scaling matrices that are derived from the quasi-Cauchy relation are a natural generalization of Oren-Luenberger scaling.
In general, a procedure to obtain diagonal updating formulae via quasi-Cauchy relation can be summarized as follow.Suppose that   > 0 is a positive definite diagonal matrix and  +1 is the updated version of , which is also diagonal.Then the updated  +1 is required to satisfy the quasi-Cauchy relation.It is also essential to require the deviation between   and  +1 being minimized under some variational principle, which in return will encourage numerical stability.Very often the Frobenius matrix norm

Quasi-Cauchy Relation
The performance of the L-BFGS method is depending on a good approximations of the actual Hessian.In the basic implementation of the L-BFGS method, the correction pairs are as follows: to correct  0  .The choices of  0  often influences the behaviour of the method.It is worth the investigation on the choices of  0  .Throughout this section, when we mention a direct initial matrix, we mean a matrix that is a rough approximation to the Hessian; otherwise an initial matrix is an approximation to the inverse of Hessian.
Our approach is inspired by [5,6] which employed a variational technique that is analogue to the one used to derive the Powell Symmetric Broyden (PSB) quasi-Newton update (see, e.g, Dennis and Schnabel [7]).The resulting update for approximating the Hessian matrix diagonally is derived as follows: where ),  −1, denotes the th component of the vector  −1 , and tr is the trace operator.
Note that when   −1  −1 <   −1  −1  −1 , the resulting  +1 is not necessarily positive definite.Thus, like their counterpart of PSB update in the quasi-Newton setting, the foregoing update may suffer from the loss of positive definiteness and it is not appropriate for use within a quasi-Newton-based algorithm.
In this study, our approach in finding an efficient diagonal Hessian approximation is done through letting the diagonal approximating matrix be a combination of two diagonal matrices.This gives us a freedom to incorporate curvature information into one of these diagonal matrices, while the property of hereditary positive definiteness is carried over to the second matrix.
To begin, suppose that the Hessian matrix of an objective function has positive diagonal elements.Let us divide the Hessian matrix into two parts: where ∇ 2   () is a diagonal matrix consisting the diagonal entries of the Hessian and ∇ 2   () would resemble the actual Hessian except that its diagonal entries are all zero.Thus, we intend to form two diagonal approximating matrices to approximate each part of the Hessian, respectively; that is, Since it is assume that the entries of the actual Hessian are all positive, an excellent choice would be to let a positive definite diagonal matrix, says Ψ 1 , to approximate ∇ 2   ().Meanwhile, ∇ 2   () is expected to be dense and expensive to compute and would be approximated by Ψ 2 .To preserve positive-definiteness, we introduce additional term in the form of , that is, a Levenberg-Marquat-like step to maintain positive definite of Ψ 2 in a way that Ψ 2 is expressed as Ψ 2 =  + Ψ 3 .Thus, the technique for calculating  0  subjected to the weak-quasi-Newton relation is as follows.
Theorem 1. Suppose that  −1 ̸ = 0; then the optimal solution of the following minimization problem: is given by Proof.Since the objective function in ( 8) is convex and the feasible set is also convex, then (8) has a unique solution.Its Lagrangian function is given by where  is the Lagrange multiplier associated with the constraint.Differentiating (10) with respect to each of elements of Ψ 3 and setting the results to zero yields, Pre-and postmultiplying (11) by  −1 and invoking the constraint, We have Finally, by substituting (13) into (11) and using the fact that 9), A direct result of Theorem 1 leads to the following diagonal preconditioning formulation: An analogue of the above approach can be used to derive the initial inverse Hessian approximation  0 , which is more useful for algorithmic purpose.Once again by letting the inverse Hessian of an objective be separated by parts; then the initial approximating matrix  0 can be expressed as Since it is our intention to derive an initial approximation for the L-BFGS method, then an excellent choice of Δ +1, for approximating the diagonal entries of the inverse Hessian would be the diagonalized inverse BFGS formula of Gilbert and Lemaréchal [8] updated from a multiple of identity matrix; that is, where we choose Δ , =      /     as the Oren-Luenberger scalar.Thus, diagonal matrix Δ 1 can be obtained in (18).Note that, to safeguard very small or very large ‖ −1 ‖, we impose the additional following condition: if  2 ≤ ‖ −1 ‖ ≤  1 , for some small and large positive  1 and  2 , we set  0 = .Here, we used  1 = 10 8 and  2 = 10 −8 .
By interchanging the role of  and  in Theorem 1, one can obtain the formula for  0 at step  as follows: where Δ 1 is given by (18) and For purposes of numerical illustrations, the latter diagonal formula (19) is used to initialize the L-BFGS method, although the potential use of (15) should not be neglected.Furthermore, one can observe that Δ 3 involved in the solution of the variational problem is isolated from the solution and its value does not affect the quality of the solution.This allows us to choose the value of  freely to ensure that  0  are positive definite while satisfying the weak-secant equation.
Note that maintaining positive definiteness for  0  is crucial for L-BFGS method to generate descent direction.For this purpose, the following lemma suggests a possible choice on .

Lemma 2. Assume that
) . (20) Proof.Note that to keep positive definiteness of  0  , we should choose  such that which also implies Therefore by our choice on  as (20), we have the following.
we will let  = 1, and thus, Since It is clear that for both cases,  0  would maintain positive definiteness.
Hence,  0  can be expressed in the following form: Now, we can set up the basic algorithm of our L-BFGS methods using  0  as (25).
Remark 3. The LBFGS-USD algorithm is exactly the L-BFGS algorithm of Liu and Nocedal [9], except that  0  is computed by (25).

Convergence Analysis
We shall also establish the convergence of the LBFGS-USD algorithm.The following standard assumptions are made on the objective function.
(a) The objective function  is twice continuously differentiable.(b) The level set £ =  ∈ R  | () < ( 0 ) is convex, and there exist positive constants  1 and  2 such that, for all  ∈ R  and  ∈ £.
Theorem 4 suggests that as long as  0  is chosen such that ‖ 0  ‖ is bounded for all , then the corresponding L-BFGS algorithm restarts by  0  would generate {  } that converges globally and R-linearly.For this purpose, we give the following result which ensures that ‖ 0  ‖ where  0  is given by ( 25) is upper and lower bounded by some constants.

Lemma 5. Let 𝑥 0 be a starting point for which 𝑓 satisfies assumptions (a)-(b). Consider the sequence {𝑥 𝑘 } generated by L-BFGS algorithms subject to the diagonal initial approximation, 𝐻 0
given by (25).If   ̸ = 0 for all , then ‖  0  ‖ is upper and lower bounded for all .

Proof. Let 𝐺
and assumption (a) implies that Subsequently, (28) also gives and therefore, First, we begin by showing that every component of Δ 1 in (18) is bounded, so that ‖Δ 1 ‖ is bounded.It is more convenient to show that each part of Δ 1 , namely, (i) then, we obtain (ii) Note that where  , is the th component of  −1 .Then, On the other hand, Therefore, it gives, in overall, (iii) In a similar way, we can establish and subsequently, Then, we have Following that, the component of Δ 1 , namely Δ 1, , is bounded as follow: Note that, by the definition of  0  , we have Hence,  0 , satisfies Whereas, when where  2 −1, is the largest component among all  2 −1, .Therefore, we can conclude that ‖( 0 , )‖ is upper and lower bounded for all .

Numerical Experiences
Our test used a large set of unconstrained minimization problem consisting of 50 test problems where the list of problems is given in Table 1.These test problems are selected from Moré et al. [2], CUTE [3], Toint [4], and various other test function collections such as in [10].The subroutine of test problems is available at http://camo.ici.ro/forum/SCALCG/evalfg.for (accessed on Jan 2012).The method tested is as follows: (1) LBFGS-I: L-BFGS method with the initial matrix,  0  = , (2) LBFGS-I: L-BFGS method with the initial matrix,  0  =  where   is the Oren-Luenberger scaling at , (3) LBFGS-USD: L-BFGS method with the initial matrix,  0  is given by (19).

Discussion
In general, Figure 1 indicates that the new diagonal initial approximating matrix are substantially better, followed by both standard initializations of the L-BFGS method in terms of number of iterations, function/gradient calls, and CPU time, respectively.To better study the effect of our initial approximation, we include Tables 2, 3, and 4 that give ratio of function/gradient calls over iteration counts for the L-BFGS method with standard initial matrices and our diagonal initial approximating matrix.As conclusion, our diagonal initial approximation, LBFGS-USD, performs better in which the ratio is close to one and hence would likely to accept unit step length compared to the LBFGS-I (without any preconditioning) method.Moreover for LBFGS-USD method, it requires, in general less iteration counts as much as 38% and 19% than LBFGS-I and LBFGS-I, respectively.Meanwhile, in terms of number of function/gradient, LBFGS-MTD needs 46% and 16% less function/gradient counts, respectively.Finally, LBFGS-MTD requires 33% and 6% less CPU time in second, respectively.In conclusion, the numerical results for a broad class of the test problems show that the LBFGS-USD algorithm is efficient and vast superior in solving small to large size problems.

Concluding Remarks
We proposed technique that exploits the presence of the Hessian in the diagonal matrix form.Under some standard assumption on the objective function, we observe that the convergence of the diagonal initial approximation of the L-BFGS scheme is R-linear.Based on our numerical results, we believe that the following conclusions can be made on the diagonal approximation that is derived in this study.(ii) The numerical experiments show that our diagonal initial approximating matrix is generally effective for the L-BFGS method compared to the standard initial matrices in the literature.

Figure 1 :
Figure 1: Performance profile of LBFGS-USD based on iterations, function/gradient call, and CPU time for all choices of  and .

Table 1 :
Selected test problems.

Table 2 :
Ratio of function/gradient calls over iteration counts for  = 3.Our diagonal approximating matrix is able to maintain positive definiteness in a very simple way and give () storage requirement.

Table 3 :
Ratio of function/gradient calls over iteration counts for  = 5.

Table 4 :
Ratio of function/gradient calls over iteration counts for  = 7.