A Modified Conjugacy Condition and Related Nonlinear Conjugate Gradient Method

The conjugate gradient (CG) method has played a special role in solving large-scale nonlinear optimization problems due to the simplicity of their very low memory requirements. In this paper, we propose a new conjugacy condition which is similar to Dai-Liao (2001). Based on this condition, the related nonlinear conjugate gradient method is given. With some mild conditions, the givenmethod is globally convergent under the strongWolfe-Powell line search for general functions.Thenumerical experiments show that the proposed method is very robust and efficient.


Introduction
The conjugate gradient (CG) method has played a special role in solving large-scale nonlinear optimization problems due to the simplicity of their iterations and their very low memory requirements.In fact, the CG method is not among the fastest or most robust optimization methods for nonlinear problems available today, but it remains very popular for engineers and mathematicians who are interested in solving large-scale problems.The conjugate gradient method is designed to solve the following unconstrained optimization problem: where () :   →  is a smooth, nonlinear function whose gradient will be denoted by ().The iterative formula of the conjugate gradient method is given by where   is a step-length which is computed by carrying out a line search, and   is the search direction defined by where   is a scalar and   denotes the gradient ∇(  ).The different conjugate gradient methods correspond to different computing ways of   .If  is a strictly convex quadratic function, namely, where  is a positive definite matrix, and if   is the exact one-dimensional minimizer along the direction   , then the method with (2) and ( 3) is called linear conjugate gradient method.Otherwise, it is called nonlinear conjugate gradient method.The most important feature of linear conjugate gradient method is that the search directions satisfy the following conjugacy condition: For nonlinear conjugate gradient methods, (5) does not hold, since the Hessian ∇ 2 () changes at different iterations.
Mathematical Problems in Engineering Some well-known formulae for   are the Fletcher-Reeves (FR), Polak-Ribière (PR), Hestense-Stiefel (HS), and Dai where ‖ ‖ denotes the Euclidean norm.Their corresponding conjugate methods are abbreviated as FR, PR, HS, and DY methods.In the past two decades, the convergence properties of these methods have been intensively studied by many researchers (e.g., [1][2][3][4][5][6][7][8][9]).Although all these methods are equivalent in the linear case, namely, when  is a strictly convex quadratic function and   are determined by exact line search, their behaviors for general objective functions may be far different.
For general functions, Zoutendijk [10] proved the global convergence of FR method with exact line search.(Here and throughout this paper, for global convergence, we mean that the sequence generated by the corresponding methods will either terminate after finite steps or contain a subsequence such that it converges to a stationary point of the objective function from a given initial point.)Although one would be satisfied with its global convergence properties, the FR method performs much worse than the PR and HS methods in real computations.Powell [11] analyzed a major numerical drawback of the FR method; namely, if a small step is generated away from the solution point, the subsequent steps may be also very short.On the other hand, in practical computation, the HS method resembles the PR method, and both methods are generally believed to be the most efficient conjugate gradient methods since these two methods essentially perform a restart if a bad direction occurs.However, Powell [12] constructed a counterexample and showed that the PR and HS methods without restarts can cycle infinitely without approaching the solution.This example suggests that these two methods have a drawback that they are not globally convergent for general functions.Therefore, over the past few years, much effort has been put to find out new formulae for conjugate methods such that they are not only globally convergent for general functions but also have robust and efficient numerical performance.
Recently, using a new conjugacy condition, Dai and Liao [13] proposed two new methods.Interestingly, one of their methods is not only globally convergent for general functions but also performs better than HS and PR methods.In this paper, similar to Dai and Liao's approach, we propose a new conjugacy condition.Based on the proposed condition, a new formula for computing   is given.And then, we analyze the convergence properties for the given method and also carry the numerical experiment which shows that the given method is robust and efficient.
The remainder of this paper is organized as follows.In Section 2, after a short description of Dai and Liao's conjugacy condition and related methods, the motivations of this paper are represented.According to the motivations, we propose a new conjugacy condition and related method at the end of Section 2. In Section 3, convergence analysis for the given method is presented.In the last Section we perform the numerical experiments by testing a set of largescale problems and do some numerical comparisons with some existing methods.

Motivations, New Conjugacy Condition, and Related Method
For a general nonlinear function , we know by the mean value theorem that there exists some  ∈ (0, 1) such that Therefore, it is reasonable to replace (5) with the following conjugacy condition: Recently, extension of (9) has been studied by Dai and Liao in [13].Their approach is based on the quasi-Newton techniques.Recall that, in the quasi-Newton method, an approximation matrix  −1 of the Hessian ∇ 2 ( −1 ) is updated such that the new matrix   satisfies the following quasi-Newton equation: The search direction   in quasi-Newton method is calculated by Combining these two equations, we obtain The above relation implies that (9) holds if the line search is exact since in this case     −1 = 0.However, practical numerical algorithms normally adopt inexact line searches instead of exact line search.For this reason, it seems more reasonable to replace conjugacy condition (9) with the condition where  ≥ 0 is a scalar.
To ensure that the search direction   satisfies conjugacy condition (13), one only needs to multiply (3) with  −1 and use (13), yielding It is obvious that For simplicity, we call the method with (2), (3), and ( 14) as DL1 method.Dai and Liao also proved that the conjugate gradient method with DL1 is globally convergent for uniformly convex functions.For general functions, Powell [12] constructed an example showing that the PR method may cycle without approaching any solution point if the steplength   is chosen to be the first local minimizer along   .
Since the DL1 method reduces to the PR method in the case that     −1 = 0 holds, this implies that the method with ( 14) need not converge for general functions.To get the global convergence, Dai and Liao made a restriction on  DL1  as follows Dai and Liao replaced ( 14) by We also call the method with (2), (3), and (17) as DL method; Dai and Liao show that DL method is globally convergent for general functions under sufficient descent condition (31) and some suitable conditions.Besides, some numerical experiments in [13] indicate the efficiency of this method.Similar to Dai and Liao's approach, Li et al. [14] proposed another conjugate condition and related conjugate gradient methods.And they also proved that the proposed methods are globally convergent under some assumptions.
Recently, based on a modified secant condition given by Zhang et al. [15], Yabe and Takano [16] derive an update parameter  YT  and show that the YT+ scheme is globally convergent under some conditions: where ≥ 0 is a constant 2.2.Motivations.From the above discussions, Dai and Liao's approach is effective; the main reason is that the search directions   generated by DL1 method or DL method not only contain the gradient information but also contain some Hessian ∇ 2 () information.From ( 15) and ( 17),  DL1  and  DL  are formed by two parts; the first part is  HS  and the second part is −(    −1 /  −1  −1 ).So we also consider DL1 and DL methods as the modified forms of the H method by adding some information of Hessian ∇ 2 () which is contained in the second part.
From the structure of ( 17), we know that the parameter  DL  may be negative since the second part −(    −1 /  −1  −1 ) may be less than zero.In conjugate gradient methods, if the   < 0 and |  | is large, then the generated directions   and  −1 may tend to be opposite.This type of methods is susceptible to jamming.
On the other hand, in conjugate gradient methods, the following strong Wolfe-Powell line search is often used to determine the step size   : where 0 <  <  < 1; a typical choice of  is  = 0.1.
From the structure of ( 17), we know that  DL  depends on the directional derivative     −1 which is determined by the line search.For PRP+ algorithm with the strong Wolfe-Powell line search, in order to make sufficient descent condition (31) hold, people often used Lemarechal [17], Fletcher [18], or Moré and Thuente's [19] strategy to make the directional derivative |    −1 | sufficiently small.Under this strategy, the second part of  DL  will tend to vanish.This means that the DL method is much line-search-dependent.
The above discussions motivate us to propose a modified conjugacy condition and the related conjugate gradient method, which should possess the following properties (1) Nonnegative property   ≥ 0.
(2) The new formula contains not only the gradient information but also some Hessian information.
(3) The formula should be less line-search-dependent.

The Modified Conjugacy Condition and Related Method.
From the above discussion, it seems reasonable to replace conjugacy condition (13) with the following modified conjugacy condition: To ensure that the search direction   satisfies condition (23), one only needs to multiply (3) with  −1 and use (23), yielding It is obvious that For simplicity, we call the method with (2), (3), and (25) as MDL method.Similar to Gilbert and Nocedal's [4] approach, we propose the following restricted parameter  MDL+  : And we call the method with (2), (3), and (26) as MDL+ method and give the nonlinear conjugate gradient algorithm as below.
Step 2. Compute   by the strong Wolfe-Powell line search.

Convergence Analysis
In the convergence analysis of conjugate gradient methods, we often make the following basic assumptions on the objective functions.
Under the above assumptions of , there exists a constant  ≥ 0 such that We say the descent condition holds if for each search direction   , In addition, we say the sufficient descent condition holds if there exists a constant  > 0 such that, for each search direction   , we have Under Assumption A, based on the Zoutendijk condition in [10], for any conjugate gradient method with the strong Wolfe-Powell line search, Dai et al. in [20] The proof is completed.
Proof.First, note that   ̸ = 0; otherwise (31) is false.Therefore   is well defined.In addition, by relation (40) and Lemma 2, we have that Now, we divide formula  MDL+  into two parts as follows: (49) On the other hand, line search condition (22) gives Equations ( 22), (31), and (50) imply that So we have and the proof is completed.
Gilbert and Nocedal [4] introduced property ( * ) which is very important for the convergence analysis of the conjugate gradient methods.In fact, with Assumption A, (40), and (50), if (31) holds with some constant  > 0, the method with  MDL+  possesses such property ( * ).

Mathematical Problems in Engineering
We say that the method has property ( * ), if, for all , there exist constants  > 1,  > 0 such that |  | ≤ , and if In fact, by (31), (40), and (50), we have Combining ( 55) with ( 27) and ( 28) and (29), we obtain Note that  can be defined such that  > 1. Therefore we can say that  > 1.As a result, we define and we get from the first inequality in (56) that if Let  * denote the set of positive integers.For  > 0 and a positive integer Δ, denote Let |  ,Δ | denote the number of elements in   ,Δ .From the above property ( * ), we can prove the following theorem.Theorem 6. Suppose that Assumption A holds.Consider MDL+ method, where   satisfies condition (31) with  > 0, and   is obtained by the strong Wolfe-Powell line search.Then if (40) holds, there exists  > 0 such that, for any Δ ∈  * and any index  0 , there is an index  ≥  0 such that The proof of this theorem is similar to the proof of Lemma 3.5 in [13].So, we omit the proof.According to the above lemmas and theorems, we can prove the following convergence result for the MDL+ method.
Let  > 0 be given by Theorem 6 and define Δ := ⌈8/⌉ to be the smallest integer not less than 8/.By Theorem 6, we can find an index  0 ≥ 1 such that With this Δ and  0 , Theorem Thus Δ < 8/, which contradicts the definition of Δ.The proof is completed.

Numerical Results
In this section, we report the performance of Algorithm 1 (MDL+) on a set of test problems.The codes were written in Fortran 77 and in double precision arithmetic.All the tests were performed on the same PC (Intel Core i3 CPU M370 @ 2.4 GH, 2 GB RAM).The experiments were performed on a set of 73 nonlinear unconstrained problems collected by Neculai Andrei.Some of the problems are from CUTE [21] library.For each test problem, we have performed 10 numerical experiments with a number of variables  = 1000, 2000,. .., 10000.
In order to assess the reliability of the MDL+ algorithm, we also tested this method against the DL method and HS method using the same problems.All these algorithms terminate when ‖  ‖ ≤ 10 −5 .We also force the routines to stop if the iterations exceed 1000 or the number of function evaluations reaches 2000.The parameters  and  in Wolfe-Powell line search conditions ( 21) and ( 22) are set to be 10 −4 and 10 −1 respectively.For DL method,  = 0.1, which is the same with [13].We also test MDL+ algorithm with different parameters  to see that  = 0.05 is the best choice.
The comparing data contain the iterations, function, and gradient evaluations and CPU time.To approximatively assess the performance of MDL+, HS, and DL methods, we use the profile of Dolan and Moré [22] as an evaluated tool.Dolan and Moré [22] gave a new tool to analyze the efficiency of algorithms.They introduced the notion of a performance profile as a means to evaluate and compare the performance of the set of solvers  on a test set . Assuming that there exist   solvers and   problems, for each problem  and solver , they defined  , = computing cost (iterations or function and gradient evaluations or CPU time) required to solve problem  by solver .
Requiring a baseline for comparisons, they compared the performance on problem  by solver  with the best performance by any solver on this problem; that is, using the performance ratio Suppose that a parameter  ≥  , for all , .Set  , =  if and only if solver  does not solve problem .Then they defined Thus   () is the probability for solver  that a performance ratio  , is within factor  ≥ 1 of the best possible ratio.
Then function   is the distribution function for the performance ratio.The performance profile   is a nondecreasing, piecewise constant function.That is, for subset of the methods being analyzed, we plot the fraction  of the problems for which any given method is within a factor  of the best.
For the testing problems, if all three methods can not terminate successfully, then we got rid of it.In case one method fails, but there is another method that terminates successfully, then the performance ratio of the failed method is set to be  ( is the maxima of the performance ratios).
The performance profiles based on iterations, function and gradient evaluations, and CPU time of the three methods are plotted in Figures 1, 2, and 3, respectively.From Figure 1, which plots the performance profile based on iterations, when  = 1, the HS method performs better than MDL+ and DL methods.With the increasing of , when  ≥ 1.3, the profile of MDL+ method outperforms HS and DL methods.This means that, from the iteration points of view, for a subset of problems, HS method is better than MDL+ and DL methods.But, for all the testing problems, DML+ method is much robuster than HS and DL methods.
From Figure 2, which plots the performance profile based on function and gradient evaluations, it is easy to see that, for all  ≥ 1, MDL+ method performs much better than HS and DL methods.It is an interesting phenomenon, since, when  ≤ 1.3, the profiles of HS based on iterations outperform DML+ method.This means that, during process of iteration, the required function and gradient evaluations of MDL+ method are much less than HS and DL methods.Form this point of view, the CPU time consumed by MDL+ method should be much less than HS and DL methods, since the CPU time is mainly dependent on function and gradient evaluations.Figure 3 validates that the CPU time consumed by MDL+ method is much less than HS and DL methods.

Theorem 7 .
Suppose that Assumption A holds.Consider MDL+ method, where   satisfies condition (31) with  > 0, and   is obtained by the strong Wolfe-Powell line search.Then we have lim inf  → ∞ ‖  ‖ = 0.