A Transformation of Accelerated Double Step Size Method for Unconstrained Optimization

A reduction of the originally double step size iteration into the single step length scheme is derived under the proposed condition that relates two step lengths in the accelerated double step size gradient descent scheme.Theproposed transformation is numerically tested. Obtained results confirm the substantial progress in comparison with the single step size accelerated gradient descent method defined in a classical way regarding all analyzed characteristics: number of iterations, CPU time, and number of function evaluations. Linear convergence of derived method has been proved.


Introduction and Background
The SM iteration from [1] is defined by the iterative process where x +1 is a new iterative point, x  is the previous iterative point, g  is the gradient vector (search direction),   is a step length, and   > 0 is the acceleration parameter.In [1] it is verified that the accelerated gradient SM iteration (1) outperforms the gradient descent, GD, as well as the Andrei's accelerated gradient descent AGD method from [2].Double direction and double step size accelerated methods, denoted by ADD and ADSS methods, respectively, for solving the problems of unconstrained optimization are presented in [3,4].These two algorithms can be generally formulated through the next merged expression: where x  is the previous iterative point and real values   and   denote two step lengths while the vectors s  and d  are two vector directions.The values of step lengths are determined by backtracking line search techniques.The gradient is basically used for defining a search direction, but some new suggestions for deriving a descending vector direction are given in [3,5].Taking the substitutions into (2) produces the ADD iterative scheme from [3]: where   represents the acceleration parameter for the iteration (4).The benefits of the acceleration properties that arise from the usage of the parameter   are explained in [3].The so called nonaccelerated version of ADD method (NADD method shortly) is defined in order to numerically verify the acceleration property of the parameter   .Three methods, SM, ADD, and NADD, are numerically compared in [3].Results show the enormous efficiency of ADD scheme in comparison with its nonaccelerated counterpart NADD.Derivation of the direction vector d  is explained by the Algorithm 3.2 in [3].The ADD outperforms its competitive SM method from [1] with respect to the number of iterations.

Mathematical Problems in Engineering
By replacing the vectors s  and d  from (2) by − −1  g  and −g  , respectively, the next iteration is defined as The previous scheme is noted as ADSS model and it is proposed in [4].In the same paper, a huge improvement in performances of this accelerated gradient descent method when compared to the accelerated gradient descent SM method from [1] is numerically confirmed.
The main contribution of the present paper is a transformation of the double step size iterative scheme (5) for unconstrained optimization into an appropriate accelerated single step size scheme (called TADSS shortly).Convergence properties of the introduced method are investigated.A special contribution is given by the numerical confirmation that the TADSS algorithm developed from the double step size ADSS model ( 5) is evidently more efficient than the accelerated SM method obtained in a classical way.Surprisingly, numerical experiments show that the TADSS method overcomes the initial ADSS method.
The paper is organized in the following way.The reduction of the double step size ADSS model into the single step size iteration TADSS and the presentation of defined accelerated gradient decent model are given in Section 2. Section 3 contains the convergence analysis of derived algorithm for uniformly convex and strictly convex quadratic functions.The results of numerical experiments as well as their comparative analysis of developed method and its forerunners are illustrated in Section 4.

Transformation of ADSS Scheme into a Single Step Size Iteration
Very advanced numerical results obtained in [4] motivated further research on this topic.An idea is to investigate the properties of a single step size method developed as a reduction of the double step size ADSS model.This reduction is defined by an additional assumption which represents a trade-off between two step length parameters   and   in the ADSS scheme: Taking into account assumption (6) into expression (5), which defines the ADSS iteration, leads to the iterative process The iteration (7) is noted as transformed ADSS method, or shortly TADSS method.Defined TADSS iteration represents not only a reduction of the double step size ADSS model into the corresponding single step size method, but also a sort of modification of the single step size SM iteration from [1].This modification can be explained as the substitution of the product    −1  , from the SM iteration (1), by the multiplying factor   ( −1  −1)+1 of the gradient from the TADSS iteration (7).
For the sake of simplicity, we use the notation   =   ( −1  − 1) + 1 whenever it is possible.The value of the acceleration parameter  +1 in ( + 1)th iteration can be derived by using Taylor's expansion, similarly as described in [1,3,4]: The vector  in (8) satisfies Further, it is reasonable to replace in (8) the Hessian ∇ 2 () by the diagonal matrix  +1 , where  +1 is an appropriately chosen real number.This replacement implies The relation (10) allows us to compute the acceleration parameter  +1 : Next, the natural inequality  +1 > 0 is inevitable.This condition is required in order to fulfill second-order necessary condition and second-order sufficient condition.The choice  +1 = 1 is reasonable in the case when the inequality  +1 < 0 appears for some .This choice produces the next iterative point as which evidently represents the classical gradient descent step.
We consider now the ( + 1)th iteration, x +2 , which is given by Examine the function Φ +1 (): defined as the finite part of the Taylor expansion of the function under the assumption  +2 =  +1 .This function is convex when  +1 > 0, and its derivative Φ +1 ()   is calculated in the following way: Since the inequality  +1 > 0 is achieved, the following is valid: (17) Therefore, the function Φ +1 () decreases in the case (Φ +1 )   < 0 and achieves its own minimum in the case  = 1.According to the criteria given by ( 17), desirable values for  are within the interval (−∞, 1].Now, ( 7) is a kind of the gradient descent process in the case   =   ( −1  − 1) + 1 > 0. Since   > 0, it is easy to verify the following condition for the step length   : Since   /(  −1) > 1 in the case   > 1, this fractional number is not appropriate upper bound for   in this case.On the other hand, the inequality   /(  − 1) < 0 holds in the case   < 1, so that   /(  − 1) is an appropriate upper bound for   in this case.Further analysis of (10) in the case   < 1 gives which implies Taking into account   < 1, it is not difficult to verify that the criterion (20) restricts desirable values for   within the interval Final conclusion is that the upper bound for   is defined in the case   < 1,  +1 < 1, as the minimum between the upper bounds defined in (18) and (21): According to the previous discussion, the iterative step   is derived by the backtracking line search procedure presented in Algorithm 1.
Algorithm 1 (calculation of the step size   by the backtracking line search which starts from the upper bound defined in (22)).Requirement: objective function (x), the direction d  of the search at the point x  , and real numbers 0 <  < 0.5 and  ∈ (, 1).
Finally, the TADSS algorithm of the defined accelerated gradient descent scheme ( 7) is presented.
(2) If ‖g  ‖ < , then go to Step 8; else continue by the next step.
(3) Find the step size   applying Algorithm 1.

Convergence of TADSS Scheme
The content of this section is the convergence analysis of the TADSS method.In the first part of this section a set of uniformly convex functions is considered.The proofs of the following statements can be found in [6,7] and have been omitted.
Proposition 3 (see [6,7]).If the function  : R  → R is twice continuously differentiable and uniformly convex on R  then (1) the function  has a lower bound on ( The value of decreasing of analyzed function through each iteration is given by the next lemma which is restated and proven in [1].The same estimation can similarly be found considering iteration (7).Theorem 6 is approved in [1] and confirms a linear convergence of the constructed method.
Lemma 5.For twice continuously differentiable and uniformly convex function  on R  and for the sequence {x  } generated by Algorithm (7) the following inequality is valid: where and the sequence {x  } converges to x * at least linearly.
In the following review the case of strictly convex quadratic functions is analyzed.This set of functions is given by In the previous expression  is a real × symmetric positive definite matrix and b ∈ R  .It is assumed that the eigenvalues of the matrix  are given and lined as  1 ≤  2 ≤ ⋅ ⋅ ⋅ ≤   .Since the convergence for the most gradient methods is quite difficult to analyze, in many research articles of this profile convergence analysis is reduced on the set of convex quadratics [8][9][10].The convergence of TADSS method is also analyzed under similar presumptions.
Lemma 7. By applying the gradient descent method defined by (7) in which parameters   and   are given by relation (11) and Algorithm 1 on the strictly convex quadratic function  expressed by relation (29) where  ∈ R × presents a symmetric positive definite matrix, the next inequalities hold: where  1 and   are, respectively, the smallest and the largest eigenvalues of .
Proof.Considering expression (29), the difference between function value in the current and the previous point is Applying expression (7) the following is obtained: Using the facts that the gradient of the function ( 29) is g  = x  − b in conjunction with the equality b  g  = g   b, one can verify the following: Substituting (33) into (11), the parameter  +1 becomes The last relation confirms that  +1 is the Rayleigh quotient of the real symmetric matrix  at the vector g  , so the next inequalities hold: which combined with the fact that 0 ≤  +1 ≤ 1 prove the right hand side in (30): The estimation proved in [1], is considered in order to prove the left hand side of (30).Using the notation adopted in this paper, expression (37) becomes Inequality (38) and the facts that  ∈ (, 1),  ∈ (0, 0.5) and  +1 ∈ (0, 1) lead to the following conclusion: In the last estimation, Lipschitz constant  can be replaced by   .The conclusion that the eigenvalue   of matrix  has the property of Lipschitz constant  is to be derived from the next analysis.Since matrix  is symmetric and g(x) = x − b the following inequality can be provided: Substituting  by   , inequality (39) becomes and this proves the left hand side of inequalities (30).
Remark 8. Comparing the estimations resulting from the similarly proposed lemma in [1,3,4] with the estimation derived from the previous lemma, considering the TADSS method, it can be concluded that the estimation provided by the TADSS scheme involves only the eigenvalues  1 and   and not the parameter  from the backtracking procedure.
Theorem 9. Let the additional assumptions   < 2 1 for the eigenvalues of matrix  be imposed and let  be the strictly convex quadratic function given by (29).Assume {v 1 , v 2 , . . ., v  } is the orthonormal eigenvectors of symmetric positive definite matrix  and suppose that {x  } is the sequence of values constructed by Algorithm 2. The gradients of convex quadratics defined by ( 29) are g  = x  − b and can be expressed as for some real constants   1 ,   2 , . . .,    and for some integer .Then the application of the gradient descent method (7) on the goal function (29) satisfies the following two statements: Proof.Taking into account (7) one can verify and by taking (42) we get In order to prove (43), it is enough to show that |1−    | ≤ .So, two cases have to be analyzed.In the first one, it is supposed that     ≤ 1. Applying (30) leads to In the other case, it is assumed that     ≥ 1.From this condition arrives the following conclusion: Expression (42) implies The fact that the parameter , from (43), satisfies 0 <  < 1 confirms expression (44).

Numerical Experience
Numerical results provided by applying the implementation of TADSS, ADSS, and SM methods on 22 test functions for unconstrained test problems, proposed in [2,11], are presented and investigated.We chose most of the functions from the set of test functions presented in [3,4] and, as proposed in these papers, also investigated the experiments with a large number of variables in each function: 1000, 2000, 3000, 5000, 7000, 8000, 10000, 15000, 20000, and 30000.The stopping criteria are the same as in [1,3,4].Backtracking procedure is developed using the values  = 0.0001,  = 0.8 of needed parameters.Three main indicators of the efficiency are observed: number of iterations, CPU time, and number of function evaluations.First, we compare the performance of the TADSS scheme with the ADSS method.The reasons for this selection is obvious: the TADSS scheme presents a onestep version of ADSS method.Also, the intention to examine behavior of TADSS and compare it with its forerunner is natural.Obtained numerical values are displayed in Table 1 and refer to the number of iterative steps, the CPU time of executions computed in seconds, and the number of evaluations of the objective function.
Obtained numerical results, generally, confirm advantages in favor of TADSS, considering all three tested indicators.More precise, regarding the number of iterative steps TADSS shows better results in 17 out of 22 functions, while ADSS outperforms TADSS in 4 out of 22 experiments and for the extended three exponential terms function both methods require the same number of iterations.Results concerning spanned CPU time confirm that both methods, TADSS and ADSS, are very fast.In 9 out of 22 cases TADSS is faster than ADSS, in 2 out of 22 testings ADSS is faster than TADSS, and According to results displayed in Table 2, it can be concluded that although TADSS outperforms ADSS in 17 out of 22 testings with regard to the number of iterations, average results show slight advantage of ADSS on this matter.Considering the average number of evaluations, there is an opposite case in favor of TADSS.Consumed CPU time is averagely three times less in favor to the TADSS comparing to the ADSS.Generally, it can be concluded that the one-step variant of the ADSS method, constructed TADSS scheme, behaves slightly better than the original ADSS iteration, especially when we consider the speed of executions.Some additional experiments have been carried out in further numerical research.These tests show the comparison between the TADSS and the SM iterations.As mentioned before, both of the schemes, TADSS and SM, are accelerated gradient descent methods with one iterative step size parameter.We choose this additional numerical comparison in order to confirm that the accelerated single step size TADSS algorithm, derived from the accelerated double step size ADSS model, gives better performances with respect to the all three analyzed aspects than the classically defined accelerated single step size SM method.Table 3 with 30 displayed test functions verifies the previous assertion.
It can be observed from displayed numerical outcomes that the TADSS method provides better results than the SM method considering the number of iterations in 24 out of 30 testings, while the number of opposite cases is 5 out of 30.For the NONSCOMP test function, both models have the same number of iterations.Concerning the CPU time, both algorithms give the same results for 10 test functions.The TADSS method is faster than SM for 19 test functions, while the SM method is faster than TADSS for one test function only.The greatest progress is obtained with respect to the number of evaluations of the objectives.On this matter, using the TADSS algorithm, better results are obtained in 27 out of 30 test functions, while the opposite case holds for two test functions only.For the NONSCOMP test function both of the compared iterations give the same number of evaluations.From Table 3, we can also notice that for 2 out of 30 test functions testings are lasting more than the time limiter constant   defined in [3], while for all 30 test functions when the TADSS algorithm is applied the time of execution is far less than   .The results arranged in Table 4 give even more general view on the benefits provided by applying the TADSS method with regard to the SM method.The average values of 28 test functions, which were possible to test by both methods according to the constant   , are presented in the table.
The results presented in the previous table confirm that by applying the TADSS method approximately 25 times less iterations and even 35 times less evaluations of the objective function are needed in comparison with the SM method.Finally, when the TADSS is used, the testing is lasting even 107 times shorter than when the SM is applied.
The codes for presented numerical experiments are written in the Visual C++ programming language on a Workstation Intel 2.2 GHz.

Conclusion
The accelerated single step size gradient descent algorithm, called TADSS, is defined as a transformation of the accelerated double step size gradient descent model ADSS, proposed in [4].More precisely, the TADSS scheme is derived from the

Table 1 :
Summary of numerical results for TADSS and ADSS tested on 22 large scale test functions.

Table 2 :
Average numerical outcomes of 220 testings of each method among the 22 test functions tried out on 10 numerical experiments in each iteration.

Table 3 :
Numerical results for 30 test functions tested by the TADSS and the SM methods.

Table 4 :
Average values of numerical results for the TADSS and the SM methods calculated on 280 tests for each method.