Effective Algorithms for Solving Trace Minimization Problem in Multivariate Statistics

This paper develops two novel and fast Riemannian second-order approaches for solving a class of matrix trace minimization problems with orthogonality constraints, which is widely applied in multivariate statistical analysis. The existing majorization method is guaranteed to converge but its convergence rate is at best linear. A hybrid Riemannian Newton-type algorithm with both global and quadratic convergence is proposed firstly. A Riemannian trust-region method based on the proposed Newton method is further provided. Some numerical tests and application to the least squares fitting of the DEDICOM model and the orthonormal INDSCAL model are given to demonstrate the efficiency of the proposed methods. Comparisons with some latest Riemannian gradient-type methods and some existing Riemannian second-order algorithms in the MATLAB toolbox Manopt are also presented.


Introduction
Many multivariate statistical techniques are based on fitting a model to observed data. In most cases, these models are fitted by least squares, and as a consequence, one has to minimize a least squares loss function possibly subject to certain constraints on the parameters in the model. Least squares loss functions can often be expressed in terms of matrix trace functions. Sometimes, these have straightforward closed-form solutions, but also often such solutions are not available. In such cases, one may resort to iterative algorithms.
e main purpose of the present paper is to provide several efficient algorithms that can be used for optimizing such matrix trace functions.
A class of functions that would include all those of interest in multivariate statistical analysis can probably not be given, but in practice many statistical problems can be reformulated as special cases of the optimization of the general function [1][2][3]: where A ∈ R p×n (p ≤ n), B j ∈ R n×n , C j ∈ R p×p , j � 1, . . . , m; X is an unknown n × p matrix, c is a constant that does not depend on X, and tr(M) means the sum of the diagonal elements of M. e matrices A, B j , and C j at this point are arbitrary but with the required orders. In fact, the problems of least squares fitting of a model M X (involving a matrix X of model parameters) to a data matrix Y, that is, the minimization of σ(X) � ‖Y − M X ‖ 2 F over X (with ‖ · ‖ F denoting the Frobenius norm), can often be written in a form like that of (1). To give an example, consider the case where the data matrix Y has submatrices Y k and the model M X has submatrices of the form F k XG; then, tr GG T X T F T k F k X .
en, minimization of (2) is equivalent to minimization of (1) by letting c � m k�1 tr(Y T k Y k ), A � − 2 m k�1 GY T k F k , B k � F T k F k , and C k � GG T , k � 1, . . . , m. e function f(X) can be minimized over arbitrary X, but in many practical applications of multivariate statistics, constraints are imposed, the most frequent one being columnwise orthonormality: X T X � I p [1][2][3][4][5][6][7][8][9][10], where I p denotes the identity matrix of size p. Another common constraint is that of reduced rank; that is, the rank of X is forced to be less than or equal to r (r ≤ min(n, p)). e present paper discusses only the case of columnwise orthonormality. erefore, the considered matrix trace minimization problem with constraints can be mathematically formulated by the following problem. tr B j XC j X T subject to X ∈ R n×p , X T X � I p , (3) where A ∈ R p×n , B j ∈ R n×n , and C j ∈ R p×p , j � 1, . . . , m are given, c is a constant, and 1 ≤ p ≤ n.
Next, we present several practical applications of Problem 1 in multivariate statistics. e first example was proposed in [1,4] for a least squares fit of a model for linear dynamical systems. Following their notation, let B be a fixed n × n matrix, D, F, and H be fixed p × p matrices, X and Y be fixed n × p matrices, and Z be a variable n × p. One of their subproblems is to minimize a function over Z, subject to the constraint Z T Z � I p . is function can be expanded as where c 1 denotes a constant term not depending on Z. It is readily verified that f 1 (Z) is the special case of f(X) with X replaced by Z, e second example is the least squares fitting of the "DEDICOM model" [1,5], which can be written as minimizing where F is an arbitrary fixed n × n matrix and X and R are variable matrices of orders n × r and r × r, respectively, with X columnwise orthonormal. More detailed description of this model will be given in Numerical Experiments section. Kiers et al. [6] have given an alternating least squares algorithm for fitting this model. e DEDICOM function, with R considered fixed, can be rewritten as where c 2 is a constant not depending on X. is is a special case of f(X) with A � 0, B 1 � − 2F T , C 1 � R. e third example is to fit the "orthonormal INDSCAL model" (see [7], pp. 110, 118 and [1,10]). Let S j be a fixed symmetric matrix of order n × n, X be a variable matrix of order n × p, and D j be a variable diagonal matrix of order p, j � 1, . . . , m. Fitting the orthonormal INDSCAL model is equivalent to minimizing the function subject to X T X � I p . As in Berge [8], f 3 can be minimized by alternatingly updating X and D j , j � 1, . . . , m. Function f 3 , with D j , j � 1, . . . , m, considered fixed, can be rewritten as where c 3 is a constant with respect to X. is is the particular case of f(X) with A � 0, B j � − 2S j , C j � D j , j � 1, . . . , m.
As a final example, we consider the simultaneous components analysis problem [3,9] of minimizing over Q j and S, subject to S T S � I p . By means of an alternating least squares algorithm, they updated all parameter matrices in turn. Function f 4 , with Q j , j � 1, . . . , m, considered fixed, can be expanded as which, ignoring the constant c 4 , is a special case of f(X) with X replaced by S, A � (− 2 j Q T j ), B j � R − 1 j and C j � Q T j Q j j � 1, . . . , m. Concerning numerical approach for solving Problem 1, actually, Kiers et al. [1][2][3] have proposed the well-known majorization algorithm, which is based on majorizing f(X) by a different function g(X) whose minimum is easily obtained. e majorizing function is chosen such that the parameter set for which the majorizing function attains its minimum will also decrease the function to be minimized. For a more detailed description of such strategies, see [1][2][3]. However, since each iteration of the majorization method requires a full eigenvalue decomposition and the rate of convergence is at best linear, the method can potentially be very slow. An average of computational results of 10 random tests of the majorization method on Problem 1 is reported in Table 1, where the constant c � 0 and m � 2 and the given matrices A, B 1 , C 1 , B 2 , and C 2 are all generated randomly by means of (in MATLAB style) A � rand n(p, n), B 1 � rand n(n, n), C 1 � rand n(p, p), B 2 � rand n(n, n), and C 2 � rand n(p, p). In Table 1, "n, p" means the dimension indexes and "CT." and "IT." mean the averaged total computing time in seconds and the averaged number of iterations to achieve the stopping criteria, respectively. "Grad." means the averaged norm of Riemannian gradient grad f(X k ), "f diff � |f(X k+1 ) − f(X k )|" means the averaged difference of the objective function values, and "X feasi " means the averaged value of feasibility ‖X T k X k − I p ‖ F . We can observe from Table 1 and the convergence histories of the objective function value that are unreported here that the existing majorization algorithm is a monotonically convergent algorithm for minimizing the matrix trace function iteratively. However, it requires a large number of iterations to converge to within a given tolerance because the rate of convergence of this method is at best linear.
To overcome the slow linear rate of convergence, in this work we reconsider Problem 1 under the framework of Riemannian optimization, by noting that the feasible set, St(n, p) � X ∈ R n×p | X T X � I p , with p ≤ n is referred to the well-known Stiefel manifold, an np − (1/2)p(p + 1)-dimensional embedded submanifold of the vector space R n×p . Riemannian optimization refers to optimization on Riemannian manifolds and has been extensively studied over decades. Optimization over the Stiefel manifold is an important special case of Riemannian optimization, which has recently aroused considerable research interests due to the wide applications in different fields such as the linear eigenvalue problem, the orthogonal Procrustes problem, the nearest low-rank correlation matrix problem, the Kohn-Sham total energy minimization, and singular value decomposition. Since optimization over the Stiefel manifold can be viewed as a general nonlinear optimization problem with constraints, many standard algorithms [11] in the Euclidean space can be generalized to manifold setting directly and have been explored and successfully applied to various applications, e.g., Riemannian steepest descent method [12], Riemannian curvilinear search method with Barzilai-Borwein (BB) steps [13], Riemannian Dai's nonmonotone-based conjugate gradient method [14,15], Riemannian Polak-Ribière-Polyak-based nonlinear conjugate gradient method [16,17], and Riemannian Fletcher-Reeves-based conjugate gradient method [18]. However, as we know, gradient-type algorithms often perform reasonably well but might converge slowly when the generated iterates are close to an optimal solution. Usually, fast local convergence cannot be expected if only the gradient information is used. In the Euclidean space R n , it is well known that higher rates of convergence can be achieved by using second-order information on the cost function. e classical choice is Newton's method; it plays a central role in the development of numerical techniques for optimization because of its simple formulation and its quadratic convergence properties. e history of Newton's method on manifolds can be traced back to Gabay [19] who proposed a formulation for the method on embedded submanifolds of R n . Edelman et al. [20] proposed a formulation of Newton's method on Stiefel manifold. Recently, Aihara and Sato [21] presented a matrix-free implementation of Riemannian Newton's method on the Stiefel manifold. Sato [22] developed the Riemannian Newton's method for the joint diagonalization problem on the Stiefel manifold, where the dimension of Newton's equation is reduced and can be effectively solved by means of the Kronecker product and the vec and veck operators. Very recently, Hu et al. [23] proposed a regularized Newton method for optimization problems on Riemannian manifolds. ey used a secondorder approximation of the objective function in the Euclidean space to form a sequence of quadratic subproblems while keeping the manifold constraints. Numerical experiments and comparisons with other state-ofthe-art methods indicated that their proposed algorithm is very promising. Another popular second-order algorithm on Stiefel manifold is the Riemannian trust-region (RTR) algorithm [24][25][26][27][28], which has been successfully applied to various applications. In particular, Yang et al. [29] presented the RTR algorithm for H 2 model reduction of bilinear systems, where the H 2 error norm is treated as a cost function on the Stiefel manifold. Sato [30] developed the RTR algorithm to the join singular value decomposition of multiple rectangular matrices, which is formulated as a Riemannian optimization problem on the product of two Stiefel manifolds.
Motivated by these works, in this paper, we are interested in extending Riemannian Newton's method and the Riemannian trust-region method to the underlying matrix trace minimization problem. We first derive the specific expression of Riemannian gradient and Hessian of the objective function of Problem 1. Specifically, we use the Kronecker product and the vectorization operators to reduce the dimension of the involved Newton's equation and to transform the equation into a standard symmetric linear system Hx � b.
e resultant system can be efficiently solved by means of direct inversion or some well-known Krylov subspace methods, such as the conjugate residual method (page 182 of [31]). Because Newton's method is not guaranteed to have global convergence, we need to prepare a suitable starting point that is sufficiently close to an optimal solution. Here, we applied OptStiefelGBB, a stateof-the-art algorithm proposed by Wen and Yin [13] which has the benefit of requiring very little memory and has been proved to be globally convergent, to obtain a suitable initial point. e resulting hybrid Riemannian Newton-type algorithm is globally and quadratically convergent. In the Euclidean space R n , it is well known that the pure Newton method converges only locally, and it cannot distinguish between local minima, local maxima, and saddle points. Compared to the pure Newton-type algorithm, the advantage of the trust-region algorithm is its more stable behavior [25]. For the RTR algorithm, convergence to stationary points is guaranteed for all initial points. By utilizing the Taylor expansion on Stiefel manifold, RTR algorithm constructs a trust-region subproblem on the tangent space. Here, we proposed a new trust-region subproblem, which has a lower computational cost than the original subproblem, based on our expression of the Riemannian Hessian of the objective function. e new trustregion subproblem can be solved by the classical truncated conjugate gradient, which is most popular due to its good properties and relatively cheap computational cost. is paper is organized as follows. In Section 2, some basic geometric properties of the Stiefel manifold are given, and the representation matrix formula of Riemannian gradient and Hessian of the objective function are also derived. A hybrid Newton-type algorithm with globally and quadratically convergent is provided in Section 3. A Riemannian trustregion-based algorithm for Problem 1 is described in Section 4. Numerical experiments and application to the least squares fitting of the DEDICOM model and the orthonormal INDSCAL model and comparisons with some latest Riemannian gradient-type algorithms mentioned above and some existing second-order Riemannian algorithms in the MATALB toolbox Manopt are reported in Section 5. e conclusion is presented in Section 6.

Preliminaries
In this section, we first recall some notations, definitions, and basic properties of Riemannian manifolds used throughout the paper. e tangent space at x on a manifold M is denoted by T x M. For manifolds M and N and a mapping f: M ⟶ N, the differential of f at x ∈ M is denoted by Df(x), which is a mapping from T x M to T f(x) M. Given a smooth function f on a manifold M ⊂ R n 1 ×n 2 , the symbol f is the extension of f to the ambient Euclidean space R n 1 ×n 2 . e symbols ∇ and grad denote the Euclidean and Riemannian gradients, respectively; i.e., given a smooth function f on a manifold M ⊂ R n 1 ×n 2 , ∇ and grad act on f and f, respectively. e symbol Hess denotes the Riemannian Hessian. e concept of a retraction, which is a smooth map from the tangent bundle of M into M that approximates the exponential map to the first order, is given as follows.
Definition 1. (see [24], Definition 4.1.1). A retraction on a differentiable manifold M is a smooth mapping R from the tangent bundle TM onto M satisfying the following two conditions (here R x denotes the restriction of R to T x M): Given a retraction R and a smooth manifold M, the general feasible algorithm framework on the manifold can be expressed as where t k is the step size at the k-th iterate x k and ξ k ∈ T x k M is a tangent vector. We next introduce some basic geometric properties of the involved Stiefel manifold St(n, p), the reader is referred to [20,24] for more details. e tangent space T X St(n, p) at X ∈ St(n, p) can be expressed as Since the manifold St(n, p) is an embedded submanifold of the matrix Euclidean space R n×p , a natural metric for T X St(n, p) is the Euclidean metric 〈Y, Z〉 X � tr(Y T Z), which induces a norm ‖Z‖ X � ‖Z‖ F , where ‖·‖ F is the Frobenius norm of a matrix. In what follows, we denote by 〈·, ·〉 and ‖·‖ F the Riemannian metric and its induced norm on St(n, p), respectively. Under the Riemannian metric, the orthogonal projection P X at X ∈ St(n, p) onto T X St(n, p) is expressed as where sym(M) denotes the symmetric part of M, i.e., sym(M): � ((M + M T )/2). We next introduce several different retraction operators on the Stiefel manifold at the current point X for a given step size t and descent direction ξ (one can refer to [27] for more details).
(1) Exponential map [20]: where QR � − (I n − XX T )ξ is the QR decomposition of − (I n − XX T )ξ. is scheme needs to calculate an exponent of a 2p-by-2p matrix and an QR decomposition of an n-by-p matrix. (2) Cayley transformation [13]: where When p < (n/2), this scheme is much cheaper than the exponential map. (3) Polar decomposition (see [24], p. 58): e computational cost is lower than the Cayley transform but the Cayley transformation gives a better approximation to the exponential map. (4) QR decomposition (see [24], p. 59): where qf(·) denotes the Q factor of the QR decomposition of the matrix in parentheses. It can be seen as an approximation of the polar decomposition. e main cost is the QR decomposition of a n-by-p matrix.
e Riemannian gradient and Hessian of an objective function are basic concepts in Riemannian optimization; we next derive the representation matrix formulas of Riemannian gradient grad f and Riemannian Hessian Hess f of the objective function in Problem 1. e Riemannian gradient, grad f(X), of an objective function f at X ∈ St(n, p) is defined to be a unique tangent vector which satisfies e Hessian, Hess f(X), of f at X is defined to be a linear transformation of the tangent space T X St(n, p) through the covariant derivative ∇ ξ grad f of grad f evaluated at X: where the covariant derivative is defined through the Levi-Civita connection ∇ on St(n, p) (see [24], Definition 5.5.1).
In what follows, we define f(X) to be a function with the same form as f(X) defined in R n×p ; that is, e following two lemmas derive the explicit expressions of grad f and Hess f.

Lemma 1.
e Riemannian gradient of f at X ∈ St(n, p) can be expressed as Proof. Since St(n, p) is a Riemannian submanifold of R n×p endowed with the induced metric, grad f(X) is equal to the orthogonal projection of the Euclidean gradient ∇f at X onto T X St(n, p). On the other hand, according to the matrix trace function differentiation, the Euclidean gradient ∇f at X can be computed as Hence, by using the projection P X given in (15), we obtain (23).

Lemma 2. Let ξ be a tangent vector at X ∈ St(n, p), the Riemannian Hessian of f at X is expressed as a linear map on T X St(n, p) and given by
Proof. For X ∈ R n×p , by simple calculation, the Euclidean Hessian ∇ 2 f can be written as By using the classical expression of the Riemannian connection on a Riemannian submanifold of a Euclidean space (see [32], §5.3.3) and choosing ∇ ξ ζ ≔ P X (Dζ (X)[ξ]), the Riemannian Hessian hess f of f on St(n, p) can also be expressed by using ∇ 2 f and the projection map (15). at is, Note that grad f(X) � P X (∇f(X)). We view the righthand side of this relation as a product of two matrix functions of X, rather than a composition of a map and a function. en, (27) can be written as [32] Hess in which we used the relation P X P X � P X in the last equality of (28), and for any M ∈ R n×p , is relation can reduce the computational cost of P X (DP X [ξ](∇f(X))) in the last equality of (28). Indeed, it follows from (29) and (30) that and some useful properties will be used in the sequel. For any M � (a ij ) ∈ R m×n , the vec operator is defined as vec(M) � a 11 , . . . , a m1 , a 12 , . . . , a m2 , a 1n , . . . , a mn T ∈ R mn . (33) For any N � (b ij ) ∈ AS n×n , where AS n×n means the set of all n-by-n skew-symmetric matrices, the veck operator is defined as e above definitions yield the following inner product equivalence: (ii) ere exists an mn × mn vec-permutation matrix [33] where E ij is the n × m matrix with a 1 in position (i, j) and zeros elsewhere. From [33], we have T (m,n) T (n,m) � I mn . Because T (m,n) is a permutation matrix and so is orthogonal, then we have where the matrix K n ∈ R n 2 ×(n(n− 1)/2) is defined as the following form: where e i is the ith column of I n . Obviously, K n is standard column orthogonal, that is,

Riemannian Newton's Method for Problem 1
Since we have already obtained the matrix expressions of grad f and Hess f and some other requisites for Riemannian optimization algorithms, in this section, we develop Riemannian Newton's method for Problem 1. In Newton's method, the search direction ξ (k) ∈ T X (k) St(n, p) at X (k) ∈ St(n, p) is determined to be the solution to Newton's equation After ξ (k) is obtained, the candidate for the new iterate is then given by where the step size is set to one for simplicity. By substituting (23) and (25) into (43), complete Riemannian Newton's equation for Problem 1 at X ∈ St(n, p) can be written as Now, the problem is how to solve the above Newton's equation. Noting that it is complicated and difficult to solve because it must be solved for ξ ∈ T X St(n, p) for given X ∈ St(n, p), i.e., ξ must satisfy ξ T X + X T ξ � 0. To overcome these difficulties, we wish to obtain the representation matrix of Hess f as a linear transformation on T X St(n, p) for arbitrarily fixed X so that we can rewrite (45) into a standard linear equation, which permits many quite effective numerical approaches. To this end, we introduce the following lemma.
Lemma 3 (see [24], p. 42). An equivalent form of the tangent space to St(n, p) at X ∈ St(n, p) is given by where X ⊥ is an arbitrary n × (n − p) matrix that satisfies From Lemma 3, one can easily check that the vector space of all tangent vectors has a dimension of np − (1/2)p(p + 1). Let ξ ∈ T X St(n, p) be the search direction at X ∈ St(n, p). From Lemma 3, ξ can be expressed by (47) Since Hess f(X) [ξ] is also determined to be a unique tangent vector in T X St(n, p), there exist unique matrices U H ∈ AS p×p and V H ∈ R (n− p)×p which satisfy e following proposition shows that we can write U H and V H by using U and V.

then the Hessian Hess f(X) of the objective function acts on ξ as Hess
Proof. From (25) and (48), we have the following equality: Note that from the left and using the relations X T X � I p and X T X ⊥ � 0 yields By substituting (47) into (52) and using the relation By substituting (47) into (53) and using the relation Since the Hessian Hess f(X)[ξ] is a linear transformation with respect to ξ � XU + X ⊥ V, and, therefore, with respect to the set of matrices (U, V) with U ∈ AS p×p and V ∈ R n− p×p . Meanwhile, Hess f(X) [ξ] is also in the tangent space T X St(n, p) and can be uniquely expressed as ; then, we can regard that the Hessian Hess f(X)[ξ] is a linear transformation on R K that transforms a K-dimension vector into another K-dimension vector. e following lemma obtains the representation matrix of that linear transformation, in which we use the vec and veck operators taking matrices and skew-symmetric matrices into vectors.
where U H and V H are given in equations (49) and (50). en, the representation matrix H A of H is given by where Proof. From equations (49) and (50)  veck Mathematical Problems in Engineering is completes the proof. e following proposition considers the structure of the representation matrix H A .

Proposition 2.
e representation matrix H A is always symmetric.
Proof. Let ξ and η be two tangent vectors at X which are expressed by where U 1 , U 2 ∈ AS p×p and V 1 , V 2 ∈ R (n− p)×p . From the induced metric for T X St(n, p), together with (35) and (36), one can get Since the Hessian Hess f(X) is symmetric with respect to the induced inner product for T X St(n, p) [24], we have Hence, we obtain because ξ and η can be arbitrary tangent vectors at X.
We are now in the position to solve Newton's equation (45). From (48), Newton's equation (45) can be rewritten as By multiplying (63) from the left by X T and X T ⊥ , (63) is equivalent to Applying the vec operation to the two equations of (64), we get Mathematical Problems in Engineering Together with Lemma 4, Newton's equation (45) can be represented in the following form: where H A is given in (55), which is the representation symmetric matrix of Hess Hess f(X): We have thus derived an equivalent form, which is a standard symmetric linear equation, of Newton's equation (45). After we solve this symmetric linear equation (66), we can obtain veck(U) and vec(V); then, we can easily reshape U ∈ AS p×p and V ∈ R (n− p)×p . erefore, we can calculate the solution ξ � XU + X ⊥ V of Newton's equation (45).
Based on the above discussion, the corresponding Riemannian Newton's method for solving Problem 1 can be described as follows.
If we know a good approximate solution of the problem in advance, Newton's method works effectively. is is because Newton's method generates locally but quadratically convergent sequences in general ( eorem 6.3.2 of [24,34]). □ Theorem 1. (local convergence). Consider Algorithm 1 with retraction R as described in Section 2. Let X * ∈ St(n, p) be a critical point of f; grad f(X * ) � 0. Assume that Hess f(X * ) is nondegenerate at X * ∈ St(n, p).
en, there exists a neighborhood U of X * in St(n, p) such that for all X (0) ∈ U, Algorithm 1 generates an infinite sequence X (k) converging quadratically to X * .
Remark 1. In the Step 3 of Algorithm 1, we can arbitrarily fix X (k) ⊥ to satisfy (X (k) ) T X (k) ⊥ can be computed by the full QR decompositions. In practice, using the MATLAB's qr function with input X (k) ∈ St(n, p) returns an orthogonal matrix Q ∈ R n×n such St(n, p) and erefore, we can choose Q 2 as X (k) ⊥ .

Remark 2. e symmetric linear equation (69) can be solved by direct inversion if H (k)
A is invertible. However, if the dimension of the problem is large, it is difficult to equate by direct inversion. In this case, some Krylov subspace methods such as the conjugate residual (CR) method (page 182 of [31]) can be used to solve (69). e CR method is used to solve linear equations of the form Ax � b where A is an invertible and symmetric matrix and b is nonzero. Note that the system matrix is only required to be symmetric, not symmetric positive definite. However, we should point out that other solvers can also be applied to solve (69) because there are no specific structural requirements for the variable x. When Hess f(X) is positive definite, the conjugate gradient (CG) method can be used. Moreover, if we consider (69) as just a linear equation of standard form, then the representation matrix H A is not necessarily symmetric. In this case, more Krylov subspace methods can be used, such as the generalized minimal residual (GMRES) method and biconjugate gradient stabilized (BiCGSTAB) method. Specifically, we use the CR method for all the numerical experiments in Section 5.

Remark 3.
In general, the Newton vector ξ (k) , solution of (43), is not necessarily a descent direction of f(X). Indeed, we have which is not guaranteed to be negative without additional assumptions on the operator Hess f(X). A sufficient condition for ξ (k) to be a descent direction is that Hess f(X (k) ) be positive definite, i.e., 〈ξ, Hess f(X)[ξ]〉 > 0 for all ξ ≠ 0. e positive definiteness of the representation matrix H A can be used to check the positive definiteness of the Hessian Hess f(X), which can be used to check whether a critical point obtained by Algorithm 1 is a local minimum. We note that for erefore, if the symmetric matrix H A is positive definite, then Hess f(X) is positive definite, and X is a local minimum. By using the partition form (55) of H A , we can easily derive a necessary condition for the positive definiteness of H A as H 11  Remark 4. Similar to unconstrained optimization in the Euclidean space, ‖grad f(X (k) )‖ F < ε for some constant ε > 0 is a reasonable stopping criterion for Problem 1. In fact, the Lagrangian of Problem 1 is where Λ is a symmetric matrix representing the Lagrange multiply. en, the first-order optimality conditions in the Euclidean sense are From the second equation, we have Λ � X T ∇f(X). Since Λ must be symmetric, we further obtain Λ � ∇f (X) T X. Under the conditions X T X � I p , (z/zX)L(X, Λ) � 0 is equivalent to grad f(X) � 0, since it follows from (15), (23), and (24) that us, first-order critical points in the Euclidean sense can be interpreted as stationary points in the Riemannian sense.
Because Newton's method is not guaranteed to have global convergence, we need to prepare a suitable starting point that is sufficiently close to an optimal solution. Here, we apply a state-of-the-art Riemannian gradient-type algorithm, OptStiefelGBB, to obtain a suitable initial point. OptStiefelGBB, proposed by Wen and Yin [13], is simple and has the benefit of requiring very little memory. e retraction adapted in OptStiefelGBB is the Cayley transformation (17), and an initial step size computed by the Barzilai-Borwein (BB) method [35] is used to speed up the convergence. Since the BB step size does not necessarily decrease the objective value at every iteration, it may invalidate convergence, and a nonmonotone line search method based on a strategy in [36] is also used to guarantee global convergence.
To speed up the convergence, we propose a hybrid method for Problem 1. e strategy is as follows: we start to perform the OptStiefelGBB until ‖grad f(X (k) )‖ F is sufficiently small. en, we switch the method to Newton's one. One possibility of the switching criterion can be set as (1) Choose an initial point X (0) ∈ St(n, p).

) Compute the partition matrix H (k)
A by Lemma 4 and the vector g (k) by (68). Compute x (k) ∈ R np− (1/2)p(p+1) that satisfies the symmetric linear equation where R is a retraction on St(n, p). (9) end for ALGORITHM 1: Riemannian Newton's method for Problem 1.
where ε 1 > 0. e algorithm is stated as follows.
Since the cost function f(X) in Problem 1 is smooth and the Stiefel manifold St(n, p) is compact, OptStiefelGBB applied to Problem 1 generates a sequence converging globally to a critical point of f [13]. On the other hand, according to eorem 1, Newton's method generates a quadratically convergent sequence if the initial point is sufficiently close to an optimal solution. erefore, we know that the proposed hybrid algorithm is globally and quadratically convergent if ε 1 is sufficiently small.

Riemannian Trust-Region Method for Problem 1
With the preliminary results discussed in Section 2, we now state the Riemannian trust-region method for solving Problem 1. e trust-region method is an iterative method for minimizing a cost function. At each iteration step, a quadratic model of the cost function is obtained. is model is assumed to be suitable in a region (the trust region) around the current iterate. en, an update is computed as the minimizer of the model in the trust region. e quality of the trial update is evaluated; it is consequently accepted or rejected, and the trust-region radius is adjusted.
In a Euclidean space E, if f: E ⟶ R and 〈·, ·〉 is the inner product in E, the trust-region subproblem for finding the update ξ ∈ E for the current iterate x k ∈ E is given by where DF(x k )[z] denotes the directional derivative of the function F at x k in the direction of z and Δ k is the trustregion radius. e quality of the model m is evaluated by means of the quotient If ρ is close to 0 or negative, then the model is very inaccurate; i.e., the step must be rejected, and the trustregion radius must be reduced. If ρ is larger but still small, the step is accepted, and the trust-region radius is reduced. Finally, if ρ is close to 1, then there is a good correspondence between the model and the cost function; the step is accepted, and the trust-region radius can be increased.
In the Riemannian trust-region method, at the k-th iteration X (k) , by utilizing the Taylor expansion on manifold, we consider the following trust-region subproblem on the tangent space: From (47) and (48), we have On the other hand, since grad f(X (k) ) ∈ T X (k) St(n, p), then from Lemma 3, grad f(X) can also be uniquely expressed as grad f( en, the Riemannian trust-region subproblem (77) can be written as In order to provide a guidance for selecting the new trust-region radius Δ (k) , we introduce the quotient which is also used to judge the acceptance or rejection of the candidate R X (k) (ξ (k) ) � R X (k) (X (k) U (k) + X (k) ⊥ V (k) ). Due to the fact that m X (k) (0 X (k) ) � f(X (k) ), ρ (k) can be given by .

(82)
As in the Euclidean space setting, the constants (1/4) and (3/4) are compared with the ratio ρ (k) and the result determines the trust-region radius in the next iteration. Except that, the constant ρ ′ ∈ [0, (1/4)) is used to measure ρ (k) . e trust-region step will be taken as the next iteration if ρ (k) > ρ ′ , and rejected, otherwise. Specifically, we present the Riemannian trust-region method for Problem 1 as follows.
e trust-region subproblem (80) is solved iteratively and forms the inner iteration of Algorithm 3. Note that (80) is a minimization problem in AS p×p × R (n− p)×p and could then be solved by many classical approaches. A widely used approach is the truncated conjugate gradient algorithm (tCG) due to the following reasons: (1) e tCG is a Krylov subspace-based solver in which if the initial guess U 0 ∈ AS p×p and V 0 ∈ R (n− p)×p , then the sequences U j and V j generated by tCG always satisfy U j ∈ AS p×p and V j ∈ R (n− p)×p , and hence the approximation e pesudocode of tCG is presented in Algorithm 3 for completeness. Note that we use indices in superscript without round brackets to denote the evolution of U and V within the inner iteration, while superscripts with round brackets are used in the outer iteration. As suggested by Absil et al. [24], the algorithm can stop in either after a fixed number of iterations or by the criterion: where κ, θ > 0 are real parameters and . In our numerical testing, we set κ � 0.1 and θ � 1.

Remark 5.
Fix k, and let ξ j ∈ T X (k) St(n, p) be the jth iterate of the tCG method for the subproblem expressed by (77). In the existing RTR method, which directly solves the subproblem (77), ξ j is updated as a tangent vector in at is, if the tCG method for (77) terminates with U j and V j , we only have to compute ξ j � X (k) U j + X (k) ⊥ V j . We do not have to compute ξ i for i < j. Furthermore, in the existing RTR method, hess f(X (k) )[δ j ] for some δ j � X (k) M j + X (k) ⊥ N j ∈ T X (k) St(n, p) must to computed at each iteration of the tCG method. However, the proposed Algorithm 4 needs erefore, if the number of iterations in the inner tCG method needed for solving the trust-region subproblems is sufficiently large, the proposed Algorithm 4 may have a shorter total computational time than the existing RTR method which directly solves the subproblem (77). ese facts imply that our proposed method can reduce the computational cost. e convergence of the general RTR-tCG on a smooth Riemannian manifold has been extensively analyzed in [24,37] where both the global convergence and local convergence rate have been established under appropriate assumptions. erefore, to understand the performance of the RTR-tCG method for solving Problem 1, we only need to check these assumptions, which leads to the following conclusions.

Mathematical Problems in Engineering 13
the cost function f(X) and the adopted retraction as described in Section 2 are smooth, and the involved manifold St(n, p) is a smooth and compact Riemannian manifold. (83). Let X * ∈ St(n, p) be a (nondegenerate) local minimum of f. en, there exists c > 0 such that for all sequences X (k) generated by the algorithm converging to X * , there exists K > 0 such that for all k > K,

Theorem 3 (local convergence speed). Consider Algorithms 3 and 4 with retraction R as in Section 2 and stopping criterion in Algorithm 4 as in
with θ > 0 as in (83), and "dist" defines the Riemannian distance on St(n, p) (see [24], p. 46).

Numerical Experiments
In this section, we report the numerical performance of the proposed algorithms for Problem 1. All the numerical experiments were completed on a personal computer with a Intel(R) Core(TM)2 Quad of 2.33 GHz CPU and 3.00 GB of RAM equipped with MATLAB R2019b. We report numerical experiments and application to the least squares fitting of the DEDICOM model and the orthonormal INDSCAL model. To illustrate the efficiency of our algorithms, we also compare Algorithms 2 and 3 with some latest Riemannian conjugate gradient-(CG-) type methods which are all applicable to Problem 1 with necessary modifications and some existing Riemannian second-order algorithms in the MATALB toolbox Manopt.
Return U and V. (12) else (13) Set roughout our experiments, in the implementation of Algorithm 2, the parameter ε 1 in (74) is set to be ε 1 � 0.1. In the implementation of the inner iteration of Algorithm 2, the CR method to the symmetric linear equation (69), the initial matrix is all set to be 0 with suitable size at each repetition of Newton's method, and were stopped when the relative residual 2-norms (‖r (k) L ‖ 2 /‖g (k) ‖ 2 ) became less than 10 − 6 .
Here, the residual is r (k) L denotes the Lth approximation of CR at the kth repetition of Newton's method. e largest number of CR method is set to be 1000. In the implementation of Algorithm 3, OptStiefel is also applied to obtain a suitable initial point, and the parameters are set to be Δ 0 � � 3 √ , ρ ′ � 0.1 and Δ � ∞. e stopping criteria for Algorithms 2 and 3 are set to be with ε � 10 − 6 . For all the compared Riemannian gradienttype algorithms, we use the same stopping criterion as that in OptStiefelGBB [13]: we let algorithms run up to K iterations and stop it at iteration for some constants ε, ε X , ε f ∈ (0, 1), and T, K ∈ N + , where e default values of ε X , ε f , T, and K are 10 − 6 , 10 − 8 , 5, and 2000, respectively. roughout, "IT.," "CT.," "Grad.," and "Obj." mean the number of iterations, the total computing time in seconds, the norm of Riemannian gradient grad f(X k ), and the objective function value at the final iterate by implementing the proposed algorithms, respectively. [5] for the analysis of asymmetric data. According to the DEDICOM model, a square data matrix F, containing entries F ij representing the (asymmetric) relation of object i to object j, is decomposed as

Numerical Experiments and Application to the Least Squares Fitting of the DEDICOM Model. DEDICOM is a model proposed by Harshman
where X is an n by p (p < n) matrix of weights for the n objects on p dimensions or aspects, R is a square matrix of order p, representing (asymmetric) relations among the p dimensions, and N is an error matrix with entries (n/j) representing the part of the relation of object i to object j that is not explained by the model. e objective of fitting this model to the data is to explain the data by means of relations among as small a number of dimensions as possible. ese dimensions can be considered as "aspects" of the objects. e "loadings" of the objects on these aspects are given by matrix X, which is constrained to be columnwise orthonormal. e entries in matrix X indicate the importance of the aspects for the objects. e dimensionality of R and X and hence the number of aspects to be determined are to be based on some external criterion, defined by the user. e DEDICOM model has to be fit in the least squares sense over matrices X and R of order n by p and p by p, respectively. e loss function that is to be minimized can be written as over matrix X and R, subject to the constraint X T X � Ip.
Kiers and ten Berge [1,2] have given an algorithm for this problem based on alternating least squares method that can be interpreted as alternately updating R and X. Because X T X � I p , the minimum of f over R for fixed X is given by R � X T FX. Minimizing f over X, for fixed R, is equivalent to minimizing which is a particular case of Problem 1 with A � 0, B 1 � − 2F T , and C 1 � R. An algorithm based on majorization was provided in [2] for updating this subproblem on X. Because majorization algorithm requires a large number of iterations to converge within a given tolerance, we here test numerical performance of our proposed algorithms for updating X. Without loss of generality, the square data matrix F in the DEDICOM model is given by F � rand n(n, n) and the fixed R is given by R � rand(p, p). e dimensions n and p were set to 500 and 5, respectively. e iterations of Algorithms 2 and 3 were started with X (0) � qf(rand(n, p)) or X (0) � qf(X * + ϑ * rand(n, p)), where the parameter ϑ was set to 10, 10 2 , and 10 3 , and X * is a solution of problem generated by majorization algorithm. Figure 1 shows the convergence histories of the proposed Algorithm 2 after switching to Newton's method (Algorithm 1), and Algorithm 3 after OptStiefel is performed to generate a suitable initial point for different choices of initial matrix X (0) . e number of iterations is plotted on the horizontal axis, and the log 10 of norm of the Riemannian gradient grad f(X (k) ) is plotted on the vertical axis in these figures. More numerical results are reported in Table 2, in which the term "IT." in Algorithms 2 and 3 means the number of iterations required by OptStiefelGBB to satisfy the switching criteria and the number of extra iterations required by Algorithm 2 or Algorithm 3 to achieve the stopping criteria and "Obj." means the objective function value − 2tr(F T X (k) RX (k) T ) at the final iterate. From Figure 1 and Table 2, we observe that once X (k) arrives in the appropriate region determined by ε 1 , Algorithms 2 and 3 generate sequences which quickly converge. We can also see that the effectiveness of Algorithms 2 and 3 was not affected by different choices of initial matrix X (0) , and the convergence speed of these two algorithm is not very )‖F (d) Figure 1: Convergence histories of Algorithm 2 after switching to Newton's method and Algorithm 3 after OptStiefel is performed to generate a suitable initial point, with different initial matrix X (0) . (a) X (0) � qf(rand(n, p)). (b) X (0) � qf(X * + 10rand(n, p)). (c) X (0) � qf(X * + 10 2 rand(n, p)). (d) X (0) � qf(X * + 10 3 rand(n, p)).  16 Mathematical Problems in Engineering sensitive to the choice of X (0) . is is in accordance with the global convergence of these two algorithms established in Proposition 3 and eorem 2, respectively. For this reason, we use X (0) � qf(rand(n, p)) in the subsequent examples. Next, we also show the detailed numerical results when applying the CR method to the symmetric linear equation (69) directly in the implementation of step 7 of Algorithm 2. Figure 2 shows the convergence histories of the CR method for (69). e number of iterations of the CR method is plotted on the horizontal axis, and the log10 of the relative residual norm, (‖r (k) , is plotted on the vertical axis in the figures. More numerical results are reported in Table 3. We can observe from Figure 2 and Table 3 that the CR method is effective for solving the Newton equation (69) as the symmetric linear system, and that Newton's method, (69), generates locally quadratically convergent sequences.
Finally, we numerically compare Algorithm 3 in which the modified trust-region subproblems (80) are solved by Algorithm 4 with the existing trust-region method (denoted by Existing RTR) in MATLAB toolbox Manopt [38]. We fix n � 500. For each p ∈ 5, 10, . . . , 50 { }, we compute the total computing time needed for convergence. Table 4 and Figure 3 show that the performance of the proposed Algorithm 3 is slightly better than that of the existing method in terms of computational time. is phenomenon becomes more pronounced as the matrix dimension increases.

Numerical Comparison with Latest Riemannian CG Methods for the Least Squares Fitting of the Orthonormal INDSCAL Model.
e orthonormal INDSCAL model is used to consider a three-way array consisting of m symmetric n × n slices S j [7,8,10]. In the model, these slices S j are composed by the doubly centered dissimilarity matrices. e orthonormal INDSCAL model decomposes each slice as where X is a n × r matrix (assumed to be columnwise orthonormal), D j is a diagonal matrix, and E j is a n × n matrix, containing the errors of the model fit. at means all slices share a common loading matrix X and differ from each other only by the diagonal elements of D j called idiosyncratic saliences. e orthonormal INDSCAL model, appeared initially in psychometric literature, is now widely used in social sciences, marketing research, etc. e orthonormal INDSCAL problem seeks for (X, D 1 , D 2 , . . . , D m ) such that the model fits the data in least squares sense, i.e., for given and fixed n × n symmetric matrices S j , j � 1, 2, . . . , m, minimizes the function: where D(r) m � D(r) × · · · × D(r) and D(r) denotes the set of all r × r diagonal matrices. is optimization problem has no direct analytical solution. e standard numerical solution of this problem is given by an alternating least squares algorithm [1,8,10]. Minimizing f over X, for fixed D j , j � 1, 2, . . . , m, is equivalent to minimizing which is also a particular case of Problem 1 with A � 0, B j � − 2S j , C j � D j , j � 1, . . . , m. A majorization-based algorithm was provided in [1] for updating this subproblem on X. Here, we test our proposed algorithms for this X subproblem.
To illustrate the efficiency of our algorithms, we also compare them with the existing Riemannian trust-region algorithm (denoted by existing RTR) and the existing Riemannian BFGS algorithm (denoted by existing RBFGS) in MATLAB toolbox Manopt [38] and some latest Riemannian conjugate gradient-(CG-) type methods which are all applicable with necessary modifications. To this end, we first introduce the iterative framework of each compared Riemannian CG method. For solving Problem 1, update the iterated using the following scheme starting in X 0 ∈ St(n, p) with ξ 0 � − grad f(X (0) ), where t k > 0 is the step size and ξ (k) ∈ T X St(n, p) is where T is a vector transport on St(n, p) [24], which is a smooth mapping from the product of tangent bundles T X St(n, p) ⊕ T X St(n, p) to the tangent bundle T X St(n, p), where ⊕ is the Whitney sum. ere are several expressions to update the parameter β k+1 of equation (1), and some of the most popular expressions are β of the Fletcher-Reeves method: and β of the Polak-Ribière method: In this experiment, we compare Algorithms 2 and 3 with the following Riemannian CG methods. We note that the differences between these Riemannian CG methods are mainly due to their specific ways of updating the parameters t k and β k+1 and constructing retractions and vector transports.
with β k+1 as in (2) and θ k+1 � (〈grad f(X (k+1) ), (iii) "RCG-Cayley": the Riemannian case of Dai's nonlinear CG method used in [14] designed to solve Problem 1 by using the Cayley transformation (17) to construct the retraction and the differentiated retraction to construct the vector transport   where T . e step size t k is determined by the following nonmonotone line search condition instead of the Wolfe conditions:  where h k � min h − 1, k { } and h is some positive integer.
e parameter β k+1 is computed as  (iv) "RCG-QR": a modification of "RCG-Cayley," by using the QR-based retraction and the corresponding differentiated retraction as a vector transport (see p. 173 of [24]). (v) "RCG-Pol": another modification of "RCG-Cayley" by using the polar decomposition (18) to construct a retraction and naturally using the corresponding differentiated retraction as a vector transport (see p. 173 of [24]).
In our experiments, the index m � 2. We consider S i to be correlation matrices of the form In this case, the INDSCAL can be seen as simultaneous principal component analysis of several correlation matrices [10]. We use the diagonal matrices D 1 � diag(p, p − 1, . . . , 1) and D 2 � 0.5 * diag(1, 2, . . . , p). All starting points X (0) were generated randomly by X (0) � qf(rand(n, p)). In particular, similar to [16,18], a reasonable initial step length in the "RGPRR-CG" and "RFR-CG" methods can be set to All the parameters in the compared Riemannian CG methods are chosen in a standard way as in their corresponding literature. Figures 4 and 5 show the convergence histories of the considered algorithms with different problem sizes (n, p). e number of iterations is plotted on the horizontal axis, and the log 10 of norm of the Riemannian gradient grad f(X (k) ) is plotted on the vertical axis in these figures. e convergence curves generated by Algorithm 2 or Algorithm 3 are composed of two pieces, one of which is an initial part of the curve generated by OptStiefelGBB, and the second piece is the vertical line segment generated by Newton's method. ese curves show that the switching of OptStiefel to Newton's one or trust region's one accelerates drastically the speed of convergence. e convergence curve of OptStiefel shows that the convergence of the sequence generated by the OptStiefel is very slow. If the sequence generating method is switched to Newton's method or trustregion method after the iterate X (k) gets into the ε 1 -neighborhood of the target, the generated sequence reaches, in a few steps, the final point within the required accuracy. More numerical results are reported in Table 5, where the term "IT." in Algorithms 2 and 3 is the same as that in Table 3 and "Obj.," means the objective function Mathematical Problems in Engineering value − 2 m j�1 tr(S j X (k) D j X (k) T ) at the final iterate. We can observe from Table 5 that Algorithm 3, with the implementation of OptStiefel to generate a suitable initial point, works very effectively for the orthonormal INDSCAL fitting problem and always performs much better than the existing Riemannian BFGS algorithm and all the compared Riemannian gradient-types methods in terms of total computing time and the accuracy of the solution. On the other hand, we note that the OptStiefel part of Algorithm 3 seems to be like preconditioning. In most cases, the preconditioning scheme can bring a significant reduction in terms of iteration steps than the existing Riemannian RTR method.
We can also see from Table 5 that when the problem size is large, Algorithm 2 uses much more computing time than that of other methods.
is is because Newton's method involve an inner iteration, the CR method to the symmetric linear equation (69), it consumes a lot of time to get an approximate solution to the equation, especially when the system dimension is large. Even though Newton's method takes the longest time per iteration among the considered algorithms, the convergence is very quick as a whole.

Conclusion
We have dealt with the trace function minimization problem with orthogonal constraints as an optimization problem on the manifold St(n, p) and have developed Riemannian Newton's method and Riemannian trust-region method for the problem. For diminishing the difficulty in solving Newton's equation in Newton's method, we have computed the representation matrix of the Hessian of the objective function using the vectorization operators and have reduced Newton's equation into a standard symmetric linear equation with dimension reduction. In addition, we have proposed a new trust-region method in which trust-region subproblems are solved by the truncated conjugate gradient method based on our expressions of the Hessian of the objective function. Furthermore, we have performed numerical experiments to verify that our present algorithms almost always perform much better than some Riemannian gradient-type methods.

Data Availability
e data used to support the findings of this study are included within the article.

Conflicts of Interest
e authors declare that they have no conflicts of interest.