The Alternating Direction Method of Multipliers for Sufficient Dimension Reduction

. Te minimum average variance estimation (MAVE) method has proven to be an efective approach to sufcient dimension reduction. In this study, we apply the computationally efcient optimization algorithm named alternating direction method of multipliers (ADMM) to a particular approach (MAVE or minimum average variance estimation) to the problem of sufcient dimension reduction (SDR). Under some assumptions, we prove that the iterative sequence generated by ADMM converges to some point of the associated augmented Lagrangian function. Moreover, that point is stationary. It also presents some numerical simulations on synthetic data to demonstrate the computational efciency of the algorithm.


Introduction
It is well known that dimension reduction is a highly efcient way in visualization and statistical analysis of highdimensional data.It is assumed that the predictor vector X ∈ R p afects the response variable Y ∈ R only through a few linear combinations β T 1 X, β T 2 X, • • • , β T d X, with d < p. Tus, it implies that all the information of X about Y is summarized by B T X, with B � (β 1 , β 2 , • • • , β d ).Te column vectors of B span the subspace S(B).We call this subspace S(B) the sufcient dimension reduction (SDR) subspace (see Li [1], Cook [2]).To estimate S(B), we do dimension reduction with the smallest possible value of d.
In the literature of dimension reduction, two most popular problems have been widely studied.One question is how does the conditional distribution of Y | X vary with X. Specifcally, it aims at seeking a few linear combinations B T X such that Ten, we can fnd the subspace S(B) satisfying (1).If ∩ S(B) is a SDR subspace, we call it the central subspace (CS) and denote it by S Y|X .In many real applications, E(Y | X) is of most interest to the researchers, so the objective of the problem is to seek out a matrix B p×d , satisfying We call the intersection of all SDR subspaces, the central mean space (CMS), and denote it by S E(Y|X) , if it is still a SDR subspace.Cook [2] gave more discussions about the CS and the CMS, including Cook and Li [3].
Tere were various methods to recover S Y|X or S E(Y|X) in the references.For example, Li and Duan [4] studied the ordinary least squares.Later, Li [1] studied sliced inverse regression.Cook and Weisberg [5] put forward sliced inverse variance estimation.Ten, Li [6] studied principal Hessian directions.Ten, Xia et al. [7] also considered minimum average variance estimation (MAVE).Li and Wang [8] considered directional regression, too.Ten, density MAVE was of interest to Xia [9].Sliced regressionbased MAVE was considered by Wang and Xia [10].Efcient semiparametric estimation was studied by Ma and Zhu [11][12][13].Most of these approaches used the principle of inverse regression with the well-known limitation-the need for a linearity condition on the covariates.MAVE and some other semiparametric methods do not have strong hypotheses on the design of covariates.Moreover, they have wider applicability.Furthermore, in many situations, they yield more accurate estimators of the reduced dimensional space, see references Ma and Zhu [11], Xia et al. [7], and Xia [9].We know MAVE is not robust to outliers in the response because of the use of least squares and is more computationally intensive due to the use of kernel smoothing.Te MAVE-type methods estimate the column space of B ∈ R p×d through minimizing the loss criterion described later in (14) with the orthogonality constraint B T B � I d .Tus, the implementation of the MAVE-type methods involves nonconvex optimization problems based on the nonconvexity with the orthogonality constraint.Terefore, heuristic algorithms are often used to have no convergence guarantees in the literature.
Recently, many researchers adopted ADMM algorithm to high-dimensional statistics and machine learning, see Yin et al. [14], Xue et al. [15], Zhang and Zou [16], Gu et al. [17], and Kapla et al. [18], and many other topics, such as matrix completion, tensor completion, and sparse recovery, see Li et al. [19,20], Liu et al. [21], and Shi et al. [22].Te main advantages of the ADMM algorithm are its fexibility at simplifying a diversity of optimization problems and its good convergence property, see Boyd et al. [23].In this study, we demonstrate that the ADMM algorithm can be adapted to solve the optimization problem of the aforementioned MAVE-type methods.Moreover, we prove that the proposed algorithm will converge to some point that is stationary, through Wang et al. [24]'s theory for ADMM on nonconvex problems.To our knowledge, for MAVE, the proposed ADMM algorithm is the frst one with the convergence property.Details refer to Teorem 2 in the study.In addition, Zhang et al. [25] proposed a robust estimation through regularization with case-specifc parameters to achieve robust estimation and outlier detection simultaneously.
In the rest of this article, we present the proposed ADMM algorithm and prove its convergence properties in Section 2. Ten, numerical simulations are conducted to illustrate the proposed algorithm in Section 3. Finally, a brief discussion is concluded in Section 4. Some technical details are relegated to Appendix.

The Proposed Algorithm Based on ADMM
Te following ADMM algorithm is applicable to all the MAVE-type approaches.For simplicity, we only present ADMM algorithm to estimate the CMS problem.Xia et al. [7] considered the following model: It is known that the conditional variance, for given B T X, should be Tus, from the conditional expectation, we get So, we know the expression ( 6) is equivalent to For any given X 0 , we know E(y i | B T X i ) ′ local linear expansion, at X 0 , can be expressed as follows: Here, a Obviously, the residuals we want to fnd are the following expressions: Trough the spirit of the local estimation that is linear smoothing, it is natural to estimate σ 2 (B T X), using the following approximation: Here, ω i0 are the weights satisfying  n i�1 ω i0 � 1.Two choices of the weights ω i0 were given in Xia et al. [7].Te estimation of a is the minimum point of expression (6).Of course, the 2 Journal of Mathematics same is true for b.Ten, we know the estimation of σ 2 (B T X 0 ) is the minimum value of the expression (6).Tat is, Due to the expressions ( 4), (7), and ( 13), the MAVE method estimates B by solving the following minimization problem: Xia et al. [7] solved the minimizer of ( 14) by an iterative algorithm for B and (a j , b j ) ,j�1,•••,n .Although the iterative algorithm seems feasible, it is difcult to establish its convergence property due to the nonconvexity of the orthogonality constraint B T B � I d .Zhu [26] studied optimization problems on Stiefel manifold and mentioned that problem ( 14) can be solved.Machine learning, computer vision, and statistics are all research hotspots.In the felds of them, there are many optimization problems that are structured convex, such as Zanni et al. [27].For solving various convex or nonconvex problems that arise in the felds of machine learning, computer vision, and statistics, the alternating direction method with multipliers (ADMM) has been a powerful and successful method, since Gabay and Mercier [28] introduced the ADMM, and then, its convergence properties for convex objective functions have been extensively studied.Many researchers successfully applied ADMM to solve these problems.For example, Boyd et al. [23] gave the recent survey paper.Liu et al. [29] and Cascarano et al. [30] also studied them.
Recently, many researchers successfully used some variants of ADMM to solve some previous nonconvex problems.For example, Hong et al. [31] discussed the convergence properties of variants of ADMM and then applied them to nonconvex problems.Moreover, they established the iteration complexity of ADMM.For the multiblock separable optimization problems, Guo et al. [32] studied the case of linear constraints and no convexity of the related component functions.For the iteration sequence generated by ADMM, they drew a conclusion that each clustering point of it is a critical point.For the multiblock proximal ADMM's two linearized variants, Jiang et al. [33] studied their iteration complexity.Especially, for minimizing an objective function, lack of convexity, and possible smoothness and constrained by coupled linear identities, Wang et al. [24] studied the convergence of ADMM.To solve the optimization problems with separability and nonconvexity, Jia et al. [34] considered the convergence rate of the ADMM.
In this study, we are interested in the following ADMM algorithm to optimize problem (14) under the framework of Wang et al. [24].General ADMM fow is referred to the original paper, such as Gabay and Mercier [28], for better understanding.
Let S � B ∈ R p×d : B T B � I d  .Ten, we can reformulate the problem of minimizing (8) as that of minimizing where the function I S is simply the indicator function of the set S, i.e., I Te function h is proper, diferentiable, and nonconvex.
Te inner product of A and B is, as usual, written to be 〈A, B〉 � Tr(A T B).Here, Tr(•) is the trace operator.Ten, the following function, is the Lagrangian function of (15), where ρ is the penalty parameter, Θ ∈ R p×d is the Lagrange multiplier, and • ‖ ‖ F is the Frobenius norm.
We can compute the estimations of (B, a, b, A, Θ) through the following ADMM algorithm.
For a given value of A (m) , B (m) , and Θ (m) at step m, the iterative process is as follows: Obviously, the function, is equivalent to (10).Ten, by the defnition of h(B (m) , a, b) and simple algebraic manipulation, we obtain the explicit form Journal of Mathematics In (19), after throwing away the terms independent of B, one has Since the right term of above expression is the quadratic function of B, B (m+1) has an explicit form.Specifcally, we can obtain B (m+1) by the gradient of the right term of above expression.By some algebraic manipulation, we have Ten, we obtain B (m+1) by vec where vec(A) denotes vectorization for any matrix A and Te function, is equivalent to (12).Te solution of the function above is given in the following lemma.
Lemma 1.For any matrix C ∈ R p×d with rank d, its projection, P S (C), onto S is defned as the solution of Ten, we have In Appendix, we give the Proof of Lemma 1.We apply Lemma 1 to the update step of A and obtain 4

Journal of Mathematics
Finally, the update of Θ is given in (21).Te stopping rule is P B (m+1) − P B (m) � � � � � � � � F ≤ ε, for some small value ε, where P C � C(C T C) −1 C T for any matrix C. We summarize the above procedure in Algorithm 1.
We next derive the convergence property of the proposed ADMM algorithm.Theorem 2. For any sufciently large ρ, there is one limit point at least, for the sequence {a (m) , b (m) , A (m) , B (m) , Θ (m) } that is obtained by Algorithm 1.Moreover, each limit point is a stationary point of the associated augmented Lagrangian function L ρ (B, a, b, A, Θ).
Te above result is based on the ADMM theory on nonconvex problems established by Wang et al. [24].Te proof of our theorem is relegated to Appendix.

Numerical Simulations
Numerical simulation studies are done to examine the numerical performances of the algorithm.Matlab is used for the numerical tests.Te estimation accuracy is assessed by two diferent measures: the norm distance and the trace correlation Tr(P B P  B )/d.In addition, the norm distance refers to the canonical distance between the projection matrix P B and Te performance of the ADMM algorithm is studied in the frst two examples, for single models d � 1 and multiple index models d > 1. Te third example aims to explore how diferent penalty parameters ρ afect estimation accuracy.
Example 1.In this example, two single index models are considered.
(i) Te vector X ∈ R 2 is independent uniformly distributed [0, 1], and we generate the data also by the model [35] Y � 4 (ii) Te vector X ∈ R 3 is independent uniformly distributed [0, 1] and we generate the data by the model where . Tis model was investigated by Carroll et al. [36].
Te error ε has the standard normal distribution.Te true parameters of models (18) and (19) Example 2. In this example, two multiple index models are studied.
(i) X ∼ N(0, I 7 ) and we generate the data, by using the model where the resulting parameters are and ε ∼ N(0, 1).(ii) X ∼ N(0, I 6 ) and we generate the data, by using the model [7] Y where , and ε ∼ N(0, 1).
Te simulation results are shown in the following Tables 1 and 2.
Tables 1 and 2 report the two diferent measures of estimation accuracy, their corresponding standard errors (in parentheses), and the running times.We run the desired numerical simulation on an Intel(R) Core(TM) i7-5600U CPU at 2.60 GHz with 8 GB memory.Te results suggest that the ADMM algorithm can provide slightly more accurate estimates than the MAVE method, according to the two diferent measures: the norm distance and the trace correlation.We also see that the ADMM algorithm is slightly faster than the MAVE method of Xia et al. [7] in most cases, according to the running times.Example 3. Tis example examines the performance of the ADMM algorithm for diferent penalty parameters.
For Example 3, 1000 replications are simulated with n � 200.We display the results in Table 3.For simplicity, one only presents the mean and standard deviations of the Frobenius norm distance and the running times.Obviously, the mean and standard deviations of the Frobenius norm distance and the running times have very little diference when ρ is taken as 0.01, 0.5, 5, 10, 20, and 50, respectively.Tese results reveal that the ADMM algorithm is almost insensitive to the choice of penalty parameter.Furthermore, we can fnd that the results based on the penalty parameters ρ � 10 and 20 are slightly faster than the other results.

Concluding Remarks
Xia and his students have proposed an efcient direct algorithm to achieve MAVE and contributed it into an R package "MAVE."Te detailed description can be seen in the website https://CRAN.R-project.org/package=MAVE.
In this study, we develop an ADMM algorithm to solve the MAVE-type methods and establish its convergence properties.Te computational efciency of the algorithm has been demonstrated by numerical experiments.It is noteworthy that the proposed ADMM algorithm is applicable to all the MAVE-type approaches.For example, for survival data, Xia et al. [37] proposed the hMAVE method for survival data, which can be regarded as a censored vision of Xia et al. [7]'s MAVE method.In this setting, our ADMM algorithm can also be used.It has been proved that the ADMM algorithm is quite fexible in handling many large-scale statistical problems, see Boyd et al. [23].Terefore, it is desirable to deal with sufcient dimension reduction for high or ultrahigh-dimensional data by our ADMM algorithm.

Appendix
We frst give the Proof of Lemma 1.
Proof of Lemma 1.We use the method of Lagrange multipliers to show this result.Consider the Lagrangian function where V ∈ R d×d are the Lagrange multipliers with V � V T .Ten, we have

Table 1 :
Simulation results for Example 1: the mean and standard deviations (in parentheses) of two diferent measures and the running times.

Table 2 :
Simulation results for Example 2: the mean and standard deviations of two diferent measures and the running times.

Table 3 :
Simulation results for Example 3: the ADMM algorithm for the diferent penalty parameters.