Kernel Sliced Inverse Regression : Regularization and Consistency

and Applied Analysis 3 variablesX.We call {β j } d j=1 the nonlinear e.d.r. directions, and S = span(β 1 , . . . , β d ) the nonlinear e.d.r. space.The following proposition [10] extends the theoretical foundation of SIR to this nonlinear setting. Proposition 2. Assume the following linear design condition for H that for any f ∈ H, there exists a vector b ∈ R such that E [⟨f, φ (X)⟩ | S (X)] = b T S (X) , with S (X) = (⟨β1, φ (X)⟩ , . . . , ⟨βd, φ (X)⟩) T . (11) Then, for the model specified in (9), the inverse regression curve E[φ(X) | Y] is contained in the span of (Σβ 1 , . . . , Σβ d ), where Σ is the covariance operator of φ(X). Proposition 2 is a straightforward extension of the multivariate case in [1] to a Hilbert space or a direct application of the functional SIR setting in [14]. Although the linear design condition (11) may be difficult to check in practice, it has been shown that such a condition usually holds approximately in a high-dimensional space [15]. This conforms to the argument in [11] that the linearity in a reproducing kernel Hilbert space is less strict than the linearity in the Euclidean space. Moreover, it is pointed out in [1, 10] that even if the linear design condition is violated, the bias of SIR variate is usually not large. An immediate consequence of Proposition 2 is that nonlinear e.d.r. directions are the eigenvectors corresponding to the largest eigenvalues of the following generalized eigendecomposition problem:


Introduction
The goal of dimension reduction in the standard regression/ classification setting is to summarize the information in the -dimensional predictor variable  relevant to predicting the univariate response variable .The summary () should have  ≪  variates and ideally should satisfy the following conditional independence property: Thus, any inference of  involves only the summary statistic () which is of much lower dimension than the original data .
Linear methods for dimension reduction focus on linear summaries of the data, () = (  1 , . . .,    ).The dimensional subspace, S = span( 1 , . . .,   ), is defined as the effective dimension reduction (e.d.r.) space in [1], since S summarizes all the predictive information on .A key result in [1] is that under some mild conditions, the e.d.r.directions {  }  =1 correspond to the eigenvectors of the matrix: Thus, the e.d.r.directions or subspace can be estimated via an eigenanalysis of the matrix , which is the foundation of the sliced inverse regression (SIR) algorithm proposed in [1,2].Further developments include sliced average variance estimation (SAVE) [3] and Principal Hessian directions (PHDs) [4].
The aforementioned algorithms cannot be applied in highdimensional settings, where the number of covariates  is greater than the number of observations , since the sample covariance matrix is singular.Recently, an extension of SIR has been proposed in [5], which can handle the case for  >  based on the idea of partial least squares.
A common premise held in high-dimensional data analysis is that the intrinsic structure of data is in fact low dimensional, for example, the data is concentrated on a manifold.Linear methods such as SIR often fail to capture this nonlinear low-dimensional structure.However, there may exist a nonlinear embedding of the data into a Hilbert space, where a linear method can capture the low-dimensional structure.The basic idea in applying kernel methods is the application of a linear algorithm to the data mapped into a feature space induced by a kernel function.If projections onto this low-dimensional structure can be computed by inner products in this Hilbert space, the so-called kernel trick [6,7] can be used to obtain simple and efficient algorithms.Since the embedding is nonlinear, linear directions in the feature space correspond to nonlinear directions in the original data space.Nonlinear extensions of some classical linear dimensional reduction methods using this approach include kernel principle component analysis [6], kernel Fisher discriminant analysis [8], and kernel independent correlation analysis [9].This idea was applied to SIR in [10,11] resulting in the kernel sliced inverse regression (KSIR) method which allows for the estimation of nonlinear e.d.r.directions.
There are numeric, algorithmic, and conceptual subtleties to a direct application of this kernel idea to SIR, although it looks quite natural at first glance.In KSIR, the -dimensional data are projected into a Hilbert space H through a feature map:  :  → H and the nonlinear features are supposed to be recovered by the eigenfunctions of the following operator However, this operator  is actually not well defined in general, especially when H is infinite dimensional and the covariance operator cov(()) is not invertible.In addition, the key utility of representation theorems in kernel methods is that optimization in a possibly infinite dimensional Hilbert space H reduces to solving a finite dimensional optimization problem.In the KSIR algorithm developed in [10,11], the representer theorem has been implicitly used and seems to work well in empirical studies.However, it is not theoretically justifiable since  is estimated empirically via observed data.Moreover, the computation of eigenfunctions of an empirical estimate of  from observations is often ill-conditioned and results in computational instability.In [11], a low rank approximation of the kernel matrix is used to overcome this instability and to reduce computational complexity.In this paper, our aim is to clarify the theoretical subtleties in KSIR and to motivate two types of regularization schemes to overcome the computational difficulties arising in KSIR.The consistency is proven and practical advantages of regularization are demonstrated via empirical experiments.

Mercer Kernels and Nonlinear e.d.r. Directions
The extension of SIR to use kernels is based on properties of reproducing kernel Hilbert spaces (RKHSs) and in particular Mercer kernels [12].Given predictor variables  ∈ X ⊆ R  , a Mercer kernel is a continuous, positive, and semidefinite function (⋅, ⋅) : X × X → R with the following spectral decomposition: where {  } are the eigenfunctions and {  } are the corresponding nonnegative, nonincreasing eigenvalues.An important property of Mercer kernels is that each kernel  uniquely corresponds to an RKHS as follows: where the cardinality of Λ := { :   > 0} is the dimension of the RKHS which may be infinite [12,13].Given a Mercer kernel, there exists a unique map or embedding  from X to a Hilbert space defined by the eigenvalues and eigenfunctions of the kernel.The map takes the following form: The Hilbert space induced by this map with the standard inner product (, ) = ⟨(), ()⟩ is isomorphic to the RKHS (5), and we will denote both Hilbert spaces as H [13].
In the case where  is infinite dimensional,  : X → ℓ 2 .
The random variable  ∈ X induces a random element () in the RKHS.Throughout this paper, we will use Hilbert space valued random variables; so we now recall some basic facts.Let  be a random element in H with E‖‖ < ∞, where ‖ ⋅ ‖ denotes the norm in H induced by its inner product ⟨⋅, ⋅⟩.The expectation E() is defined to be an element in H, satisfying ⟨, where Let P denote the measure for random variable .Throughout, we assume the following conditions.
Under Assumption 1, the random element () has a well-defined mean and a covariance operator because ‖()‖ 2 = (, ) is bounded (a.s.).Without loss of generality, we assume E() = 0, where 0 is the zero element in H.The boundedness also implies that the covariance operator Σ = E[()⊗()] is compact and has the following spectral decomposition: where   and   ∈ H are the eigenvalues and eigenfunctions, respectively.
We assume the following model for the relationship between  and :  =  (⟨ 1 ,  ()⟩ , . . ., ⟨  ,  ()⟩ , ) , (9) with   ∈ H and the distribution of  is independent of .This model implies that the response variable  depends on  only through a -dimensional summary statistic Then, for the model specified in (9), the inverse regression curve E[() | ] is contained in the span of (Σ 1 , . . ., Σ  ), where Σ is the covariance operator of ().
Proposition 2 is a straightforward extension of the multivariate case in [1] to a Hilbert space or a direct application of the functional SIR setting in [14].Although the linear design condition (11) may be difficult to check in practice, it has been shown that such a condition usually holds approximately in a high-dimensional space [15].This conforms to the argument in [11] that the linearity in a reproducing kernel Hilbert space is less strict than the linearity in the Euclidean space.Moreover, it is pointed out in [1,10] that even if the linear design condition is violated, the bias of SIR variate is usually not large.
An immediate consequence of Proposition 2 is that nonlinear e.d.r.directions are the eigenvectors corresponding to the largest eigenvalues of the following generalized eigendecomposition problem: or equivalently from an eigenanalysis of the operator  = Σ −1 Γ.In the infinite dimensional case, a technical difficulty arises since the operator is not defined on the entire Hilbert space H.So for the operator  to be well defined, we need to show that the range of Γ is indeed in the range of Σ.A similar issue also arose in the analysis of dimension reduction and canonical analysis for functional data [16,17].In these analyses, extra conditions are needed for operators like  to be well defined.
In KSIR, this issue is resolved automatically by the linear design condition and extra conditions are not required as stated by the following Theorem; see Appendix A for the proof.
Theorem 3.Under Assumption 1 and the linear design condition (11), the following hold: (i) the operator Γ is of finite rank  Γ ≤ .Consequently, it is compact and has the following spectral decomposition: where   and   are the eigenvalues and eigenvectors, respectively.Moreover,   ∈ range(Σ) for all  = 1, . . .,  Γ ; (ii) the eigendecomposition problem (12) is equivalent to the eigenanalysis of the operator , which takes the following form

Regularized Kernel Sliced Inverse Regression
The discussion in Section 2 implies that nonlinear e.d.r.directions can be retrieved by applying the original SIR algorithm in the feature space induced by the Mercer kernel.There are some computational challenges to this idea such as estimating an infinite dimensional covariance operator and the fact that the feature map is often difficult or impossible to compute for many kernels.We address these issues by working with inner products of the feature map and adding a regularization term to kernel SIR.

3.1.
Estimating the Nonlinear e.d.r.Directions.Given  observations {( 1 ,  1 ), . . ., (  ,   )}, our objective is to obtain an estimate of the e.d.r.directions ( β1 , . . ., β ).We first formulate a procedure almost identical to the standard SIR procedure except that it operates in the feature space H.This highlights the immediate relation between the SIR and KSIR procedures.
(1) Without loss of generality, we assume that the mapped predictor variables are mean zero, that is, ∑  =1 (  ) = 0, for otherwise we can subtract  = (1/) ∑  =1 (  ) from (  ).The sample covariance is estimated by (2) Bin the  variables into  slices  1 , . . .,   and compute mean vectors of the corresponding mapped predictor variables for each slice Compute the sample between-group covariance matrix (3) Estimate the SIR directions β by solving the generalized eigendecomposition problem: This procedure is computationally impossible if the RKHS is infinite dimensional or the feature map cannot be computed (which is the usual case).However, the model given in (9) requires not the e.d.r.directions but the projection onto these directions, that is, the  summary statistics which we call the KSIR variates.Like other kernel algorithms, the kernel trick enables KSIR variates to be efficiently computed from only the kernel , not the map .
The key quantity in this alternative formulation is the centred Gram matrix  defined by the kernel function (⋅, ⋅), where Note that the rank of  is less than ; so  is always singular.Given the centered Gram matrix , the following generalized eigendecomposition problem can be used to compute the KSIR variates: where  denotes the -dimensional generalized eigenvector and  denotes an  ×  matrix with   = 1/  if ,  are in the th group consisting of   observations and zero otherwise.The following proposition establishes the equivalence between two eigendecomposition problems, (22) and (19), in the recovery of KSIR variates (V 1 , . . ., V  ).
This result was proven in [10], in which the algorithm was further reduced to solving  =  (24) by canceling  from both sides of (22).
Remark 5.It is important to remark that when Σ is not invertible, Proposition 4 states that the equivalence between ( 19) and ( 22) holds modulo the null space of Σ that is a requirement of the representer theorem, that is, β are linear combinations of (  ).Without this mandated condition, we will see that each eigenvector of ( 22) produces an eigenvector of (19), while the eigenvector of ( 19) is not necessarily recovered by (22).
Remark 6.It is necessary to clarify the difference between ( 12) and (19).In (12), it is natural to assume that  is orthogonal to the null space of Σ.To see this, let  =  0 +  ⋆ with  0 belonging to the null space and  ⋆ orthogonal to the null space.Then, that is, ⟨, ()⟩ = ⟨ ⋆ , ()⟩ (a.s.).It means that  0 does not contribute to the KSIR variates and thus could be set as 0. However, in (19), if β is an eigenvector and β⋆ is its orthogonal component relative to the null space of Σ, the identity ⟨ β, ()⟩ = ⟨ β⋆ , ()⟩ is in general not true for a new point  which is different from the observations.Thus, from a theoretical perspective, it is not as natural to assume the representer theorem, although it works well in practice.
In this sense, the KSIR algorithm based on ( 22) or ( 24) does not have a thorough mathematical foundation.

Regularization and Stability.
Except for the theoretical subtleties, in applications with relatively small samples, the eigendecomposition in ( 22) is often ill-conditioned resulting in overfitting as well as numerically unstable estimates of the e.d.r.space.This can be addressed by either thresholding eigenvalues of the estimated covariance matrix Σ or by adding a regularization term to (22) or (24).We motivate two types of regularization schemes.The first one is the traditional ridge regularization.It is used in both linear SIR and functional SIR [18][19][20], which solves the eigendecomposition problem Here, and in the sequel,  denotes the identity operator and  is a tuning parameter.Assuming the representer theorem, its kernel form is given as Another type of regularization is to regularize (22) directly: The following proposition, whose proof is in Appendix B, states that solving the generalized eigendecomposition problem ( 28) is equivalent to finding the eigenvectors of Proposition 7. Let ĉ be the eigenvectors of (28), and let β be the eigenvectors of (29).Then, the following holds for the regularized KSIR variates: This algorithm is termed as the Tikhonov regularization.For linear SIR, it is shown in [21] that the Tikhonov regularization is more efficient than the ridge regularization.
The conclusion follows from the observation that β = (1/  )( Γ β − Σ β ) for the ridge regularization and β = (1/  )( ΣΓ β − Σ2 β ) for the Tikhonov regularization.To close, we remark that KSIR is computationally advantageous even for the case of linear models when  ≫  due to the fact that the eigendecomposition problem is for  ×  matrices rather than the  ×  matrices in the standard SIR formulation.

Consistency of Regularized KSIR.
In this subsection, we prove the asymptotic consistency of the e.d.r.directions estimated by regularized KSIR and provide conditions under which the rate of convergence is   ( −1/4 ).An important observation from the proof is that the rate of convergence of the e.d.r.directions depends on the contribution of the small principal components.The rate can be arbitrarily slow if the e.d.r.space depends heavily on eigenvectors corresponding to small eigenvalues of the covariance operator.
Note that various consistency results are available for linear SIR [22][23][24].These results hold only for the finite dimensional setting and cannot be adapted to KSIR where the RKHS is often infinite dimensional.Consistency of functional SIR has also been studied before.In [14], a thresholding method is considered, which selects a finite number of eigenvectors and uses results from finite rank operators.Their proof of consistency requires stronger and more complicated conditions than ours.The consistency for functional SIR with ridge regularization is proven in [19], but it is of a weaker form than our result.We remark that the consistency results for functional SIR can be improved using a similar argument in this paper.
In the following, we state the consistency results for the Tikhonov regularization.A similar result can be proved for the ridge regularization while the details are omitted.This theorem is a direct corollary of the following theorem which is proven in Appendix C. Theorem 10.Define the projection operator and its complement for each  ≥ 1 where {  } ∞ =1 are the eigenvectors of the covariance operator Σ as defined in (8), with the corresponding eigenvalues denoted by   .

Application to Simulated and Real Data
In this section, we compare regularized kernel sliced inverse regression (RKSIR) with several other SIR-related dimension reduction methods.The comparisons are used to address two questions: (1) does regularization improve the performance of kernel sliced inverse regression, and (2) does the nonlinearity of kernel sliced inverse regression improve the prediction accuracy?
We would like to remark that the assessment of nonlinear dimension reduction methods could be more difficult than that of linear ones.When the feature mapping  for an RKHS is not available, we do not know the true e.d.r.directions or subspace.So in that case, we will use the prediction accuracy to evaluate the goodness of RKSIR.

Importance of Nonlinearity and Regularization. Our first example illustrates that both the nonlinearity and regularization of RKSIR can significantly improve prediction accuracy.
The regression model has ten predictor variables  = ( 1 , . . .,  10 ) with each one following a normal distribution   ∼ (0, 1).A univariate response is given as with noise  ∼ (0, 0.1 2 ).We compare the effectiveness of the linear dimension reduction methods SIR, SAVE, and PHD with RKSIR, by examining the predictive accuracy of a nonlinear kernel regression model on the reduced space.We generate 100 training samples and apply the above methods to compute the e.d.r.directions.In RKSIR, we used an additive Gaussian kernel as follows: where  = 2.After projecting the training samples on the estimated e.d.r.directions, we train a Gaussian kernel regression model based on the new variates.Then, the mean square error is computed on 1000 independent test samples.This experiment is repeated 100 times with all parameters set by cross-validation.The results are summarized in Figure 1. Figure 1(a) displays the accuracy for RKSIR as a function of the regularization parameter, illustrating the importance of selecting regularization parameters.KSIR without regularization performs much worse than the RKSIR.Figure 1

(b) displays the prediction accuracy of various dimension reduction methods.
RKSIR outperforms all the linear dimension reduction methods, which illustrates the power of nonlinearity introduced in RKSIR.It also suggests that there are essentially two nonlinear e.d.r.directions.This observation seems to agree with the model in (35).Indeed, Figures 1(c) and 1(d) show that the first two e.d.r.directions from RKSIR estimate the two nonlinear factors well.

Effect of Regularization.
This example illustrates the effect of regularization on the performance of KSIR as a function of the anisotropy of the predictors.
For this model, it is known that SIR will miss the direction along the second variable  2 .So we focus on the comparison of KSIR and RKSIR in this example.
If we use a second-order polynomial kernel (, ) = (1 +   ) 2 that corresponds to the feature space then  1 +  2 2 can be captured in one e.d.r.direction.Ideally the first KSIR variate V = ⟨ 1 , ()⟩ should be equivalent to  1 +  2  2 modulo shift and scale So for this example given estimates of KSIR variates at the  data points {V  }  =1 = {⟨ β1 , (  )⟩}  =1 , the error of the first e.d.r.direction can be measured by the least squares fitting of V with respect to ( 1 +  2  2 ) We drew 200 observations from the model specified in (37), and then we applied the two dimension reduction methods KSIR and RKSIR.The mean and standard errors of 100 repetitions of this procedure are reported in Figure 2. The result shows that KSIR becomes more and more unstable as  increases and the regularization helps to reduce this instability.

Importance of Nonlinearity and Regularization in Real
Data.When SIR is applied to classification problems, it is equivalent to a Fisher discriminant analysis.For the case of multiclass classification, it is natural to use SIR and consider each class as a slice.Kernel forms of Fisher discriminant analysis (KFDA) [8] have been used to construct nonlinear discriminant surfaces and the regularization has improved performance of KFDA [25].In this example, we show that this idea of adding a nonlinearity and a regularization term improves predictive accuracy in a real multiclass classification data set, the classification of handwritten digits.
The MNIST data set (Y. LeCun, http://yann.lecun.com/exdb/mnist/) contains 60, 000 images of handwritten digits {0, 1, 2, . . ., 9} as training data and 10, 000 images as test data.Each image consists of  = 28 × 28 = 784 gray-scale pixel intensities.It is commonly believed that there is clear nonlinear structure in this 784-dimensional space.We compared regularized SIR (RSIR) as in (26), KSIR, and RKSIR, on this data to examine the effect of regularization and nonlinearity.Each draw of the training set consisted of 100 observations of each digit.We then computed the top 10 e.d.r.directions using these 1000 observations and 10 slices, one for each digit.We projected the 10, 000 test observations onto the e.d.r.directions and used a k-nearest neighbor (kNN) classifier with  = 5 to classify the test data.The accuracy of the kNN classifier without dimension reduction was used as a baseline.For KSIR and RKSIR we used a Gaussian kernel with the bandwith parameter set as the median pairwise distance between observations.The regularization parameter was set by cross-validation.
The mean and standard deviation of the classification accuracy over 100 iterations of this procedure are reported in Table 1.The first interesting observation is that the linear dimension reduction does not capture discriminative information, as the classification accuracy without dimension reduction is better.Nonlinearity does increase classification accuracy and coupling regularization with nonlinearity increases accuracy more.This improvement is dramatic for 2, 3, 5, and 8.

Discussion
The interest in manifold learning and nonlinear dimension reduction in both statistics and machine learning has led to a variety of statistical models and algorithms.However, most of these methods are developed in the unsupervised learning framework.Therefore, the estimated dimensions may not be optimal for the regression models.Our work incorporates nonlinearity and regularization to inverse regression approaches and results in a robust response driven nonlinear dimension reduction method.
RKHS has also been introduced into supervised dimension reduction in [26], where the conditional covariance operators on such kernel spaces are used to characterize the conditional independence between linear projections and the response variable.Therefore, their method estimates linear e.d.r.subspaces, while in [10,11] and in this paper, RKHS is used to model the e.d.r.subspace, which leads to nonlinear dimension reduction.
There are several open issues in regularized kernel SIR method, such as the selection of kernels, regularization parameters, and number of dimensions.A direct assessment of the nonlinear e.d.r directions is expected to reduce the computational burden in procedures based on cross validation.While these are well established in linear dimension reduction, however, little is known for nonlinear dimension reduction.We would like to leave them for future research.
There are some interesting connections between KSIR and functional SIR, which are developed by Ferré and his coauthors in a series of papers [14,17,19].In functional SIR, the observable data are functions and the goal is to find linear e.d.r.directions for functional data analysis.In KSIR, the observable data are typically not functions but mapped into a function space in order to characterize nonlinear structures.This suggests that computations involved in functional SIR can be simplified by a parametrization with respect to an RKHS or using a linear kernel in the parametrized function space.On the other hand, from a theoretical point of view, KSIR can be viewed as a special case of functional SIR, although our current theoretical results on KSIR are different from the ones for functional SIR. because of (A.1); so This proves (i).

B. Proof of Proposition 7
We first prove the proposition for matrices to simplify then notation; we then extend the result to the operators, where   is infinite and a matrix form does not make sense.

and 𝑈
are similarly defined.
The above formulation of Φ and Φ  coincides the definition of Σ as a covariance operator.Since the rank of Σ is less than , it is compact and has the following representation: where  ≤  is the rank and  ); the rate is ( 1/4 ).

3 )Figure 1 :
Figure 1: Dimension reduction for model (35): (a) mean square error for RKSIR with different regularization parameters; (b) mean square error for various dimension reduction methods.The blue line represents the mean square error without using dimension reduction.(c)-(d) RKSIR variates versus nonlinear factors.

Figure 2 :
Figure 2: Error in e.d.r. as a function of .

Table 1 :
Mean and standard deviations for error rates in classification of digits.

𝑁
[14][28][29] that each (  ) lies in span( 1 , ...,  In order to prove Theorems 9 and 10, we use the properties of the Hilbert-Schmidt operators, covariance operators for the Hilbert space valued random variables, and the perturbation theory for linear operators.In this subsection we provide a brief introduction to them.For details, see[27][28][29]and references therein.Given a separable Hilbert space H of dimension  H , a linear operator  on H is said to belong to the Hilbert-{  } is an orthonormal basis.The Hilbert-Schmidt class forms a new Hilbert space with norm ‖ ⋅ ‖ HS .Given a bounded operator  on H, the operators  and  both belong to the Hilbert-Schmidt class and the following holds: , positive, compact, and belongs to the Hilbert-Schmidt class.A well-known result from perturbation theory for linear operators states that if a set of linear operators   converges to  in the Hilbert-Schmidt norm and the eigenvalues of  are nondegenerate, then the eigenvalues and eigenvectors of   converge to those of  with same rate or convergence as the convergence of the operators.C.2.Proof of Theorem 10.We will use the following result from[14]:     HS +      1 −  2    HS +      2 −     HS .(C.7)For the first term, observe that      T −  1     HS ≤       ( Σ2 + ) −1           ΣΓ − ΣΓ     HS =   (   ) ⊗   .Since   ∈ range(Σ), there exists ũ such that   = Σũ  .Then,      Π  (ũ  )      +      Π ⊥  (ũ  )      .