^{1}

^{2}

^{3}

^{1}

^{2}

^{3}

Kernel sliced inverse regression (KSIR) is a natural framework for nonlinear dimension reduction using the mapping induced by kernels. However, there are numeric, algorithmic, and conceptual subtleties in making the method robust and consistent. We apply two types of regularization in this framework to address computational stability and generalization performance. We also provide an interpretation of the algorithm and prove consistency. The utility of this approach is illustrated on simulated and real data.

The goal of dimension reduction in the standard regression/classification setting is to summarize the information in the

Linear methods for dimension reduction focus on linear summaries of the data,

A common premise held in high-dimensional data analysis is that the intrinsic structure of data is in fact low dimensional, for example, the data is concentrated on a manifold. Linear methods such as SIR often fail to capture this nonlinear low-dimensional structure. However, there may exist a nonlinear embedding of the data into a Hilbert space, where a linear method can capture the low-dimensional structure. The basic idea in applying kernel methods is the application of a linear algorithm to the data mapped into a feature space induced by a kernel function. If projections onto this low-dimensional structure can be computed by inner products in this Hilbert space, the so-called kernel trick [

There are numeric, algorithmic, and conceptual subtleties to a direct application of this kernel idea to SIR, although it looks quite natural at first glance. In KSIR, the

The extension of SIR to use kernels is based on properties of reproducing kernel Hilbert spaces (RKHSs) and in particular Mercer kernels [

Given predictor variables

Given a Mercer kernel, there exists a unique map or embedding

The random variable

Let

For all

There exists

Under Assumption

We assume the following model for the relationship between

Assume the following linear design condition for

Proposition

An immediate consequence of Proposition

Under Assumption

the operator

where

the eigendecomposition problem (

The discussion in Section

Given

Without loss of generality, we assume that the mapped predictor variables are mean zero, that is,

Bin the

Compute the sample between-group covariance matrix

Estimate the SIR directions

This procedure is computationally impossible if the RKHS is infinite dimensional or the feature map cannot be computed (which is the usual case). However, the model given in (

The key quantity in this alternative formulation is the centred Gram matrix

Given the centered Gram matrix

Given the observations

This result was proven in [

It is important to remark that when

It is necessary to clarify the difference between (

Except for the theoretical subtleties, in applications with relatively small samples, the eigendecomposition in (

We motivate two types of regularization schemes. The first one is the traditional ridge regularization. It is used in both linear SIR and functional SIR [

Another type of regularization is to regularize (

Let

This algorithm is termed as the Tikhonov regularization. For linear SIR, it is shown in [

Except for the computational stability, regularization also makes the matrix forms of KSIR, (

For both ridge and the Tikhonov regularization scheme of KSIR, the eigenfunctions

The conclusion follows from the observation that

To close, we remark that KSIR is computationally advantageous even for the case of linear models when

In this subsection, we prove the asymptotic consistency of the e.d.r. directions estimated by regularized KSIR and provide conditions under which the rate of convergence is

Note that various consistency results are available for linear SIR [

In the following, we state the consistency results for the Tikhonov regularization. A similar result can be proved for the ridge regularization while the details are omitted.

Assume

If the e.d.r. directions

This theorem is a direct corollary of the following theorem which is proven in Appendix

Define the projection operator and its complement for each

Assume

If

In this section, we compare regularized kernel sliced inverse regression (RKSIR) with several other SIR-related dimension reduction methods. The comparisons are used to address two questions:

We would like to remark that the assessment of nonlinear dimension reduction methods could be more difficult than that of linear ones. When the feature mapping

Our first example illustrates that both the nonlinearity and regularization of RKSIR can significantly improve prediction accuracy.

The regression model has ten predictor variables

Dimension reduction for model (

Figure

RKSIR outperforms all the linear dimension reduction methods, which illustrates the power of nonlinearity introduced in RKSIR. It also suggests that there are essentially two nonlinear e.d.r. directions. This observation seems to agree with the model in (

This example illustrates the effect of regularization on the performance of KSIR as a function of the anisotropy of the predictors.

The regression model has ten predictor variables

For this model, it is known that SIR will miss the direction along the second variable

If we use a second-order polynomial kernel

We drew

Error in e.d.r. as a function of

When SIR is applied to classification problems, it is equivalent to a Fisher discriminant analysis. For the case of multiclass classification, it is natural to use SIR and consider each class as a slice. Kernel forms of Fisher discriminant analysis (KFDA) [

The MNIST data set (Y. LeCun,

We compared regularized SIR (RSIR) as in (

The mean and standard deviation of the classification accuracy over

Mean and standard deviations for error rates in classification of digits.

Digit | RKSIR | KSIR | RSIR | kNN |
---|---|---|---|---|

0 | 0.0273 (0.0089) | 0.0472 (0.0191) | 0.0487 (0.0128) | 0.0291 (0.0071) |

1 | 0.0150 (0.0049) | 0.0177 (0.0051) | 0.0292 (0.0113) | 0.0052 (0.0012) |

2 | 0.1039 (0.0207) | 0.1475 (0.0497) | 0.1921 (0.0238) | 0.2008 (0.0186) |

3 | 0.0845 (0.0208) | 0.1279 (0.0494) | 0.1723 (0.0283) | 0.1092 (0.0130) |

4 | 0.0784 (0.0240) | 0.1044 (0.0461) | 0.1327 (0.0327) | 0.1617 (0.0213) |

5 | 0.0877 (0.0209) | 0.1327 (0.0540) | 0.2146 (0.0294) | 0.1419 (0.0193) |

6 | 0.0472 (0.0108) | 0.0804 (0.0383) | 0.0816 (0.0172) | 0.0446 (0.0081) |

7 | 0.0887 (0.0169) | 0.1119 (0.0357) | 0.1354 (0.0172) | 0.1140 (0.0125) |

8 | 0.0981 (0.0259) | 0.1490 (0.0699) | 0.1981 (0.0286) | 0.1140 (0.0156) |

9 | 0.0774 (0.0251) | 0.1095 (0.0398) | 0.1533 (0.0212) | 0.2006 (0.0153) |

| ||||

Average | 0.0708 (0.0105) | 0.1016 (0.0190) | 0.1358 (0.0093) | 0.1177 (0.0039) |

The interest in manifold learning and nonlinear dimension reduction in both statistics and machine learning has led to a variety of statistical models and algorithms. However, most of these methods are developed in the unsupervised learning framework. Therefore, the estimated dimensions may not be optimal for the regression models. Our work incorporates nonlinearity and regularization to inverse regression approaches and results in a robust response driven nonlinear dimension reduction method.

RKHS has also been introduced into supervised dimension reduction in [

There are several open issues in regularized kernel SIR method, such as the selection of kernels, regularization parameters, and number of dimensions. A direct assessment of the nonlinear e.d.r directions is expected to reduce the computational burden in procedures based on cross validation. While these are well established in linear dimension reduction, however, little is known for nonlinear dimension reduction. We would like to leave them for future research.

There are some interesting connections between KSIR and functional SIR, which are developed by Ferré and his coauthors in a series of papers [

Under the assumption of Proposition

Since

Recall that for any

Since for each

We first prove the proposition for matrices to simplify then notation; we then extend the result to the operators, where

Let

We need to show the KSIR variates

In order for this result to hold rigorously when the RKHS is infinite dimensional, we need to formally define

The above formulation of

In order to prove Theorems

Given a separable Hilbert space

Given a bounded operator

Let

A well-known result from perturbation theory for linear operators states that if a set of linear operators

We will use the following result from [

For the first term, observe that

Since

If all the e.d.r. directions

The authors acknowledge the support of the National Science Foundation (DMS-0732276 and DMS-0732260) and the National Institutes of Health (P50 GM 081883). Any opinions, findings and conclusions, or recommendations expressed in this work are those of the authors and do not necessarily reflect the views of the NSF or NIH.