A novel semisupervised dimensionality reduction method named Semisupervised Tangent Space Discriminant Analysis (STSD) is presented, where we assume that data can be well characterized by a linear function on the underlying manifold. For this purpose, a new regularizer using tangent spaces is developed, which not only can capture the local manifold structure from both labeled and unlabeled data, but also has the complementarity with the Laplacian regularizer. Furthermore, STSD has an analytic form of the global optimal solution which can be computed by solving a generalized eigenvalue problem. To perform nonlinear dimensionality reduction and process structured data, a kernel extension of our method is also presented. Experimental results on multiple realworld data sets demonstrate the effectiveness of the proposed method.
Dimensionality reduction is to find a lowdimensional representation of highdimensional data, while preserving data information as much as possible. Processing data in the lowdimensional space can reduce computational cost and suppress noises. Provided that dimensionality reduction is performed appropriately, the discovered lowdimensional representation of data will benefit subsequent tasks, for example, classification, clustering, and data visualization. Classical dimensionality reduction methods include supervised approaches like linear discriminant analysis (LDA) [
LDA is a supervised dimensionality reduction method. It finds a subspace in which the data points from different classes are projected far away from each other, while the data points belonging to the same class are projected as close as possible. One merit of LDA is that LDA can extract the discriminative information of data, which is crucial for classification. Due to its effectiveness, LDA is widely used in many applications, for example, bankruptcy prediction, face recognition, and data mining. However, LDA may get undesirable results when the labeled examples used for learning are not sufficient, because the betweenclass scatter and the withinclass scatter of data could be estimated inaccurately.
PCA is a representative of unsupervised dimensionality reduction methods. It seeks a set of orthogonal projection directions along which the sum of the variances of data is maximized. PCA is a common data preprocessing technique to find a lowdimensional representation of highdimensional data. In order to meet the requirements of different applications, many unsupervised dimensionality reduction methods have been proposed, such as Laplacian Eigenmaps [
In many realworld applications, only limited labeled data can be accessed while a large number of unlabeled data are available. In this case, it is reasonable to perform semisupervised learning which can utilize both labeled and unlabeled data. Recently, several semisupervised dimensionality reduction methods have been proposed, for example, Semisupervised Discriminant Analysis (SDA) [
Although all of these methods have their own advantages in semisupervised learning, the essential strategy of many of them for utilizing unlabeled data relies on the Laplacian regularization. In this paper, we present a novel method named Semisupervised Tangent Space Discriminant Analysis (STSD) for semisupervised dimensionality reduction, which can reflect the discriminant information and a specific manifold structure from both labeled and unlabeled data. Unlike adopting the Laplacian based regularizer, we develop a new regularization term which can discover the linearity of the local manifold structure of data. Specifically, by introducing tangent spaces we represent the local geometry at each data point as a linear function and make the change of such functions as smooth as possible. This means that STSD appeals to a linear function on the manifold. In addition, the objective function of STSD can be optimized analytically through solving a generalized eigenvalue problem.
Consider a data set consisting of
Define the total scatter matrix as
In practice, we usually impose a regularizer on (
As a supervised method, LDA has no ability to extract information from unlabeled data. Motivated by Tangent Space Intrinsic Manifold Regularization (TSIMR) [
TSIMR [
Consider a transformation
Substituting
Armed with the above results, we can formulate our regularizer for semisupervised dimensionality reduction. Consider data
Relating data with a discrete weighted graph is a popular choice, and there are indeed a large family of graph based statistical and machine learning methods. It also makes sense for us to generalize the regularizer
Therefore, the generalization of the proposed regularizer turns out to be
The regularizer (
It should be noted that, besides the principle that accorded with TSIMR, the regularizer (
With the regularizer developed in Section
The optimization of the objective function (
In many applications, especially when the dimensionality of data is high while the data size is small, the matrix
Tradeoff parameters
Construct the adjacency graph;
Calculate the weight matrix
Construct
Compute the eigenvectors
eigenvalues;
The main computational cost of STSD lies in building tangent spaces for
However, given a neighborhood size
In summary, the overall runtime of STSD is
Essentially STSD is a linear dimensionality reduction method, which can not be used for nonlinear dimensionality reduction or processing structured data such as graphs, trees, or other types of structured inputs. To handle this problem, we extend STSD to a Reproducing Kernel Hilbert Space (RKHS).
Suppose examples
Let
Recall that STSD aims to find a set of transformations to map data into a lowdimensional space. Given examples
Let
Given the eigenvectors
In order to illustrate the behavior of STSD, we first perform STSD on a toy data set (two moons) compared with PCA and LDA. The toy data set contains 100 data points and is used under different label configurations. Specifically, 6, 10, 50, and 80 data points are randomly labeled, respectively, and the rest are unlabeled, where PCA is trained by all the data points without labels, LDA is trained by labeled data only, and STSD is trained by both the labeled and unlabeled data. In Figure
Illustrative examples of STSD, LDA, and PCA on the twomoon data set under different label configurations. The circles and squares denote the data points in positive and negative classes, and the filled or unfilled symbols denote the labeled or unlabeled data, respectively.
6 labeled
10 labeled
50 labeled
80 labeled
In this section, we evaluate STSD with realworld data sets. Specifically, we first perform dimensionality reduction to map all examples into a subspace and then carry out classification using the nearest neighbor classifier (1NN) in the subspace. This measurement for evaluating semisupervised dimensionality reduction methods is widely used in literature, such as [
In our experiments, we compare STSD with multiple dimensionality reduction methods including PCA, LDA, SELF, and SDA, where LDA is performed only on the labeled data, while PCA, SELF, SDA, and STSD are performed on both the labeled and unlabeled data. In addition, we also compare our method with the baseline method which just employs the 1NN classifier with the labeled data in the original space. Since the performances of PCA and SELF depend on the dimensionality of the embedding subspace discovered by each method, we show the best results for them.
For the graph based methods, including SELF, SDA, and STSD, the number of nearest neighbors for constructing adjacency graphs is determined by fourfold crossvalidation. The parameters
Two types of data sets under different label configurations are used to conduct our experiments. One type of data sets is the face images which consist of highdimensional images, and the other one is the UCI data sets constituted by lowdimensional data. For the convenience of description, we name each configuration of experiments as “Data Set” + “Labeled Data Size.” For example, for the experiments with the face images, “Yale 3” means the experiment is performed on the Yale data set with 3 labeled data per class. Analogously, for the experiments with the UCI data sets, “BCWD 20” means the experiment is performed on the Breast Cancer Wisconsin (Diagnostic) data set with a total of 20 labeled examples from all classes.
It is well known that highdimensional data such as images and texts are supposed to live on or near a lowdimensional manifold. In this section, we test our algorithm with the Yale and ORL face data sets which are deemed to satisfy this manifold assumption. The Yale data set contains 165 images of 15 individuals and there are 11 images per subject. The images have different facial expressions, illuminations, and facial details (with or without glass). The ORL data set contains 400 images of 40 distinct subjects under varying expressions and illuminations. In our experiments, every face image is cropped to consist of
Mean values and standard deviations of the unlabeled error rates (%) with different label configurations on the face data sets.
Method  Yale 3  Yale 4  ORL 2  ORL 3 

Baseline  49.50 ± 4.86  43.93 ± 4.71  30.31 ± 3.11  21.13 ± 2.29 
PCA  47.67 ± 4.40  42.60 ± 5.05  29.23 ± 2.56  20.30 ± 2.22 
LDA  32.56 ± 3.85  25.60 ± 2.98  17.17 ± 3.23  8.05 ± 2.51 
SELF  54.22 ± 3.88  52.07 ± 4.67  48.79 ± 4.39  37.48 ± 2.81 
SDA  32.33 ± 4.11  25.93 ± 3.22  16.67 ± 3.36  7.85 ± 2.48 
STSD 




Mean values and standard deviations of the test error rates (%) with different label configurations on the face data sets.
Method  Yale 3  Yale 4  ORL 2  ORL 3 

Baseline  46.17 ± 7.67  46.67 ± 8.65  29.94 ± 3.66  19.19 ± 3.50 
PCA  40.67 ± 8.06  42.00 ± 7.29  28.06 ± 3.92  18.13 ± 3.71 
LDA  32.33 ± 8.31  26.17 ± 7.74  16.56 ± 3.97  9.13 ± 3.63 
SELF  50.00 ± 6.49  49.33 ± 8.28  47.88 ± 4.82  35.56 ± 3.52 
SDA  32.00 ± 8.40  26.17 ± 7.67  16.13 ± 4.05  9.00 ± 3.33 
STSD 




In this set of experiments, we use three UCI data sets [
From the results reported in Tables
Mean values and standard deviations of the unlabeled error rates (%) with different label configurations on the UCI data sets.
Method  BCWD 10  BCWD 30  CMSC 10  CMSC 30  CTG 20  CTG 160 

Baseline  11.90 ± 4.04  10.22 ± 3.21  14.39 ± 6.40  14.05 ± 1.90  63.71 ± 3.73  47.91 ± 1.73 
PCA  11.87 ± 4.01  10.21 ± 3.27  11.86 ± 2.51  13.43 ± 2.40  63.74 ± 3.75  47.89 ± 1.66 
LDA  20.34 ± 8.76  9.61 ± 2.76  13.18 ± 4.49  14.21 ± 3.28  67.28 ± 6.32  41.60 ± 2.65 
SELF  13.43 ± 3.63  14.1 ± 4.20  10.06 ± 3.30  11.88 ± 2.53  67.00 ± 4.50  44.09 ± 2.66 
SDA  10.10 ± 3.26  7.12 ± 2.17  9.06 ± 0.97  8.71 ± 0.78  58.27 ± 5.01  41.91 ± 2.17 
STSD 






Mean values and standard deviations of the test error rates (%) with different label configurations on the UCI data sets.
Method  BCWD 10  BCWD 30  CMSC 10  CMSC 30  CTG 20  CTG 160 

Baseline  12.75 ± 6.56  10.65 ± 4.20  14.63 ± 8.48  12.81 ± 4.37  64.15 ± 4.74  48.76 ± 2.44 
PCA  12.75 ± 6.56  10.50 ± 4.16  8.75 ± 1.90  9.13 ± 2.66  64.07 ± 4.76  48.66 ± 2.41 
LDA  20.60 ± 10.34  10.85 ± 5.06  13.19 ± 5.66  15.13 ± 5.35  67.47 ± 7.27  41.95 ± 3.43 
SELF  14.65 ± 6.88  13.70 ± 4.37  9.06 ± 2.66  8.69 ± 1.84  67.02 ± 5.06  43.34 ± 2.81 
SDA 

8.75 ± 3.09  8.69 ± 3.15  8.06 ± 1.70  58.72 ± 4.26  41.67 ± 3.11 
STSD  10.15 ± 4.55 





Notice that the error rates of several dimensionality reduction methods over the CMSC data set do not improve with the increasing size of labeled data. The reason may be that the data in the CMSC data set contain some irrelevant features as reflected by the original data description [
It should be noted that overall the experiments are conducted with 5 data sets, and in terms of the results of all the data sets STSD is likely to beat other methods account for a signtest’s
Essentially, both STSD and SDA are regularized LDA methods with specific regularizers. STSD imposes the regularizer (
Although the previous experiments have shown that STSD gets better results than SDA in most situations, SDA can achieve similar results with STSD in some configurations. However, this does not mean that STSD and SDA are similar or, in other words,
Note that given a graph, the performance of STSLap can be at least, ideally, identical to SDA or STSD, because STSLap degenerates to SDA or STSD when the parameter
Tables
Mean values and standard deviations of the unlabeled error rates (%) with mediumsized labeled data on different data sets.
Method  BCWD 30  CMSC 30  CTG 160  Yale 3  ORL 2 

SDA 

9.60 ± 2.27  41.97 ± 2.72 

20.81 ± 2.76 
STSD  6.96 ± 2.45 

43.47 ± 2.83  32.56 ± 6.67  16.48 ± 2.14 
STSLap  7.07 ± 2.46  9.60 ± 2.24 

33.39 ± 7.01 

Mean values and standard deviations of the test error rates (%) with mediumsized labeled data on different data sets.
Method  BCWD 30  CMSC 30  CTG 160  Yale 3  ORL 2 

SDA  6.90 ± 2.86  9.56 ± 3.28  41.85 ± 3.23  33.33 ± 5.92  20.63 ± 5.98 
STSD  6.70 ± 2.81  9.44 ± 3.45  42.47 ± 3.57  33.00 ± 6.20  14.81 ± 4.20 
STSLap 





STSD is a semisupervised dimensionality reduction method under a certain manifold assumption. More specifically, we assume that the distribution of data can be well approximated by a linear function on the underlying manifold. One related method named SDA [
Rather than constructing an appropriate regularizer on a given graph, SSDA [
For the manifold related learning problem considered in STSD, the estimation of bases for tangent spaces is an important step. In this paper, we use local PCA with fixed neighborhood size to calculate the tangent spaces, and the neighborhood size is set to be same as the one used to construct the adjacency graph. This is certainly not the optimal choice, since manifolds can have varying curvatures and data could be nonuniformly sampled. Note that the neighborhood size can determine the evolution of calculated tangent spaces along the manifold. When a small neighborhood size
In our method, each example in the data matrix can be treated as an anchor point, where local PCA is used to calculate the tangent space. The number of parameters that should be estimated in our method basically grows linearly with respect to the number of anchor points. Therefore, in order to reduce the parameters to be estimated, one possible approach is to reduce the anchor points where only “key” examples are kept as the anchor points. This will be a kind of research for data set sparsification. People can make different criteria to decide whether or not an example should be regarded as the “key” one.
The research of anchor point reduction is especially useful when training data are largescale. For largescale data, anchor point reduction can be promising to speed up the training process. In addition, data can exhibit different manifold dimensions at different regions, especially for complex data. Therefore, adaptively determining the dimensionality at different anchor points is also an important refinement of the current approach.
In this paper, we have proposed a novel semisupervised dimensionality reduction method named Semisupervised Tangent Space Discriminant Analysis (STSD), which can extract the discriminant information as well as the manifold structure from both labeled and unlabeled data, where a linear function assumption on the manifold is exploited. Local PCA is involved as an important step to estimate tangent spaces and certain relationships between adjacent tangent spaces are derived to reflect the adopted model assumption. The optimization of STSD is readily achieved by the eigenvalue decomposition.
Experimental results on multiple realworld data sets including the comparisons with related works have shown the effectiveness of the proposed method. Furthermore, the complementarity between our method and the Laplacian regularization has also been verified. Future work directions include finding more accurate methods for tangent space estimation and extending our method to different learning scenarios such as multiview learning and transfer learning.
The authors declare that there is no conflict of interests regarding the publication of this paper.
This work is supported by the National Natural Science Foundation of China under Project 61370175 and Shanghai Knowledge Service Platform Project (no. ZF1213).