Dimensionality Reduction by Supervised Neighbor Embedding Using Laplacian Search

Dimensionality reduction is an important issue for numerous applications including biomedical images analysis and living system analysis. Neighbor embedding, those representing the global and local structure as well as dealing with multiple manifolds, such as the elastic embedding techniques, can go beyond traditional dimensionality reduction methods and find better optima. Nevertheless, existing neighbor embedding algorithms can not be directly applied in classification as suffering from several problems: (1) high computational complexity, (2) nonparametric mappings, and (3) lack of class labels information. We propose a supervised neighbor embedding called discriminative elastic embedding (DEE) which integrates linear projection matrix and class labels into the final objective function. In addition, we present the Laplacian search direction for fast convergence. DEE is evaluated in three aspects: embedding visualization, training efficiency, and classification performance. Experimental results on several benchmark databases present that the proposed DEE exhibits a supervised dimensionality reduction approach which not only has strong pattern revealing capability, but also brings computational advantages over standard gradient based methods.


Introduction
The classification of high-dimensional data, such as biological characteristic sequences, high-definite images, and gene expressions, remains a difficult task [1]. The main challenges that these high-dimensional data pose include inferior outcome performance, tremendous storage requirements, and high computational complexity. Dimensionality reduction (DR) [2], as the core research topic in subspace learning community, is the well-acknowledged solution for this curse of dimensionality problem. For the classification tasks, the goal of DR focuses on constructing a lowdimensional representation of data in order to achieve better discrimination and accelerate the subsequent processing. In this realm, very straightforward algorithms are dominated, as the computational complexity of advanced DR techniques proposed is too high.
Fisher discriminant analysis (FDA) [3] and its variants [4,5], which incorporate the class labels information and aim at the preservation of classification accuracy in the embedded subspace, are the mostly adopted DR techniques. FDA amplifies the between-class scatter and simultaneously shrinks the within-class scatter in subspace for the purpose of desirable separability. Recently, LFDA [6], MMDA [7], DCV [8], and MMPA [9] markedly improve the performance of FDA by solving different kinds of existing thorny issues. LFDA adds the locality preservation property to the Fisher criterion, which preserves the multimodal structure. MMDA presents a novel criterion that straightly maximizes the minimum pairwise distances of the whole classes for better between-class separability. DCV circumvents the "small sample size" problem by using two different methods, the withinclass scatter matrix and the Gram-Schmidt orthogonalization procedure, to obtain the discriminative common vectors. MMPA takes into account both intraclass and interclass geometries and also possesses the orthogonality property for the projection matrix. Broadly speaking, all these methods have a unique solution computed by a generalized eigensolver and exhibit acceptable performance on most data, but they may be suboptimal for the data with nonuniform density or multiple manifolds.
To deal with more complex structural data, especially in biomedical applications [10][11][12], a batch of novel DR techniques [13][14][15][16][17][18][19] based on stochastic neighbor embedding (SNE) [13] absorbs a lot of interest. In contrast with the FDA-type techniques, those consider only the original highdimensional space for constructing the objective function. SNE matches similarities which are achieved both from the high-dimensional and low-dimensional spaces. Furthermore, -SNE [14] extends SNE with symmetric similarities and by using student's -distribution in low-dimensional space. Symmetric SNE [15] explains why the heavy-tailed distribution and the symmetric similarity form in -SNE lead to better performance. NeRV [16] uses the "dual" Kullback-Leibler (KL) divergence for well-content DR results in information retrieval perspective. Lee et al. [17] adopted a scaled version of the generalized Jensen-Shannon divergence that better preserves small K-ary neighborhoods. Bunte et al. [18] analyzed the visualization performance of SNE with arbitrary divergences and claimed that KL divergence is the most acceptable. In terms of visualization results, all these techniques outperform most of the past techniques. However, the reasons of this well behavior remain obscure. Lee and Verleysen [20] investigated the role played by the specific similarities and identified that appropriate normalization with property of shift invariance is the main cause of the admirable performance. However, Carreira-Perpiñán [21] revealed the fundamental relation between SNE and the Laplacian eigenmaps [22] method and proposed a new DR method, the elastic embedding (EE), that can be seen as learning both the coordinates and the affinities between data points without the shift invariance property.
EE is more efficient and robust than SNE, -SNE, and NeRV. Even so, it cannot be directly applied in classification tasks because of the unideal discrimination ability, the outof-sample problem, and the high computational complexity in some large-scale classification tasks [23,24]. Many researchers have been dedicated to solve these drawbacks. Venna et al. [16] proposed supervised NeRV with better discrimination capability by complex locally linear functions. Bunte et al. [25] presented a general framework for a variety of nonparametric DR techniques and then extended them to parametric mapping by means of optimization. Gisbrecht et al. [26] used only a fraction of whole samples for training the DR model in interactive settings. Yang et al. [27] and Maaten [28] simultaneously adopted the Barnes-Hut tree and proposed a generic approximated optimization technique which reduces the neighbor embedding optimization cost from ( 2 ) to ( log ).
Inspired by these works, we proposed a linear supervised DR technique called discriminative EE (DEE) for classification. To be specific, the linear projection matrix is introduced to solve the out-of-sample problem similarly as in [29]. The class labels information is involved in the construction of objective function as MMPA. We search for a reasonable direction in the iterative processing to solve our model by gradient-based method. The remainder of this paper is structured as follows. Section 2 provides a brief view of related works. Section 3 describes the objective function and the search direction of our proposed DEE. Section 4 gathers the experimental results. Finally, Section 5 draws the conclusions and sketches some future works.

Fundamental Contributions
Even though there were numerous previously studied algorithms in the context of embedding high-dimensional data for visualization or classification, we focus here only on a few approaches that were recently proposed and that we will use to compare our evaluations against them. The involved techniques include the elastic embedding (EE), the discriminative stochastic neighbor embedding (DSNE) [30], and the min-max projection analysis (MMPA). Let x ∈ ( = 1, 2, . . . , ) be -dimensional samples, ∈ (1, 2, . . . , ) be corresponding class labels and X = {x 1 , x 2 , . . . , x } be the matrix of all samples, where is samples size and is the classes size. The nonlinear embedding approaches proceed to look for the subspace matrix Y × , whose column vectors y ∈ are coordinates of pixel maps, where d is the subspace dimension. On the other hand, the goal of the usual linear embedding techniques is to learn a projection matrix A × , which is further used to compute the embedding coordinate Y = AX.

The Elastic
Embedding. The elastic embedding (EE) technique, which is nonlinear and unsupervised, is an extension of SNE-type algorithm. The objective function of EE is defined as where (1), called as attractive term, preserves local distances as the Laplacian eigenmaps [22]. The right (−) term in (1), called as repulsive term, preserves global distances in a plainer way more than the traditional SNE algorithm. is a tunable parameter for trading off both the attractive and the repulsive terms.
The gradient of (Y) in (1) is then computed as where the authors defined the affinities and the graph Laplacians L = D − W in the common way.
) is the degree matrix. After the gradient is obtained, EE uses the fixed-point (FP) diagonal iteration Computational and Mathematical Methods in Medicine 3 scheme to achieve global and fast convergence. First, the gradient is split as Both the objective function and the gradient for EE are intuitively clearer and less nonlinear than other SNE-type algorithms since EE avoids the cumbrous log-sum term. Moreover, the FP strategy results in fewer local optima. However, EE is still nonlinear, so the embedding for the out-of-sample input is inefficient. Furthermore, EE neglects the use of the class labels, which makes EE suboptimal for classification.

Discriminative Stochastic Neighbor
Embedding. The discriminative stochastic neighbor embedding (DSNE) technique, which is linear and supervised, is an extension of -SNE algorithm. For each input data x and each potential neighbor x within the same class or not, the probability that x selects x as its neighbor is where is a regularization parameter which is set manually. For the embedded samples Y = AX, a heavy-tailed Student's -distribution with one degree of freedom for neighbors is made, so that the induced embedded probability that y selects y as its intraclass or interclass neighbors is The aim of DSNE is to place close together intraclass samples and place far apart interclass samples. This is achieved by minimizing the objective function, which is the sum of KL divergences between and with consideration of the class labels The gradient of (A) in (7) can be obtained as where L intra and L inter are the Laplacian matrices for intraclass samples and interclass samples, respectively. DSNE introduces explicit projection matrix and labels information to make it suitable for classification tasks. However, the cumbersome log-sum term and the tedious conjugate gradient training method make DSNE converge slowly.

Min-Max Projection
Analysis. The min-max projection analysis (MMPA) is another recently proposed linear supervised dimension reduction technique. MMPA combines the main advantages of block optimization and whole alignment strategy [31]. It also incorporates a desirable property from marginal Fisher analysis [32], that is, pulling together the far apart within class neighbors as close as possible, as well as placing away the neighbors having different labels as far as possible. The combination of these properties leads to a technique that is qualified for online input stream data and has desirable discrimination capability. The projection matrix A derived from MMPA is a result of solving the following objective function: By resolving the generalized eigenvalue problem in (9), MMPA gets a closed form solution without any iteration process, which is closely related to the classical dimension reduction algorithms such as DCV and LFDA. All these techniques present efficient computation cost. However, they always present crowding problem that leads to cluttered subspace visualization.

Discriminative Elastic Embedding
In this section, we depict the DEE technique that focuses on exploring an explicit mapping, presenting a large disjoint interclass region and achieving a faster convergence. We begin with an introduction of the embedding formulation.
3.1. Formulation. As mentioned in Section 2, the eigenmaptype algorithms such as MMPA adopt simple affinity function for constructing direct generalization eigenvalue problems, which is sensitive to noise and results in crowded embedded subspace. DSNE can go beyond the spectral techniques and find better optima, exhibiting large gaps between different classes as well as dealing with multiple manifolds. However, the optimization of DSNE is costly and apt to local optima. In addition, our understanding of these SNE-type algorithms is limited to an intuitive interpretation of their cost function. EE furthers our understanding by deriving the explicit relation between SNE and Laplacian eigenmaps. Moreover, EE adopts the simpler unnormalized model for more efficient and robust optimization. The objective function of EE is formed by a local distance term and a global distance term to represent better global and local embedding structure. However, the purpose of this paper is to enlarge the disjoint region for the different classes. We resolve this problem by introducing the class labels to both the attractive affinity weights and the repulsive affinity weights similar to MMPA: In addition, we adopt the explicit projection matrix A to make EE linear and avoid the out-of-sample problem. That is, in (1), we replace y as Ax to make it become Equation (12) is chosen here as our objective function. We call the model as discriminative elastic embedding (DEE). In next sections, we will present the optimization strategy for the optimal projection matrix A.

The Fixed-Point Search
Direction. The cost function in (12) characterizes the desired embedding: objects of intraclass samples are encouraged to embed nearby, but objects of interclass samples are encouraged to embed far away. However, this equation is nonconvex, and there is no closedform solution. Some gradient based methods such as gradient descent, conjugate gradients, and multiplicative updates [33] are used for the existing SNE-type algorithms. These are all reported as very slow and with tiny steps. The fixedpoint iteration strategy used in EE works much better, so we introduce this FP method into our DEE technique in this section. The gradient of DEE is obtained as where we replace x = x − x and = + − − exp(−‖y − y ‖ 2 ) for brevity. From the stationary point equation of the gradient (13), we can split / A into two parts as where D + is symmetric, positive, and definite and (L − D + ) is symmetric. Then, we can get the FP iteration scheme A ← AX(D + − L)X (XD + X ) −1 , which further implies the FP The objective function will be decreased and converged to a stationary point along with the FP direction by a line search A ← A + Δ FP satisfying the Wolfe conditions for > 0 [34]. The main cost of each FP iteration is dominated by the gradient equation (13) which is (2 + 2 ).

The Laplacian Search Direction.
Our goal in this section is to present a search direction that can lead to fast and global convergence. There are two common ways for achieving this objective. One is to speed up the computation in the gradient based iteration scheme. The other is to achieve the optimal objective value with as a few iterations as possible. The intuitive method for speeding up the computation is to reduce the samples size. This obvious approach of subsampling always produces inferior results. In [27,28], the authors simultaneously adopted the Barns-Hut tree to approximate the SNE-type gradients, which leads to substantial computational advantages over existing neighbor embedding techniques. However, this Barns-Hut tree strategy requires sufficient training samples for maintaining preferable performance. Moreover, the Barns-Hut tree based neighbor embedding methods can only be applied for embedding data in two or three dimensions subject to the tree size. In conclusion, we turn to explore the best search direction for less iteration. From the numerical optimization theory [34], we repeat the line search method as where Δ is the chosen search directions, g is the gradient of the objective function, H is a positive definite matrix ensuring the descent direction Δ g < 0, and > 0 satisfies the Wolfe conditions. Our purpose here is to select a desirable matrix H ranges from I to the Hessian matrix obtained as where I is the × identity matrix. When H is selected as the identity matrix I, the optimization is refined as gradient descent, which is very slow for convergence. At the other extreme, when the Hessian is used, the optimization is termed as Newton's method, which consumes too much computation each iteration. Our intuitive principle is to employ as much Hessian info as possible that leads to an efficient solution of Since L − is closely related to the variable A, the second part and the third part of (17) need recomputation each iteration. The first part is constant and it only needs be computed once in the first iteration. Moreover, since the entries in W + are symmetric and nonnegative from (10), the term XL + X is symmetric, positive, and semidefinite, and we can add a small I to it for achieving a positive definite matrix. In conclusion, the attractive Hessian 4(XL + X )I constructs our final search direction which is the desirable compromise of fast convergence and efficient calculation. Since this direction is mainly related to the attractive Laplacian L + , we call it as the Laplacian direction (LD). Note that to avoid the direct calculation of H Δ = −g which costs ( 3 × ) we can firstly achieve the Cholesky factor R of H and then use two backsolves R (RΔ ) = −g for the Laplacian direction Δ . The cost of Cholesky decomposition is in ( 3 /3) and it needs only to be computed once. The cost of two backsolves is in ( 2 ). We find the Laplacian direction works surprisingly well with more less iteration times.

Experiments and Results
We evaluate the performance of the proposed algorithm in this section. First, four methods are compared for DEE: gradient descent (GD), used in SNE; conjugate gradients (CG), used in -SNE; fixed-point (FP), used in EE; and the Laplacian direction (LD), presented in this paper. Afterwards, we demonstrate the effectiveness of DEE in clustering visualization compared with some classical algorithms such as -SNE, DSNE, and EE. Finally, we present the experimental results on image classification. Four datasets are used for evaluation: the COIL20 images database, the ORL faces database, the Yale faces database, and the USPS handwritten digits database.

Datasets.
The COIL20 database contains 20 objects. The images of each object were taken 5 degrees apart as the object is rotated on a turntable and each object has 72 images. There are several parameter values that require the user to set for all the three implemented methods. Commonly, there is little clue on which parameter values are the most appropriate. This is the main reason why LD method, which has no parameters to be set, is our preferred choice. We set most of the parameters to be the same as [13,14,21]. The maximum iterations were set 1000 constantly and the ultimate convergence condition was set to be 1 − 3. For the first three datasets, COIL20, ORL, and Yale, we used all the samples as the input data. And for avoiding clutter, we randomly selected sixty samples for every class as the input data for the USPS handwritten digits dataset.
The visualization results are shown in Figures 2, 3, 4, and 5, where all the input data are projected into 2D space. The different colors stand for diverse classes. Figure 6 demonstrates the learning curves as a function of progressive iterations. It also states the elapsed time in seconds for a single model construction. From Figures 2-5, we can see that, with different training methods, DEE is useful for clustering diverse class data. However, the LD method is clearly more competitive than the other three methods. In Figures 2 and  5, the colored coordinates show that DEE with LD method accurately separates the underlying disjoint structure present in diverse class. However, the other three methods have more overlaps incurred between different classes. In Figures 3 and  4, although all the four methods exhibit clearly the disjoint factors between diverse classes, the within class coordinates for FP, CG, and GD are more scattered, which is suboptimal for classification. From Figure 6, it is clear that DEE with LD method can achieve more precise objective values with less iteration times. In decreasing efficiency, the four methods should be roughly ordered as LD > FP > CG and GD. The CG method needs the most iteration times to meet the convergence condition. However, the objective value of CG is a little more precise than GD's value. This also explains why the clustering results for CG are slightly more accurate than GD's in Figures 2-5. Mostly, FP is more efficient than CG and GD, and it costs less time for completing a DEE model construction. Nevertheless, the runtime in every iteration loop for FP is more than CG and GD. So the construction time for FP is slower than GD in the COIL20 dataset, where these two methods spend close iteration times. LD not only achieves the most precise objective values, but also requires the least runtime. Take ORL dataset as example; LD needs only 13 iterations to obtain the optimal objective value, but FP needs about 390 iterations for the same convergence condition. So, LD costs about 1 second to construct the DEE model, which is about 38 times faster than FP, the second efficient method in the order. This result adheres to the theoretical analysis in Section 2 that the spectral direction is more useful for rapid convergence.

Evaluation of Different Embedding
Algorithms. The DEE model with LD strategy was proved to be the most effective and efficient method in the preceding evaluation. To begin the classification performance analysis of the proposed approach, we secondly compare it with other embedding algorithms for assessing its capability of avoiding overlaps with different classes. We carried out comparisons to DSNE, EE, and -SNE in 2D embedding space. The visualization results are illustrated in Figures 7,8,9,and 10. What is clear from these figures is that DEE and DSNE are more capable of separating data apart from different classes than EE and -SNE. Note that EE and -SNE both neglect the class labels for model construction. This demonstrates that the class labels ought to be a far more significant factor for enhancing classification performance. Furthermore, from Figures 7 and 9, we can see that DSNE not only has more interclass overlaps, but also has more intraclass scatters than DEE. This is due to two main factors. First, the traditional SNE-type embedding algorithms such as -SNE or DSNE use normalization probabilities, which is cumbersome and unnecessary. However, DEE abandons the normalization term but focuses on the important and explicit relation between nonlinear and spectral methods, which makes DEE more robust to various types of data. Second, DEE uses the spectral direction for optimization, which is efficient and has no parameters to tune. Although DSNE uses conjugate gradient method for optimization, there are many parameters that need to be manually adjusted, which is difficult and time consuming. Besides, the conjugate gradient method is apt to fall into local optimum, which leads to cluttered subspace coordinates.

The Evaluation of Classification Performance.
In [19,30], some comparison studies of SNE-type embedding algorithms and spectral methods were demonstrated for image data and hyperspectral data, respectively. Those demonstrations showed that, while SNE-type embedding algorithms do improve the classification performance, the requirement of even more concise subspace dimension remains a challenge. From the experimental results in Section 4.3, we know that the class labels are important to the classification performance. Here we compare DEE with three other supervised dimensionality reduction techniques, DSNE, DCV, and MMPA. DCV and MMPA are two recently proposed spectral methods. DCV has no special parameters needed to be tuned. For MMPA, we set the parametric pair and to be the average intraclass and interclass Euclidean distances, respectively. For DSNE, we follow the parametric set as in [30]. To illustrate the classification performance in the projected spaces, a nearest neighbor classifier is used to produce the decision results. For COIL20 dataset, we randomly selected 15 samples for each object as the training samples. For ORL and Yale datasets, we uniformly provided 50% training samples. In USPS, 25 samples in each class were used for input data. All the rest data were used as testing samples. Figure 11 shows the recognition rate versus progressive subspace dimension for DEE, DSNE, DCV, and MMPA in four different datasets. All the results in Figure 11 were come into being with ten replications. From this illustration, we can see that the maximum subspace dimension of DCV is limited to C-1, due to the rank of the difference matrix. This limitation makes DCV perform poorly in some datasets. Besides, DCV demands more null space information in intraclass scatter matrix for better recognition rate. So, in USPS dataset, the optimal accuracy for DCV is only 83%, which is the worst compared with other three algorithms. Without this restriction, the other three algorithms are free for the choice of subspace dimension. However, since the conjugate gradient method  is unstable and suboptimal, DSNE only gets a little better recognition rate than DCV, and its accuracy curve is more fluctuant. The best recognition rate of MMPA and DEE is very close. By comparison, the recognition rate curve of DEE is smoother than MMPA's. This reduces the complexity of choosing a proper subspace dimension value in a wide range for the users. Furthermore, DEE reaches the higher recognition rate with lower subspace dimension value, which complies with the essence of dimensionality reduction. In other words, DEE is capable of using as little as possible

Conclusions
A new embedding algorithm based on EE is proposed in this paper. The algorithm can be used for clustering visualization and classification. Our experimental illustrations were focused on image data embedding; however, this algorithm can also be extended to dimension reduction of other data without any adjustment. DEE, as a supervised embedding algorithm, is capable of pulling together the intraclass examples as well as pushing away the interclass examples. This "pull-push" property makes DEE qualified for discrimination tasks. The main disadvantage of all the SNEtype algorithms is that their optimization is a nonconvex issue requiring relatively slow iterative process. We introduced the Laplacian search direction to improve this gradient based optimization strategy. Empirically, the solutions solved by Laplacian direction are faster and more effective than the existing optimization methods. The experimental results in this paper on four image datasets show that DEE outperforms existing state-of-the-art algorithms for clustering visualization and classification. With fewest computation cost and more concise subspace dimension, DEE shows better embedded structure and reaches highest recognition rate.
In future work, we plan to speed up the computation cost in every iteration loop for LD strategy, which brings "big data" within reach of visualization and classification. We will also investigate the scalable optimization of all SNE-type algorithms, from which we can establish the uniform SNE based embedding framework.