Fast Discriminative Stochastic Neighbor Embedding Analysis

Feature is important for many applications in biomedical signal analysis and living system analysis. A fast discriminative stochastic neighbor embedding analysis (FDSNE) method for feature extraction is proposed in this paper by improving the existing DSNE method. The proposed algorithm adopts an alternative probability distribution model constructed based on its K-nearest neighbors from the interclass and intraclass samples. Furthermore, FDSNE is extended to nonlinear scenarios using the kernel trick and then kernel-based methods, that is, KFDSNE1 and KFDSNE2. FDSNE, KFDSNE1, and KFDSNE2 are evaluated in three aspects: visualization, recognition, and elapsed time. Experimental results on several datasets show that, compared with DSNE and MSNP, the proposed algorithm not only significantly enhances the computational efficiency but also obtains higher classification accuracy.


Introduction
In recent years, dimensional reduction which can reduce the curse of dimensionality [1] and remove irrelevant attributes in high-dimensional space plays an increasingly important role in many areas. It promotes the classification, visualization, and compression of the high dimensional data. In machine learning, dimension reduction is used to reduce the dimension by mapping the samples from the high-dimensional space to the low-dimensional space. There are many purposes of studying it: firstly, to reduce the amount of storage, secondly, to remove the influence of noise, thirdly, to understand data distribution easily, and last but not least, to achieve good results in classification or clustering.
Currently, many dimensional reduction methods have been proposed, and they can be classified variously from different perspectives. Based on the nature of the input data, they are broadly categorized into two classes: linear subspace methods which try to find a linear subspace as feature space so as to preserve certain kind of characteristics of observed data, and nonlinear approaches such as kernel-based techniques and geometry-based techniques; from the class labels' perspective, they are divided into supervised learning and unsupervised learning; furthermore, the purpose of the former is to maximize the recognition rate between classes while the latter is for making the minimum of information loss. In addition, judging whether samples utilize local information or global information, we divide them into local method and global method.
We briefly introduce several existing dimensional reduction techniques. In the main linear techniques, principal component analysis (PCA) [2] aims at maximizing the variance of the samples in the low-dimensional representation with a linear mapping matrix. It is global and unsupervised. Different from PCA, linear discriminant analysis (LDA) [3] learns a linear projection with the assistance of class labels. It computes the linear transformation by maximizing the amount of interclass variance relative to the amount of intraclass variance. Based on LDA, marginal fisher analysis (MFA) [4], local fisher discriminant analysis (LFDA) [5], and maxmin distance analysis (MMDA) [6] are proposed. All of the three are linear supervised dimensional reduction methods. MFA utilizes the intrinsic graph to characterize the intraclass compactness and uses meanwhile the penalty graph to characterize interclass separability. LFDA introduces the locality to the LFD algorithm and is particularly useful for samples consisting of intraclass separate clusters. MMDA considers maximizing the minimum pairwise samples of interclass.
To deal with nonlinear structural data, which can often be found in biomedical applications [7][8][9][10], a number of nonlinear approaches have been developed for dimensional reduction. Among these kernel-based techniques and geometrybased techniques are two hot issues. Kernel-based techniques attempt to obtain the linear structure of nonlinearly distributed data by mapping the original inputs to a high-dimensional feature space. For instance, kernel principal component analysis (kernel PCA) [11] is the extension of PCA using kernel tricks. Geometry-based techniques, in general, are known as manifold learning techniques such as isometric mapping (ISOMAP) [12], locally linear embedding (LLE) [13], Laplacian eigenmap (LE) [14], Hessian LLE (HLLE) [15], and local tangent space alignment (LTSA) [16]. ISOMAP is used for manifold learning by computing the pairwise geodesic distances for input samples and extending multidimensional scaling. LLE exploits the linear reconstructions to discover nonlinear structure in high-dimensional space. LE first constructs an undirected weighted graph, and then recovers the structure of manifold by graph manipulation. HLLE is based on sparse matrix techniques. As for LTSA, it begins by computing the tangent space at every point and then optimizes to find an embedding that aligns the tangent spaces.
Recently, stochastic neighbor embedding (SNE) [17] and extensions thereof have become popular for feature extraction. The basic principle of SNE is to convert pairwise Euclidean distances into probabilities of selecting neighbors to model pairwise similarities. As extension of SNE, -SNE [18] uses Student's -distribution to model pairwise dissimilarities in low-dimensional space and it alleviates the optimization problems and the crowding problem of SNE by the methods below: (1) it uses a symmetrized version of the SNE cost function with simpler gradients that was briefly introduced by Cook et al. [19], and (2) it employs a heavy-tailed distribution in the low-dimensional space. Subsequently, Yang et al. [20] systematically analyze the characteristics of the heavy-tailed distribution and the solutions to crowding problem. More recently, Wu et al. [21] explored how to measure similarity on manifold more accurately and proposed a projection approach called manifold-oriented stochastic neighbor projection (MSNP) for feature extraction based on SNE and -SNE. MSNP employs Cauchy distribution rather than standard Student's -distribution used in -SNE. In addition, for the purpose of learning the similarity on manifold with high accuracy, MSNP uses geodesic distance for characterizing data similarity. Though MSNP has many advantages in terms of feature extraction, there is still a drawback in it: MSNP is an unsupervised method and lacks the idea of class label, so it is not suitable for pattern identification. To overcome the disadvantage of MSNP, we have done some preliminary work and presented a method called discriminative stochastic neighbor embedding analysis (DSNE) [22]. DSNE effectively resolves the problems above, but since it selects all the training samples as their reference points, it has high computational cost and is thus computationally infeasible for the large-scale classification tasks with high-dimensional features [23,24]. On the basis of our previous research, we present a method called fast discriminative stochastic neighbor embedding analysis (FDSNE) to overcome the disadvantages of DSNE in this paper.
The rest of this paper is organized as follows: in Section 2, we introduce in detail the proposed FDSNE and briefly compare it with MSNP and DSNE in Section 3. Section 4 gives the nonlinear extension of FDSNE. Furthermore, experiments on various databases are presented in Section 5. Finally, Section 6 concludes this paper and several issues for future works are described.

Fast Discriminative Stochastic Neighbor Embedding Analysis
Consider a labeled data samples matrix as where x ∈ is a -dimensional sample and means the th sample in the th class. is the number of sample classes, is the number of samples in the th class, and = 1 + 2 + ⋅ ⋅ ⋅ + . In fact, the basic principle of FDSNE is the same as -SNE which is to convert pairwise Euclidean distances into probabilities of selecting neighbors to model pairwise similarities [18]. Since the DSNE selects all the training samples as its reference points, it has high computational cost and is thus computationally infeasible for the large-scale classification tasks with high-dimensional features. So according to the KNN classification rule, we propose an alternative probability distribution function which makes the label of target sample determined by its first -nearest neighbors in FDSNE. In this paper, NH (x ) and NM (x ) are defined. They, respectively, denote the th-nearest neighbor of x from the same class and the different classes in the transformed space. Mathematically, the joint probability is given by In formula (2), is the Euclidian distance between two samples x and x , the parameter is the variance parameter of Gaussian which determines the value of , and 1 ≤ ≤ 2 }, and then the denominator in formula (2) means all of the reference points under selection from the same class or the different classes. In particular, the joint probability not only keeps symmetrical characteristics of the probability distribution matrix but also makes the probability value of interclass data to be 1 and the same for intraclass data.
For low-dimensional representations, FDSNE uses counterparts y and y of the high-dimensional datapoints x and x . It is possible to compute a similar joint probability via the following expression: In what follows, we introduce the transformation by a linear projection: . Then by simple algebra formulation, formula (3) has the following equivalent expression: Note that all data have the intrinsic geometry distribution and there is no exception for intraclass samples and interclass samples. Then the same distribution is required to hold in feature space. Since the Kullback-Leiber divergence [25] is wildly used to quantify the proximity of two probability distributions, we choose it to build our penalty function here. Based on the above definition, the function can be formulated as: In this work, we use the conjugate gradient method to minimize (A). In order to make the derivation less cluttered, we first define four auxiliary variables , , , and as: Then differentiating (A) with respect to the transformation matrix A gives the following gradient, which we adopt for learning: 4

Computational and Mathematical Methods in Medicine
Let U be the order matrix with element , and let U be the order matrix with element . Note that U and U are symmetric matrices; therefore, D can be defined as a diagonal matrix that each entry is column (or row) sum of U and the same for D , that is, D = ∑ U and D = ∑ U . With this definition, the gradient expression (7) can be reduced to Once the gradient is calculated, our optimal problem (5) can be solved by an iterative procedure based on the conjugate gradient method. The description of FDSNE algorithm can be given by the following.
Step 1. Collect the sample matrix X with class labels, and set -nearest neighborhood parameter 1 , 2 , the variance parameter , and the maximum iteration times .
Step 2. Compute the pairwise Euclidian distance for X and compute the joint probability by utilizing formula (2) and class labels.
Step 3 (set = 1 : ). We search for the solution in loop: firstly, compute the joint probability by utilizing formula (4); then, compute gradient (A)/ (A) by utilizing formula (8); finally, update A based on A −1 by conjugate gradient operation.
Step 4. Judge whether − −1 < (in this paper, we take = 1 − 7) converges to a stable solution or reaches the maximum value . If these prerequisites are met, Step 5 is performed; otherwise, we repeat Step 3.
Step 5. Output A = A .
Hereafter, we call the proposed method as fast discriminative stochastic neighbor embedding analysis (FDSNE).

Comparison with MSNP and DSNE
MSNP is derived from SNE and -SNE, and it is a linear method and has nice properties, such as sensitivity to nonlinear manifold structure and convenience for feature extraction. Since the structure of MSNP is closer to that of FDSNE, we briefly compare FDSNE with MSNP and DSNE in this section.
FDSNE, MSNP, and DSNE use different probability distributions to determine the reference points. The difference can be explained in the following aspects.
Firstly, MSNP learns the similarity relationship of the high-dimensional samples by estimating neighborhood distribution based on geodesic distance metric, and the same distribution is required in feature space. Then the linear projection matrix A is used to discover the underlying structure of data manifold which is nonlinear. Finally, the Kullback-Leibler divergence objective function is used to keep pairwise similarities in feature space. So the probability distribution function of MSNP and its gradient used for learning are respectively given by where geo is the geodesic distance for x and x and is the freedom degree parameter of Cauchy distribution. DSNE selects the joint probability to model the pairwise similarities of input samples with class labels. It also introduces the linear projection matrix A as MSNP. The cost function is constructed to minimize the intraclass Kullback-Leibler divergence as well as to maximize the interclass KL divergences. Its probability distribution function and gradient are, respectively, given as by Computational and Mathematical Methods in Medicine Note that on the basis of the DSNE, FDSNE makes full use of class label which not only keeps symmetrical characteristics of the probability distribution matrix but also makes the probability value of interclass data and intraclass data to be 1, and it can effectively overcome large interclass confusion degree in the projected subspace. Secondly, it is obvious that the selection of reference point in MSNP or DSNE is related to all training samples, while FDSNE only uses the first -nearest neighbors of each sample from all classes. In other words, we propose an alternative probability distribution function to determine whether x would pick x as its reference point or not. Actually, the computation of gradient during the optimization process mainly determines the computational cost of MSNP and DSNE. So their computational complexity can be written as (2 + 2 ) in each iteration. Similarly, the computational complexity of FDSNE is (2 + ) in each iteration, where = 1 + 2 . It is obvious that ≪ . Therefore, FDSNE is faster than MSNP and DSNE during each iteration.

Kernel FDSNE
As a bridge from linear to nonlinear, kernel method emerged in the early beginning of the 20th century and its applications in pattern recognition can be traced back to 1964. In recent years, kernel method has attracted wide attention and numerous researchers have proposed various theories and approaches based on it.
The principle of kernel method is a mapping of the data from the input space to a high-dimensional space , which we will refer to as the feature space, by nonlinear function. Data processing is then performed in the feature space, and this can be expressed solely in terms of inner product in the feature space. Hence, the nonlinear mapping need not be explicitly constructed but can be specified by defining the form of the inner product in terms of a Mercer kernel function .
Obviously, FDSNE is a linear feature dimensionality reduction algorithm. So the remainder of this section is devoted to extend FDSNE to a nonlinear scenario using techniques of kernel methods. Let which allows us to compute the value of the inner product in without having to carry out the map.
It should be noted that we use to denote (x ) for brevity in the following. Next, we express the transformation A with We define B = [ (1) , . . . , ( ) ] and Φ = [ 1 , . . . , ] , and then A = BΦ. Based on above definition, the Euclidian distance between x and x in the space is ] is a column vector. It is clear that the distance in the kernel embedding space is related to the kernel function and the matrix B.
In this section, we propose two methods to construct the objective function. The first strategy makes B parameterize the objective function. Firstly, we replace (A) with (A) in formula (3) so that 1 , 1 which are defined to be applied in the high dimensional space can be written as Then, we denote (B) by modifying (A) via substituting A with B into the regularization term of formula (5). Finally, 6 Computational and Mathematical Methods in Medicine  by the same argument as formula (7), we give the following gradient: In order to make formula (15) easy to be comprehended, 1 , 1 , 1 , and 1 are given by Meanwhile, the gradient expression (15) can be reduced to where U 1 is the order matrix with element 1 , and U is the order matrix with element 1 . Note that U 1 and U 1 are symmetric matrices; therefore, D 1 can be defined as a diagonal matrix that each entry is column (or row) sum of U 1 and the same for D 1 , that is, D 1 = ∑ U 1 and D 1 = ∑ U 1 . For convenience, we name this kernel method as FKD-SNE1.
Another strategy is that we let (A) be the objective function in the embedding space . So its gradient can be written as ] Φ (18) in this form, ( − ) can be regard as the × matrix with vector − in the th column, and vector − in the th column and the other columns are all zeros.
This method is termed as FKDSNE2. Note that Φ is a constant matrix. Furthermore, the observations of formula (18) make us know that updating the matrix A in the optimization only means updating the matrix B. Additionally, Φ does not need to be computed explicitly. Therefore, we do not need to explicitly perform the nonlinear map (x) to minimize the objective function (A). The computational complexity of FKDSNE1 and FKDSNE2, is respectively, (2 2 + ) and (2 + 2 ) in each iteration. Hence, it is obvious that FKDSNE2 is faster than FKDSNE1 during each iteration.
In the first set of experiments, we focus on the visualization of the proposed methods which are compared with that of the relevant algorithms, including SNE [17], -SNE [18], and MSNP [21]. In the second set of experiments, we apply our methods to recognition task to verify their feature extraction capability and compare them with MSNP and DSNE [22]. Moreover, the elapsed time of FDSNE, FKDSNE1, FKDSNE2, and DSNE is compared in the third set of experiments. In particular, the Gaussian RBF kernel (x, x ) = exp(−‖x− x ‖ 2 /2 2 ) is chosen as the kernel function of FKDSNE1 and FKDSNE2, where is set as the variance of the training sample set of X. COIL-20 is a dataset of gray-scale images of 20 objects. The images of each object were taken 5 degrees apart as the object is rotated on a turntable and each object has 72 images. The size of each image is 40 × 40 pixels. Figure 1 shows sample images from COIL-20 images dataset.
USPS handwritten digit dataset includes 10 digit characters and 1100 samples in total. The original data format is of 16 × 16 pixels. Figure 2 shows samples of the cropped images from USPS handwritten digits dataset.
ORL consists of gray images of faces from 40 distinct subjects, with 10 pictures for each subject. For every subject, the images were taken with varied lighting condition and different facial expressions. The original size of each image is 112 × 92 pixels, with 256 gray levels per pixel. Figure 3 illustrates a sample subject of ORL dataset. Using FDSNE, FKDSNE1, and FKDSNE2. We apply FDSNE, FKDSNE1, and FKDSNE2 to visualization task to evaluate their capability of classification performance. The experiments are carried out, respectively, on COIL-20, USPS, and ORL datasets. For the sake of computational efficiency as well as noise filtering, we first adjust the size of each image to 32×32 pixels on ORL, and then we select five samples from each class on COIL-20, fourteen samples from each class on USPS, and five samples from each class on ORL.

Visualization
The experimental procedure is to extract a 20-dimensional feature for each image by FDSNE, FKDSNE1, and FKDSNE2, respectively. Then to evaluate the quality of features through visual presentation of the first two-dimensional feature.
FDSNE, FKDSNE1, and FKDSNE2 are compared with three well known visualization methods for detecting classification performance: (1) SNE, (2) -SNE, and (3) MSPN. The parameters are set as follows: the -nearest neighborhood parameter of FDSNE, FKDSNE1, and FKDSNE2 methods is 1 = ℎ − 1 (let ℎ denote the number of training samples in each class), 2 = 40; for SNE and -SNE, the perplexity parameter is perp = 20 and the iteration number is = 1000; for MSNP, the degree freedom of Cauchy distribution is = 4 and the iteration number is 1000 as well.       MSNP, both of them are linear methods and were proved to be better than existing feature extraction algorithms such as SNE, -SNE, LLTSA, LPP, and so on in [21,22]. The procedure of recognition is described as follows: firstly, divide dataset into training sample set X train and testing sample set X test randomly; secondly, the training process for the optimal matrix A or B is taken for FDSNE, FKDSNE1 and FKDSNE2; thirdly, feature extraction is accomplished for all samples using A or B; finally, a testing image is identified by a nearest neighbor classifier. The parameters are set as follows: the -nearest neighborhood parameter 1 , 2 in FDSNE, FKD-SNE1, and FKDSNE2 is 1 = ℎ − 1, 2 = 40; for DSNE, the perplexity parameter is = 0.1 and the iteration number is = 1000; for MSNP, the freedom degree of Cauchy distribution in MSNP is determined by cross validation and the iteration number is 1000 as well.  The maximal recognition rate of each method and the corresponding dimension are given in Table 1, where the number in bold stands for the highest recognition rate. From Table 1,   we can find that FKDSNE1 and FKDSNE2 outperform MSNP, DSNE, and FDSNE on COIL-20, USPS, and ORL. As can be seen, FKDSNE1 and FKDSNE2 enhance the maximal recognition rate for at least 2% compared with other three methods. Besides, FKDSNE1 and FKDSNE2 achieve considerable recognition accuracy when feature dimension is 20 on the three datasets. It indicates that FKDSNE1 and FKDSNE2 grasp the key character of face images relative to identification with a few features. Though the maximal recognition rate of DSNE and FDSNE is closer to that of FKDSNE1 and FKDSNE2 on ORL dataset, the corresponding dimension of FKDSNE1 and FKDSNE2 is 20 while that of DSNE and FDSNE exceeds 30. From the essence of dimensional reduction, this result demonstrates that FDSNE and DSNE are inferior to FKDSNE1 and FKDSNE2.

Analysis of Elapsed Time.
In this subsection, we further compare the computational efficiency of DSNE, FKDSNE, FKDSNE1, and FKDSNE2. The algorithm MSPN is not compared since its recognition rate is obviously worse than other algorithms. The parameters of the experiment are the same to Section 5.3. Figures 10, 11, and 12, respectively, show the elapsed time of four algorithms under different subspace dimensions on the three datasets. It can be observed from the figures that FKDSNE2 has the lowest computational cost among the four algorithms while DSNE is much inferior to other nearest-neighbor-based algorithms on all datasets. Particularly, on the COIL-20 dataset, the elapsed time of FKD-SNE2 is more than 2 times faster than DSNE. As for DSNE and FDSNE, the former is obviously slower than the latter. Besides, for the two kernel methods, FKDSNE2 is notably faster than FKDSNE1, which confirms our discussion in Section 4.
Furthermore, kernel-based algorithms FKDSNE1 and FKDSNE2 can effectively indicate the linear structure on high-dimensional space. Their objective function can achieve better values on desirable dimensions. For instance, Figure 13 illustrates the objective function value of MSNP, DSNE, FKD-SNE, FKDSNE1, and FKDSNE2 versus iterative number on ORL dataset. It can be found that FKDSNE2 and FKDSNE1 is close to the convergence value 1 − 7 while FDSNE and DSNE only achieve 1 − 6 and MSNP achieves 1 − 5.4 when the iterative number is 400. It means that FKDSNE1 and FKDSNE2 can get the more precise objective function value with less iterative number compared with DSNE and FDSNE; that is to say that, FKDSNE1 and FKDSNE2 can achieve the same value by using forty percent of the elapsed time of DSNE and FDSNE.

Conclusion
On the basis of DSNE, we present a method called fast discriminative stochastic neighbor embedding analysis (FDSNE) which chooses the reference points in -nearest neighbors of the target sample from the same class and the different classes instead of the total training samples and thus has much lower computational complexity than that of DSNE. Furthermore, since FDSNE is a linear feature dimensionality reduction algorithm, we extend FDSNE to a nonlinear scenario using techniques of kernel trick and present two kernel-based methods: FKDSNE1 and FKDSNE2. Experimental results on COIL-20, USPS, and ORL datasets show the superior performance of the proposed methods. Our future work might include further empirical studies on the learning speed and robustness of FDSNE by using more extensive, especially large-scale, experiments. It also remains important to investigate acceleration techniques in both initialization and long-run stages of the learning.