A Kernel Based Neighborhood Discriminant Submanifold Learning for Pattern Classification

We propose a novel method, called Kernel Neighborhood Discriminant Analysis (KNDA), which can be regarded as a supervised kernel extension of Locality Preserving Projection (LPP). KNDA nonlinearly maps the original data into a kernel space in which two graphs are constructed to depict the within-class submanifold and the between-class submanifold. Then a criterion function which minimizes the quotient between the within-class representation and the between-class representation of the submanifolds is designed to separate each submanifold constructed by each class.The real contribution of this paper is that we bring and extend the submanifold based algorithm to a general model and by some derivation a simple result is given by which we can classify a given object to a predefined class effectively. Experiments on the MNIST Handwritten Digits database, the Binary Alphadigits database, the ORL face database, the Extended Yale Face Database B, and a downloaded documents dataset demonstrate the effectiveness and robustness of the proposed method.


Introduction
In many practical applications, such as data mining, machine learning, and computer vision, the dimensionality reduction is a necessary preprocessing step for the purpose of noise reduction and reducing the computation complexity.The basic principle of dimensionality reduction is to map the data point from the original space to a low dimensional space through a linear or a nonlinear map.Many dimensionality reduction methods have been developed to deal with this problem.Principal Component Analysis (PCA) [1] and Linear Discriminant Analysis (LDA) [2] are two traditional linear methods.PCA seeks to find a set of bases along which the data exhibit greater variances than other axes.LDA attempts to make the samples as separable as possible in the low dimensional space.The number of features extracted by LDA is at most  − 1 ( is the number of classes), which is suboptimal for classification in Bayes sense unless a posteriori probability functions are selected.Since PCA and LDA optimize the mapping based on the global correlations in the given dataset, it is likely to distort the local correlation structures of the data.Meantime, they use the Euclidean distance in the space, which assumes that the space is isotropic and homogeneous, but this assumption is often invalid due to the curse of dimensionality.So an algorithm which is based on the original Euclidean distance is not a good choice all the time.To address this problem, some other algorithms have been developed, such as Locally Linear Embedding (LLE) [3,4], Isomap [5], Laplacian Eigenmaps [6], Locality Preserving Projections [7,8], Neighborhood Preserving Embedding [9], and Tangent Distance Preserving Mapping [10].These methods attempt to use the local geometry structure of the data manifold to approach the original whole manifold structure.Based on the different geometric intuitions, these methods can reveal the low dimensional structure of the manifold that cannot be detected by classical linear methods.
Recently, lots of novel algorithms which tend to extend or combine the classical linear and nonlinear dimensionality reduction algorithms have been proposed.Weighed Locally Linear Embedding (WLLE) [11] modifies the LLE algorithm based on the weighted distance measurement to improve the dimension reduction and the internal feature extraction performance especially for the deformed distributed data.In reference [12], a novel nonlinear dimensionality reduction algorithm is proposed.It uses relative distance comparisons to explore the local geometrical relations between data points.

Journal of Applied Mathematics
All such relative comparisons derived in each neighborhood on the manifold are enumerated and maintained in lowdimensional manifold to be learned.Sparse Multinomial Kernel Discriminant Analysis (sMKDA) [13] is a method for sparse, multinomial kernel discriminant analysis.It is based on the connection between Canonical Variate Analysis (CVA) [14] and least-squares and uses forward selection via orthogonal least-squares to approximate a basis, generalizing a similar approach for binomial problems.Neighborhood Component Analysis (NCA) [15] is a method to learn a transformation by maximizing a stochastic variant of leaveone-out knn score on the training set, such that in the transformed space all similar inputs of the same class are clustered as tightly as possible.There are some other related algorithms, including [16][17][18][19][20], proposed to deal with these kind of problems.
Most of these methods design a heuristic criterion by using the locally geometrical information of the data manifold and through some optimization methods to get the low dimension data.Also, most of these methods may have their supervised versions by using the class label information which is closed to the discriminant ability of an algorithm.
Here, by using a different locally geometrical intuition, we proposed a novel submanifold learning method, called Kernel Neighborhood Discriminant Analysis (KNDA), which is based on the kernel tricks [21][22][23].KNDA considers both the within-class submanifold and the betweenclass submanifold by integrating the neighboring information into a weighted matrix in the kernel space.The final goal of KNDA is to keep the within-class data as close as possible in the low dimensional space while keeping away the between-class data in a greatest degree.
However, our method may seemed to be a supervised kernel extension of Locality Preserving Projections (LPP), so before explaining our method, we first briefly introduce LPP.
The rest of the paper is organized as follows.In Section 2, Locality Preserving Projections is reviewed, and then in Section 3, we elaborate the proposed method Kernel Neighborhood Discriminant Analysis (KNDA).Section 4 presents the experiment results.Section 5 will make the discussions about KNDA.Finally, in Section 6, the conclusion is given.

Review of Locality Preserving Projections (LPP)
LPP is a linear approximation of the nonlinear Laplacian Eigenmap for learning a locality preserving subspace which preserves the intrinsic geometry of the data and local structure.A neighborhood relationship graph  is built to depict the local structure of the data points.
The objective function of LPP is defined as where   is the weight which depicts the nearness relationship between two data points and   and   are the mapping results of the data points   and   .
There are two common ways to compute   as follows.
(2) Constant Weight: The weight   defines a matrix , whose entry   is nonzero only when   and   are adjacent.Also notice that the entries of  are nonnegative, and  is sparse and symmetric.
There is also an imposed constraint to (1), namely,    = , where  is a diagonal matrix, with   = ∑    .
Finally, the minimization problem reduces to the following form: where  =  −  is the Laplacian [24] of the graph  constructed by LPP and  is the projection matrix, whose column vectors are the mapping directions by which we can project  into a low dimensional space.
In order to get the optimal projection matrix  LPP , we just need to get the eigenvectors corresponding to the minimum eigenvalues of the generalized eigenvalue problem: When we get  LPP , we can obtain the projection results of  easily by  =  LPP  .Here, the size of matrix  LPP is  * ,  is the dimensionality of the original space, and  is the dimensionality of the space into which we map the original matrix .

Kernel Neighborhood Discriminant Analysis (KNDA)
Although LPP has a strong principle base and can project the high dimensional data into a low dimensional space with the local geometry structure of the original manifold being preserved, it does not use the class relationship of the data points, which is more important in pattern classification.
Here, we propose a robust and promising method, which first nonlinearly maps the original data points into a high dimensional space and then explicitly constructs an affinity graph by using the label of each point.Assume that the dataset is represented as  = [ 1 ,  2 , . . .,   ], where   ( = 1, 2, . . ., ) ∈   and each data point is assigned a label   ∈ {1, 2, . . .,   }, where   is the number of classes.
First, we utilize a nonlinear function  to map the original data points into a kernel space , in which data points can be represented as Then, in order to use the label information, we construct the within-class affinity graph and between-class affinity graph in the kernel space .

Build the Within-Class Manifold
Structure.This step is somewhat like the LPP, we construct an affinity graph, each vertex of which is a data point in the kernel space.But we constrain that there is one edge between two vertexes only when they are from the same class; namely, they have the same label.
When the within-class affinity graph is obtained, each edge of the graph will be given a value to measure the nearness relationship between the two vertexes which form an edge.Here, we use the constant weight: Aiming at preserving the local geometry structure of submanifold constructed by each class, we wish to minimize where  is the projection matrix whose columns are the axes of the subspace.Equation ( 8) can be reduced as follows: where  is a diagonal matrix with   = ∑    and  min is a matrix with  min (, ) =   .
Notice that  can be represented as a linear combination of (  ) ( = 1, 2, . . ., ), namely, So, ( 8) can be reduced further to where  is the kernel matrix in the kernel space.

Build the Between-Class Manifold
Structure.The submanifolds constructed by each class maybe overlap, which is the main reason why the recognition rate is low in the classification works.So, how to separate the submanifolds in the low dimensional space is a key issue in dimensionality reduction researches.Here, we explore the between-class relationship to build an affinity graph which reflects the membership between data points with different labels.We link two points (  ) and (  ) in the kernel space only when the Euclidean distance between them is less than some limit values and they are from different class.Here, unlike building the within-class manifold structure, we add another limit; namely, the two data points with different labels must reside on a local sphere with a small radius.We suppose this limit is because if two data points with different labels are much close to each other in the space, there will be a higher possibility of classifying the data points falsely.So we are compelled to add this limit in order to make the points with different labels far away furthest in the low dimension space.The radius value should be chosen to make the local spheres contains a few data points that have different labels to the data point reside on the centre of the sphere, meanwhile, the radius value should be as small as possible, which will reflect the local structure of the manifold faithfully.
When the between-class affinity graph is obtained, like in Section 3.1, we give each edge of the graph a value.We also utilize the constant weight here; namely, and label ( (  )) ̸ = label ( (  )) 0 otherwise.(12) Since we conduct this step in the kernel space, so we can calculate ‖(  ) − (  )‖ like the following: where matrix  is the kernel matrix whose entries are the inner product of all pairs of (  ) and (  ) (,  = 1, 2, . . ., ).
In order to make the between-class submanifolds constructed by each class as separable as possible in the low dimensional space, we wish to maximize According to Section 3.1, ( 14) can be reduced to where  max is a diagonal matrix with   = ∑    ,  max is a matrix with  max (, ) =   , and  is the kernel matrix in the kernel space.

Criterion Function.
Since Kernel Neighborhood Discriminant Analysis is designed to preserve the within-class geometry structure of all the submanifolds and meanwhile keep away the between-class submanifold, we define the criterion function as follows: This problem can be reformulated as a constrained minimization problem: We can obtain the solution of (17) easily by solving the following generalized eigenvalue problem: If ( max −  max ) is invertible, we can reduce (18) to the common eigenvalue problem: In real operation, we can add a diagonal matrix with small entry value, such as 0.01, to the matrix ( max −  max ) to ensure that it has full rank.Then the optimal solution vectors  * of ( 17) are the eigenvectors corresponding to the smallest  eigenvalues of (( max −  max )) −1 ( min −  min ) ( is the reduction dimensionality).
When  * is gotten, the data points in the kernel space can be mapped into a subspace by And for a new test sample , it can be mapped to the subspace by Algorithm 1.The formal algorithm procedure can be described as follows.
Step 1. Compute the kernel matrix  of the data points.
Step 2. Construct the within-class affinity graph, and compute the weigh matrix  min and  min : min is a diagonal matrix with   = ∑    .
Step 3. Construct the between-class affinity graph, and compute the weigh matrix  max and  max : max is a diagonal matrix with   = ∑    .
Step 4. Solve the eigenvalue problem: In real operation, we can add a diagonal matrix with small entry value, such as 0.01, to the matrix ( max −  max ) to ensure that it has full rank.
Step 5. Produce the mapped vectors.Let  * 1 ,  * 2 , . . .,  *  be the eigenvectors corresponding to the smallest  eigenvalues of the eigenvalue problem shown in Step 4; the final  dimension embedding result of the original data points is where

Experiments and Discussions
In this section, we conduct several experiments on different datasets to demonstrate the effectiveness and the robustness of our proposed method Kernel Neighborhood Discriminant Analysis (KNDA).Remember that in all the following experiments the parameter  used to find the nearest neighborhoods in LPP and KLPP is set to 10. Also by an auxiliary experiment, we find that by using polynomial kernel with KNDA can get better results than other kernel functions, such as Gaussian kernel and sigmoid kernel, and there are no serious impacts on the experiment results when parameter  is disturbed in polynomial kernel.So for comparison and simplicity, we employ polynomial kernel by choosing "" equal to 2 in KNDA, KPCA, KLDA, and KLPP.

MNIST Handwritten Digits.
In this experiment, we use a subset of the MNIST Handwritten Digits database [25], which contains 8 bit grayscale images of "0" through "9".Each digit image sample is represented as a high dimensional vector of length 784. Figure 1 depicts two digits "0" and "1" of the database.We choose randomly 100 samples of every digit to form our dataset, and we compare KNDA with SLPP, SKLPP, PCA, and KPCA.The results are shown in Figure 2.
We can see that the embedding result of KNDA almost separates each class of the sample perfectly, while all the other four methods result in the overlap of the data points from different classes.That is because PCA type methods are designed to preserve the global structure of the manifold and LPP type methods consider the local structure of the manifold more.Also PCA and KPCA neglect the key factor "class label" that is very important in the classification work.Notice that digits "4" and "9" are not separated completely by KNDA (see Figure 2).If we look back to the digits "4" and "9" in the original database, we can find that the two digits in the database are somewhat similar.The images are shown in Figure 3.
Here, we look back to Figure 2. We can find that the centroid of the two clusters of digits "4" and "9" is too close, if we can make the abscissa of the two centroid much separated, maybe we can split digits "4" and "9" totally.So, we discard the eigenvector corresponding to the first smallest eigenvalue of matrix (( max − max )) −1 ( min − min ) and use the two eigenvectors corresponding to the second and the third smallest eigenvalues of matrix (( max − max )) −1 ( min −  min ).This means that we use the original -direction as the -direction now.The experiment result is in Figure 4.
This time, digits "4" and "9" are separated completely.The result can be seen as a 90 ∘ globally clockwise rotation to the result contained by KNDA in Figure 3 at a certain degree, and now notice that the -direction distance of the cluster centroids of digits "4" and "9" is almost the same as the -direction in Figure 3, which is because now the eigenvector corresponding to the 2nd eigenvalue of matrix (( max − max )) −1 ( min − min ) is used as the embedding vector of -direction, and it is used as the embedding vector of -direction in the foregoing experiment.

Binary Alphadigits.
The dataset used in this part is Binary Alphadigits [25], which is composed of Binary 20 × 16 digits of "0" to "9" and capital "A" to "Z".Each digit has 39 examples.Here, we use all the samples of the database.Figure 5 depicts digit "9" and character "A" of the database.
By KNDA, we map all the data points to the 2-dimension space, and the experiment results are shown in Figure 6.
It is very interesting and exciting that the samples from each class are embedded to a point in the 2-dimension space, which demonstrates the effectiveness and robustness of KNDA.Since this database contains 36 different classes, namely, digits "1-9" and characters "A-Z", we do not show the results produced by PCA, LPP, and NPE.All these three methods' results are somewhat like the ones in the MNIST database.
However, digit "0" and character "O" are overlapped, which is not deviant, if we take a look at the images in the database, which is shown in Figure 7.
They are the same!That is why KNDA embeds digit "0" and character "O" in the same place in the 2-dimension embedding space.

ORL.
In this experiment, we use a famous face database ORL [26].ORL contains 40 different subjects, and each subject has ten different images.The images include variation in facial expressions (smile or not and open/closed eyes) and pose.Figure 8 illustrates two sample subjects of the ORL database along with variations in facial expressions and pose.
We test KNDA against DNDA (Direct Neighborhood Discriminant Analysis), PCA, KPCA, LPP, KLPP, LDA, and KLDA to demonstrate the predominance of KNDA.DNDA is conducted by directly building the within-class manifold structure and the within-class manifold structure in the original data space and the other steps are the same as KNDA, so we do not explain DNDA at length here.We form the training set by randomly selecting 5 images per individual with labels, and the rest of the database was considered to be the testing set.Nearest neighborhood classifier is employed in the experiments.Also, we find that if we run PCA prior to LPP, the results of LPP can increase a lot.So in our experiment, before conducting LPP we first run PCA.The experiments are conducted 10 times, and we report the average results here.The results are given in Figure 9.The horizontal axis represents the dimension of the subspace and the vertical axis stands for the recognition rate.
It can be seen that KNDA outperforms all the other 7 methods.Although DNDA, PCA, KPCA, LDA, KLDA, and LPP perform well, KNDA exceeds them a lot.Another point we should pay attention to is that LPP has the same performance when the dimension is larger than 50.This is    because we run PCA to preserve 90 percent principal component prior to conducting LPP, and the subspace dimension obtained by PCA is smaller than 50.Also except KDNA, DNDA, LDA, and KLDA, all the other 4 methods do not consider explicitly the class labels of the data points, so when comparing with KNDA, they look poorer.Meantime, although it is obvious that DNDA, LDA, and KLDA, which consider the label information explicitly, get better results, KNDA outperforms them at a certain degree.We also list the best recognition rate achieved by each method along with the corresponding subspace dimension  in Table 1.conditions and each subject has 64 cropped images.Figure 10 illustrates a cropped example face images from the Extended Yale Face Database B, database under different illuminations.

Extended Yale
We also test KNDA against DNDA (Direct Neighborhood Discriminant Analysis), PCA, KPCA, LPP, KLPP, LDA, and KLDA.The training set is formed by randomly selecting 30 images per subject with labels, and the rest of the database was considered to be the testing set.Nearest neighborhood classifier is employed in the experiments.Like in Section 4.2.1, we run PCA prior to LPP to achieve the best performance of LPP.The experiments are conducted 10 times, and we report the average results here.The results are given in Figure 11.The horizontal axis represents the dimension of the subspace and the vertical axis stands for the recognition rate.
This time, KNDA outperforms almost all the other 7 methods too.Although KLDA gets a better result when the dimensionality is low, KNDA exceeds KLDA for 0.11 percent when the dimension equals 120.Also, we must mention that this result may be related to the dataset.In Section 4.2.1 we can see that KNDA outperforms KLDA a lot and the intrinsic relationship between the dataset and the algorithm is still unknown.Meanwhile, we can see that LPP is better than PCA, KPCA, and KLPP, but KNDA exceeds it nearly by 3 percent during the increasing of the dimension.Also, if we look back to the experiment results in Section 4.2.1, we will find that this time KLPP exceeds PCA and KPCA a lot, but in the ORL database, this is almost the reverse.However, KNDA gets the best recognition rate in both  the ORL database and the Extended Yale Face Database B, which indicates a stabilized performance.The best recognition rate achieved by each method along with the corresponding subspace dimension  is shown in Table 2.

Text Categorization.
In order to test the performance of KNDA in the real application, a random set of Web documents that belonged to 25 different classes has been downloaded from the Internet by an auxiliary Web crawler program.The number of the documents belonging to different classes is very different.Some classes, such as sport or computer, have more than 2000 documents, and on the other hand, classes such as decoration have less than 500 documents.This difference is deliberately imported to simulate the real application context to the maximum extent.We use 80 percent of each class to compose the train set and the other 20 percent of each class are used to test.The experiments are conducted 10 times, and we report the average results here.The results are given in Figure 12.The horizontal axis represents the dimension of the subspace and the vertical axis stands for the recognition rate.We can see clearly that KNDA is the best method globally comparing with others, although it is not the best one when the dimensionality is low.Also, this time LPP gets a better performance than lots of the other methods, which suggests that LPP is not stable when using different dataset.On the other hand, since DNDA and KNDA consider both the within-class structure and the between-class structure of the data manifold, a more stable performance is gotten.Meantime, remember that we use randomly downloaded Web documents dataset as the train set, and the distribution of the documents is not uniform, which means that KNDA can get a more robust and effective result in the real application context.
Also, the best recognition rate achieved by each method along with the corresponding subspace dimension  is shown in Table 3.

Discussions
KNDA is a nonlinear dimension reduction algorithm which is based on the graph theory.There are other graph based algorithms, such as S-Isomap [28], Tensor Subspace Analysis [29], and Conformal Embedding Analysis (CEA) [30].
The key difference between KNDA and them is how to construct the graph and how to choose the optimization criterion, which is very important to the algorithm.KNDA uses not only the within-class information but the betweenclass information to construct the graph which can effectively improve the discriminant ability of an algorithm.
Meantime, the criterion function of KNDA is similar to the LDA type algorithms, so here we briefly summarize the similarity and dissimilarity between them.The LDA type algorithms are based on the global correlations in the given dataset; it is likely to distort the local correlation structures of the data.On the contrary, KNDA fully uses the locally geometrical information of the data manifold to construct the discriminant criterion and can adapt to the problem of the curse of dimensionality.
Also, one can combine both the global and the local information of the data to design an algorithm, such as [31], in which a new algorithm called Distinguishing Variance Embedding (DVE) is proposed.DVE unfolds the dataset by maximizing the global variance subject to the proximity relation preservation constraint originated in Laplacian Eigenmaps.

Conclusions
In this paper, a kernel based neighborhood discriminant submanifold learning algorithm called Kernel Neighborhood Discriminant Analysis (KNDA) is proposed.KNDA is derived by first nonlinearly maping the original dataset into a kernel space, and then within-class submanifold and between-class manifold are modeled in the kernel space in order to separate each submanifold constructed by each class.Through solving an eigenvalue problem, we get the embedding vectors in the low dimension space.Digit visualization, face recognition, and a real documents' categorization experiments are conducted on several different artificial or real dataset to demonstrate the dominance of KNDA.

Figure 8 :Figure 9 :
Figure 8: Sample face images from the ORL database along with variations in facial expressions and poses.

Figure 10 :
Figure 10: Cropped example face images from the Extended Yale Face Database B database under different illuminations.

Figure 11 :
Figure 11: The recognition rates versus the subspace dimension on Extended Yale Face Database B.

Figure 12 :
Figure 12: The recognition rates versus the subspace dimension on random downloaded Web documents which belonged to 25 different classes.

Table 1 :
The best recognition rate achieved by each method along with the corresponding subspace dimension .
[27] Database B. In this section, Extended Yale Face Database B[27]is used to conduct our experiments.Extended Yale Face Database B contains 16128 images of 28 human subjects under 9 poses and 64 illumination conditions.We use a subset of Yale Database B here which contains 38 subjects under different illumination

Table 2 :
The best recognition rate achieved by each method along with the corresponding subspace dimension .

Table 3 :
The best recognition rate achieved by each method along with the corresponding subspace dimension .