An Efﬁcient Feature Extraction Method, Global Between Maximum and Local Within Minimum, and Its Applications

. Feature extraction plays an important role in preprocessing procedure in dealing with small sample size problems. Considering the fact that LDA, LPP, and many other existing methods are conﬁned to one case of the data set. To solve this problem, we propose an e ﬃ cient method in this paper, named global between maximum and local within minimum. It not only considers the global structure of the data set, but also makes the best of the local geometry of the data set through dividing the data set into four domains. This method preserves relations of the nearest neighborhood, as well as demonstrates an excellent performance in classiﬁcation. Superiority of the proposed method in this paper is manifested in many experiments on data visualization, face representative, and face recognition.


Introduction
Nowadays with the continual development of information technology, the amount of data has largely expanded, such as in the domain of pattern recognition, artificial intelligence, and computer vision.Because the dimension of the samples of data set is a lot greater than the number of the obtained samples of data set, it results in "the curse of dimensionality" 1 .Feature extraction method plays an important role in dealing with small sample size SSS problems.It represents original high dimensional data in the low-dimensional space through capturing some important data structure and information and is a common preprocessing procedure in multivariate statistical data analysis.At present, feature extraction methods have successfully been applied in many domains such as text classification 2 , remote sensing image analysis 3 , microarray data analysis 4 , and face recognition 5, 6 .
1 GBMLWM method shares excellent properties with LDA and MMC.In this paper, we maintain global merit in the process of global between maximum.Similar to LDA, we first keep all samples in the data set away from the class centroid, and then let the samples, labeled the same class with the fixed sample and beyond its nearest neighborhood close to its class centroid.So, GBMLWM is a way to supervise, and it is feasible to take apart between class of the data set and keep close within class of the data set.
same class with the fixed sample in its nearest neighborhood approach.At the same time, let the samples different from the fixed sample away from it.So, GBMLWM maintains the submanifold space of the fixed sample.
3 As connection to PCA, LDA, MMC, LPP, and ANMM, we could derive those methods from GBMLWM framework by imposing some conditions, that is to say, those methods are the special case of GBMLWM.Visual and classification experiments have also indicated that proposed method in the paper is superior to the above methods.
The rest of this paper is organized as follows.Section 2 briefly reviews global and local methods, that is, PCA, LDA, MMC, LPP, and ANMM.The GBMLWM algorithm is put forward in Section 3, and its relationship with the above methods is also discussed in this section.The experimental outcomes are presented in Section 4. The conclusion appears in the Section 5.

Brief Review of Global and Local Methods
Suppose that X x 1 , x 2 , . . ., x n ∈ R m×n is a set of m-dimensional samples of size n, and it is composed of C i , i 1, . . ., C, where each class contains n i samples, C i 1 n, and let x i j a m-dimension column vector which denotes the jth sample from the ith class.Generally speaking, the aim of the linear feature extraction or dimensionality reduction is to find an optimal linear transformation W ∈ R m×d d m from the original high-dimensional space to the goal low-dimensional space y i W T x i , so that those transformated data in terms of different optimal criteria best represent different information such as that of algebra and geometry structure.

Principle Component Analysis
PCA attempts to seek an optimal projection direction so that covariance of the data set is maximized, or average cost of projection is minimized after transformation.The objective function of PCA is defined as follows: where Applying algebra knowledge, 2.1 may be rewritten as where is the sample covariance matrix.m x is the mean of the all samples.The optimal W w 1 , w 2 , . . ., w d is the eigenvectors of S t corresponding to the first d largest eigenvalues.

Linear Discriminant Analysis
The purpose of LDA is to discriminate and classify, and it seeks an optimal discriminative subspace by maximizing between-class scatter matrices, meanwhile, minimizing within-class scatter matrices.LDA's objective is to find a set of vectors W according to where respectively, represent the between-class scatter matrix and the within-class scatter matrix.m i x is the mean of the ith class.The projection directions W are the generalized eigenvectors w 1 , w 2 , . . ., and w d solving S b w λS w w associated with the first d largest eigenvalues.

Maximum Margin Criterion
MMC keeps similarity or dissimilarity information of the high-dimensional space as much as possible after dimensionality reduction by employing the overall variance and measuring the average margin between different classes.MMC's projection directions matrix is as follows where S b , S w are defined as 2.5 .

Locality Preserving Projection
PCA, LDA and MMC aim to preserve global structure of the data set, while LPP is to preserve the local structure of the data set.LPP models the local submanifold structure by maintaining the neighborhood relations of the fore and aft transformated samples in data set.With the same mathematical notations as above, the objective function of LPP is defined as follows: where D is a diagonal matrix, that is, D i, i j SL ij , i 1, . . ., n, L D − SL is the Laplacian matrix.And SL SL ij n×n is a similarity matrix, defined as follows: where S i, j exp − x i − x j 2 /t , for i, j 1, . . ., n, t is a kernel parameter, N i is the set of nearest neighborhood of x i .The optimal W is given by the d eigenvectors corresponding to minimum eigenvalue solution to the following generalized eigenvalue problem: 2.9

Average Neighborhood Margin Maximum
Different from PCA and LDA, ANMM aims to obtain effective discriminating information by using average local neighborhood margin maximum.For each sample, ANMM aims at pulling the neighborhood samples with the same label towards it as near as possible, meanwhile, pushing the neighborhood samples with different labels away from it as far as possible.ANMM's solutions as follows: where A is called the scatterness matrix, B is called the compactness matrix and N e i , N o i , respectively, is ξ the nearest heterogenous and homogenous neighborhood of the x i , | • | is the cardinality of a set.Here, we can regard ANMM as the local version of the MMC.

Global between Maximum and Local Within Minimum
In this section, we present our algorithm-global between maximum, simultaneously local within minimum GBMLWM .It profits from global and local methods.GBMLWM algorithm preserves not only the local neighborhood of submanifold structure, but also the global information of the data set.To state our proposed algorithm, we first give four domains about x i as follows: Domain I: those samples are a subset of the nearest neighborhood of x i and labeled the same class with x i .
Domain II: those samples are also a subset of the nearest neighborhood of x i , but labeled the different class from x i .
Domain III: those samples labeled the same class with x i , but do not lie in the nearest neighborhood of x i .
Domain IV: those samples do not lie in the nearest neighborhood of x i and also are labeled as the different class from x i .
Figure 1 shows us an intuition about the above four domains.The nearest neighborhood of x i consists of domain I and II.The samples labeled the same class with x i lie in domain I and III, and the samples labeled the different class from x i lie in II and IV.

Global between Maximum
The purpose of classification and feature extraction is to make the samples labeled as different class apart from each other.We first operate those points in domain II and IV via maximizing global and local between-class scatter.That is to say, our aim is not only to make the data globally separable, but also to maximize the distance between different classes in the nearest neighborhood.Thus, our objective functions are defined as follows: 3.1

Local Within Minimum
As for classification, maximizing between class is not adequate, and compacting within-class scatter is also required.So, we now make the samples from domain I close to x i itself, the samples from II away form x i and the samples from domain II close to their own class centroid. where 3.5

GBMLBM Algorithm
In the previous description, nearest neighborhood of x i is indicated as K nearest neighborhood based on Euclidue distance between two samples from the data set.Our objective function is defined as follows: where And then our optimal projection directions W are solutions to the following optimization problem:

3.8
So, W w 1 , . . ., w d is the eigenvectors of Mw λw corresponding to the first d largest eigenvalues.It is obvious that the GBMLWM algorithm is fairly straightforward instead of computing inverse matrix, and thus it absolutely avoids the SSS problem.Now, the algorithm procedure of GBMLBM is formally summarized as follows: 1 as for each sample x i , i 1, . . ., n, dividing the samples from data set except x i , i 1, . . ., n into four domains: I, II, III, and VI; 2 computing S t , L 1 , L 2 , SL w , according to 2.3 , 3.3 , 3.4 , and 3.5 , respectively; 3 and then, we can obtain matrix M according to 3.7 ; 4 computing the generalized eigenvectors of Mw λw, and the optimal projection matrix W w 1 , . . ., w d corresponding to the d largest eigenvalues, where d is the rank of matrix M. For a testing sample x, its image in the lower dimensional space is given by x −→ y W T x. 3.9

Discussion
Here, we find those methods limited to global structure or local geometry of the data set are special case of GBMLWM algorithm.PCA regards the data set as a whole domain and demands all the samples away from the total mean of the data set.Thus, we see that PCA is an unsupervised version special case of GBMLWM algorithm.Both MMC and LDA divide Training cost is the amount of computations required to find the optimal projection vectors and the sample feature vectors of the training set for comparison.We compare the training cost of the methods based on their computational complexities.Here, we suppose that each class has the same number of training samples.If we regard each column vector as a computational cell and do not consider the computational complexity of eigen-analysis, we estimate approximately computational complexity for six different algorithms which include based-local methods and based-global techniques.Table 1 gives the analysis of computational complexity for the six different algorithms.From Table 1, we can see that our method has the largest training cost.However, in practice, the size of neighborhood and the number of class are often not large enough to cause much more computation of our algorithm.The computational complexity of GBMLWM also shows that our algorithm not only considers the global information, but also utilizes the local geometry.That makes our algorithm efficiently reflect the intrinsical structure of the training set.The following experimental results also manifest this point.

Experiments
In this section, we will carry out several experiments to show the effectiveness of the proposed GBWMLW method for data visualization, face representative, and recognition.Here, we will compare the global methods, that is, PCA, LDA, MMC, and local methods, that is, LPP, NPE, and ANMM, with our proposed method on the following four databases: MNIST digit, Yale, ORL, and UMIST database.In the processing of the PCA, we only maintain the N − C dimensions to ensure scatter matrix nonsingular.In testing phases, the size of neighborhood k is determined by 5-fold cross validation in all experiments, and the nearest neighbor NN rule is used in classification.In using the LPP and GBMLWM algorithms, the weight of two samples is computed with Gaussian kernel, and the kernel parameter is selected as follows: we firstly compute the pairwise distance among all the training samples, then, t is made equal to the half median of those pairwise distance.

Data Visualization
In this subsection, we first use a publicly available handwritten digits to illustrate data visualization.MNIST database 20 has 10 digits, and each digit contains 39 samples.The number of total samples is 390 which each image has the size 20 × 16.Here, we only select 20 samples from each digit.So the size of the training set is 320 × 200, and each image is represented lexicographically as a high-dimensional vector of the length 320. Figure 2 shows all the samples of the ten digits.For visualization, we project the data set in 2-D space by all seven subspace learning methods.And the experiment results are depicted in Figure 3.With the exception of LDA and GBMLWM, the samples from the different digits seem to heavily overlap.Compared with GBMLWM algorithm, LDA makes the samples from the same class become a point.Although this phenomenon is helpful for classification, it has a poor generalization ability since it does not exhibit the case in each object oneself.GBMLWM algorithm not only separates each digit, but also shows what is hidden in each digit.When the number of the nearest neighbor of x i reduces from K 15 to K 2, the samples from the same object become more and more compacted.That also verifies that LDA is a special case of GBMLWM.

Yale Database
This experiment aims to demonstrate the ability of capturing the important information on Yale face database 21 , called face representative.The Yale face database contains 165 gray scale images of 15 individuals.There are 11 images per subject, one per different facial expression or configuration: center-light, with/without glasses, happy, left/right light, normal, sad, sleepy, surprised and wink.All images from the Yale database were cropped and the cropped images normalized to the 32 × 32 pixels with 256 gray level per pixel.Some samples from the Yale database are shown in Figure 4. Here, the training set is composed of all the samples from this database.And the most significant 10 eigenfaces obtained from the Yale face database through using the seven subspace learning methods are shown in Figure 5. From the Figure 5, we obviously see that our algorithm captures more basic information of the face than other methods.

UMIST Database
The UMIST database 22 contains 564 images of 20 individuals, each covering a range of poses from profile to frontal views.Subjects cover a range of race, sex, and appearance.: Two-dimensional projections of the handwritten digits, respectively, by using seven related subspace learning methods." " denotes 0, "•" denotes 1, " * " denotes 2, "×" denotes 3, " " denotes 4, " " denotes 5, "Δ" denotes 6, "∇" denotes 7, " " denotes 8, " " denotes 9.  We use a cropped version of the UMIST database that is publicly available at S. Roweis' Web page.All the cropped images normalized to the 64 × 64 pixels with 256 gray level per pixel.Figure 6 shows some images of an individual.We randomly select three, four, five, and six images of each individual for training, and the rest for testing.We repeat these trails ten times and compute the average results.The maximal average recognition rates of seven subspace learning methods are presented in Table 2. From Table 2, we find that GBMLWM algorithm's highest accuracy, respectively, are 79.88%,86.10%, 91.85%, and 93.80% on the different training sets and corresponding testing sets.The improvements are significant.Furthermore, the dimensions of the four GBMLWM subspaces corresponding to the maximal recognition rates are remarkably low, and they are 15, 13, 11, and 18, respectively.

ORL Database
In the ORL face database 23 , there are 40 distinct subjects, each of which contains ten different images.So there are 400 images in all.For some subjects, the images are taken at different times, varying the lighting, facial expressions and facial details.All the images are taken against a dark homogeneous background with the subjects in an upright, frontal position.All images from the ORL database are cropped, and the cropped images normalized to the 32×32 pixels with 256 gray level per pixel.Same samples from this database are showed in Figure 7.In this experiment, four training sets, respectively, correspond to the numbers of samples from each subject three, four, five, and six.And other samples, respectively, form the testing sets.We repeat these trails ten times and compute the average results.The recognition rates versus the reduced dimensions are shown in Figure 8.The best average recognition rates of seven subspace learning methods are presented in Table 3.It can be seen that GBMLWM algorithm's recognition rates remarkably outperform the other methods in all the four training subsets with the highest accuracy of 90.07%, 94.67%, 97.20%, and 97.31%, respectively.The standard deviations of the GBMLWM corresponding to the best results are 0.03, 0.02, 0.02, and 0.02.

Conclusions
In this paper, we have proposed a new linear projection method, called GBMLWM.It is an efficient linear subspace learning method with the supervised and unsupervised character.Similar to PCA, LDA, and MMC, we consider the global character of the data set.At the same time, similar to LPP, NPE, and ANMM, we also make the best use of the local geometry structure of the data set.We have pointed out that the existing linear subspace learning methods are a special case of our GBMLWM algorithm.A large number of experiments demonstrate that the method which we propose is obviously superior to other existing methods, such as LDA and LPP.

Figure 1 :
Figure 1: Here are four domains into which other samples except x i in the data set are divided.The left figure shows four domains in the original high-dimensional space, and the right depicts four domains in the low-dimensional space.

Figure 2 :
Figure 2: All the samples of handwritten digits from number 0 to 9 used in our data visualization experiment.

Figure 4 :
Figure 4: Some face samples from the Yale database.

Figure 5 :
Figure 5: The most significant 10 eigenfaces obtained from the Yale face database through using the seven subspace learning methods: PCA, LDA, MMC, LPP, NPE, ANMM, and GBMLWM from top to bottom.

Figure 6 :
Figure 6: Some face samples from the Yale database.

Figure 7 :
Figure 7: Some face samples from the ORL database.

Figure 8 :
Figure 8: Average recognition rates of seven subspace learning methods and the different samples from each object versus the reduced dimensions on the ORL database.

Table 1 :
Estimate approximately computational complexity for the six different algorithms, where n denotes the total number of training samples, k and C are the size of neighborhood and the number of class, respectively.i in the data set into two domains: one is composed of the samples labeled as the same class with x i , called within-class S w ; the other contains the samples labeled the different class from the x i , called between-class S b .They, respectively, correspond to the domains I ∪ III and II ∪ VI, as illustrated in Figure1.The local methods, such as LPP and ANMM, are different from the above methods based on global structure.LPP and ANMM divide the whole data set into two domains according to the nearest neighborhood of x i .LPP is operated in I ∪ II, while ANMM method in the domain I and II, as depicted in Figure1.Those local methods do not utilize the global information of the data set and are local version special case of the algorithm proposed in this paper.The superiority of the GBMLWM algorithm is manifested in the data experiments in the following section.

Table 2 :
Recognition accuracy % of different algorithms and the numbers in the bracket corresponding to dimensions on the UMIST database.

Table 3 :
The best recognition accuracy % of different algorithms and the numbers in the bracket corresponding to standard deviation on the ORL database.