Local and Global Geometric Structure Preserving and Application to Hyperspectral Image Classification

Locality Preserving Projection (LPP) has shown great efficiency in feature extraction. LPP captures the locality by the K-nearest neighborhoods. However, recent progress has demonstrated the importance of global geometric structure in discriminant analysis. Thus, both the locality and global geometric structure are critical for dimension reduction. In this paper, a novel linear supervised dimensionality reduction algorithm, called Locality and Global Geometric Structure Preserving (LGGSP) projection, is proposed for dimension reduction. LGGSP encodes not only the local structure information into the optimal objective functions, but also the global structure information. To be specific, two adjacent matrices, that is, similarity matrix and variance matrix, are constructed to detect the local intrinsic structure. Besides, a marginmatrix is defined to capture the global structure of different classes. Finally, the threematrices are integrated into the framework of graph embedding for optimal solution.The proposed scheme is illustrated using both simulated data points and the well-known Indian Pines hyperspectral data set, and the experimental results are promising.


Introduction
Hyperspectral image (HSI) processing, as a typical application of high dimensional data analysis, has witnessed great interest among worldwide researchers [1].The acquisition of hyperspectral image is usually concerned with analysis, measurement, understanding, and interpretation from a given scenario at different airline distance by the satellite [2].Different HSI data poses different level of challenge to the task of data analysis.However, a common issue of HSI data is the high dimensional feature space within relative small sample size [3], which is also known as the "Hughes phenomenon." To increase efficiency, dimensionality of HSI data must be reduced before further processing.Dimension reduction plays a significant role in HSI community [4].
Recently, some articles [11] pointed out that high dimensional data may rely on a submanifold that reflects the inherent geodesic structure.Under this circumstance, both PCA and LDA may fail to find the hidden manifold, whereas the nonlinear literatures, such as LLE and Isomap, have been proposed and developed to tackle this difficulty.However, the mapping of LLE and Isomap is implicit and there is no exact computational expression of new data points.That is, the projected data points of LLE and Isomap are defined on the training data points, and both methods can not directly embed a new data point in the projected space.Moreover, these methods are computationally expensive.This drawback makes these algorithms hard to be further developed and limits their massive application in various areas, especially in the hyperspectral image analysis community.
To address this issue, He and Niyogi [10] proposed the Locality Preserving Projection (LPP), which is a linear approximation of intrinsic manifold, to reduce the high dimensional facial feature vectors into a low dimensional subspace.The neighborhood relationship in LPP is preserved in the projected submanifold.However, LPP is an unsupervised algorithm.The discriminant information is ignored.Wong and Zhao [11] proposed a supervised version LPP, where discriminant information of different classes is adopted to improve the classification performance.Vasuhi and Vaidehi [12] found that the basis of LPP in the projected space is not orthogonal.They applied an orthogonal basis to facial classification and found that the classified accuracy of orthogonal basis was better than conventional LPP.A common theme of many discriminant analysis based methods is this: by minimizing neighborhood distance from the same class, the locality based approaches utilize discriminant information to maximize the distance among data points from different classes, simultaneously preserving the intraclass compactness.The distance of adjacent data points represents the local geometrical structure of the same class, yet distance from different data points indicates the global geometrical structure of different classes.By doing so, the structure of data points in the projected space is expected to be similar to the original space.
Despite this, some articles, such as Local and Global Structures Preserving Projection (LGSPP) [13] and Joint Global and Local Structure Discriminant Analysis (JGLDA) [14], reported that besides locality, the global structure is also important.The locality can be generally captured by a Laplacian matrix that comes from neighborhood relationship, that is, adjacent graph.Moreover, the global geometric structure can also be captured by a relationship matrix, for example, penalty matrix [15], -farthest neighborhood adjacent matrix [16], or nearest neighborhood adjacent matrix [17].However, these methods only capture the similarity structure of data points to learn the intrinsic geometric structure (local structure).They ignore the distribution of data points, and the structure of data points in the embedded space is destroyed.Consequently, it leads to incorrect description of data structure.In most instances, a single locality is insufficient for describing the intrinsic geometry of data points.Thus, it will be more discriminative if both local and global statistic properties are integrated to describe the geometry of data points.
Motivated by these factors, in this paper, we proposed a novel approach, that is, Locality and Global Geometric Structure Preserving (LGGSP) projection, that makes use of not only the local structure, but also the global structure of data points, to reduce the dimensionality of feature vectors.Specifically, we focus on the global distribution of data points, where the local structure is characterized by the similarity and the diversity of samples from the same class, respectively.Besides, the global structure is characterized by the margin of different samples.To achieve the goal of discovering both local and global structures hidden in data points, we first define three optimization functions.Then we solve them in the framework of graph embedding to make the LGGSP algorithm supervised.And finally, a linear transformation is found by utilizing the principle of discriminant analysis.
The rest of this paper is organized as follows.Section 2 provides a brief analysis of basic discriminant techniques.Proposed LGGSP is presented in Section 3. Results of synthetic data sets and real hyperspectral image data are presented in Section 4. Finally, concluding remarks and discussion are drawn in Section 5.

Related Works
Before further discussion, some of the notations that will be used throughout this paper are listed in Notations section.
A brief review of discriminant analysis techniques, for example, Locality Preserving Projection (LPP) [10] and discriminant analysis [13,14], are provided in this section.To facilitate the following discussion, we start with a supervised learning problem.Suppose that the -dimensional data set X = {x  }  =1 , x  ∈ R  is distributed on a -dimensional submanifold ( < ).And this data set X belongs to  classes with class labels {  }  =1 , respectively.Let   be the samples number of class ; then  = ∑  =1   .We are expected to find a transformation  : Y =  ⊤ X,  ∈  × that projected the -dimensional data points X = {x  }  =1 to -dimensional data points Y = {y  }  =1 with the goal of preserving the data structure without losing any information needed.The notation ⊤ represents the transpose of a matrix or a vector.Thereby, the problem at hand is how to evaluate the data model and formulate the objective transformation .

Locality Preserving Projection.
LDA aims to learn a global structure that separates samples efficiently.Nevertheless, for most real world applications, the local structure of neighborhood is also important.Locality Preserving Projection (LPP) is a graph based subspace learning algorithm, where the neighborhood structure will be preserved in the projected space.To achieve this goal, a weighted graph G = (V, E, ) is constructed, where V represents the vertex set, E denotes the edges of connected data points, and  is the similarity weight that characterizes the likelihood of pairwise data points.
For a new coming point x  , LPP defines a transformation in the mapping space; that is, y  =  ⊤ x  .Then the criterion function of LPP becomes where  is the similarity matrix of two data points.If two neighboring data points x  and x  are mapped far away, then  , incurs a heavy penalty.This property ensures that adjacent data points stay as close as possible in the embedded space.By simple algebra formulation, it can be deduced from (1) that where  is a diagonal matrix whose entries are column (or row) sums of ; that is,  , = ∑   , . =  −  represents Laplacian matrix, which is the discrete approximation of Laplace-Beltrami operator on compact Rimannian manifold [11].Naturally, the matrix  provides a measure on the data points.The importance of y  is relevant to the value of  , .
To make a uniform measurement and remove the arbitrary scaling factor in the embedding, LPP imposes an additional constraint: ( This constraint is joined into the objective function.Finally the minimal problem is reduced to arg min The solution can be gained by solving a generalized eigenvector decomposition: Let {  }  =1 be the -smallest eigenvalue of (5) with ascending order, that is,  1 ≤  2 ≤ ⋅ ⋅ ⋅ ≤   , and {  }  =1 the corresponding eigenvectors.Then the solution of ( 4) is given by For a new testing instance , the new data points μ in the embedding space are given by μ =  ⊤ LPP .
LPP can significantly find a projection that preserves the data structure.However, due to its unsupervised nature, data points that are close to boundaries may even be put closely in the projected space.In fact, these points may belong to different classes.Besides, LPP only makes use of nearest neighborhoods, and the global geometric structure is fully ignored in the calculation procedure.This drawback makes this algorithm apt to overfit the training samples.From the above analysis, we can see that LPP is sensitive to noise for those defective samples.For this reason, LPP congenitally has some deficiencies on learning ability and robusticity.

Laplacian Discriminant Analysis.
As an extension of discriminate analysis, the efficiency of Laplacian linear discriminant analysis (LapLDA) has been proved by many studies [18].The common behavior of these approaches is that an adjacent graph is employed to model the geometrical structure of the intrinsic manifold [19].There are two popular approaches to conduct the adjacent matrix, of which the first approach is by adopting -nearest neighborhood (the NN approach), and the other one is by placing an edge on two data points within a controllable Euclidean distance  (the neighborhood approach).LapLDA depicts the locality by the following quadratic function: where  , denotes the "weights" of connected points with the following definition: The notation NN(x * ) in ( 9) denotes the neighbors of x * .By this definition, the smaller the distance between two connected neighborhoods, the bigger the "weight" they arise, and the closer the distance they should keep in the mapped space.Nevertheless, ( 8) also enforces data points with bigger distance to be closer in the low dimensional subspace, which may bring chaos to the structure between connected data pairs.To cope with this issue, some researchers proposed a novel approach that integrates both global and local structure into the objective function [14].In order to construct a reasonable locality adjacent matrix, the typical global structure of neighborhood data points can be presented by the following penalty matrix : where FN(x * ) denotes the -farthest neighborhood of x * and (x  , x  ) 2 is the square Euclidean distance of two points x  and x  , respectively.

Proposed Methodology
The structure of HSI data is very complex; hence it is insufficient to represent HSI data using only global property or local property.To model the complex HSI data, a novel approach, which preserves both the local and global geometric structure of data samples, is proposed in this section.The new approach is called Locality and Global Geometric Structure Preserving (LGGSP) projection.Detailed motivation and formulation are given below.

Capturing the Local Structure of Intraclass Samples.
Inspired by [11,14], the local structure in LGGSP is described by two adjacent matrices, that is, the similarity matrix and the diversity matrix.To model the local structure, two adjacent graphs, that is,   = {X  , } and   = {X  , }, are adopted to model the diversity and similarity over the whole training data samples from the same class, where the notation X  is the whole training samples,  is the diversity matrix, and  is the similarity matrix, respectively.  reflects the variance of nearby data points, and   characterizes the similarity among nearby data points.
To make samples more separable, we define a sophisticated similarity matrix: where (  ) (  ∈ C) is the class prior probabilities of the   th class,  > 0 the slack parameter, and ‖ * ‖ 2  the Frobenius norm.
Statistically, if two samples x  and x  are very close, that is, is small, then the distance between them is also small, and the similarity should be large enough in the embedding space.In contrast, if ‖x  − x  ‖ 2  is large, which implies that they prefer to be dissimilar in distance, the corresponding similarity will be small.Note that, in (11), the class prior (  ) is imposed to ensure that they have the same class prior probability in the embedded space.
On the other hand, to measure the distribution of nearby data points, diversity is introduced.Different from the similarity matrix, the numerical value of diversity between two connected samples with large distance will be large.On the other hand, diversity of two connected samples with small distance should be small.This property explicates the trivial diversity of two adjacent points from the same class.Thus, the diversity matrix  can be defined as follows: where the notations of  and  are the free tuning parameters.Now consider the problem of mapping the original HSI data to a line so that the connected points from the same class can be preserved.Let Y = {y  }  =1 be such mapped point from Note that ( 13) incurs a heavy penalty on the within-class graph if two adjacent points x  and x  , which are close to each other, are mapped far apart, yet in fact they are from the same class.Similarly, ( 14) incurs a heavy penalty on the within-class graph if two neighboring points x  and x  are mapped close enough, that is, a single point, whereas they share the same label.Hence, minimizing ( 13) is to ensure that neighboring points which have the same label are also close in the embedding space.Simultaneously, maximizing (14) can prevent overfitting problem and the variation can also be preserved in the projected space.The limitation of ( 13) is that it may enforce connecting points with large distance to be very close to each other in the reduced space and lead to violations of topological structure preserving.By the constraint of ( 14), the situation may be alleviated.By integrating ( 13) and ( 14) together, the structural topology can be approximately preserved in the embedding space.That is, connected data points with larger distance prefer to be larger.Simultaneously samples with small distance can be kept close enough in the embedded space.Thus, the local structure can be preserved under the objective functions of ( 13) and ( 14), respectively.

Capturing the Global Structure of Interclass Samples.
To capture the global structure from different classes, an adjacent graph   = {X  , } is constructed over the whole training samples.The notation  denotes the weight matrix of graph   , and it is a variation (i.e., margin or distribution) of the connected samples from different classes on the entire training data set.Similar to local Fisher's goal [20], we do not "weight" the value of different samples from different classes.The reason behind this is that, since we want to separate the samples from different classes at maximum, the affinity in the original feature space will be ignored in the embedded subspace.To encode the discriminant information into the variation matrix , the elements of  can be defined as Now consider the problem of mapping HSI data to a line so that connected data points in adjacent matrix   stay as far as possible.In order to encode the discriminant information, a reasonable mapping can be found by optimizing the following function: Note that the objective function of ( 16) on the between-class graph   will incur a heavy penalty when two neighboring points x  and x  are mapped close enough, despairing the fact that the labels of two connected points x  and x  are actually different from each other.In this case, maximizing (16) will enforce the corresponding mapped points y  and y  to keep far apart.Thus, the global geometric structure of interclass samples could be well detected by (16).
3.3.Optimal Solution.Let x  and x  be the connected points in the original space,  ∈ R × the projected direction, y  and y  the embedded points; that is, y  =  ⊤ x  and y  =  ⊤ x  , respectively.To solve the objective functions of ( 13), (14), and ( 16) in the Laplacian graph embedding framework, we substitute y ⊤  =  ⊤ x  into the three functions.For simple algebraic formulation, the three objective functions can be written as Likewise, where The notations   ,   , and   represent the dimensional diagonal matrices whose th diagonal element is respectively.The matrices of   ,   , and   are the Laplacian matrices in graph embedding [15].Now let us join the three objective functions of ( 17) and ( 18) into one objective function, and the final optimal problem reduces to finding where The notations  1 ,  2 , and  in (22) represent the nonnegative constants that balance the "importance" on each criterion, where 0 ≤  1 ≤ 1, 0 ≤  2 ≤ 1, and 0 ≤  ≤ 1.In the whole experiments, we take the value of  1 = 0.8,  2 = 0.1, and  = 0.5.Moreover, the notations   and   represent the between-class scatter matrix and within-class scatter matrix, respectively.Note that optimizing the problem of ( 21) will lead to a generalized eigenvalue decomposition problem: Let the column vector {  }  =1 be the solution of (23), where the column vectors are corresponding to the eigenvalues that are ordered by  1 ≥  2 ≥ ⋅⋅⋅ ≥   .Then the optimal projected direction of LGGSP is given by where y  ∈ R  is a -dimensional vector,  ∈ R × is a projected direction, and x  ∈ R  is the original high dimensional point.

Experiment on Synthetic Data Sets.
To illustrate the effectiveness of proposed LGGSP algorithm, five synthetic data sets were investigated, that is, a toy example, "tulip" data set, "ripley" data set, a generated multimodal example, and a "two-moon" data set.Seven methods, that is, LDA [21], PCA [22], LPP [10], MFA [23], LGSPP [13], JGLDA [14], and proposed LGGSP algorithm, were compared.There are 100 test samples for the toy example, 100 test samples for "tulip, " 1000 test samples for "ripley, " 200 test samples for the multimodal example, and 100 test samples for the "twomoon" data set.All algorithms were implemented in Matlab language and all computations are carried out on an Acer Aspire-5750G laptop with i7-2670QM processor (2.2 GHz) and Ubuntu 12.04.1 LTS (64-bit version) operating system.
Figure 1 shows the results of a simple case, that is, two classes, for the first 3 test data sets.Several conclusions can be extracted from these examples.First of all, LDA, MFA, JGLDA, and proposed LGGSP algorithms work quite well on a simple linear separable toy example.All algorithms produce comparable results on the "tulip" data set.For the "ripley" data set, only LPP, LDA, JGLDA, and proposed LGGSP find the optimal direction.The three examples indicate the robustness of proposed LGGSP algorithm.
Figure 2 shows the experimental results for the relatively complex examples.The extracted features of LGGSP in Figure 2(a) (the multimodal example) are nearly the most optimal.In particular, LDA and LGSPP yield a very poor performance for this classification problem.In the "twomoon" examples (showed in Figure 2 optimal method under this situation.LDA is slightly different from JGLDA and LGGSP but is still better than the other methods.The result produced by LGSPP performs the worst in this data set.Note that, in this scenario, LGGSP can still produce comparable results with JGLDA, which reflects the robustness of proposed algorithm.

Experiments on Real HSI Data Set.
In this subsection, we evaluate our proposed method with PCA [22], LapLDA [24], MFA [23], LPP [10], RP [9], LGSPP [13], and JGLDA [14] on a hyperspectral image, that is, the well-known Indian Pines scenario.In the following experiments, dimension reduction techniques are firstly adopted to reduce the dimensionality of input feature, following a classified procedure with a concrete classifier (e.g., NN classifier or SVM classier).Indian Pine 1992 data set was gathered by National Aeronautics and Space Administration (NASA) airborne visible/infrared imaging spectrometer (AVIRIS) sensor over the northwestern Indian Pines test site in 1992, which consisted of 145 × 145 pixels and 224 spectral reflectance bands in the  Since the purpose of this paper is to reduce the dimensionality of HSI data for classification, the performance will be measured by overall accuracy (OA), kappa coefficient (kappa), and average accuracy (AA).

Numerical Results.
There are totally 1029 samples that are chosen from the available labeled samples for training.In particular, 15 samples are chosen from each labeled class; then the missing samples are randomly chosen from the remaining unchosen samples.Table 1 summarizes the numerical statistics of training samples corresponding to each class.Finally, the remaining samples are used for testing (Figure 3(c)).
Two experiments were performed in this section.Note that the dimensionality  in LapLDA is a fixed value; that is,  ≡ .For this reason, in the first experiment, we reduce the dimensionality of HSI data to 16.However, there are two critical parameters ( and ) in radial basis function (RBF) for SVM classifier.Parameter  controls the trade-off between the margin and the size of slack variables.We use five-fold crossvalidation to find the best  and  (suggested by [25]).For the other classifiers, the default parameters were employed.In the second experiment, we use all available data to generate the classified map to evaluate the performance of all methods on the whole HSI data.Table 2 summarizes the classified performance of the first experiment.These results show that the nonlinear SVM with the RBF kernel outperforms the other ones in a general way.Moreover, it is notable that proposed LGGSP algorithm almost gains the best classified performance by different classifiers (except for the linear SVM classifier).MFA produces the worst classifier performance among all methods.JGLDA and LapLDA perform approximately the same, producing barely acceptable results.LGSPP behaves a little better than JGLDA and is approximate to PCA.LPP and PCA are better than LapLDA and JGLDA but still worse than LGGSP.The MFA will be deduced in the next experiment, due to the seriously bad results.
The class accuracy for each class is listed in Table 3.The embedding data are then classified by 5NN classifier and RBF-SVM classifier.All these results show that the proposed LGGSP algorithm outperforms the other dimension reduction methods, providing almost the best classified performance in most cases.In the following experiment, we devote our attention to the visual inspection of the classified maps for all available samples.Due to the limited length of this paper, 5NN and RBF-SVM were selected as a demonstration, and classified pseudo images are shown for visual comparison.The best classification is selected to generate the classified images.
To achieve this purpose, the dimensionality is chosen as 15, while only one permitted embedding subspace for LapLDA is fixed to 16 due to the peculiar calculation.Figures 4 and 5 display the classified maps in pseudocolor images.It is clear that LapLDA and JGLDA both perform poorly, while PCA performs comparable result with proposed LGGSP in RBF-SVM classifier.However, the proposed LGGSP still works better than the other classifiers.For the 5NN classifier, it is easy to observe that the proposed LGGSP still performs equally acceptable results and better than LapLDA, JGLDA, and LPP.When conjuncted with 5NN classifier, LPP exhibits a distinguishing classification performance, which indicates the instability of LPP.Note that, in this scenario, LPP is very competitive with the state-of-the-art linear PCA.Moreover, performances of LGSPP by two classifiers are almost the same.
The class accuracy is summarized in Table 4. From this table, it is found that different HSI data needs to achieve the highest results with different feature extraction methods and different classifiers.For example, when class 1 is under classifying (i.e., Alfalfa; see Table 1 for details), the highest class accuracy is based on the LGGSP feature extraction, plus a 5NN classifier.In contrast to class 1, the highest accuracy of class 3 (i.e., Corn-min-till) is based on PCA, followed by the RBF-SVM classifier.The reason is that the distribution of different classes may be different: for some classes, the structure may be simple; yet for others, the structure may be very complex.Many research papers [26] reported that there is no "best" classifier that works perfectly on "every data set." Despite this, the proposed LGGSP does provide an optimal way to extract the "representative" features.The reduced data by the proposed LGGSP are more separable as compared with those of PCA, LPP, LapLDA, LGSPP, JGLDA, MFA, and RP.The purpose of this paper is that inclusion of geometric information in the form of similarity and deviation could indeed improve even more capability with eventually no additional cost under the framework of graph embedding [27].

Discussion and Conclusion
A novel LGGSP method was proposed in this paper for dimensionality reduction and classification.LGGSP integrates both locality and global geometry structures, where the local structure is captured by the similarity matrix and variance matrix, respectively, while the global discriminant geometric structure is characterized by a weight matrix encoding with the -nearest neighborhood relationship from different classes.By combining three objective functions into the objective functions, proposed LGGSP algorithm can be achieved by solving a common eigenvalue decomposition.Since this method is built on the theoretical basis of graph embedding, we also supply a theoretical analysis of Laplacian method.
The effectiveness and stability of LGGSP were demonstrated both on the synthetic data sets and real hyperspectral image data set.The experimental results show that the proposed LGGSP algorithm outperforms the other methods in most cases, which is acceptable in classified performance.Moreover, the proposed LGGSP significantly outperforms the other dimension reduction methods when using all the available samples for testing.
The proposed LGGSP can also be used for other applications as a preprocessing step for object recognition and high dimensional data visualization.

Figure 1 :
Figure 1: Simple examples of dimensionality reduction on generated 2D data sets produced by LDA, PCA, LPP, MFA, LGSPP, JGLDA, and LGGSP, respectively.The samples from different classes are marked with different shapes and colors (blue cross × and red circle ∘), respectively.

Figure 2 :
Figure 2: Complex examples of dimensionality reduction on generated 2D data sets produced by LDA, PCA, LPP, MFA, LGSPP, JGLDA, and LGGSP, respectively.The samples from different classes are marked with different shapes and colors (blue cross × and red circle ∘), respectively.

Figure 3 :
Figure 3: RGB composition, classification map, and training set for AVIRIS Indian Pines 1992 scenario.

Table 1 :
Training set for Indian Pines.
* Numerical value in each row refers to number of total samples, number of training samples, and percentage of training samples in each class, respectively.