Optimizing Kernel PCA Using Sparse Representation-Based Classifier for MSTAR SAR Image Target Recognition

Different kernels cause various class discriminations owing to their different geometrical structures of the data in the feature space. In this paper, a method of kernel optimization by maximizing a measure of class separability in the empirical feature space with sparse representation-based classifier (SRC) is proposed to solve the problem of automatically choosing kernel functions and their parameters in kernel learning. The proposed method first adopts a so-called data-dependent kernel to generate an efficient kernel optimization algorithm. Then, a constrained optimization function using general gradient descent method is created to find combination coefficients varied with the input data. After that, optimized kernel PCA (KOPCA) is obtained via combination coefficients to extract features. Finally, the sparse representation-based classifier is used to perform pattern classification task. Experimental results on MSTAR SAR images show the effectiveness of the proposed method.


Introduction
Recently, kernel learning or kernel machine has aroused broad interest in pattern recognition and kernel learning areas.For classification problem based supervised kernel learning, different kernel geometrical structures give different class discriminations.However, separability of the data in the feature space could be even worse if an inappropriate kernel is chosen since the geometrical structure of the mapped data in the feature space is totally determined by the kernel matrix, so the selection of kernel influences greatly the performance of kernel learning and thus optimizing kernel can be regarded as an effective way to improve classification performance.Considering that optimized kernel parameters of kernel function cannot change the geometrical structures of kernel in the feature space [1,2], so it cannot improve the performance of kernel learning.In this sense, Scholkopf et al. [3] proposed an empirical kernel map which maps original input data space into a subspace of the empirical feature space.Since the training data have the same geometrical structure in both the empirical feature space and the feature space, and the former is easier to access than the latter, it is easier to study the adaptability of a kernel to the input data and to improve it in the former space.Cristianini et al. [4] and Lanckrict et al. [5] have proposed methods of choosing kernel by optimizing the measure of data separation in the feature space for the first time.Cristianini et al. and Lanckrict et al., respectively, employ the alignment and margin as the measure of data separation to evaluate the adaptability of a kernel to input data.Zhang et al. proposed several variants of KPCA [6,7] to perform fault diagnosis and nonlinear processes.Then, they utilized the improved kernel learning techniques to deal with statistical analysis of nonlinear fault detection [8], large-scale fault diagnosis processes [9], and the monitoring of dynamic processes [10].
Simultaneously, sparse representation has gained great interest in pattern recognition and computer vision areas recently.Wright et al. [11] presented a sparse representation based classification method [12] and applied it to real-world face recognition problems [11,12].With varying expression and illumination, as well as occlusion and disguise, it was very effective and robust for face recognition.
The paper is organized as follows: in Section 2, we first introduce the concept of data-dependent kernel and empirical feature space.Then, we optimize the kernel in the empirical feature space by seeking the optimal combination coefficients of data-dependent kernel based on Fisher criterion.In Sections 3 and 4, the optimized kernel PCA (KOPCA) is adopted for MSATR SAR images to obtain dimensionality reduced empirical feature so as to employ sparse representation-based classifier for pattern classification.Finally, in Section 5, experiments are carried out on MSTAR SAR images to demonstrate the improvement in the performance of the data classification algorithms after using the optimized kernel and sparse representation-based classifier.

Kernel Optimization in the Empirical
Feature Space 2.1.Data-Dependent Kernel.Since different kernels create different geometrical structures of the data in the feature space and lead to different class discriminations [13], there does not exist a kernel function that can be adaptive to all datasets in kernel learning.Therefore, data-dependent based kernel is necessary to be chosen to deal with the problem.
In this paper, we employ a data-dependent kernel which is proposed by Amari and Wu [14] to be the objective kernel function to conduct kernel optimization.There is a need to explain that the data-dependent kernel is a conformal transformation to a basic kernel.Given a set of  training samples  1 , . . .,   ∈ R  , the data-dependent kernel is defined as follows: where ,  ∈ R  ,  0 (, ) is a basic kernel such as a polynomial kernel or Gaussian kernel.(⋅) is a positive real valued factor function and different (⋅) make the datadependent kernel different properties; Amari and Wu [14] expand the spatial resolution in the margin of a SVM by where  1 (,   ) = exp(−‖ −   ‖),   ∈ R  is the th support vector, and SV is a set of support vector.The set {a | a = [ 1 ,  2 , . . .,   ]} called the "empirical cores, " can be determined according to the distribution of the training data. is a free parameter.  { = 0, 1, . . ., } are the positive combination coefficients which are considered as contribution weights corresponding to the   .Meanwhile, the data-dependent kernel is a kernel function as it satisfies the Mercer condition [3].
We can prove that the training data has the same geometric structure in both the empirical feature space and feature space.Let  = Λ −1/2 , then the dot product matrix {Φ  (  )} in the empirical feature space can be calculated as Notice that   = ,  = [Φ(  ) ⋅ Φ(  )] × , and the result of   is exactly the dot product matrix of {Φ(  )} in the feature space; therefore, we say that the empirical feature space preserves the geometric structure in the feature space.

Fisher Criterion Based Kernel Optimization.
As illustrated in Section 2.2, the training data has the same geometric structure in both the empirical feature space and feature space and it is easier to access the empirical feature space than the feature space, so it is better to measure class separability in the empirical feature space.In this paper, we choose the acquainted Fisher criteria to measure class separability: where   is the between-class scatter matrix,   is the withinclass scatter matrix, and tr is the trace of given matrix. is Fisher criteria to measure the class separability.Notice that  measures the class separability in the feature space and  is independent of the projections in the common projection subspace, so it is a satisfying choice to be the task of kernel optimization.
Up to now, the kernel optimization problem is transformed to maximize Fisher scalar .Let the number of each class  ( = 1, 2, . . ., ) be   , that is, class  has   training samples and  denotes the number of all training samples.What is more, let   ,  0 denote the center of each training samples in class  ( = 1, 2, . . ., ) and the center of all training samples, respectively, that is, =1   and   ( = 1, 2, . . ., ) be the images of the training samples in the empirical feature space, that is,   = Φ  (  ).Then, we can define where where   ( = 1, 2, . . ., ;  = 1, 2, . . ., ) represent the submatrices of  and the size of Let the following matrices  and  be called "betweenclass" and "within-class" kernel scatter matrices, respectively, We can also employ  0 and  0 to denote "between-class" and "within-class" kernel scatter matrices corresponding to the basic kernel  0 .Now we establish the relation between Fisher scalar  and the proposed kernel scatter matrices.Let 1  be the -dim vector whose elements are equal to 1. Then we can get The proof is given in the appendix.
To maximize , we adopt the general gradient method and use formula (4).Define Then, Thus, Denote Considering the fact that it is almost impossible to make  0 invertible because of the limited amount of training samples in real-world applications, the general gradient descent method is adopted to get an approximate value of the optimal .The updating equation to maximize  is defined as follows: To guarantee the convergence of formula (17),  is defined as the function of iterations, that is, where  0 is a predefined initial value,  denotes total number of iterations, and  represents the current iteration number.
After we calculate the combination coefficients of , then we can get q, as q =  1 , and thus, the optimizing kernel or data-dependent kernel, , is easy to achieve.

Optimizing Kernel PCA (KOPCA)
In this section, we will employ the optimized kernel function mentioned above to construct the optimizing kernel PCA and extract feature in the empirical feature space.
Given a set of  training samples  1 ,  2 , . . .,   ∈ R  and an empirical feature mapping Φ  , let the input data space  be mapped into the empirical feature space R  : Φ  :  → R  ,   → Φ  ().The covariance operator on the empirical feature space R  can be constructed by where   and  0 are defined the same as above, that is, It is easy to proof that all nonzero eigenvalues of  Φ  are positive, and every eigenvector  of  Φ  can be linearly expanded by To get these expansion coefficients, we denote  = [ 1 ,  2 , . . .,   ],   = [ 1 ,  2 , . . .,   ] and form an  ×  Gram matrix  =   , whose elements are determined by optimizing kernel, that is,   =      = (Φ  (  ), Φ  (  )) = (  ,   ).Note that the kernel matrix  = [(  ,   )] × is the same as what is defined in formula (7).
Centralize  by where 1  is defined the same as above, that is, 1  is -dim vector whose elements are equal to 1.
Let the eigenvectors of   be  1 ,  2 , . . .,   corresponding to the  largest positive eigenvalues After the projection of a mapped sample   = Φ  (  ) onto the eigenvectors  = [ 1 ,  2 , . . .,   ], we can obtain optimizing kernel PCA transformed feature vector z by Meanwhile, the th optimizing kernel PCA component: Up to now, the essence of optimizing kernel PCA has been revealed.That is, we first maximize a measure of class separability in the empirical feature space by virtue of Fisher criterion to form needful data-dependent kernel and then take advantage of optimizing kernel PCA to extract feature in the empirical feature space.
Then, the linear representation of  can be rewritten in terms of  as where  = [0, . . ., 0,    , 0, . . ., 0]  ∈ R  is a coefficient vector whose entries are zero except those associated with the th class.
Hereto, we should take the number of row and column of  into consideration.If the row number  is bigger than column number , the system of equations  =  is overdetermined and the correct  can usually be found as its unique solution.Nevertheless, this is not what we need since sparse representation involves an underdetermined system of linear equations  = , where  <  as it is motivated by the following fact: given a test sample , the representation is naturally sparse if training sample size (column number) is large enough, and if the sparser the coefficient vector  is, the easier it will be to accurately reconstruct the identity of the test sample  [12].
Consequently, it means that the dimension of feature vector (row number) must be smaller than the training sample size (column number).Considering, before we use sparse representation, we have obtained dimensionality reduced empirical feature in Section 2, and just in time, it can meet requirements.
The above discussion motivates us to seek the sparest solution by solving the following optimization problem: where ‖ ⋅ ‖ 0 denotes the ℓ 0 -norm, which counts the number of nonzero entries in a vector.However, solving ℓ 0 optimization problem in formula (27) is NP hard and time-consuming.Recent research of spares representation and compressed sensing [15,16] proves that if the solution μ0 is sparse enough, the solution of the ℓ 0 optimization problem is equivalent to finding the solution of the ℓ 1 optimization problem: This problem can be solved in polynomial time by standard linear programming algorithms [17].
After obtaining the sparest solution μ1 , we can form a sparse representation-based classifier (SRC) in the following way.For each class  ( = 1, 2, . . ., ), let   : R  → R  be a function which selects the coefficients associated with the th class, then   () is a vector whose only nonzero entries are the entries in  that are associated with class .Making use of the coefficients associated with the th class, one can reconstruct the given test sample  as ŷ =   (μ 1 ).ŷ is often called the prototype of class  with respect to the sample .The residual between  and its prototype ŷ of class  is defined as follows: Then the SRC decision rule is to minimize the residual, that is, if   () = min    (),  is assigned to class .It is necessary to explain that our implementation minimizes the ℓ 1 -norm via the basis pursuit denoising (BPDN) algorithm for linear programming based on [17][18][19].

Experimental Results
In this section, experiments are designed to evaluate the performance of the proposed algorithm.The first experiment is adopted to show that the class separability is probably worse in the feature space than that in the input space in some cases and demonstrate that the proposed kernel optimization algorithm can enlarge class separability.The second experiment is carried out on MSTAR SAR images using KOPCA compared with conventional KPCA to extract features and use nearest neighbor (NN) classifier to implement pattern classification.Simultaneously, sparse representation-based classifier (SRC) is applied to verify its superiority and effectiveness to deal with pattern classification compared with other classifiers.In order to verify the sparsity via BPDN, we randomly choose a test sample and show its representation coefficients on the training set.

Kernel Optimization on Synthetic Gaussian Distributed
Dataset.Before concentrating on optimizing the kernel in the empirical feature space, we use two simple datasets called Gaussian distribution data generated by computer to get intuition about the embedding of data in the feature space into the empirical feature space.More information about data embedding can be found in [20].) with  = 1.0×10 5 is employed.Both the two basic kernels are mentioned in formula (1).It is seen from Figures 1(b) and 1(c) that the class separability is worse in the feature space than that in the input space when adopting both the polynomial kernel and Gaussian kernel.Therefore, it is important to conduct kernel optimization.We will carry out an experiment later to demonstrate that when applying the kernel optimization algorithm in Section 2.3, the measure for class separability is surely enlarged.
In this experiment, we set parameter  of the function  1 (⋅) in formula (2) as  = 16 for the given polynomial kernel  0 (, ) = (, ) 3 and the given Gaussian kernel  0 (, ) = exp(−1.0× 10 −5 ‖ − ‖ 2 ).One-third of the synthetic data are randomly chosen to form the "empirical core" set {  }.The initial learning rate  0 and total iteration number  are set 0.1 and 200, respectively, in both the polynomial kernel and Gaussian kernel.Figures 2(a) and 2(b) show the projection of the data into empirical feature space onto the first two significant dimensions corresponding to the first two largest eigenvalues of , when the polynomial kernel and Gaussian kernel are used as mentioned above.It is seen from Figure 2 that the proposed kernel optimization algorithm substantially improves the class separability of the data in the empirical feature space and, hence, in the feature space.

KOPCA Criterion on MSTAR SAR Dataset. This experiment is conducted on MSTAR SAR image provided by Defense Advanced Research Project Agency and Air Force
Research Laboratory (DARPA/AFRL).The data is the MSTAR public release subset in order to initiate Moving and Stationary Target Acquisition and Recognition project which has provided a unique opportunity to promote and assess progress in SAR ATR algorithm development.
Since the characteristic in SAR image changes greatly with different aspect angles, a great many of images within one target-class were collected, where the poses lie between 0 and 360 degree.
The vehicle in MSTAR SAR Dataset contains BMP2 (sn-c21, sn-9563, sn-9566) tracked Armored Personnel Carrier, BTR70 (sn-c71) wheeled Armored Personnel Carrier, and T72 (sn-132, sn-812, sn-s7) Main Battle Tank.Different serial numbers in one-target class mean that vehicles are variant with small differences in configuration, articulation under extended operating condition (EOC) [21].Therefore, scattering centers of SAR images change so intensively that recognition ability decreases greatly, and in this sense, recognizing variants in SAR images is difficult.
In this experiment, we select images of BMP2sn-c21, BTR70sn-c71 and T72sn-132 in 17 depression angle as the training samples (numbers of each class are 233, 233, 233).The Gaussian kernel ()  195,196,195,191).The testing targets have small configuration differences to the training targets.There is a need to explanation; in this paper, KOPCA extracts features of all MSTAR images with different aspect angles directly and the process does not need to form different aspect windows and before recognition, images are chipped into 48 * 48 pixels.We set parameter  of the function  1 (⋅) in formula (2) as  = 16.The kernel functions are chosen as the polynomial kernel  0 (, ) = (, )  where  is from 1 to 10, and the Gaussian kernel  0 (, ) = exp(−‖ − ‖ 2 /) where  is from 11 to 110.One-third of the training data are chosen to form the "empirical core" set {  }.The initial learning rate  0 and total iteration number  are set 0.1 and 200, respectively, in both the polynomial kernel and Gaussian kernel.Moreover, the feature dimension and empirical feature dimension are set 100 in both KPCA and KOPCA criterion.In order to reflect the performance of optimizing kernel in real-world application, the simplest nearest neighbor (NN) classifier is selected.
Then, if a test sample  satisfies (,   ) = min  (,   ), and   belongs to class , then  belongs to class .
Tables 1 and 2 show the recognition rates of KPCA and KOPCA with polynomial kernel and Gaussian kernel using the nearest neighbor (NN) classifier, respectively.From them, we can see that using the proposed data-dependent kernel optimization algorithm with KPCA criterion, recognition rate can be increased 10% ∼15% in both polynomial kernel and Gaussian kernel compared with conventional KPCA in the whole process.The class separability in the empirical feature space is improved, and thus, recognition rate is improved.Now, we will conduct another experiment to discuss sparse representation-based classifier (SRC).In the first part, we validate its effectiveness to deal with pattern classification task after extracting features via KOPCA (KOPCA: SRC) when compared with other classifiers, such as -nearest neighbor (KNN) classifier, support vector classifier (SVC), and linear regression classifier (LRC) [22].In the second part, in order to verify the sparsity of sparse representationbased classifier (SRC) via BPDN, we randomly choose a  testing sample and show its representation coefficients on the training set.Tables 3 and 4, respectively, show the recognition rates of KOPCA with KNN, SVC, LRC, and SRC classifiers corresponding to polynomial kernel and Gaussian kernel.
From Table 3, we see that sparse representation-based classifier (SRC) outperforms other classifiers no matter what the order of polynomial kernel is.In the whole experiment process, the recognition rate of KOPCA: SRC achieves higher than 95% while others are lower than 95%.Only KOPCA: LRC is close to KOPCA: SRC, the others are lower than 10%, even 20% compared with it.Meanwhile, notice that there exists an interesting phenomenon, that is, though the order of polynomial kernel varies from 1 to 10, variations of recognition rates of all the algorithms are less than 10%.However, it is inapplicable to Gaussian kernel.From Table 4, we learn that KOPCA: SRC has the superiority and effectiveness, that is, (1) when parameter  of Gaussian kernel is between 11 and 15, recognition rate of KOPCA: SRC is slightly lower than KOPCA: LRC, the difference is no more than 2%, but it is better than KOPCA: KNN and KOPCA: SVM.(2) Though, KOPCA: LRC performs well, but it has limitations when  is equal or greater than 16 since the recognition rate degrades quickly and significantly, while KOPCA: SRC remains stable and superior.(3) The performance of all these algorithms decreases rapidly when  is greater than 18 while KOPCA: Gaussian kernel () The basis pursuit (RB) method is introduced to optimize l1-norm based minimization problem in our experiment.Here, we randomly choose a testing sample in the third class.Intuitively, most nonzero representation coefficients for the testing sample lie in the range from 301 to 450 (since each class has 150 training samples, so the index for the third class is from 301 to 450).Fortunately, our result demonstrates it.From Figure 3, we see that the representation coefficients are sparse with respect to the basis, that is, the training set.Moreover, the nonzero coefficients are mostly located in the range from 301 to 450.As the end, we are to say that BPDN algorithm is fast enough to perform our SAR images' recognition and classification.The BPDN software package that we use is from the "L1 Homotopy" homepage: http://users.ece.gatech.edu/∼sasif/homotopy/.The running time is in second level, which is as the same level as LRC and SVC.
Through the complete discussion above, we can come to the conclusion that optimizing kernel PCA can indeed enhance the class separability in the empirical feature space compared to conventional KPCA and thus improve recognition rate.In the meantime, sparse representation-based classifier is robust and of high efficiency for classification compared to other nice classifiers.

Conclusion
In this paper, we proposed an efficient pattern classification method named kernel optimized PCA with sparse representation classifier (KOPCA: SRC).After conducting several experiments, we can come to the following conclusions with the experimental results.
(1) We have proposed a new space called the empirical feature space, in which the data is embedded in a way that the geometrical structure of the data in the feature space is preserved.
(2) We have presented a general form of data-dependent kernel and derived an effective algorithm for optimizing kernel by maximizing class separability of the dataset in the empirical feature space via Fisher criterion. ( Note formula (3), we easily get  =  0 ,  =  0 , simultaneously, 1    = q  and 1  = q.Hence, 0 1  = q   0 q q   0 q . (A.5)

Figure 1 : 2 -Figure 2 :
Figure 1: 2-Dim dataset and its projections in the feature space onto the first two significant dimensions.(a) Two classes of data samples with two Gaussian distributions.(b) 2-Dim projection in the feature space for polynomial kernel with  = 3. (c) 2-Dim projection in the empirical feature space for Gaussian kernel with  = 1.0 × 10 5 .

Figure 3 :
Figure 3: Representation coefficients to a chosen testing sample in the third class.
) 2.2.Empirical Feature Space.The different kernels cause various class discriminations owing to their different geometrical structures of the data in the feature space.It is often not so convenient or easy to compute in feature space.Hence, the concept of empirical feature space is introduced.Let   { = 1, 2, . . ., } be a -dim training dataset and   = [ 1 ,  2 , . . .,   ].  = [(  ,   )] × denotes the kernel matrix with rank .Since  is a symmetric positive semidefinite matrix, it can be decomposed as ( = 1, 2, . .., ) be the matrix formed by the training samples of the th class, that is,   = [ 1 ,  2 , . ..,    ] ∈ R ×  .And define a new matrix  for the total training set with  classes as the concatenation of the  training samples:  = [ 1 ,  2 , . ..,   ] ∈ R × .Given a test sample  from the th class, then  can be approximately represented by the linear span of the training samples in the corresponding class, that is,

Table 1 :
Recognition rates of KPCA and KOPCA with polynomial kernel using NN classifier.

Table 2 :
Recognition rates of KPCA and KOPCA with Gaussian kernel using NN classifier.

Table 3 :
Recognition rates of KOPCA with polynomial kernel using different classifier.

Table 4 :
Recognition rates of KOPCA with Gaussian kernel using different classifier.