A New Feature Selection Method for Hyperspectral Image Classification Based on Simulated Annealing Genetic Algorithm and Choquet Fuzzy Integral

Hyperspectral remote sensing technology is a rapidly developing new integrated technology that is widely used in numerous areas. Rich spectral information from hyperspectral images can aid in the classification and recognition of the ground objects. However, the high dimensions of hyperspectral images cause redundancy in information. Hence, the high dimensions of hyperspectral data must be reduced. This paper proposes a hybrid feature selection strategy based on the simulated annealing genetic algorithm (SAGA) and the Choquet fuzzy integral (CFI). The band selection method is proposed from subspace decomposition, which combines the simulated annealing algorithmwith the genetic algorithm in choosing different cross-over andmutation probabilities, as well as mutation individuals. Then, the selecting bands are further refined by CFI. Experimental results show that the proposed method can achieve higher classification accuracy than traditional methods.


Introduction
Hyperspectral remote sensors peculiarly provide measurements of the Earth's surface with very high spectral resolution, usually resulting in tens of channels.Unlike multispectral sensors, the high spectral resolution renders hyperspectral remote sensors very powerful in applications requiring the identification of subtle differences in ground covers (e.g., material quantification and target detection).On the other hand, the large-dimensional data spaces generated by these sensors introduce challenging methodological problems.In the context of supervised classification, the most important methodological issue raised by these sensors is the so-called curse of dimensionality (also known as the Hughes effect) that occurs when the numbers of features and of available training samples are unbalanced [1].
Meanwhile, hyperspectral remote sensing images have nonlinear properties.These nonlinear properties originate from the multiscattering between photons and ground targets, within pixel spectral mixing, and from scene heterogeneity.In addition, given that the pixel size in most remote sensing systems is sufficiently large to include different types of land cover, classification error arises and produces unreliable classification results.In this case, traditional classifiers may fail completely.
In remote sensing literature, numerous methods have been developed to solve the hyperspectral data classification problem.A successful approach to hyperspectral data classification is based on the support vector machine (SVM).SVM determines two classes by identifying the optimal separating hyperplane that maximizes the margin between the closest training sample and the separating hyperplane.Data samples located at the hyperplane border are referred to as support vectors and are used to create a decision surface.The properties of SVM for both full-dimensional and reduced-dimensional data have been investigated, while multi-class SVM strategies have been considered in [2].Hyperspectral image classification using different kernelbased approaches has been analyzed and compared, and SVM has been found to be more useful than other kernelbased methods in [3].SVM classification performance is compared with other well-known neural approaches in [4], which exhibited that SVM provides simplicity, robustness, and increased classification accuracy compared with neural networks.In addition, some improved SVM methods have also been successfully used in hyperspectral image classification.The proposed method, called contextual SVM using Hilbert space embedding showed significant improvement over other methods on several hyperspectral images in [5].A semisupervised method for addressing a domain adaptation problem based on multiple-kernel SVMs in the classification of hyperspectral data was presented in [6].Thus, SVM is very suitable for hyperspectral image classification.However, dimension reduction is not sufficiently considered in SVM.
Commonly used dimension reduction methods fall into two categories, namely, feature selection and feature extraction.Since every band of hyperspectral data has its own corresponding image, the feature extraction approach maps a high-dimensional feature space to low-dimensional space via linear or nonlinear transformation.However, the original physical interpretation of the image cannot be retained.Thus, feature extraction approaches are unsuitable for the dimension reduction of hyperspectral images.Given that the spectral distance between adjacent bands in the hyperspectral data is only 10 nm and because the correlation between them is extremely high [7], a considerable redundancy is observed, which should be largely reduced by the feature selection or band selection methods to improve classification efficiency and accuracy.A semisupervised feature-selection technique for hyperspectral image classification was developed in [8].A method for unsupervised band selection by transforming the hyperspectral data into complex networks was presented in [9].Therefore, a new dimension reduction method is proposed that combines the simulated annealing genetic algorithm (SAGA) with the Choquet fuzzy integral (CFI).
A population and temperature ladder-based new genetic algorithm (GA) or the so-called SAGA was recently proposed to examine a sample from a distribution defined on a space of finite binary sequence.The feature selection strategy of hyperspectral images based on GA and SVM was proposed in [10,11].A GA-based feature selection and local-Fisher's discriminant analysis-based feature projection are performed for effective dimensionality reduction in [12].But SAGA method works by simulating a parallel population of samples with different temperatures.The population is updated via selection, mutation, cross-over, and exchange operations that are highly similar with GA.SAGA has the learning capability of GA, as well as the fast-mixing capability of parallel tempering (simulated tempering).In most cases, classification accuracy is only used as the fitness function, but internal relations between bands and classes have not been taken into account.Considering the above problem, a correction method based on CFI is proposed.The CFI does not assume the independence of one element from another and, based on any fuzzy measure, it is employed to perform the overall evaluation of an input pattern [13].Moreover, the fuzzy measure defined on an attribute is used as the relative degree of importance of this attribute such that the connection weights can be interpreted as the fuzzy measure values or the degrees of importance of the respective input variables.The band selection method of this paper that is based on SAGA and CFI (SAGA-CFI) cannot only improve the accuracy of classification, but also effectively reduce the uncertainty of the information in order to further improve the accuracy.
Since hundreds of bands in the hyperspectral imagery exist, the direct search space for SAGA and CFI on the original band space becomes extremely huge.An adaptive subspace decomposition (ASD) method for hyperspectral data dimensionality reduction was proposed in [14].To avoid the impact of enormous data sets on traditional statistical classification techniques, the ASD scheme is used.Thus, the differences between global and local statistical characteristics have been fully considered, and the problem presented by a limited number of training samples is then alleviated.
In this paper, we use SAGA and CFI in every subspace to choose suitable bands based on ASD which differs from the previous work [5,6,[8][9][10][11][12] in three aspects.First of all, ASD is employed to divide the bands into disjoint subspace rather than mutual information.Although mutual information may make better performance than ASD, it also cannot be chosen in this paper because mutual information is interconnected with entropy of information, and it can be directly formulated by entropy.It is better to keep independence between ASD and CFI.Furthermore, based on GA, SAGA is used in band selection which includes a schedule of temperatures and approaches the global minimum when the temperatures change gradually.Last but not least, CFI is first employed to further optimize the band selection method.Thus, we reduce the search space and computational complexity, while avoiding the selection of an excessive number of adjacent bands.
The remainder of this paper is organized as follows.Section 2 introduces subspace decomposition.Section 3 presents the proposed SAGA.In Section 4, a brief description of three related elements and fuzzy measure followed by CFI is given.Section 5 provides the SVM classification adopted in this paper.Section 6 describes the proposed method.Experiments and analysis are demonstrated in Section 7. Finally, Section 8 concludes the paper.

Subspace Decomposition
The main characteristics of hyperspectral remote sensing data are a large quantity of imaging channels (approximately 220 bands) and a narrow band spectrum.The spectrum of hyperspectral data is highly concentrated, rendering overall and local characteristics quite different.We may lose some important local characteristics if we select the bands from the total space.In terms of the overall situation, the bands are notably characterized by groups.We can divide all bands into several groups as long as a lower correlation exists between adjacent bands.Subspace decomposition not only reduces the dimension of the images, but also significantly improves the efficiency of data processing.Division of data sources based on ASD and fusion classification based on consensus theory is proposed in [15].So the commonly used method continues to be ASD.According to the correlation matrix of hyperspectral images between bands, the full data space with dimensionality is adaptively decomposed into numerous subspaces with different dimensionalities.In each subspace, the bands have very strong correlation, while the energy is more concentrated.Hence, full data dimensionality can be logically reduced.
Since different bands have different correlations, all subspaces do not have the same dimensionality.Therefore, the goal is to match the features of each subspace with one or few classes.For this purpose, the new method primarily depends on the correlation matrix  between different bands.The element of the correlation matrix  is defined as .
The value   of the matrix  ranges between 0 and 1.The closer   is to 1, the more correlation exists between the two bands.  and   are the mean values of   and   , respectively.[•] is the value of the mathematical expectation.

Simulated Annealing Genetic Algorithm
Traditional selection, cross-over, and mutation operator, as well as the selection of fitness proportion in GA, allow the superior chromosome to maintain its predominance or strengthen it in the subsequent generations.The convergent chromosome may not be the overall optimal chromosome.SAGA combines the simulated annealing algorithm with GA.Thus, SAGA can perform the temperature-control function in the simulated annealing algorithm by controlling selection probability [16].If we want to sample from a distribution defined on a space of finite binary sequence, we employ the following: where  is the -dimensional binary vector  = { 1 ,  2 , . . .,   } with   ∈ {0, 1},  is the scale parameter (a socalled temperature that can be any value of interest), and () is the fitness function in terms of GA.First, a sequence of distributions  1 ( 1 ),  2 ( 2 ), . . .,   (  ) is constructed as follows: where, for For convenience, we denote the ladder by  = ( 1 ,  2 , . . .,   ).Note that we always set   =  as   () = () to correspond to the target distribution from which we obtain the sample. = { 1 ,  2 ,  3 , . . .,   } denotes a population of samples where   = {  1 ,   2 , . . .,    } is a sample from   () and is called a chromosome or an individual in terms of GA, while  represents the population size.In SAGA, the Boltzmann distribution of the population is expressed as where () = ∏  =1   (  ).The population is updated by selection, cross-over, mutation, and exchange operators.

Selection. The probability of having the chromosome chosen first is
and probability of (  ,   ) is where 3.2.Cross-Over.One chromosome pair, such as   and   ( ̸ = ), is selected from the current population  through the roulette wheel.Two offspring,   and   , are generated according to a specific cross-over operator.A new population  is proposed as  = { 1 ,  2 ,  3 , . . .,   , . . .,   , . . .  } and is accepted with probability min(1,   ) according to the Metropolis-Hastings rule that is expressed as follows: where [(  ,   ) | ] denotes the selection probability of (  ,   ) from the population  and [(  ,   ) | ] denotes the selection probability of (  ,   ) from the population .

Mutation.
We define the mutation operator as an additional move of the Metropolis-Hastings rule.One chromosome, such as   , is uniformly chosen from the current population .A new chromosome is generated by the addition of a random vector   , such that where   is usually chosen to achieve moderate acceptance probability for the mutation operation.The new population  = { 1 ,  2 ,  3 , . . .,   , . . .  } is accepted with the probability min(1,   ) according to the following Metropolis-Hastings rule: 3.5.Fitness Function.In addition, another key of SAGA is the design of the fitness function.We use only the classification accuracy obtained from the training feature subset as the fitness function.The purpose of the iterative repetition is to determine the optimal feature subset and to maximize classification accuracy.The adopted classifier is SVM, which is described in Section 5.

Choquet Fuzzy Integral
Based on subspace decomposition, CFI method is used to further refine the selecting bands.The definition of fuzzy measure and Choquet integral are shown in [18,19].
With regards to the theory on information fusion, fuzzy density   server as the importance or the contribution of the source   .The group of source  can determine a unique fuzzy measure in the process of data fusion.Based on the fuzzy measure, Choquet proposed a fuzzy integral method.Definition 3 (Choquet integral (see [19])).Given a function ℎ :  → [0, 1], and its Choquet integral on fuzzy measure  is defined as In the equation, the value of the function ℎ(  ) can be interpreted as a credibility estimation of the source   for specific target.Note that the function ℎ(  ) is increasing, 0 = ℎ( 0 ) ≤ ℎ( 1 ) ≤ ℎ( 2 ) ≤ ⋅ ⋅ ⋅ ≤ ℎ(  ) ≤ 1; fuzzy measure  is the importance or contribution of information source with respect to the ultimate decision-making or estimation,   = {  ,  +1 , . . .,   }.
According to (13), CFI can be seen as the weighted sum of ℎ( 1 ), ℎ( 2 ), . . ., ℎ(  ), and the weights depend on of the rank of {  }, and the value of ℎ{  } decides the rank of {  }; so the CFI is a nonlinear function of function ℎ.It is clear that when  = 0, the -fuzzy measure is the probability measure, and the CFI is a linear function of ℎ.The CFI is used in data fusion if we regard the ℎ(  ) ∈ [0, 1] as a result of goal judgment and -as the degree of importance or contribution.Obviously the CFI is the nonlinear combination of the result of information source with the importance of information source.
Before computing the fuzzy integral we must compute the value of .From (12), we know that the solution to  of the fuzzy integral is the root of high-order polynomial.If there are many sources, there is computation burden to get the value of parameter , blocking the online and real-time of algorithm.

SVM Classification Methods
Training data are required to train the SVM model.However, these data cannot be separated without errors.The data points that are closest to the hyperplane are used to measure the margin, while SVM attempts to identify the hyperplane that maximizes the margin and minimizes a quantitative proportion to the number of misclassification errors [20,21].SVM derives the optimal hyperplane as the solution of the following convex quadratic programming problem [22] where {( The aforementioned optimization problem can be reformulated through a Lagrange function, where Lagrange multipliers can be found via dual optimization to generate a convex quadratic programming solution as follows [23][24][25]: where  = [ 1 ,  2 , . . .,   ] is the vector of Lagrange multipliers, while (•, •) is a kernel function which is introduced as follows [26]: Thus, the final result is a discrimination function () conveniently expressed as a function of the data in the original (lower) dimensional feature space [27]:

Proposed Method
6.1.Adaptive Subspace Decomposition.In the beginning, adaptive subspace decomposition is used to divide into seven subspace according to (1).All   values are identified, and then the proper threshold   is set.The continuous bands of |  | ≥   in the same subspace are subsequently placed.We can dynamically control the number of subspaces and the number of bands in each subspace by changing the threshold   .

The Band Order Method in Subspace.
SAGA is used in order to find out the optimal bands in each subspace.Here we choose common binary coding method as the genetic coding mode, and the iteration times of SAGA is 50.Generally, a subspace has many bands, and all the suitable bands should be chosen.Meanwhile, if the subspace has only one band, it must be chosen.

The Band Reorder Method in Subspace.
After the bands are chosen according to SAGA, they also can be further optimized based on the CFI method.CFI takes into account the factors of entropy of information, correlation coefficient, and standard distance between the means.

Entropy of Information and Variance.
According to Shannon's information theory, entropy measures information content in terms of uncertainty.The entropy of the hyperspectral components represents the information content of each component.Thus, the higher the entropy, the richer the information content, resulting in a more meaningful representation.The entropy or total information [28] is defined as where   () is the probability of pixel value .
Variance represents deviation from mean value to the gray-scale value of pixel.The formulae of computing mean value   and variance  2  are as follows [29]: where  and +1 are the numbers of two adjacent bands. and  represent the width and the height of image, and   (, ) is the gray-scale value of pixel (, ).

Correlation Coefficients.
In statistics, the correlation coefficient denotes the accuracy of a least square fitting to the original data.It is a normalized measure of the strength of the linear relationship between two variables.Correlation is employed in many types of applications, such as hyperspectral image processing where it is used to measure and to quantitatively compare the similarity between bands [30].The two-dimensional normalized correlation function for image processing is shown below: where CC is a real number between −1 and 1.

Standard Distance between the Means.
Object classes need to be analyzed in depth in which the band is easy to be distinguished [31] that is, the statistical distance between object classes in the band.Standard distance between the means  is defined as where   1 and   2 are spectrum means of corresponding regions of the two samples.  1 and   2 are variances of corresponding regions of the two samples. reflects separability of the two samples in each band.
Then, the procedure of the band reorder method using CFI is as follows.(1) According to (18), entropies of information in each subspace are computed and recorded as  1 .
(2) According to (21), the correlation coefficients in each subspace are computed and recorded as  2 .(3) According to (22), standard distances between the means in each subspace are computed and recorded as  3 .(4) Belief function is constructed and domain is  = { 1 ,  2 ,  3 }.The relations between index value of each factor and band reorder are described below.(1) The bigger the entropy is, the rich the information is.
(2) The smaller the correlation coefficient is, the more independent the band is.

Subspace Decomposition Experiment.
The ASD scheme is used to obtain the correlation value between the bands.Table 1 gives the values of the parts of the correlation matrix  according to (1).
As presented in Table 1, the autocorrelation coefficient of each band is equal to 1, and the correlation value is very high.In this paper, the ASD method is performed using the correlation criterion of a given threshold, which is 0.8.The full data space is decomposed into seven subspaces.The dimensions of each subspace are shown in Table 2.
From the 220 spectral channels acquired by the AVIRIS sensor, 41 bands were discarded because they were affected by atmospheric problems.The discarded bands were as follows: 1-4, 78, 80-86, 103-110, 149-165, and 217-220.As a result, the new dimensions of each subspace are shown in Table 3.

SAGA in Each
Subspace.The hyperspectral image is categorized into seven classes according to the real data on the ground.The ratio of training and test samples is 1 : 3 because SVM is suitable for small samples.SAGA is used in each subspace, while fitness is computed and illustrated in Figure 3.We select the most optimum band in each subspace.SAGA in subspace numbers.4 shows the index values of the bands when threshold   is 0.940.
SAGA is used to determine which band/bands shall be selected in each subspace, but it cannot indicate which bands have higher priority than others.The index values of CFI are then further refined the selecting bands, and the more effective optimizations come into being.

Computational Time Complexity.
There is one issue that needs to be considered.The proposed procedure constructs and analyses probably consume considerable time.Thus, we compare the time complexity of the four methods GA, SAGA, CFI, and SAGA-CFI in this part.The time complexity of SAGA-CFI is ( 2 ), just the same as the other three methods.This means that the processing cost of SAGA-CFI is no more than the others.5.
In this work, we implement another two similar classification methods for hyperspectral images to compare with the algorithm proposed in the paper.One method is based on SAGA and SVM classification (SAGA-SVM).The other method is based on CFI and SVM classification (CFI-SVM).Similarly, the method of this paper is based on SAGA, CFI, and SVM classification (SAGA-CFI-SVM).The error matrices of the three methods are presented in Tables 6, 7, and 8, while the total accuracy and Kappa value are exhibited in Table 9.The threshold   is 0.94 in all of the above methods.Table 10 shows the different total accuracy and Kappa value by changing the threshold   .
In the error matrix, the product's accuracy (PA) is defined as and the user's accuracy (UA) is defined as where  , is the value on the major diagonal of the th row in the error matrix,  + is the total number of the th row, and  + is the total number of the th column.SVM to classify hyperspectral remote sensing images.On the basis of subspace decomposition, SAGA was used in each subspace to lower the computational complexity and select the suitable bands, and CFI method was adopted to further modificate the selecting bands in order to increase classification accuracy.SAGA-CFI-SVM has been implemented to achieve improved classification methods compared with conventional algorithms.Comparison results show that the proposed method is superior in terms of classification accuracy.
The classification of hyperspectral remote sensing images based on SAGA-CFI-SVM in this paper is far from complete and thus requires further research.One problem cited is the further reduction of the computational complexity of SAGA and the acceleration of the searching procedure faster.Another problem is the thorough improvement of the kernel function to obtain significantly higher classification accuracy.Last but not least, we need to study the classification method based on selective ensemble support vector machine, for it may further improve the accuracy.

Figure 1 :
Figure 1: Overall structure of the proposed method.

7. 1 .
Hyperspectral Images.Experiments were conducted on a hyperspectral data set from the Northwest Indiana Indian Pine test site 3 (2 × 2 mile portion of Northwest Tippecanoe County, Indiana) on June 12, 1992.These data include 145 by 145 pixels and 220 bands.The false color image is shown in Figure 2, which is composed of band 89, band 5, and band 120.
3, 6, and 7 is unnecessary because each of these subspaces contains only one band.The kernel function used is a radial basis function, while the two SVM parameters (i.e.,  and ) are selected based on fivefold cross-validation during the training phase.The search range for  is in [2 −3 , 2 10 ] and [2 −8 , 2 2 ] for .

7. 4 .
Index Value of CFI.Entropy, correlation coefficient, and standard distance between the means of each band are computed.The index values of CFI are then obtained and sorted in descending order in each subspace.The bigger the index value is, the more important the band is.  is a given threshold of index value.Table

7. 6 .
Classification Experiment.The hyperspectral image is also categorized into seven classes, while the ratio of training and test samples remains 1 : 3. The numbers of training samples and of test samples are shown in Table

Table 2 :
Dimensions and bands of each subspace.

Table 3 :
New dimensions and bands of each subspace.

Table 4 :
Band number and index value in each subspace.

Table 5 :
Numbers of training samples and of test samples.