A Research on the Combination Strategies of Multiple Features for Hyperspectral Remote Sensing Image Classification

It has been common to employ multiple features in the identification of the images acquired by hyperspectral remote sensing sensors, since more features give more information and have complementary properties. Few studies have discussed the combination strategies of multiple feature groups. This study made a systematic research on this problem. We extracted different groups of features from the initial hyperspectral images and tried different combination scenarios. We integrated spectral features with different textural features and employed different dimensionality reduction algorithms. Experimental results on three widely used hyperspectral remote sensing images suggested that “dimensionality reduction before combination” performed better especially when textural features performed well. The study further compared different combination frameworks of multiple feature groups, including direct combination, manifold learning, and multiple kernel method. The experimental results demonstrated the effectiveness of direct combination with an autoweight calculation.


Introduction
The analysis of hyperspectral images has been more and more discussed in recent years.In the classification problem, it has been widely accepted that features from different views help to better recognize objects.It is common to extract multiple features before the classification procedure.Therefore, researchers have made great efforts in several aspects.First, for a long time, people have been striving to extract desirable and suitable features from hyperspectral remote sensing images for better representation [1][2][3][4][5][6].It has been recognized that linear features are less effective than nonlinear features [7], while it reduces efficiency to obtain nonlinear features in many cases.Some efforts have been made to combine both features simultaneously [8]; second, more recently, many researchers have successfully designed different frameworks [9][10][11][12] to organize different types of features, like texture features, shape features, and spectral features, because the views from different feature spaces have particular statistical properties [13].Finally, great efforts have been made in the classification process.Classifiers based on kernels [14] have dealt well with the Hughes phenomenon [15].It has led to a trend towards multiple kernel learning (MKL) [16][17][18], in which different groups of features have different kernel matrices and finally a composite kernel is yielded.
In general, the above approaches combine features extracted from peculiar bands or features (e.g., top principal component) with the initial hyperspectral bands and then put different groups of features into dimensionality reduction (DR) frameworks or classifiers [19,20].However, complementary properties of multiple features have not been widely considered or analyzed.Few approaches have taken global features which can be transformed or converted from the complete hyperspectral bands as a group of input features.More frequently, approaches like principal component analysis (PCA) [21], linear discriminant analysis (LDA) [22], isometric feature mapping (ISOMAP) [23], and Laplacian eigenmaps (LE) [24] have been only exploited as global dimensional reduction techniques.The low-dimensional output features have reduced the contribution of each group of features [9,10].
In this paper, we first made a systematic research on two schemes for hyperspectral image classification based on multiple features by utilizing different DR tools (linear and nonlinear) and different types of features.One scheme was combining spectral features with other features before the DR process; the other was reducing the dimensionality of the original hyperspectral features before combining with other extracted features.Based on the experimental results on three hyperspectral datasets, we suggested that alternative decisions should be made in different circumstances.Based on the research, we further compared different combination frameworks in hyperspectral remote sensing image classification.We selected complementary features including linear and nonlinear global features (in this paper, we take features converted from all the bands as "global features") and two kinds of textural features (extracted from certain bands or layers).Three combination frameworks were tested on three frequently used hyperspectral datasets, which comprised two scenes collected by airborne visible/infrared imaging spectrometer (AVIRIS) over the Indian Pine region and Salinas valley, and one scene collected by the reflective optics spectrographic imaging system (ROSIS) over Pavia University.
The remainder of the paper is organized as follows.Section 2 provides the details about the source of three hyperspectral remote sensing datasets employed in the experiments and the process of the proposed method.Then, the experimental results are reported in detail in Section 3, including the comparison of two classification schemes based on multiple features and the test of the combination frameworks with the results both in accuracy and visual perspectives.Discussions based on the classification results are also included in this section.Finally, a general summary of the paper is represented in Section 4.

Methodology
2.1.Multiple Feature Extraction 2.1.1.Dimensionality Reduction of Spectral Features.In this paper, we consider features achieved from all the spectral bands as global features (GF).Generally, GF are divided into two categories.One is linear features, which commonly derive the information from the original spectral image bands by multiplying a transformation matrix.Among them, PCA is a conventional linear transition without class label information and has quite high time efficiency; the other is nonlinear features including manifold learning features and kernel features [25,26].More and more investigators have focused on absorbing or exploiting nonlinear features in the classification problems because linear features do not take into account the underlying nonlinear class boundaries [27].However, linear and nonlinear global features are rarely combined in previous studies and the complementary properties of two types of features have not been widely discussed.
2.1.2.Textural Feature Extraction.In this paper, we take textural features derived from certain spectral bands or features as TF (textural feature).Many approaches have supplemented TF to spectral features because they complement features from different perspectives and give some detailed information.Two frequently used textural features are exploited in the scheme, including filter-based Gabor features and statistical GLCM features.
(1) 2D Gabor Textural Feature.We consider the procedure in [30]; a Gabor function is defined as where x = x, y is the image location in spatial domain and frequency vector k determines the scales and directions of Gabor functions.It is defined as In our experiment, parameter f is fixed to 2. The scale parameter s ranges from 0 to 3, and the direction parameter d ranges from 0 to 7, which stands for 4 scales and 8 directions.S and d are both integers.Parameter δ is fixed to 2π which represents the number of oscillations under the Gaussian envelope.According to [10], the textural images derived from Gabor filters are the real part of convolving the image I x, y with different s and d.F s,d x, y = G s,d x, y * I x, y 3 (2) GLCM Textural Feature.The gray-level cooccurrence matrix (GLCM) [31] textural feature is a widely used statistical feature.Given a certain distance and a direction, a graylevel cooccurrence matrix is built by calculating the probability of the occurrence of two gray levels from a pixel.Various features can be obtained from GLCM, and we extract 8 features for the combination in the experiment, including mean value, variance, homogeneity, contrast, dissimilarity, entropy, second moment, and correlation (details of calculation process can be found in [31]).In the experiment, the grayscale quantization level is fixed to 64 and the preprocess window is 3 × 3. (1) Concatenate and normalize different groups of features.

Combination
(2) Calculate the mean values and variances of all classes for each feature from the sample set and compute the standardized distances between each two classes for each feature.The standardized distance is calculated as follows: where d is the standardized distance for each feature, μ s and μ t represent the mean values of class s and class t, and σ s and σ t represent the standard deviation of class s and class t.
According to the standardized distances, if a couple of selected features give rise to the smallest distance within classes and the biggest distance between classes, the best classification result is likely to be yielded.Next, we extend the method to fit multiple classes and multiple features.
(3) Calculate the sum of the standardized distances between each two classes for each feature and allocate the weight multipliers for different feature groups: the expression of a sample is defined as , in which L i is the number of the ith group of features and n is the number of feature groups (n = 4; PC, LE or ISOMAP, Gabor, and GLCM).We define w i as the weight multiplier of the ith group.w i can be calculated by in which where d i pqk is the average standardized distance of the kth feature in the ith group between classes p and q (4) and l is the number of all the classes.(4) Renew the representation of the sample X = w i x i1 , w i x i2 , … , w i x iL i n i=1 .
The advantages of the weight estimation method are as follows: (i) Each type of features has its own weights, which maintains the specific properties of different features.
(ii) The weight multipliers are around 1, so the method does not influence much on the normalized values of the features.
(iii) The method has high time efficiency without iterative procedure.

Combination and Classification.
As discussed before, we have got several global features and two kinds of textural features.For GF, we select the first 10 PC, 10 ISOMAP features, and 10 LE features, while for TF, 32 Gabor features (4 scales associated with 8 directions) and 8 GLCM features are extracted from the PCA top component.The two textural features have been recognized to have complementary properties [32].For GF, 10 features not only avoid large amount of the calculation but also give sufficient information, while for TF, the 8 GLCM features are employed frequently as well as the Gabor features.Although different studies have selected different numbers of scales and directions for Gabor features, we use relatively fewer features in order to increase the efficiency of the calculation procedure.We do not analyze the properties of different parameters in detail in this paper.
For the first 2 datasets, ISOMAP features are employed as the only features based on manifold learning while for the third dataset, we try to utilize LE features instead of ISOMAP.
The paper addressed three scenarios to combine multiple features.

(i) Scenario 1: direct combination
We directly combine different types of feature vectors, such as GF and TF; thus, a longer feature vector is formed.

(ii) Scenario 2: dimensionality framework
The state-of-the-art dimensionality algorithm LE has been widely discussed and utilized in many studies [33].We consider the method reported in [9].

(iii) Scenario 3: multiple kernel method
The existing multiple kernel learning algorithms have to calculate the weight factors through an iterative process.To avoid the complex iterative calculation, we estimate the weight of different features before classification.We design the method by the distance measurement.Finally, the basis kernels are multiplied by the relevant weight factors.Then, it is converted to a simple kernel classifier.When the combinations are decided, the SVM [34] classifier is employed to test on these features with its parameters C and g confirmed through cross-validation with the training samples [35].  1 lists the exhaustive class information and the number of samples.Figure 1 shows the image and the labeled condition.

Experimental Section
3.1.2.Salinas Scene.Salinas (SL) dataset (Figure 2) was also acquired by AVIRIS.204 valid bands are selected from the total 224 bands.The resolution of the scene is 3.7 m, and the size is 512 × 217 pixels in which 54129 pixels are labeled.1% of the labeled pixels are considered as training samples for each class.The dataset is divided into 16 classes with the details listed in Table 2.  3.

Research on Different Dimensionality Reduction Scenarios.
A systematic research is made on two DR scenarios for hyperspectral image classification based on multiple features.Different DR tools (linear and nonlinear) and different types of textural features are employed.One scenario is the conventional procedure characterized by combining hyperspectral bands with other features before reducing the dimensionality; the other is featured by reducing the dimensionality of the original hyperspectral bands before combining other extracted features.In addition, classification scenarios by only spectral features and only textural features help to compare and analyze the results.As a result, four scenarios are listed in Table 4.In scenario 3, we search the best output dimension d among 5-45 with the interval of 5 for each dataset.In scenario 4, the first 10 features are selected after the DR process.
We repeat ten times of independent experiments for each case.In each trial, the samples are randomly selected from all the labeled pixels and the selection strategy is stratified by   4 Journal of Sensors classes.We calculated the average overall accuracy, kappa index, and the best d for DR scheme in scenario 3. The parameters are optimized by the training samples.Three datasets (IP, SL, and PU) are tested, and the performance is reported in Table 5.For each method, d relates to the best performance in scenario 3. We can get from Table 5 that the classification accuracies in scenario 4 are generally higher than those in scenario 3, especially when textural features perform well by themselves.Even if textural features do not perform well, scenario 3 is not always superior to scenario 4. We may explore the reason referring to scenario 1 and scenario 2. When textural features outperform spectral features and have less numbers, scenario 4 obviously outperforms scenario 3. Features with a larger number may be dominated during a global DR transformation after the feature combination, regardless of whether the DR tool is linear or nonlinear.As a result, a higher accuracy yield is in scenario 4, owing to the good performance and sufficient feature numbers of textural features, like Gabor features.On the contrary, the initial hyperspectral features with both larger numbers and worse performance influence and reduce the accuracy (scenario 3).However, when spectral features outperform textural features, accuracies in scenario 3 are close or superior to those in scenario 4 according to the "PC-GLCM" and "LE-GLCM" methods in Table 5.In addition, Table 5 shows that, regardless of whether the DR algorithm is linear or nonlinear, it is not easy to find the empirical d during the procedure of DR.
With the development of textural feature extraction technique, the performance of textural features researchers exploited often outperforms the initial hyperspectral features with less numbers.As a result, it is proper to reduce the dimensionality of hyperspectral features before combining with textural features or other features.Generally, we cannot exactly predict the performance of different groups of features, so we just reduce the dimensionality of hyperspectral features to a certain extent and select moderate numbers of low representation.

Combination Frameworks.
In the experiment, we design 5 feature selection scenarios for comparison: hyperspectral bands only, GF only, TF only, integrating GF and TF, and integrating GF, TF, and hyperspectral bands.The results of 5 groups of features associated with 3 combining strategies are listed.In addition, for IP scenes, details of the DR framework (with the dimensions no more than 60) will be presented; for SL scenes, the classification accuracies with different groups of features will be shown in an intuitive way; for PU scenes, we will investigate the complementary properties of different feature groups.5 Journal of Sensors Also, complementary properties of linear and nonlinear features will be discussed.Finally, classification results with weight estimation and without weight estimation will be compared in accuracy.
10 independent experiments are repeated for each case.In each trial, the samples are randomly selected from all the labeled pixels and the selection strategy is stratified by classes.Overall accuracy (OA) is calculated by ten trails.We also get

IP Scenes. Figures 4-6 present classification results
based on different groups of features associated with 3 combining strategies.The accuracies can be seen in Table 6.In each figure, GF + TF has the best performance both in accuracy and visual perspectives.It has to be mentioned that there is no need to add initial hyperspectral features in the proposed method, because the accuracy reduces when the spectral features are added referring to the results in Figures 4(e), 5(e), and 6(e).As a result, 60 input features yield the best result.It can also be discovered from Table 6 that nonlinear LE dimensionality reduction tools spend more time when dealing with the combined features.According to the comparison of three organization schemes, it can be concluded that the overall calculation amount of the nonlinear DR procedure is greater than that of the classification procedure with relatively higher inputted dimensions.Also, we find that Figure 4 outperforms Figures 5 and 6.So the dimensionality framework or multiple kernel method has not given rise to a better classification result than the direct way of combination.In addition, it is not easy to find a desirable dimension in the dimensionality reduction process (Figure 7).

SL Scenes.
The SL dataset has more feature numbers and a higher resolution.Extremely high accuracy is yielded    8-10.Table 7 lists the accuracies by different scenarios.Among the 16 classes, it is challenging to distinguish class 8 and class 15, while the GF + TF strategy shows the perfect accuracy rate.
However, no matter what kinds of combination strategy are employed, the results have not varied a lot for one group of features and often remain in a certain range.The accuracies change more obviously and regularly with the variation of the input features (Figure 11).So we can suggest that the selection of features is more important than combining approaches in hyperspectral image classification.
However, no matter what kinds of combination strategy are exploited, the results do not vary a lot within a fix group of features.The accuracies just remain in a certain scope.The accuracies change more obviously and regularly with the variation of input features (Figure 11).So we can suggest that the selection of features is more important than the combination approaches in hyperspectral image classification.In addition, we can conclude that the proposed strategy applies to different combination schemes.8 Journal of Sensors 3.3.3.PU Scenes.Among the three datasets, PU has the highest resolution and the most pixels.Table 8 lists the accuracies by different scenarios.In this high-resolution dataset, the complementary properties of GF and TF have a more apparent representation.For example, in Figures 12, 13, and 14(a) and 14(b), with GF, the misclassification occurs frequently between class 2 (grass) and class 6 (bare soil), while TF discriminates the class pair well; class 1 (asphalt road) and class 4 (tree) are challenging classes for TF because of the close location, but GF or spectral features perform well between the 2 classes.Figure 15 presents a direct perspective of the complementary properties of different feature groups.10

Journal of Sensors
As has been discussed before, different groups of features have their specific properties.Different feature extraction algorithms yield different feature numbers, so it seems necessary to add weight factors to different groups of features in different circumstances.We can find the improvement of the autoweighting method in the proposed GF + TF strategy in Table 9.

Discussions.
It is necessary to discuss the experimental results on the hyperspectral datasets.
The summary of the experiments with different DR schemes is as follows: textural features combined with a low-dimensional spectral features prove to be a more appropriate strategy according to experiments in Section 3.2, especially when textural features perform well and have less numbers.When textural features do not work well, it is not always the case.However, great efforts have been made to extract favorable textural features for hyperspectral image expression, so for most circumstances, the textural features we select are empirically superior.As a result, scenario 4 in Table 4 is recommended.
The summary of the experiments with the combination scenarios is as follows: for all the datasets, the GF + TF method performs best both in accuracy and visual perspectives; "global features" and "textural features" present clear complementary properties in the classification results.The GF + TF combination with only 60 input features works well without the incorporation of the original hyperspectral bands.The weight estimation method proves to be effective when dealing with multiple features.In addition, the number of the input features of the GF + TF strategy is independent of the band number of the images or the types of sensors.
The summary of the experiments with different feature combination frameworks is as follows: the results yielded by different combination strategies do not vary a lot if the input features are confirmed; compared with the "framework" strategy, the "direct combination" strategy with the only 60 GF + TF input features not only performs better but also avoids great calculating amount and large number of dimensions; multiple kernel method and nonlinear DR framework sometimes lead to desirable results, but the best d is hard to determine, which can be reflected both in our experiments and in other studies.
We can conclude by the experimental results and the discussions above that what kinds of features to be combined influences the classification results to a larger extent than the strategies of integrating multiple features.In addition, multiple features with complementary properties lead to good classification results.In this paper, we exploit 4 different features for combination, including linear and nonlinear "global features" and filter-based and statisticsbased "textural features," which ensure the complementary and diverse properties.
Pattern recognition has been applied in many fields.Hyperspectral remote sensing image classification is a peculiar application and has its own characteristics.Compared with image recognition (like in [9,13]) and medicine field (such as gene and protein classification [7]), remote sensing image classification problems have a relatively lower dimension or at least a smaller d/N (d represents dimension and N represents labeled samples).According to an overall statistics in [15], when d ≫ N (like some cases in body, face, object image recognition, or gene classification), the accuracy appears low (Figure 16).In this case, a DR algorithm will help to improve both the results and the classification efficiency.However, in the case of N > d, the accuracy appears pretty well and a DR framework does not lead to evidently better results.In addition, it is not easy to find the best d in different cases according to our experimental results and Figure 7.In fact, researchers have not found an effective way of   16).With the development of the hyperspectral sensors, the bands may increase; thus, the area of the red frame in Figure 16 may grow.However, the proposed GF + TF strategy may still be practical for use because it is independent of the bands of the sensors.

Conclusions
Hyperspectral sensors provide more details in spectra; however, problems are yielded along with the advantages.One is the high-dimension problem, and the other is the possibility of extracting multiple features from the hyperspectral images.In this paper, a systematic research is made to find an appropriate strategy to deal with classification problems of multiple features.Then, we further exploit the complementary features to improve hyperspectral image classification performance.Experiments on 3 hyperspectral datasets suggest that the scheme GF + TF is effective.The main contribution can be concluded as follows: first, based on the experiments, we suggest DR algorithms work better as feature acquirement methods than just reducing the dimensions.The paper further selects features from a different perspective in multiview problem.In the previous work, there have been feature combination ideas characterized by "spectral and nonspectral" or more recent schemes like "linear and nonlinear."However, we present "global and nonglobal" strategy and take linear and nonlinear global features (like PC, ISO-MAP, and LE) as a portion of features for the first time.We have also got the conclusion that features with different types are more likely to have complementary properties.Second, concluded from the experimental results, feature selection   Journal of Sensors proves to be more important than how to organize multiple features in hyperspectral image classification problems.Third, a systematic research has been made on the combination frameworks of multiple features, including direct combination, manifold learning, and multiple kernel method.We have found that complex methods like manifold framework and multiple kernel do not lead to the increase in accuracy; instead, the direct combine strategy with an autoweight calculation performs the best.Finally, we have compared hyperspectral image classification problems with other applications of pattern recognition and clearly analyzed the characteristics of the former.
As future work, we will continue to find more complementary features for integration in hyperspectral remote sensing image classification based on the experiments.For example, shape features have been widely developed recently and have not been considered in the study.In addition, we will further discuss the internal redundancy of each group of features.

3. 1 .
Hyperspectral Image Data.Three commonly used datasets were tested in the experiments.All of them were acquired by hyperspectral sensors with different spatial resolutions.Researchers have been trying to improve the classification performance on these scenes for a long time.In this paper, 3 Journal of Sensors experiments are carried out on these scenes under similar experiment conditions.3.1.1.Indian Pine Scene.The Indian pine (IP) dataset, derived from the airborne visible/infrared imaging spectrometer (AVIRIS), is one of the most commonly used hyperspectral image data for test.The resolution of the image is 30 m, and the size is 145 × 145 pixels.The sensor contains 220 bands in which 62 bands have to be discarded due to water absorption or noise and finally 158 valid bands are reserved in the area.The dataset mainly covers agricultural lands with 10171 labeled data points divided into 12 classes.In the experiment, 5% of the labeled data points are considered as training samples for each class.Table 3.1.3.Pavia University Scene.The Pavia University (PU) dataset (Figure 3) was acquired by ROSIS, and the location is Pavia University, Italy.The resolution is 1.3 m which ranks the highest among the three datasets.The image size is 610 × 340 pixels, including 207400 data points.113 valid bands are selected from the total 115 bands with 2 noisy bands removed.Different from the former scenes, PU mainly covers artificial lands.2% of 42776 labeled pixels are considered as training samples for each class.The land cover details are listed in Table

Figure 4 :
Figure 4: Classification maps of the IP dataset with directly combined features; (a)-(e), respectively, represent different groups of input features: spectral features, GF, TF, GF combined with TF, and all features along with the spectral features.d is the feature dimension for input.

Figure 5 :
Figure 5: Classification maps of the IP dataset with features under a manifold-based framework; (a)-(e), respectively, represent different groups of input features: spectral features, GF, TF, GF combined with TF, and all features along with the spectral features.d is the dimension associated with the best performance.

Figure 6 :
Figure 6: Classification maps of the IP dataset by the multiple kernel method; (a)-(e), respectively, represent different groups of input features: spectral features, GF, TF, GF combined with TF, and all features along with the spectral features.d is the feature dimension for input.

Figure 7 :
Figure 7: Relationship of d and accuracy in the IP dataset (d ≤ 60).

Figure 8 :
Figure 8: Classification maps of the SL dataset with directly combined features; (a)-(e), respectively, represent different groups of input features: spectral features, GF, TF, GF combined with TF, and all features along with the spectral features.d is the feature dimension for input.

Figure 9 :
Figure 9: Classification maps of SL dataset with features under manifold based framework; (a)-(e), respectively, represent different groups of input features: spectral features, GF, TF, GF combined with TF, and all features along with the spectral features.d is the dimension associated with the best performance.

Figure 10 :
Figure 10: Classification maps of the SL dataset with directly combined features; (a)-(e), respectively, represent different groups of input features: spectral features, GF, TF, GF combined with TF, and all features along with the spectral features.d is the feature dimension for input.

Figure 11 :
Figure 11: Accuracy change range with features and strategies on the SL dataset.

Figure 12 :
Figure 12: Classification maps of the PU dataset with features under a manifold-based framework; (a)-(e), respectively, represent different groups of input features: spectral features, GF, TF, GF combined with TF, and all features along with the spectral features.d is the feature dimension for input.

Figure 13 :
Figure 13: Classification maps of the PU dataset by the multiple kernel method; (a)-(e), respectively, represent different groups of input features: spectral features, GF, TF, GF combined with TF, and all features along with the spectral features.d is the dimension associated with the best performance.

Figure 14 :
Figure 14: Classification maps of the PU dataset with directly combined features; (a)-(e), respectively, represent different groups of input features: spectral features, GF, TF, GF combined with TF, and all features along with the spectral features.d is the feature dimension for input.

Figure 15 :
Figure 15: Complementary properties of PC and LE features for all classes in the PU dataset.

Figure 16 :
Figure 16: Hughes curves labeled by common application of remote sensing image classification.

Table 1 :
Sample distribution of IP dataset.

Table 2 :
Sample distribution of SL dataset.

Table 3 :
Sample distribution of PU dataset.

Table 4 :
Details of experimental scenarios.

Table 5 :
Experimental results of different datasets associated with different scenarios."Sc." represents different scenarios in Table 4.

Table 6 :
Classification results by different features for IP dataset, including OA, kappa index, and the execution time.Journal of Sensors with the proposed scenario, and similar regulations in IP dataset can be found.The results are listed in Figures

Table 7 :
Classification results by different features for SL dataset, including OA, kappa index, and execution time.

Table 8 :
Classification results by different features for PU dataset, including OA, kappa index, and execution time.

Table 9 :
The effect of autoweighting method in the GF + TF strategy.