A Search Method for Optimal Band Combination of Hyperspectral Imagery Based on Two Layers Selection Strategy

A band selection method based on two layers selection (TLS) strategy, which forms an optimal subset from all-bands set to reconstitute the original hyperspectral imagery (HSI) and aims to cost a fewer bands for better performances, is proposed in this paper. As its name implies, TLS picks out the bands with low correlation and a large amount of information into the target set to reach dimensionality reduction for HSI via two phases. Specifically, the fast density peaks clustering (FDPC) algorithm is used to select the most representative node in each cluster to build a candidate set at first. During the implementation, we normalize the local density and relative distance and utilize the dynamic cutoff distance to weaken the influence of density so that the selection is more likely to be carried out in scattered clusters than in high-density ones. After that, we conduct a further selection in the candidate set using mRMR strategy and comprehensive measurement of information (CMI), and the eventual winners will be selected into the target set. Compared with other six state-of-the-art unsupervised algorithms on three real-world HSI data sets, the results show that TLS can group the bands with lower correlation and richer information and has obvious advantages in indicators of overall accuracy (OA), average accuracy (AA), and Kappa coefficient.


Introduction
Hyperspectral imagery (HSI) is a combination of computer generated imagery (CGI) and spectral detection technology, and it can help us analyze the characteristics of objects without direct contact. Since each pixel in HSI has both plane coordinate and spectral information, we usually describe HSI as a three-dimensional cube; that is, on the spectral axis, each band corresponds to a 2D image. Due to the different degree of absorption and reflection of object surface against electromagnetic waves with various wavelengths, as well as the continuous accuracy improvement of spectral acquisition instruments, spectra are distributed on hundreds of narrow bands (generally bandwidth less than 10 nm) continuously. Up to now, HSIs obtained via remote sensing mapping are widely applied for data analyses in many application fields, such as mineral exploration [1], environmental and atmospheric monitoring [2,3], and agricultural information services [4]. Compared with color image and multispectral image, more information can be recorded in HSI because of its high resolution, which is very useful for targets classification. However, it also brings some technological obstacles such as high dimensionality and information redundancy owing to similar or overlapped bands. Existing research studies have shown that high correlations frequently appear in some adjacent bands that probably cause the "Hughes phenomenon" [5]. erefore, we always preprocess the spectrum before classification, including noise removed and redundancy reduction, which can effectively cut down the operation costs and improve the processing speed on the premise of maintaining accuracy of image recognition.
ere are two approaches to achieve dimensionality reduction for HSI, i.e., band extraction and band selection (BS) [6,7]. e former projects the all-bands into low dimensional subspace to form a simplified image; however, it may lead inherent feature of information to change. Some recent technologies include singular spectrum analysis [8], sparse representation [9], and stacked auto-encoders [10]. In essence, BS is a combined optimization problem; that is, we should find out a band combination with rich information, low correlation, and good discrimination via the evaluation criteria function. Whatever method is adopted, it is difficult to increase speed and accuracy simultaneously, so a common practice is that improve the efficiency of dimensionality reduction by optimizing some existent algorithms.
In this paper, our main contributions are summarized as follows. (1) Strategy of two layers selection is proposed for generating an optimal band combination. Specifically, we build a candidate set U using the bands selected by FDPC at first, and then spectral analyses are conducted to choose the high-quality elements from U, where we take not only information contained in a single band but also correlation of interband into consideration. (2) We put forward a new information evaluation method called comprehensive measurement of information (CMI), and by introducing the standard deviation and k-neighbors average similarity, both individual information and correlations among each other are considered synthetically. (3) Inspired by the idea of the greedy algorithm, we adopt the mRMR method to enrich the target set iteratively that can minimize redundancy and maximize representativeness. e remaining sections are organized as follows. In Section 2, we will introduce some related research work about BS technologies in recent years. In the following section, principles of FDPC and information analysis as well as implementation flow of mRMR will be presented in detail. In Section 4, we state concretely how to build an optimal subset to replace the original spectrum via the TLS algorithm. A series of experiments and comparative analyses for results will be conducted to prove the efficiency of the proposed algorithm, and we arrange these in Section 5. In the last section, some conclusions will be given.

Related Work
As mentioned previously, it is an effective way to reduce the spectral dimensions of HSI via BS because this preprocessing can remove the redundant parts contained in the original spectrum, which is beneficial to decreasing the storage and computing consumption for subsequent image procession. If we have grasped the facts that various objects reflect against the electromagnetic waves, establishing an objectspectrum dictionary can guide us to select bands accurately; however, it is time-consuming, costly, and even impossible to get them in many cases. Unsupervised methods can well adapt to various application scenarios, which just make full use of the band distribution and interband relationship. At present, unsupervised BS is mainly categorized into rankingbased method, searching-based method, sparsity-based method, clustering-based method, and so on.

Unsupervised Band Selection
Methods. For a long time, the research studies related to BS mainly focus on two themes. One is the selection algorithm that is commonly designed by using idea of supervised, semisupervised, or unsupervised method, and the other is the output evaluation criterion which is adopted to measure the performance of an algorithm. According to the needs of discussion, we briefly introduce some unsupervised methods, as well as corresponding algorithms used in Section 5. e ranking-based method evaluates the importance of a band by using a criterion and employs top-ranked ones instead of all-bands to represent HSI. Clearly, this method can find out most discriminative bands, while the high correlation is inevitable owing to differences between each other neglected. Constrained band selection (CBS) [11] and maximum variance principal component analysis (MVPCA) [12] are both typical ranking-based algorithms. Compared with CBS, MVPCA is more sensitive to noise, so it should be used selectively according to the characteristics of data sets. e searching-based method converts BS into an optimization problem of a given criterion and iteratively searches for the best bands to constitute a target set. For example, in [13], linear prediction (LP) is adopted to evaluate the similarity between a single band and other ones, and on this basis, the best band in current round is picked out. e sparsity-based method uses sparse representation or regression to reveal the information structure of a data set, and we select the representative bands by solving an optimization problem using sparsity constraints. e improved sparse subspace clustering (ISSC) algorithm [14] which will be employed for comparative analysis in Section 5is a sparse representation-based method.
Nodes belonged to the same cluster have similar features, so based on clustering, we select several exemplars to replace the entire cluster. Hierarchical-based clustering firstly initializes the whole set or a single node as a cluster and then groups the nodes by aggregation or splitting. Classical algorithms include WaLuDi [15], BIRCH [16], and CURE [17]. For the partition-based clustering algorithm, both number and centers of the clusters must be initialized in advance. We adjust the composition of each cluster by constantly updating the ownership of nodes until the stop condition is met. Some typical algorithms, such as K-means [18] and FCM [19], are widely applied in various classification applications. Observed from the perspective of geometric distribution, the high-density areas are separated by low-density ones, and each cluster is corresponding to a data subset with the maximum local density that can be connected. erefore, the density-based algorithm can solve the classification problem for irregularly spatial distribution perfectly, and some algorithms, such as AP [20], DBSCAN [21], and FDPC [22], have shown good performances in nonspherical distribution clustering.
In the process of seeking a target set to reconstitute HSI, both initialization parameters and noise information impact greatly on the implementation effect. In many cases, the capability of an algorithm depends on initial parameters heavily, and the improper parameters may cause deviations between the clustering results and actual situations, even significant errors. Furthermore, during image acquisition, noises are generated inevitably owing to the environment or 2 Computational Intelligence and Neuroscience imaging equipment, which probably bring obstacles to subsequent processing. Usually, noise nodes are outliers with low density, so the density-based clustering algorithm has prominent advantage to noise recognition.

Fast Density Peaks
Clustering. e FDPC algorithm was proposed by Rodriguez and Laio in 2014, which can obtain the globally optimal solution through a few parameters and simple process (no iteration required), and an obvious advantage compared with other clustering-based algorithms is that it can find arbitrary-shaped clusters, rather than just spherical regions. In the field of HSI processing, besides band selection [23], FDPC is also applied to superpixel segmentation [24]. Nevertheless, its performance still needs to be promoted, including computational complexity reduction, adaptive ability of parameters enhanced, and accuracy and robustness improved. e time complexity of FDPC is O(n 2 ) without considering dimensions, where n is the number of nodes. Accordingly, the algorithm is unsuitable for large-scale data clustering because of its high complexity. As improvements, researchers introduce parallel algorithms (e.g., EDDPC [25] and LSH-DDP [26]) or use grid treatment in advance (e.g., DGB [27], DPCG [28], and PDPC [29]) to accelerate it. For example, FastDPC-KNN [30] provides a solution, in which KNN is adopted to cooperate with FDPC, and the time complexity is reduced to O(n.log2 n ). It utilizes cover tree to speed up the calculation by distinguishing the type of peak density so as to avoid calculating the distance in the global range.
Parameter self-adaption (PS) is another important research issue for FDPC. Specifically, cutoff distance delimits the neighborhood size of each node that directly determines the statistical result of local density and also has a great influence on the composition of clusters. PS can cut down the probability of errors caused by experience setting, and it is more adaptable to various data scenarios. Researchers have employed some methods such as density estimation [31] and ADPC-KNN [32] to realize the PS.

FDPC for Candidate Set.
FDPC is based on the following two assumptions. In each cluster, firstly, the density of a center is higher than that of the surrounding nodes, and secondly distance between the center and higher density node is relatively large. Moreover, there are two extremely important values in the algorithm, i.e., local density ρand relative distance δ, and both of them depend on the similarity matrix S. A hyperspectral image I can be described in both spectral and geometric spaces, I � (b 1 , b 2 , . . . , b L ) � (x 1 , x 2 , . . . , x N ), where L and N are denoted as the number of bands and pixels, respectively. us, b l � b i l |i � 1, 2, . . . , N is the response of all pixels to l th band, and x t � x i t |i � 1, 2, . . . , L is reflection of t th pixel on different bands. Generally, we should build an initial similarity matrix S � R L×L at first, and the similarity between b and i and j is expressed as the following equation: In practice, Gaussian kernel function R G (x, y) � exp(− ‖x − y‖ 2 /2σ 2 ) is commonly applied to calculate Euclidian distance. In equation (2), d ij is defined as the interband distance based on matrixS, and we obtain the correlation between the pairwise bands. Obviously, the closer two bands are, the higher redundancy is.
Closely related to cutoff distance d c , the local density ρ i is defined as follows: A convenient and intuitive way to get ρ i is that indicator function χ accumulates the nodes with Euclidean distances from b i is less than d c . However, this approach does not distinguish the contribution of distance to density, and ρ i increases by one as long as d ij < d c . Hence, we also adopt Gaussian kernel function to overcome the limitation.
d c is the only parameter provided for human-machine interaction, and experience shows that the algorithm performs well when d c is set to 1%-2% of all interband distances sorted in descending order. Inappropriate d c may cause high overlap between clusters or produce a large number of meaningless clusters. Since FDPC is very sensitive to d c , we should set it precisely through some reasonable methods, e.g., PSO [33] and ADPclust [34], rather than relying on the empirical values. In Section 4.2, we determine d c according to the number of required bands. Next, we give the definition of δ i as the following equation: δ i is the distance between b i and the node farthest from it, only when b i has the maximum local density. More generally, there are several nodes with higher density around b i ; the distance between it and the nearest node is taken as δ i . After ρ and δ of all nodes are obtained, we establish the decision graph to describe them. In Figure 1(b), the nodes that can act as cluster centers are usual outliers; for example, node 1 has the largest projection values on two-dimensional axis, which means that it has both highest local density and sufficient intercluster spacing.

Computational Intelligence and Neuroscience
However, most of the remaining nodes are concentrated near the bottom of graph with small δ (region B), which indicates that they are grouped around the high-density nodes and have less power to be independent centers. In addition, FDPC has strong noise detection capability, and it can help us eliminate interference bands before BS. In Figure 1(b), the nodes near the vertical axis are probably labeled as noise ones, e.g., node 27 and 28.
For each b i , we utilize inner product c i � ρ i × δ i to integrate density and distance at first and sort c in descending order for getting a priority sequence c 1 > c 2 > · · · > c m > c m+1 > · · · > c L . On this basis, we select m top-ranked bands, i.e., b spt(c 1 ) , b spt(c 2 ) , . . . , b spt(c m ) to form a candidate set U, where spt(c i ) is the subscript of band corresponding to c i and m is the number of required bands. e outputs of FDPC are all exemplars in each cluster, and the vital information is maintained accordingly. However, some bands that can provide more information for classifier are probably not picked out, such as boundary ones, because FDPC is more likely to select in high-density regions rather than low-density ones. Hence, based on the candidate set, we must conduct further analysis from the perspective of spectral information.

Layer 2 Selection for Target Set.
In this paper, we analyze the amount of information (AoI) and band correlation as foundation and integrate them to evaluate information comprehensively. As stated in the last section, the bands selected by FDPC are already representative, whereas it is one-sided owing to just from the view of spatial position which implies that less, similar, or overlapped information may still exist in candidate set.

Comprehensive Measurement of Information.
Shannon entropy is a common index to measure event uncertainty, and researchers usually employ it to distinguish the AoI contained in a band. It is generally believed that an event with large entropy corresponds to strong uncertainty, which means that more information can be provided for judgement. Assuming that the band b i gets different values with various probabilities, its Shannon entropy is defined as equation (6) e standard deviation is another way to measure AoI, and it reflects the uncertainty through the difference between a set of data and its mean value, as defined in the following equation: where b i is the mean value of b ik . Apparently, greater μ i corresponds to large AoI. It is improper to consider AoI in a single band alone, but ignore the relationships among them because high information correlation between adjacent bands is also very common, just like spatial redundancy.
Hence, we put forward CMI to reevaluate information situation for a band by taking both AoI and information redundancy into account comprehensively.
e correlation between b i and its h-neighbors is measured by average similarity φ i , where φ i,j ∈ (0, 1) is correlation coefficient between adjacent bands and h is an even number. For example, when h = 2, we judge the information independence of b i via average similarity on pairs of (b i− 1 , b i , b i+1 ), which is called the nearest neighbor metric. In general, we get φ i in a wider neighborhood by appropriately increasing h because we cannot guarantee that φ i,i+1 is always greater than φ i,i+2 . However, the probability of information redundancy between bands with large label difference is very small, so excessive h may cause meaningless computation.  Computational Intelligence and Neuroscience According to equation (8), if b i is what we are looking for, it should have either a large μ i or a small φ i or both. Hence, it can prevent bands with high information redundancy from being selected via CMI.

Further Selection Employed mRMR.
e abbreviation mRMR denotes maximum representativeness and minimum redundancy.ρhas a larger weight because of measurement scale during the implementation of FDPC, so the targets are most probably generated in the high-density regions. In Figure 1(b), if we want to select more from region B, the results must be the neighbors of node 1 instead of any node else. Clearly, FDPC guarantees the representativeness of candidate set but inclines to cause redundancy, so based on its outputs, we employ mRMR strategy to conduct a further filter. Let . . , b m be the candidate set, and target set and residual set are denoted as U t and U r , respectively; U � U t ∪ U r . Supposing that k (k ≥ 1) bands have already existed in U t , if the (k + 1) th band is required from U r to enrich U t , the best one should satisfy the following conditions. (1) Lowest correlation within U t , that is, the average distance from it to every element in U t is farthest; (2) Highest similarity with U r , which indicates that it has the most power to represent other bands in U r . According to formula (9), we select the most appropriate Motivated by above descriptions, taking AoI and information correlation into account simultaneously as formula (10), we get the target set with strong representativeness, good discrimination, and low redundancy to reach dimensionality reduction for HSI.

Implementation Flow.
TLS integrates spatial position, information contained in a single band, and correlation between each other to evaluate the importance of a band, so it is suitable for spectral dimensionality reduction because of the comprehensiveness of its outputs.
In this section, we explain how TLS works. As the preliminary BS (layer 1 selection), FDPC prioritizes the bands firstly and selects the top-ranked ones to establish U. In layer 2, we make some relevant initialization for preparation, U t � b spt(c 1 ) , U r � b j |j � spt(c 2 ), spt(c 3 ), . . . , spt (c m )}, and score for each band in U by using CMI index. In current round, the most valuable band b p satisfied formula (10) is picked out to join U t , and those ones that have approximate information to b p will be removed from U r .
Iterate until U r � ∅, and the informative and low information-redundancy band combination is built.
We give the technology roadmap of TLS as Figure 2. TLS filters the redundant information bands via threshold λ. According to equation (8), b p becomes the winner in a certain selection round only when it has both rich AoI and strong information independence. For each b q ∈ U r , p ≠ q, if φ p,q > λ, we take b q out of U r owing to its high correlation. We state the implementation flow of TLS in Algorithm 1and put some related explanations and analyses in Section 4.2and 4.3.

Normalization and Parameter Initialization.
As mentioned previously, different metrics cause that ρhas a heavy impact on prioritization, and bands with high-densities are more attractive to FDPC. As a direct improvement, both ρand δare normalized to interval (0, 1).
In addition, we also reduce the influence of ρby adjusting d c dynamically so that the probability of selecting in the lowdensity regions increases gradually. Improper d c may lead to algorithm failure, even domino effect happened and cannot be corrected by itself. For the sake of simplicity, the empirical method sets d c with fixed size; however, it is inefficient when dealing with high-dimensional data or fake peaks.
In order to make the density value relatively accurate, it ought to be avoided as much as possible that a band appears in different neighborhood repeatedly. Hence, we deem that d c should not be fixed but change dynamically corresponding to m, shown as the following equation: where d c− 0 is the initial value of cutoff distance. With the increase in m, d c keeps getting smaller, and the situation that a band belongs to different density neighborhood will gradually disappear. Usually, m < (L/2), and d c− 0 is multiplied by a coefficient αto determine d c . In extreme case, if each node corresponds to a cluster, i.e., m � L, we get d c � 0. erefore, in the field of dimensionality reduction to HSI, the time complexity of TLS is O(N × (L 2 + m)) which affects the real-time performance when dealing with high-resolution images.

Computational Intelligence and Neuroscience
Besides high time complexity, TLS also needs to initialize m in advance because it has no ability to automatically configure the number of clusters according to the data distribution. In layer 1, no peak or fake peak will cause the proposed algorithm invalid, for the hypothesis that makes FDPC work does not hold. In addition, the outputs of mRMR are not back-traceable, which implies that it cannot be deleted if a band has been selected into the target set.
In conclusion, the distinct advantage is that TLS can find out a more effective band combination in the condition of using the same m with others. Clearly, TLS not only inherits the characteristics of FDPC, such as good at exemplar selection, noise insensitivity, and no initialization to cluster center, but also makes information to be an important reference by using CMI.

Experiment and Discussion
In this section, a series of comparative experiments have been designed and implemented on three HSI data sets, and   (5);  e essential information about them is briefly described in Table 1.
As universal data sets, there are some common characteristics with them. First of all, pixels of land cover that belong to the same class have the similar features, whereas the spectra corresponding to distinct classes are obviously different, which is very suitable for BS by clustering methods. Secondly, the distribution of pixels among classes is inhomogeneous and even most pixels of HSI are concentrated in a few bands, as shown in Figure 3. Finally, some contaminated bands have been removed to ensure the validity of the data; for example, 16 bands disturbed by external circumstances in Indian Pines, which are numbered 104-108, 150-163 and 220, are cleared beforehand.
We train KNN (K = 5) and SVM (RBF kernel function) models with labeled samples in advance, and the classifiers have stronger generalization ability after sufficient experiences mastered. Due to the uncertainty of individual result, we take the average of 10 rounds as finals using cross-validation so as to make the outputs of algorithms more referable and convincing. In Indian Pines/PaviaU/Salinas, 30%/10%/10% pixels in every class are for classifiers learning and 10%/5%/5% ones are for tests during each round.
For a specific data set, we set the ranges of parameters and thresholds for testing, and the values corresponding to the best results are adopted. e detailed settings are as follows:

Performance Indicators.
OA, AA, and Kappa coefficient are commonly used as indicators to evaluate classification effect based on confusion matrix. OA represents the ratio of number of correctly classified pixels to the total; however, it cannot show the real situation when the class-scale difference is relatively large. As a more reasonable metric, AA reflects the recognition accuracy on a single class. Kappa coefficient is usually employed for consistency check, and in general, a larger Kappa coefficient means that the prediction results are more consistent with the ground truths. Specifically, conclusion is substantial when 0.8 > Kappa > 0.6, while Kappa ≥ 0.8 corresponds to perfect matching.

Distribution of Algorithm Outputs.
As mentioned in Section 5.1.2, seven algorithms are adopted to select bands, representing the original image with reduced spectral dimensions. For example, the results of 10 bands selected in Indian Pines are shown in Figure 4, from which we can observe the band distribution and redundancy intuitively.
Obviously, the redundancy produced by MVPCA is highest among all the employed algorithms, and most of selected bands are concentrated in the interval [120, 140]. As stated in Section 2, the ranking-based algorithm can find out critical bands efficiently by prioritizing, while it probably results in high redundancy and low discrimination because of correlation between the bands neglected. ere are no significant differences in the performance of remaining algorithms although ideas they adopted are not exactly the same. e selection results are not uniformly distributed in entire band interval, and concentration may appear in some local intervals. In Figure 4, different algorithms will conduct selections in the same interval, which means that the attractiveness of interval with remarkable characteristics to various algorithms is similar, but the specific output within an interval may be different. Nevertheless, the dispersion of bands selected by TLS in global range is still better than that of some competitors because layer 2 plays an effective role. From the illustration, the concentrated bands appear in four/ three intervals intuitively when we employ FDPC/DBSCAN, whereas they appear in just two if WaLuDi/LP/ISSC/TLS is used.

Accuracy and Consistency Check
Comparison of Accuracy Index. As illustrated in Figures 5-7, we conclude common characteristics at first. No matter what algorithm or data set is employed, the improvement of OA is always synchronized with the increase in m. Nevertheless, the band contributions to classifier decrease gradually, which implies that excessive selections have no great significance for the evolution of classifier parameter. As Figure 5(b), OA of each algorithm except MVPCA has been improved by about 20% which is brought by the increase in m from 6 to 30; however, if we raise the number to 36 or 42, OAs are maintained at the current level and the classification capability has not gone better obviously.
Moreover, the effects of the reduced sets formed by different algorithms to image recognition are unstable, which depends on both the classifier model and data set. Intuitively, OA of the SVM model is higher than that of  From the view of the model, SVM seeks a hyperplane to maximize the margin between two classes by learning the experiences provided by support vectors, and it has good generalization power, as well as strong ability of noise resistance. Differently, KNN uses the nearest neighbors voting method to determine the class attribution of a sample, and its accuracy is slightly lower than SVM. On the other hand, the capability of the same algorithm may be diverse when dealing with different data set, and for each algorithm, OAs in PaviaU and Salinas are superior to those in Indian pines. Evidently, there are several small-scale classes in Indian Pines, even three ones with less than 50 samples (Figure 3(a)), and samples contained in these classes have high probabilities of misclassification that make OA declining.
According to the number of available bands L, we take m � (1/5)Las required number (40 bands from Indian Pines and Salinas and 20 bands from PaviaU), and the accuracy of individual class, AA, and OA of algorithm is shown in Tables 2-4. In each table, we notice the following facts. Firstly, whatever algorithm is employed, the recognition accuracies on most small-scale classes are relatively lower (such as grass-pasture-mowed, oats, buildings-grass-treesdrives in Table 2, gravel in Table 3, and lettuce-romaine-6wk in Table 4), but there are exceptions (such as wheat in Table 2and shadows in Table 3). is may be caused by overfit that classifier takes OA as the criterion to fit the samples on training set as much as possible. It will generate some  erefore, when the classification model is applied to test set, its generalization ability will decrease. Although AA can make up for this defect, the main way is to reserve an appropriate number of bands with high quality to promote the discrimination ability of classifier. Secondly, OA is larger than AA and the difference reflected in Indian Pines is more prominent. Compared with averaging accuracies of all classes, it can make smaller impact to accuracy if we use proportion of quantity. Finally, some algorithms perform well just on the specific classes (such as MVPCA on Alfalfa in Table 2and DBSCAN on self-blocking bricks in Table 3), which means that the effect of the algorithm relates not only to class scale but also to match degree to data distribution. Similarly, TLS is not superior to its competitors on some classes, such as Alfalfa and meadows, although it is the best on entire data set.
By comparing AA and OA, TLS has shown its superiority, and the performance of ISSC is closest to it. WaLuDi, DBSCAN, and FDPC have similar capabilities in the aspect of BS, whereas MVPCA and LP are relatively poor. e stability of an algorithm can be shown via variance, and a small variance corresponds to low volatility. In Tables 3and 4, the recognition accuracy of TLS on each class has the  smallest change relative to the mean value, whereas its stability is second only to ISSC in Table 2.
In particular, if we want a rough recognition for HSI quickly, TLS can pick out a few high-quality bands to accelerate the training process of classifier. For example, in Figure 6(a), the accuracy of TLS exceeds 70% by training the SVM model with only 6 bands, which is about 5% higher than that of WaLuDi, LP, and FDPC. However, the advantage of TLS is gradually weakened along with more bands appended, and OA of various algorithms is quite close when m reaches a certain value.
Comparison of Consistency Index. In Table 5, Kappa coefficients of various algorithms with different m are all within the interval (0.7, 0.95), indicating that the classification results are highly consistent with the actual values in spite of only a part of ones utilized to represent the entire set. Similarly, Kappa curve is also proportional to the number of selected bands, while the rising speed slows down step by step. Moreover, observed from the classification model and data set, it is confirmed that SVM is more suitable for working on these data sets, and we can obtain higher Kappa coefficients when conducting experiments on PaviaU and Salinas owing to its relatively balanced pixel distribution compared with Indian Pines. From the aspect of the algorithm, TLS has stronger discrimination and can help the classifier to make more accurate judgement.

Execution Time.
In Section 4.3, we have analyzed the time complexity of TLS and pointed out that the algorithm  has no advantage in execution speed. e running time of the algorithm is mainly dependent on the hardware configuration; however, data set and the number of selected bands will also affect it. In this paper, the experiments run on a Windows 10 computer with an Intel i5 Quad Core processor and 8 GB of random-access memory. e corresponding execution time of seven algorithms under different conditions is shown in Table 6.
According to the setting in Section 5.1.2, the number of samples used for experiments varies on different data sets, which is clearly reflected by the time consumed, so the execution time of all algorithms is longest in PaviaU  accordingly. Besides that the speed of an algorithm is greatly impacted by its execution mode, and there is no doubt that sort/noniteration may cost less time than iteration. Hence, MVPCA takes the least amount of time followed by ISSC, FDPC, and DBSCAN in order. TLS is faster than LP, and the time consumed by the WaluDi is the longest.

Conclusion
In this paper, we propose a two layers selection (TLS) algorithm to establish a dimensionality-reduced band set for HSI. On the premise of keeping the basic features of the spectrum, the bands with strong discrimination, low redundancy, and high information are picked out to complete the image reconstitution, and TLS achieves this goal through two phases. First, we employ the FDPC algorithm to sort the inner products of the local density and relative distance of all nodes in the all-bands set aiming at building a priority sequence, and the bands corresponding to top-ranks are collected into the candidate set. Owing to great influence of local density on FDPC outputs, we utilize methods of normalization and dynamic cutoff distance to realize the cherry-pick in scattered low-density regions as much as possible. After getting CMI, mRMR is adopted to group the bands that meet the given requirements in candidate set into the target set iteratively. In order to verify the effectiveness of TLS, six state-of-the-art algorithms are used as competitors to carry out experiments on three remote sensing image data sets. e comparative results that use indicators of OA, AA, and Kappa coefficient show that the band combination created by TLS is optimal. Especially, if we want a classification model to achieve higher accuracy with less training cost, TLS provides an effective way to cut down the dimensions of samples. Besides HSI processing, it also fits some applications where the sample has two or more types of features so that the hierarchical selection can be implemented.
Although lots of work has been done to improve the capability of the BS method, there are still many technical obstacles that need to be overcome in the future. Henceforth, the theory research studies will mainly focus on how to cut down the complexity of algorithms and improve their accuracy and robustness. Meanwhile, enhancing the adaptability to large-scale and high-dimensional data environment is also the direction of our innovation.

Data Availability
e data used to support the findings of this study are included within the paper.

Conflicts of Interest
e authors declare that there are no conflicts of interest regarding the publication of this paper.