Bayesian Information Criterion Based Feature Filtering for the Fusion of Multiple Features in High-Spatial-Resolution Satellite Scene Classification

. This paper presents a novel classification method for high-spatial-resolution satellite scene classification introducing Bayesian information criterion (BIC)-based feature filtering process to further eliminate opaque and redundant information between multiple features. Firstly, two diverse and complementary feature descriptors are extracted to characterize the satellite scene. Then, sparsecanonicalcorrelationanalysis(SCCA)withpenaltyfunctionisemployedtofusetheextractedfeaturedescriptorsandremove theambiguitiesandredundanciesbetweenthemsimultaneously.Afterthat,atwo-phaseBayesianinformationcriterion(BIC)- basedfeaturefilteringprocessisdesignedtofurtherfilteroutredundantinformation.Inthefirstphase,wegraduallyimposea constraintviaaniterativeprocesstosetaconstraintontheloadingsforavertingsparsecorrelationdescendingbelowtoalower confidencelimitoftheapproximatedcanonicalcorrelation.Inthesecondphase,Bayesianinformationcriterion(BIC)isutilized toconductthefeaturefilteringwhichsetsthesmallestloadinginabsolutevaluetozeroineachiterationforallfeatures.Lastly, asupportvectormachinewithpyramidmatchkernelisappliedtoobtainthefinalresult.Experimentalresultsonhigh-spatial-resolutionsatellitescenesdemonstratethatthesuggestedapproachachievessatisfactoryperformanceinclassificationaccuracy.


Introduction
Scene classification has aroused more and more attention in remote sensing domain.For satellite imagery of high-spatialresolution, it evokes a great deal of challenging problems in scene classification due to high intraclass variability, low interclass disparity, and other external factors such as changes of viewpoint, illuminations and shadows, background clutter, partial occlusions, and multiple instances.In addition, with considerable increase of the spatial resolution of images, details of the targets become clearer, and a host of cues also become more distinctive, such as structure and colour.As a consequence, it is of great importance to appropriately combine and fuse them in various respects.In the past decade or so, many researchers and practitioners have made great efforts in exploiting different sources of information in the high-spatial-resolution satellite imagery to enhance classification performance [1][2][3][4][5].
Unlike the case of low-spatial-resolution satellite images, where solitary type of feature descriptor has been proved to be effective and efficient for classification [6,7], it is universally recognized that instead of adopting a solitary type of feature, it is more favorable to fuse and combine a set of diverse and complementary features such as features based on structure and colour information [8,9].Hence, how to fuse these pieces of diverse and complementary information and eliminate the ambiguities and redundancies between them becomes a critical problem.One widely praised approach is featurelevel fusion, in which features from different channels are fused, developing a new pattern for scene classification.Many approaches have been reported in the literature [10][11][12].

Journal of Sensors
The canonical correlation analysis (CCA) [13] method has been spotlighted, especially in feature fusion realm [14,15], due to its capability in expressing inherent correlation between two sets of features.In order to extract canonical correlation features from two groups of features, the CCA method firstly constructs a correlation criterion function by extracting two diverse features from the identical samples.Then the CCA method creates efficient and effective discriminant features for classification.
However, when the dimensions of features are too large, such as in case of high-spatial-resolution satellite scene classification researches, traditional CCA methods are no longer proper.Besides, when the features are extracted from the identical image, sample covariance matrices turn into being undefined or unstable, inducing extra difficulty in parameter estimation.As a consequence, a dimension reduction approach is imperative for CCA to tackle this problem.In the past decade, abundant approaches have been proposed for feature shrinkage and selection, containing the nonnegative garrote by Breiman [16], least absolute shrinkage and selection operator (Lasso) by Tibshirani [17], smoothly clipped absolute deviation (SCAD) by Fan and Li [18], and Elastic-net by Zou and Hastie [19].Recently, these approaches have been employed to CCA for relationship assessment between two sets of high-dimensional remote sensing data.The feature selection strategies are utilized to the canonical feature loadings which set some of the coefficients to exact zeros for selecting the remaining features.The crucial features selected are then called the sparse set of features, and the canonical correlation analysis exploiting these features is often known as sparse canonical correlation analysis (SCCA).Initially, Waaijenborg et al. [20] suggested a penalized form of CCA adopting an iterative regression process with the Univariate Soft Threshold (UST) form of the Elastic-net penalty.Subsequently, Parkhomenko et al. [21] suggested the SCCA approach utilizing a form of regularization resembling UST Elastic-net [19].Witten and Tibshirani [22] incorporated the Lasso penalty in their SCCA.These approaches are interesting.Yet, they do not control the sparsity in a direct way.At the same time, it has been revealed that sundry penalized likelihood approaches have the oracle property under certain conditions [18,[23][24][25][26].However, in practical application, without an appropriate model of feature selection, the oracle property cannot be realized.Accordingly, the approaches may not necessarily generate sparse set of features.To address the issue, this paper suggests a twophase process in implementing SCCA in which L1 penalty is utilized on the feature loadings during the first phase, and then a Bayesian Information Criterion (BIC) based feature filtering algorithm is conducted to further remove redundant and noisy information.To be specific, in the first phase, we gradually impose a constraint via an iterative process to set a constraint on the loadings for averting sparse correlation descending below to a lower confidence limit of the approximated canonical correlation.In the second phase, Bayesian information criterion (BIC) is utilized to conduct the feature filtering which sets the smallest loading in absolute value to zero in each iteration for all features.
The rest of this paper is organized as follows.The feature extraction process is presented in Section 2. Section 3 offers a concrete and detailed depiction of the methodology exploited in this paper.Section 4 displays experimental results and gives a performance assessment.Finally, Section 5 summarizes the work and points the directions for the future work.

Multiset Feature Extraction
2.1.Scale-Invariant Feature Transform (SIFT) Descriptor.In the detected region, SIFT descriptor extracts a gradient orientation histogram [27].The gradient image is sampled over a 4 × 4 grid in each of eight orientation planes; thus resulting descriptor is of dimension 128.A weight to the magnitude of each sample point is gained by implementing a Gaussian window function.This puts more highlights on the gradients that are near the center of the region and renders the descriptor less sensitive to the small changes in the position of the detected region.The gradient magnitude is used to weigh the contribution to the orientation and location bins.The descriptor is immune to small errors in the region detection and small geometric distortions, largely due to the quantization of orientations and gradient locations.The square root of the sum of squared components is calculated to normalize the descriptor for acquiring illumination invariance.

Colour Histogram Descriptor.
Colour histogram descriptor is three separated histograms for the R, G, and B channels [28].A colour histogram represents the approximate distribution of the colours in an image, due to the fact that each histogram bin stands for a local colour range in the given colour space.Colour histograms are invariant to the translation and rotation of the image content; meanwhile they are unsophisticated to compute.In our experiments, the number of bins is 40, and the resulting descriptor is of dimension 120 through concatenating the three independent histograms.

Sparse Canonical Correlation Analysis (SCCA).
Canonical correlation analysis (CCA) is a multivariate statistical approach proposed to grope for the correlation between two sets of features [13].Suppose that two feature sets  =  1 ,  2 , . . .,   and  =  1 ,  2 , . . .,   are of dimensions  ×  and × ( ≤ ,  ≤ ), which extracted from the same image.Let the columns of  and  be standardized to have standard deviation 1 and mean 0, let  and V be  × 1 and  × 1 vector of weights, and let  =  and  = V be linear combinations of the features of data sets  and , respectively.Note that  and  are  × 1 vectors.Subsequently, (1) will be maximized to estimate coefficient vectors  and V: where    and    are within data covariance matrices and    is the between data covariance matrix.Equation ( 1) can be reformulated as follows for scaling  and V has very insignificant influence on the correlation coefficient: The aforementioned CCA approach is not applicable when the quantity of features is excessive.Latent multicollinearity between predictor features further complicates the computation for the covariance matrices can turn into undefined or unstable.As a consequence, some critical features should be selected by standard model selection criteria.In the subsequent process, the selected set of features is utilized to compute the canonical correlation for making the results understandable, which is called sparse canonical correlation analysis (SCCA).Theoretically, SCCA is implemented by maximizing the penalized objective function below: In order to tackle the multicollinearity problem, a host of approaches have been introduced.Vinod suggested incorporating penalty terms to the diagonal elements of the covariance matrix, which appears to resemble the ridge regression thought in regression analysis [29].This needs to estimate additional ridge parameters.Other regularization forms have been proposed where the variance matrices are substituted with their corresponding identity matrices [22] or diagonal matrices [21].In our work, the matrices    and    are substituted with their corresponding diagonal matrices.

Shrinkage Methods.
Penalized linear regression mechanisms have been widely applied to analyze high-dimensional data, and it has incorporated feature selection and shrinkage techniques.Assume that  is an  × 1 vector and  is an  ×  matrix.Then the estimation of penalized regression coefficient  can be yielded by using following penalized regression model: where   () is the penalty term and  is a tuning parameter which is estimated utilizing permutation approaches or cross validation (CV).As the first penalized regression approach, ridge regression was proposed to temper the multicollinearity among the predictors in which a quadratic penalty term is embedded to the regular least square estimating equations [30].Ridge regression implements a penalty on the coefficients to shrink them towards zero.Yet the shrunken coefficients are never equal to zero.As a consequence, ridge regression fails to conduct feature selection.The least absolute shrinkage and selection operator (Lasso), Elastic-net, and smoothly clipped absolute deviation (SCAD) are different from ridge regression which solve the multicollinearity problem (i.e., shrinkage) and set some of the coefficient to exact zero, creating sparse set of features (i.e., feature selection).
In this paper, we apply different penalty functions to SCCA utilizing the algorithm elaborated by Parkhomenko et al. [21].
The tuning parameters for all penalty functions are estimated through cross validation (CV).

Least Absolute Shrinkage and Selection Operator (Lasso)
Penalty.The least absolute shrinkage and selection operator (Lasso) penalty is a shrinkage approach and has the competence of selecting discriminant features by shrinking some coefficients and setting others to zero [31].The penalty term of Lasso is defined as follows: where  is a tuning parameter.The solution of Lasso is given as This is similar to the soft thresholding rule introduced by Donoho et al. [32] and Donoho and Johnstone [33], which was utilized to estimate wavelet coefficients.

Elastic-Net Penalty.
Elastic-net is a regularization mechanism that carries out continuous shrinkage and feature selection simultaneously [34].To be specific, this approach utilizes both  1 penalty of Lasso and  2 quadratic penalty of ridge regression to create a convex combination.Consequently, this approach preserves the abilities of feature selection and coefficients shrinkage.The definition of the Elasticnet penalty can be formulated as follows: Nevertheless, thanks to two tuning parameters needed to be estimated, the computational cost of Elastic-net is somewhat higher.
As a substitute to Elastic-net, Zou and Hastie proposed a predigested version of the Elastic-net called univariate soft thresholding (UST) [34], the solution of which is shown as In this paper, the Elastic-net based on the univariate soft threshold is adopted to implement feature loadings  and V as follows:

Smoothly Clipped Absolute Deviation (SCAD) Penalty.
Fan and Li proposed a nonconvex penalty function named smoothly clipped absolute deviation (SCAD) [18].They suggested three criteria for determining an excellent penalty function, namely, (i) sparsity, (ii) continuity, and (iii) unbiasedness.They made further efforts to claim that the SCAD penalty possesses these properties.The SCAD penalty is shown as follows: When value within the range of  and , SCAD penalty function coincides with a quadratic spline function.The function is continuous, and when  > 2 and  > 0 the first derivative can be formulated as follows: The SCAD penalty is continuously differentiable on (−∞, 0) ∪ (0, ∞), but singular at 0, with its derivatives zero outside the range [−, ].This penalty function sets small coefficients to zero, shrinks mid-size coefficients towards zero, and keeps large coefficients untouched.Consequently, the SCAD penalty generates almost unbiased coefficients and a sparse solution for large coefficients.The solution of SCAD penalty is shown as follows: This thresholding rule has two unknown parameters:  and .In an ideal situation, the optimal results (, ) can be acquired utilizing a scheme involving a two-dimensional grid-search with criteria resembling cross validation approaches.However, such an execution is computationally intensive.In the Bayesian perspective, Fan and Li advised  = 3.7 is a wise option for many issues [18].They have further highlighted that the performance of feature selection issues does not boost tremendously when data-driven approaches are adopted.In this paper, we set  to 3.7 and  was selected by cross validation.Meanwhile, the thresholding rule (12) was adopted to load vectors  and V.

Hard-Threshold Penalty.
Hard-thresholding directly sets several coefficients to zero [35,36].However, this penalty function does not tackle the issue of multicollinearity among the predictors, for it does not shrink any coefficients toward zero.Nevertheless, the results obtained by this penalty are unbiased estimators with large effects.The solution of the hard-thresholding rule is revealed as follows: 3.3.The Suggested BIC Based Feature Filtering Algorithm.The major drawback of the current SCCA approaches is that they do not control sparsity directly.Hence, it is hard to complete efficient and effective elimination of noisy and redundant information.There is a trade-off between the sparsity of the features and the maximum correlation.In this paper, we suggest a two-phase process to establish an equilibrium between the sparsity of the features and the maximum correlation.In the first phase, we gradually impose a constraint via an iterative process to set a constraint on the loadings for averting sparse correlation descending below to a lower confidence limit of the approximated canonical correlation.
In the second phase, Bayesian information criterion (BIC) is utilized to conduct the feature filtering which sets the smallest loading in absolute value to zero in each iteration for all features.
The proposed feature filtering process is iterative and simple.One more coefficient of β , at each iteration, is set to be 0 according to the magnitudes of the coefficients in absolute value.Let β and β be the constrained effective dimension reduction direction; the proposed feature filtering process is shown as follows.
(ii) Define a new direction β  () by maintaining the largest  coefficients of β in absolute value and assigning the other +− coefficients to 0. Searching β  () as projection of β  () into the space   , the set of all β  () should satisfy the following.
(1) The set of zero coefficients in β  () is the same as that in β  ().
After the above feature filtering process is implemented, we obtain a sequence of BIC() as  descends from  +  to 0. Let  0 be the integer at which BIC() is minimized.Then, the  +  −  0 smallest coefficients of β in absolute value are assigned to 0. This proposed feature filtering process is a streamlined feature selection process.In feature filtering, at most  +  possibilities are taken into account, which make it viable to conduct even when  +  are large.Lastly, the features corresponding to the minimum BIC value are the final selected features.

Support Vector Machine (SVM) with Pyramid Match Kernel (PMK-SVM) Classifier. Establishment of kernel-based
learning algorithms is based on the notion of mapping data into a Euclidean space and then discovering linear relations within the mapped data.Taking a typical issue as an example, the SVM unearths the optimal separating hyperplane between two classes in a feature space.The assistance provided by a kernel function is to map pairs of data points in an input space to their inner product in the feature space, thereby estimating the similarities between all points and deciding their relative positions.Linear relations are discovered in the feature space, although a decision boundary may still be nonlinear in the input space, depending on the method of a feature mapping function.The support vector machine (SVM) with pyramid match kernel (PMK-SVM) [37] furnishes an accurate and time-saving solution for classification and the pyramid match kernel function is formulated as follows: where ,  are the input sets,  = [log 2 ] in which  is a sphere of diameter, Ψ is the feature extraction function,   () is the th histogram in Ψ(), and Γ is a histogram intersection function which measures the overlap between two histograms' bins: where  and  are histograms with  bins, and  () denotes the count of the th bin of .Since in the construction of the pyramid phase Γ(  (),   ()) = min(||, ||), and In order to preserve generality and obtain promising and satisfactory classification results.Here, we employ support vector machine (SVM) with pyramid match kernel (PMK-SVM) as classifier.Under the multiclass circumstances, a set of binary classifiers and majority vote technique are utilized to perform multiclass categorization.Figure 1 illustrates our classification scheme based on the suggested two-phase BIC filtering process.

Experiments and Results
Experiments were performed on a high-spatial-resolution satellite image with a size of 4000 pixels × 4000 pixels, as presented in Figure 2. The view was captured by the GeoEye-1 satellite on 21 November 2009 at Majuqiao town, which locates in the southwest of Tongzhou, southeast of Beijing, where the latitudes and longitudes at the lower right and upper left corners are 39 ∘ 43  N, 116 ∘ 32  E and 39 ∘ 44  N, 116 ∘ 30  E respectively.The ground sampling distance is approximately 0.5 m and the band assignment is red for band 3, green for band 2, and blue for band 1.The image mainly contains eight-class satellite scenes: factories, roads, water, farm land, high buildings, low buildings, bare land, and green land.With associated geographic information, the reference data presented in Figure 6(a) was manually labeled.Moreover, some classes have not been taken into account, which can be neglected, such as clearance between the green land and its neighbours.Figure 3 exhibits one example of each class from the eight-class satellite scene.
In this paper, the -fold cross validation approach is used to optimize the penalty parameters for each canonical feature pair, where  in -fold cross validation is set to 10 by empirical.Specifically, the obtained feature data are split into two parts: 1/ proportion of feature data for validation (testing) and the remaining  − 1/ proportion of feature data for training.The loading vectors are yielded in the training procedure and used in the testing procedure.We maximize the correlation of testing feature data to select the sparseness parameters   and  V using cross validation approach.The weighted vectors  and V are obtained by the preset values of   and  V .Subsequently, the correlation is calculated by where  − and V − are the weight vectors in the training sets   and   , respectively, and  is the implementing times of cross validation.Here,   and   are the test sets.Lastly,   and  V values are determined as optimum sparseness parameters according to the maximum value of Δ cor .
To achieve reliable results and see the convergence selected data; the acquired results were coincident as well.The classifiers were not conspicuously biased within this particular image for the moderate selected proportion of training samples and hence classification results would not be extremely exaggerated.We divided training\testing data at random and reiterated the experiments ten times.Meanwhile, we kept an account of the average of each-class classification accuracies for every run.The mean and standard deviation of the results from each individual experiment were used to quantify the final result.
For the sake of boost in classification performance to some extent, these two features should be fused in an apt tactic.For comparison purposes, SCCA with different penalty functions (Lasso, Elastic-net, SCAD, and Hardthreshold) was introduced to fuse the features yielded by feature extraction methods adopted in this paper.It is all obvious from Figure 4 and easy to observe that SCCA with SCAD penalty function acquires a relatively outstanding result over the other penalty functions.The reasons are sketched as follows.On one side, in order to grip the complete source of image information, we utilized two typical and representative features to depict structure and colour properties of the image, and these features contain the essential and intrinsic information of the image.On the other side, SCAD penalty function possesses the competence to manipulate multicollinearity between features as compare to the other penalty functions.However, these methods do not have immediate control over the sparsity.Consequently, an extra two-phase feature filtering process was indispensable to conduct for further removing redundant and opaque information between and within features.Figure 5 compares the classification performance of four different penalty functions with the suggested two-phase BIC based filtering process.It is evident that classification accuracy enhances after utilizing the suggested BIC filter, which can be explained by the possibility that without the suggested BIC filter the SCCA may not necessarily engender sparse set of features.In other words, after using the suggested BIC filter the noisy and redundant information among features have been further eliminated.
Note that the SCAD penalty function with the suggested BIC filter yielded the best classification accuracy.Figures  It is as expected that misclassification was more inclinable to emerge between short buildings and factories.This is due to the fact that factories often include dense houses and level and perpendicular lines, which resemble short buildings.Meantime green land was misclassified as farm land or bare land; the most contributing factor to this  misclassification is that these classes possess similar elements and patterns.However, confusions between some classes were tough to understand.Namely, some short buildings, roads, and factories were classified as bare land.

Conclusions
In this paper, a BIC based feature filtering approach was presented for high-spatial-resolution satellite scene classification.The SCCA with a two-phase BIC feature filtering process acts as a crucial component in satellite scene classification and can strikingly meliorate the classification accuracy by

Figure 1 :Figure 2 :
Figure 1: Flow chart of the suggested classification method.

Figure 3 :
Figure 3: Examples of each class in the eight-class satellite scene.

Figure 4 :
Figure 4: Performance comparison using different penalty functions on SCCA without the suggested BIC filter.

Figure 5 :
Figure 5: Performance comparison using different penalty functions on SCCA with the suggested BIC filter.