Research Article Statistical Analysis of the Performance of Rank Fusion Methods Applied to a Homogeneous Ensemble Feature Ranking

The feature ranking as a subcategory of the feature selection is an essential preprocessing technique that ranks all features of a dataset such that many important features denote a lot of information. The ensemble learning has two advantages. First, it has been based on the assumption that combining diﬀerent model’s output can lead to a better outcome than the output of any individual models. Second, scalability is an intrinsic characteristic that is so crucial in coping with a large scale dataset. In this paper, a homogeneous ensemble feature ranking algorithm is considered, and the nine rank fusion methods used in this algorithm are analyzed comparatively. The experimental studies are performed on real six medium datasets, and the area under the feature-forward-addition curve criterion is assessed. Finally, the statistical analysis by repeated-measures analysis of variance results reveals that there is no big diﬀerence in the performance of the rank fusion methods applied in a homogeneous ensemble feature ranking; however, this diﬀerence is a statistical signiﬁcance, and the B-Min method has a little better performance.


Introduction
During recent years, the amount of data generated daily has grown dramatically. IBM estimated that every day 2.5 Quintillion bytes of data is created, and 90% of the data in the world today has been created in the last two years. Nowadays, such voluminous data are known as big data. e analysis of such massive data on a single machine is impossible or very slow and time-consuming. Hence, it is necessary to use the algorithms which can be distributed between several machines or several threads.
Feature selection (FS) is a crucial preprocessing technique to deal with high-dimensional datasets that are common in the big data era. e feature selection technique's primary objective is to select a subset of features so that the selected subset contains discriminating power, the same as the original features set [1][2][3]. is technique can reduce the dimensionality of the feature space and improve classification performance due to removing irrelevant and redundant features [4].
According to the final result, feature selection techniques can be categorized into two subcategories: feature-subset selection (FSS) and feature ranking (FR). Moreover, depending on whether the label of each instance is available or not, the feature selection can be classified into supervised and unsupervised types [5][6][7][8][9].
Furthermore, according to dependency on a learning model, feature selection algorithms can be classified into three categories of Wrapper, Embedded, and Filter [10]. As it was mentioned by Brahim and Limam [11], the algorithms of the filter category have more generality than those of the other categories. Besides, these algorithms utilize a statistical criterion for feature evaluation resulting in a decrease of the computational cost.
Ensemble learning is based on the assumption that the combination of the output of several models obtains better results than the output of any individual models. Furthermore, ensemble learning algorithms have an inherent ability of distributing so that any base learners can be executed independently in separate workers or threads which is a useful characteristic in confronting a large scale dataset. e ensemble learning broadly has been applied in the classification discipline in the last decade; however, its effectiveness is imaginable in other machine learning disciplines such as feature selection as well [12]. e ensemble learning approach for feature selection technique, which is called Ensemble Feature Selection (EFS), has received increased attention in recent years [13][14][15][16][17]. In another point of view, feature ranking as a subfamily of the feature selections is a common approach when the number of features is substantial. Hence, in this paper, the feature ranking algorithms (FRAs) are applied in EFS as basic learners and called ensemble feature ranking (EFR) henceforth. Furthermore, owing to what is known as the label of training instances, it is a supervised EFR.
To sum up, the EFR approach has three advantages: first, applying feature ranking has a lower computational cost than feature selection, and it is more sensible when the number of features is high. Second, the ensemble learning approach has inherent scalability because each base learning model can be processed independently, and this ability is essential to cope with large scale datasets. ird, we hope to make a more accurate result by applying several models instead of a single model.
Generally, an EFR has three steps. First, data diversity of training datasets is provided by a subsampling method, second, several rankings of features are determined by several base FR algorithms, and, third, the intermediate rankings are fused to generate the final ranking of the features. In the last decade, several rank fusion algorithms have been introduced in various scopes, such as opinion mining and information retrieval [18,19]. e main purpose of this paper is studying the role and effectiveness of the different rank fusion methods as a part of homogeneous EFR approaches. For this purpose, seven FRAs belong to the filter category as base rankers combined with nine rank fusion methods in independent scenarios. Eventually, the experimental results are analyzed by statistical methods to respond to this question: is there a big difference in the rank fusion method applied in EFR? If so, which one can make more accurate results than others? e remainder of the article is organized as follows: Section 2 introduces some background knowledge of the ensembles feature selection. Section 3 will describe the experimental framework, and, in Section 4, the experimental results will be presented. Section 5 offers some discussing remarks. Finally, concluding remarks will be given in Section 6.

Related Work
One of the most important techniques in data analyzing and processing is feature selection applied to broad scopes such as machine learning, pattern recognition, and data mining [20][21][22][23][24]. Furthermore, this technique can be more beneficial and sensible when the dataset is high-dimensional [25,26]. e data often has lots of dimensions in some scopes, such as gene analyzing [27,28], cancer classification [29], robotics [30], satellite images processing [31], and big data [32][33][34], which makes feature selection technique essential.
In the last years, several papers were published on EFS in different fields. In general, published articles can be classified into two groups. In the first group, the output of the proposed methods is a set of features, and, in the second group, it is a rank of features. In this section, ten state-of-the-art articles are considered so that the first three items belong to the first group and the rest belong to the second group.
A method called MCF-RFE was proposed by Yang and Mao [35], in which the outputs of several FR algorithms are fused to generate the final ranking of features. en, the irrelevant features are removed by using the SVM-RFE algorithm. In this method, both the classification performance and the stability of the feature selection result are improved simultaneously. Although the proposed method used FR algorithms as base learners, its final output is a set of selected features.
In [36], Das et al. developed a method called EFSGA by applying a 7 biobjective genetic algorithm. Boundary region analysis of rough set theory and multivariate mutual information of information theory are employed as two objective functions in the proposed method. In their method, several subdatasets are prepared by the subsampling of the original dataset.
en, for each of the subdatasets, the biobjective genetic algorithm is executed, and several subsets of selected features are produced. Eventually, one of the selected feature subsets is determined as the final subset of features by using a heuristic method. It must be mentioned that both objective functions and also the genetic algorithm, as a population-based algorithm, are time-consuming and have high computational cost.
Hoque et al. in 2018 [37] proposed an algorithm called EFS-MI. EFS-MI applies some FR algorithms as base learners, and it tries to fuse the output of several FR algorithms at the final step. During the fusion step, the algorithm attempts to determine the final selected features set, so that it has both the maximum relation to class label and the minimum relation to the other features. is way causes the redundant features to have fewer chances to be a member of the final selected features set. e prominent note is that the proposed method uses the incremental approach at the fusing step that might reduce the distributing ability.
Despite both methods of MCF-RFE and EFS-MI, the base learners are of the FR type; their outputs were a set of selected features similar to the output of EFSGA. Some proposed methods, whose outputs are a feature ranking, are given in the following.
In the bioinformatics scope, an EFR method developed by Abeel et al. [38] uses the SVM-RFE as an incremental FR. In this paper, two rank fusion algorithms called CLA and CWA are introduced. In [39], a heuristic method is developed in which a given dataset is sent to 5 different FR algorithms. en, based on the outputs of the FR algorithms, five classification models are made, and, eventually, the classifier's outcomes are combined by a simple voting. It is noteworthy that, in this method, the outputs of FR algorithms are not fused directly.
According to Brahim and Limam in 2013 [40], a fusion method named ROB-EFS for fusing the base feature rankings is introduced. In their method, the selected features are fused based on two criteria of confidence and reliability. For assessing the confidence criteria, the method is applied to the SVM classification error rate that is thoroughly timeconsuming.
Boucheham and Batouche proposed a method named MEFS for fusion feature ranking [41]. In this method, the feature rankings are fused in two steps. At the first step, the base feature rankings are generated in parallel, and then they are fused. In the second step, all actions of the first step are repeated incrementally.
A heterogeneous EFR algorithm is proposed by Seijo-Pardo et al. [42]. In this algorithm, all instances of a given dataset are considered by the six FR algorithms. Consequently, the six feature rankings are generated which are fused by the SVM-Rank for acquiring the final ranking of the features. e authors of [40] published their next work [11] in which a new rank fusion method named RAA is proposed. e RAA method, similar to the Rob-EFS, utilizes the classification performance as a confidence criterion. erefore, both articles have the same weaknesses.

Experimental Framework
In this section, the performances of different rank fusion algorithms applied in the homogeneous EFR algorithm are comparatively analyzed. Some aspects of experimental studies should be considered. ese matters will be explained in the following.

Homogenous Ensemble Feature
Ranking. In this paper, as mentioned before, rank fusion algorithms are applied in a homogeneous ensemble feature ranking, which is executed in a parallel approach. erefore, explaining this algorithm is essential. In the EFR, at the first step, a given dataset is sampled S times and created S subdatasets. In the second step, each subsample is processed by an FRA in a separated thread independently, and then intermediate feature rankings are produced. At the final step, the intermediated rankings will be fused by utilizing a rank fusion algorithm to produce the final feature ranking. e details of the EFR are explained in Figure 1 and Algorithm 1. Note that, in Algorithm 1, DS ∈ R N×D refers to the given dataset and the input variable S determines the number of local datasets that is assumed to 30 in the experimental studies. Also, the input variable I refers to the number of records in each local dataset, the input variable, A refers to an FRA that can be one of the algorithms introduced in Section 3.2, the input variable F refers to a rank fusion algorithm that can be one of the algorithms introduced in Section 3.3, and, finally, the output variable O refers to the feature ranking result.

Feature Ranking Algorithms as the Base Ranker.
Seven FRAs are used in the EFR as base rankers. All these algorithms belong to the filter category, and they have a much lower computational cost than the algorithms that belonged to the wrapper and embedded categories. e algorithms that belonged to the filter category are divided into multivariate and univariate methods. In general, one can say that most of the classical feature selection approaches are univariate; each feature is considered separately, having an important advantage in scalability, but at the cost of ignoring feature dependencies, thus perhaps leading to weaker performances than other feature selection techniques. In order to improve performance, multivariate filter techniques are proposed, but at the cost of reducing scalability [43]. e FRAs applied in experiments are listed. e five first of these algorithms are univariate, and two last are the multivariate algorithms. According to predefined statements, there is an expectation that the two last algorithms have a better performance than others, which is a salient point depicted in Figure 2.
(i) Information gain: this criterion is based on the entropy measure that is used for feature ranking. e more the information gain value, the more important the feature [44][45][46]. (ii) Gain ratio: this is a normal form of information gain. Although these ways are related to each other, the final feature rankings as outputs of these ways are different [47]. (iii) Fisher: the main idea of Fisher score is to find a subset of features, such that, in the data space spanned by the selected features, the distances between data points in different classes are as large as possible, while the distances between data points in the same class are as small as possible [48][49][50]. (iv) Gini: Gini-index is used to measure the impurity of a feature for categorizing. e smaller the value, i.e., the minor the impurity, the better the feature [51,52]. (v) OneRule: this method, named oneR, tries to build one simple rule to predict the target class for any feature, and then the algorithm sorts all the features based on the error rate of their rule. For example, a simple rule for a feature can be a set of feature values bound to their majority class [53]. (vi) ReliefF: this method uses the ability of a feature in separating similar instances. For a random sample of the training set, the nearest hit and the nearest miss instances are found. en, the algorithm updates the weight of all features based on the values of the hit and miss instances. Any feature that has a larger weight value can distinguish the instances of a class better [54][55][56]. (vii) QPFS: in this method, features are evaluated by minimizing a multivariate quadratic function subjected to linear constraints. e quadratic function includes two components. e first one is a matrix of similarity among the features. e second one is a vector of dependency between the features and the target class. e result is a weight vector [57,58].

Rank Fusion Algorithms.
Rank fusion is called by different names such as rank aggregation, rank combination, and preference aggregation, in various scopes. Generating a final ranking by a set of base rankings is introduced as an optimization problem known as Kemeny ranking problem [59]. In the previous literatures, some algorithms, which are collected in Figure 3, were suggested for solving this problem [19,[60][61][62][63]. ese solutions are categorized into two groups of exact and approximate solutions, although all of these algorithms are not suitable for applying in the EFR approach. e exact solutions, such as integer linear program (ILP), have high computational cost and are time-consuming, so these methods are improper when the number of features in a given dataset is so large [62,64]. For these reasons, none of the rank fusion algorithms that belong to the exact category are studied in this paper. e approximate solutions have less computational cost than exact ones, so all rank fusion algorithms considered in the experiments belong to this category. Most of the rank fusion algorithms that belong to the positional category [65,66] have low computational cost, are so fast, and are used in many different scopes. erefore, in this paper, methods of Borda-Mean (B-Mean), Borda-Median (B-Median), Borda-Geometric-Mean (B-Geom), Borda-L2Norm (B-L2), and Borda-Min (B-Min) are studied. e rank fusion algorithms that belong to the sort based category [67][68][69][70] are not pretty high computational cost, despite being applied infrequently in the previous studies. Hence, just Kwik method as a representative of this category is studied. e computational cost of the graph-based methods is not as high as the exact methods, although that is so noticeable still. In this paper, just the Markov-Chain (MC4) method is investigated because it has a better result than the other methods in the graph-based category [71]. Two rank fusion algorithms, robust rank aggregation (RRA) and Stuart [72,73] that belong to the statistical order based, were applied a lot in the bioinformatics studies, and also the computational cost of these methods is lower than that of the methods in the exact and graph-based categories. us, in experiments, both methods are studied. erefore, the computational cost, which is such a crucial factor in processing high-dimensional dataset, and popularity in the literature, are the two criteria to select fusion algorithms in experimental studies. Also, at least one fusion algorithm is selected except for the exact solutions category in each category. It is worth mentioning that the rank fusion algorithms, studied in the experiments, are distinguished by the bold font and blue colour as a mindmap in Figure 3.

Datasets.
As mentioned before, the feature selection technique for dealing with a high-dimensional dataset is contemplated. For performing the experimental study, six datasets from two popular repositories, UCI (https://archive. ics.uci.edu/ml/datasets.php) and Scikit (https://github.com/ jundongl/scikit-feature/tree/master/skfeature/data), are collected. ese datasets have instances in a range of 1,440-21,048 and have features in a range of 179-1024, such that they are sampled 30 times in the execution time of the EFR algorithm. e characteristics of these datasets are gathered in Table 1.

Performance Assessment Criterion.
In FR as a subcategory of FS, all features are ordered based on their importance and then by using a threshold; some more important features placed at the top of the ordered list are determined as the selected features. Notice that there is no deterministic way of determining the threshold value. erefore, in this paper, to assess a feature ranking such as R � (f 1 , f 2 , . . . , f n ), k top features are evaluated by using a stepwise method. e output of the stepwise evaluation method is a feature-forward-addition (FFA) curve [74]. e pseudocode for the stepwise evaluation method is given in Algorithm 2. Note that the input variable R is a ranking of features, the input variable k is the number of top features that is set to 10%, and the output variable O includes the FFA curve points, and it is equal to the accuracy of a classifier when the various numbers of features (2 to 10% of all features) are selected. Also, the random forest classifier [75,76] is applied for evaluation in this paper. e classifier's implementation can be found in the Cran Repository (https://cran.r-project.org/web/packages/randomForest/ind ex.html); its default settings, such as ntree � 500, are used during experiments. It is expected that the most important features which have more effect on the performance of the classification algorithm are placed at the top of a given feature ranking. erefore, for two feature rankings, r A and r B , the better feature ranking reaches the maximum point on its FFA curve by a steeper slope, and, as a consequence, it has a larger area under the FFA curve, called AUFC henceforth. For example, in Figure 4, both feature rankings r A and r B have the same accuracy by using their fourteen top features, but it is obvious that r A has better performance than r B because it has a bigger area under the curve. erefore, for assessing a feature ranking performance, the AUFC is used as a criterion [49]. A blue colour hatches the AUFC of r B criterion in Figure 4.

Experimental Results
For performing the experimental study, some experiments are done on six real datasets whose characteristics were brought in Table 1. To this aim, nine rank fusion algorithms determined in Section 3.3, and seven FRAs introduced in Section 3.2 are applied in an EFR proposed in Algorithm 1 by 63 independent experiences (number of feature ranking methods × number of rank fusion methods), for each dataset. en, for each experience, the corresponding FFA curve is generated. Note that, for determining the training and test datasets, the fivefold cross-validation is used.
As an instance, Figure 2 depicts FFA curves for the USPS dataset, such that each subchart illustrates the FFA curve of various FRAs by applying a specific fusion method. As an example, the first subchart of Figure 2 illustrates FFA curves for FRAs and the B-Geom ranking fusion method when they are applied in an EFR.
As mentioned and expected, Figure 2 depicts that two multivariate algorithms, QPFS and ReliefF, have a better FFA curve than other methods in all subcharts. is matter is

Statistical Comparison among Rank Fusion Algorithms.
Nevertheless, the main purpose of this paper is to perform a comparative analysis of the performance of rank fusion methods applied in an EFR, so the FFA curves of Figure 2 should be categorized based on the FRAs, the same as what Figure 5 depicts. As illustrated by the Fisher chart in Figure 5, all rank fusion methods make almost similar FFA curves to each other when the Fisher algorithm is applied in an EFR; however, the B-Min method has a little better curve than others. By investigating the other subcharts, it can be seen that this behavior has been repeated among the other subcharts of Figure 5. Also, it is evident that each dataset can generate a figure similar to Figure 5. In order to perform the comparative analysis, the AUFC criterion is assessed for all FFA curves in Figure 5. en, the AUFC measures are normalized by dividing them on maximum AUFC (number of selected features × the maximum point among all curve points). is process is repeated for all subcharts and for all six experiments datasets. If the experimental results on each dataset are gathered, stacked column charts can be generated as Figure 6. As an instance, the second chart in Figure 6 is related to the performance of the B-Geom rank fusion method on the USPS dataset. is column includes the AUFS criterion values when the different FRAs are applied in the EFR in independent experiments, but all of these experiments applied the B-Geom method as the rank fusion algorithm. In other words, this column is a sum of normalized AUFC criterion values of the B-Geom curve in all subcharts of Figure 5. erefore, the other columns in Figure 6 can be generated in the same way. e second chart in Figure 6 depicts that all of the rank fusion methods have the same performance; however, the B-Min method has a little better performance than the others. is behavior is repeated in the remaining subcharts in Figure 5.
For more deep analysis, the experimental results are collected in Table 2, such that the average value of each column of Figure 6 is placed in the equivalent cell in Table 2. As an instance, the average column values of the second chart in Figure 6 are placed in the second column in Table 2. e critical question is the following: is there a statistically significant difference in the average value of the AUFC among rank fusion algorithms when applied to an EFR?
To answer this question, at first, a one-way repeated measure analysis of variance (ANOVA) with Greenhouse-Geisser correction is conducted. e ANOVA test results reveal that there is a statistically significant difference in the mean AUFC values among rank fusion algorithms. At second, pairwise comparison t-tests, named PCT henceforth, by applying a Bonferroni adjustment, are performed to compare the mean AUFC values among rank fusion algorithms. e results of the pairwise comparison tests are shown in the columns in Table 2 by the letter-based representation method. For each column of the table, if there is no significant difference among rank fusion algorithms, the values are marked with a shared superscript letter. For example, in column USPS, the performance of the B-Min has a statistically significant difference with all of the other methods; consequently, it has none shared letter.

Scientific Programming
According to the first column in Table 2, the B-Min method has a better result than the others, and PCT realized that there is a statistically significant difference between the B-Min method and the others, so there is no shared letter. Consequently, in the Epileptic column in Table 3, the B-Min method is set to 1 as the best fusion method in the Epileptic dataset. Also, the PCT does not realize the statistically significant difference among B-Median, Kwik, and RRA methods, and then all of these methods are set to 2 in the equivalent columns in Table 3.  In the USPS column of Table 2, the B-Min method has better performance than the other methods, and the PCT realized that this difference is statistically significant, so it is set to 1 in Table 3, and also both B-Geom and MC methods are set to 2 because there is no statistically difference between them.
In the third column in Table 2, the PCT could not recognize a significant difference between B-Geom, B-Median, Kwik, and B-Min; hence, all of these methods are set to 1 in the equivalent column in Table 3. Also, in HAR and Isolet columns in Table 2, there is a statistically significant difference between the B-Min method and the others, so this method is set to 1 in equivalent columns in Table 3. In contrast to the other columns, in the COIL20 column, the PCT could not realize a statistical difference between all methods. erefore, all methods in the COIL20 column in Table 3 are set to 1. According to these explanations, Table 3 is filled, and it depicts that the B-Min method has a better result in all of the datasets, and it acquires the number 1 in the Average column.
In summary, Figure 6 illustrates that, in the EFR algorithm, there is a small difference in the performance of the rank fusion methods, and also Table 3 proves that the performance of the B-Min method has a statistically significant difference with most of the other methods.  Table 4 is generated that depicts the AUFC measures among EFR and individual feature rankings. As an example, the first cell in Table 4 shows the AUFC values of the EFR when the Fisher algorithm is applied as the base rankers on it, and the Fisher feature ranker is used individually. is cell depicts that the AUFC of the EFR (0.861) is bigger than the AUFC of individual Fisher (0.812). Note that highlighted values in bold style discriminate the better results in each cell.
As aforementioned in Section 1, the inherent scalability is the prominent advantage of the EFR to confront massive datasets due to independent processing of each subdataset by worker nodes or worker threads. Moreover, EFR can cause acquiring more accurate results than individual FRA owing to combining multiple models instead of a single model, and it is founded on the old proverb "two heads are better than one." [12]. e latter is observable in Table 4. e results in this table illustrate that EFR has a proper potential to make more accurate results than individuals FRA ones, such that in 43/47 items the EFR has better results and in the remaining items 4/47 EFR has pretty comparable results with individual FRAs. Table 2 illustrate that there is no big difference among rank fusion methods applied in a homogeneous EFR. is matter is related to two factors. First, the homogeneous ensemble approach applies a similar  us, all base feature ranking is generated based on a likewise procedure and logic.

Figures 5 and 6 and
is matter can cause generating similar base feature rankings. Second, in a low noisy and real dataset, though, each sampled subdataset has different data instances, but there are no such diverse patterns among them. erefore, applying similar feature ranker algorithms causes producing the base feature rankings whose most informative features placed on top positions are almost similar to each other, whereas the less informative features would be placed in various bottom positions in feature rankings. According to these two factors, various fusion algorithms make outcomes such that their top positions are almost similar.
In another point of view, Table 3 and statistical test results depict that B-Min method has a little better performance than others. e B-Min method uses an optimistic approach for the fusion of the base feature rankings. In an optimistic approach, a feature placed in the top positions at least in one base feature rankings would be a top member of the final feature ranking. In other words, in the optimistic approach, the base FRAs are assumed to be trustworthy, which is an efficient approach in noiseless and real datasets.
us, most informative features of base feature rankings should have a decent chance to be a top placed informative feature in the final result. Generally, the experiments depict that the optimistic approach for the fusion of the base feature rankings which have almost similar top features can cause generating a bit better result than other fusion methods.

Conclusion
Feature selection is an essential preprocessing technique, and its importance is more sensible when the number of features of the given dataset is large that is an ordinary matter in the big data era. Also, the ensemble learning broadly has been applied in the classification discipline in the last decade; however, its effectiveness is imaginable in other machine learning disciplines such as feature ranking as well.
e ensemble learning has inherent scalability due to the fact that each subdataset can be processed independently, and this ability is more important to cope with a large scale dataset. e EFR has three major steps, subsampling, generating intermediate feature ranking, and fusing the intermediate feature rankings. Because the fusing phase is a crucial step in the EFR, in this paper, a statistical analysis of the performance of nine rank fusion methods is done when they are utilized in an EFR.
In the statistical analysis, a one-way repeated measure ANOVA with pairwise comparisons t-test applying a Bonferroni adjustment was performed to compare the mean of AUFC value among the rank fusion algorithms. e results of the one-way ANOVA revealed that the difference in the performance of the rank fusion methods is small, though there is a statistically significant difference in their performance when applied to the EFR algorithm. Additionally, the pairwise comparisons test showed that the "B-Min" method had a bit better performance than the other methods, at least on six real datasets that are examined in this paper.

Data Availability
No data were used to support this study.