Evolutionary Multilabel Feature Selection Using Promising Feature Subset Generation

Recent progress in the development of sensor devices improves information harvesting and allows complex but intelligent applications based on learning hidden relations between collected sensor data and objectives. In this scenario, multilabel feature selection can play an important role in achieving better learning accuracy when constrained with limited resources. However, existing multilabel feature selection methods are search-ineffective because generated feature subsets frequently include unimportant features. In addition, only a few feature subsets compared to the search space are considered, yielding feature subsets with low multilabel learning accuracy. In this study, we propose an effective multilabel feature selection method based on a novel feature subset generation procedure. Experimental results demonstrate that the proposed method can identify better feature subsets than conventional methods.


Introduction
Recent progress in the development of sensor networks improves the precision of continuous data sensing [1], which increases the coverage of ambient applications such as activity monitoring in daily routines that may involve the concurrent prediction of the activity level and caloric expenditure [2,3].Owing to limitations in computational and storage capability [4,5] and redundant data sensing for denoising [6,7], composing a strategy that would produce the best accuracy under given data collection conditions is considered one of the most important issues in this field [8].Consequently, multilabel learning is considered to be a promising approach because it allows for improvements in accuracy by exploiting the dependency among labels [9,10].
Let W ⊂ ℝ F denote the set of patterns described by a set of features F = f 1 , ⋯, f d .Then, each pattern w i ∈ W, where 1 ≤ i ≤ W , is assigned to a certain label subset λ i ⊆ L in which L = l 1 , l 2 , l 3 , ⋯, l L and represents a finite set of labels.To attain additional improvements in accuracy, the algorithm has to exploit useful dependencies among labels based on input feature values [11].For this purpose, the multilabel feature selection that identifies a subset S ⊂ F with maximum n < <d features that provide the largest dependency on L can be used as a promising preprocessing step because it remedies the complicated relation among features and labels by selecting important features and discarding unnecessary ones [12,13].
Basically, multilabel feature selection is a search problem [14]; it can be achieved by identifying the optimal feature subset that gives the best prediction accuracy from 〠 n k=1 d k 1 candidate feature subsets [15].Because the examination of all feature subsets is impractical, conventional methods employ a heuristic search method that identifies a feasible solution within limited computational costs by sacrificing optimality [16].Of the many search methods, the evolutionary search method is considered a promising approach because it effectively narrows down the search space by examining neighbor solutions or feature subsets of the best solutions created from past generations [17,18].
In the evolutionary search method, the best solution is replaced if a newly created neighbor solution yields a better fitness value.Therefore, generating promising solutions determines the effectiveness of the search.Owing to the extensively wide search space and limited computational cost, a conventional strategy tackling this difficulty is to employ a cheap evaluation method that measures the potential of possible solutions, filtering out unpromising solutions and then validating the exact fitness value of the remaining solutions [19].However, to the best of our knowledge, there is no serious investigation on this direction from the literatures related to intelligent sensor applications and multilabel feature selection.
In this study, we propose a novel effective evolutionary search method for multilabel datasets.Previous studies considering the intelligent sensor applications incurring multilabel feature selection did not tackle the issue related to the generation of promising feature subsets, resulting in a degeneration of search effectiveness.Our contribution can be summarized as follows: (i) The proposed method improves search effectiveness by producing a large number of feature subsets with important features and then filters out unpromising feature subsets using a cheap evaluation method.
(ii) A cheap feature subset evaluation method is employed to filter out unpromising feature subsets without checking the fitness value which demands expensive computational cost.
(iii) We compared the performance of conventional multilabel feature wrapper methods and the proposed method on 14 multilabel datasets and conducted 53 standard statistical tests to validate the superiority of the proposed method

Related Work
Because multilabel feature selection can improve the learning accuracy as well as the efficiency of a later algorithm by highlighting important features such as multilabel classifier for the concurrent prediction, it gained significant attention from diverse fields [20,21].Feature selection methods come in two categories: filters and wrappers.Filter methods rank features based on their own criterion by evaluating the importance of each feature.For multilabel feature selection on multilabel datasets, a simple strategy that changes the label sets to a single label set was often considered, such as a label powerset [22].This method is advantageous because it enables conventional feature selection methods for single-label datasets.Several conventional filter methods have been reported [23]; however, filter methods commonly suffer from low multilabel classification accuracy, owing to noninteraction with multilabel classifiers or subsequent problems such as imbalance in transformed single-label data.By contrast, wrapper methods evaluate created feature subsets and improve them.In detail, they locate promising feature subsets using a search method employed and then evaluate them using a later learning algorithm [17].Although the learning algorithm can be different according to the application, recent review indicated that the most frequent choice for the search method is the evolutionary search [24] because it is effective at searching for feasible solution in global perspective.Zhang et al. [14] proposed a multilabel feature selection method based on genetic algorithms.However, a major drawback of the genetic algorithm is their premature convergence to unrefined solutions [17].On the other hand, a genetic algorithm-based nondominated sorting genetic algorithm-II [25] and multiobjective particle swarm optimization [26] have been used for multilabel feature selection.
Although most studies consider single-label sensory datasets, there are several studies on feature selection methods because of the promising potential.To apply automatic view generation, a semisupervised feature selection method for features extracted from very high-resolution remote sensing images was proposed [27].Specifically, features are categorized into a series of disjoint groups, and then important features in each group are selected by solving the l 11,2 -norm-based minimization problem.Similarly, a refined feature subset from discrete wavelet transform coefficient features, extracted from artificial tongue sensor signals, was selected by using a dispersion ratio computation [12].Activity recognition using accelerometers was also shown to be improved by feature selection [28].There are several studies related to the identification of a set of important features based on the fitness or classification accuracy derived from the learning algorithm.For example, a feature subset can be obtained by iteratively including the best feature at each step, which is referred to as the sequential forward selection algorithm [29].This technique is applied to the application of chiller fault detection [30], which is an instantiation of an automatic fault detection problem in a smart factory [31].The genetic algorithm which is one of the most famous evolutionary search methods in the machine learning community was also considered for selecting discriminative features for online bearing fault diagnosis [32].In addition, the particle swarm optimization technique, which is another popular evolutionary search method, was also used to find the optimal feature subset for intrusion detection [13].Support vector machine recursive feature elimination has been used for the analysis of correlated gas sensor data [7].Energy consumption was minimized and the classification accuracy was improved by feature selection from sensor data [5].

Proposed Method
3.1.Preliminary.Of the various evolutionary search methods, estimation of distribution algorithm (EDA) has proven effective for solving various problems [24,33].Unlike typical evolutionary search methods, to generate new feature subsets, EDAs do not use genetic operators [19].Instead, conventional EDAs generate new solutions or candidates 2 Journal of Sensors using a probability model and update the probability model based on a statistical distribution estimated from the representation of solutions.Thus, it provides an opportunity to generate promising feature subsets by manipulating the probability model.The probability model can be implemented as follows [33,34]: where P t i is the selection probability of the i-th feature in the t-th generation, F t i is the probability associated with the i-th feature in the top 50% feature subsets in the t-th generation that are ranked in terms of their fitness values, and LR is the learning rate, which is a user-defined parameter that controls the influence of F i to the probability model in the next generation.Through (2), the probability of selecting a feature in the (t + 1)-th generation, P t+1 , is calculated, and in the (t + 1)-th generation, feature subsets are built.This process is repeated until the maximum allowed computational cost is exhausted.Although there are many stopping criteria, we set the number of spent fitness function calls (FFCs) as the termination condition for all evolutionary search methods employed in this study for a fair comparison against diversified settings and implementations [35].
In the feature selection problem, the algorithm should be capable of searching a huge parametric space; thus, significant computational cost is associated with finding a promising solution.Although simple probability models are easy to implement, it can be insufficient for solving complicated problems, such as pinpointing promising feature subsets in a large search space [36].For example, in the conventional EDA-based feature selection method, all features are initially assigned the selection probability of 0.5.This means that nonpromising features can be also present in feature subsets.To overcome this drawback, we devise a process for generation of a promising feature subset.Specifically, when creating a feature subset, the algorithm will consider important features more frequently by setting the priority to such features given by an individual feature filter.
After creating the feature subsets, the next step amounts to selecting promising feature subsets.Although good feature subsets can be created using filter methods, there can be nonpromising feature subsets because the creation process is probabilistic and there can be efficient interaction among features.Nonpromising feature subsets consume FFCs and negatively affect the search efficiency.To overcome this problem, we propose a feature subset evaluation method consuming a cheap computational cost.Using the methods of information theory, the proposed method calculates, for each subset, the relevance and redundancy of the subset features.Then, the proposed method selects feature subsets with maximal relevance and minimal redundancy.Because there is a possibility that the proposed solution will be only locally promising, the proposed method uses roulette wheel selection as the selection algorithm [37].Thus, nonpromising feature subsets are filtered out from the neighbor set, without exact evaluation.
In the proposed method, there are two key functions for the feature subset generation.create function makes candidate feature subsets that is composed of relevant features.select function selects promising feature subsets among created ones by using roulette wheel selection based on their potential given by a feature subset evaluation method.Figure 1 schematically shows the proposed method.In the first stage, the probability model is initialized, indicating feature subsets containing randomly chosen features will be created frequently.The probability model is represented as a vector where each element encodes the presence of each feature.In the next step, feature subsets are created using create function.All feature subsets are assigned random integers, ranging from one to n.If one feature subset is determined to choose two features, the proposed method ranks the features in terms of their importance, using a filter method.In the first iteration, the most important feature is f 4 .Then, the proposed method chooses a random number r between 0 and 1 and compares r to the selection probability of f 4 in the probability model, p 4 .Since p 4 is greater than r in the example, f 4 is selected and added to the feature subset.In the second iteration, features are again ranked using the filter method.In this case, the features' importance is measured again in terms of relevance and redundancy under the selection of f 4 .Thus, the features' ranks can change.In this example, f 2 is the most important feature.Then, another random number r is drawn and compared to the selection probability of f 2 , p 2 .However, p 2 is lower than r; thus, f 2 is not selected.Then, the second most important feature can be selected.In this example, f 5 is added to the subset of features, and iteration is terminated.Through this process, the proposed method creates a series of new feature subsets including important features.The next step amounts to selecting promising feature subsets among created feature subsets using select function.The proposed method chooses m promising feature subsets by roulette wheel selection biased by the proposed feature subset evaluation.Finally, the probability model is updated using m promising feature subsets and (2), to reflect the presence of features in the best half of new feature subsets ranked by fitness value.

Proposed Search
Procedure.The proposed algorithm creates feature subsets to large searching spaces and filters nonpromising subsets using the proposed subset evaluation method that does not incur exact evaluation.Algorithm 1 shows the proposed method.For the population size m and maximal number of FFCs v, the method initializes feature subsets O t and the probability model P (line 3).The method generates a set of m feature subsets O t through a random assignment of maximum F binary bits.The probability model P is an F length vector, and each entry in the vector refers to the probability of choosing coordinated features.Each entry is initialized for the distribution of the features of O t .Then, the created set O t is evaluated (line 4).The method set consumed FFCs u to 0   ).The feature subset S g , which offers globally optimal performance, is stored and replaced in the procedure (line 13).After all allowed FFCs are consumed, the algorithm returns the feature subset S g .Algorithm 2 is a create function that shows the process of creating feature subsets.Each feature subset selects random size n features (line 5).To introduce important features more frequently, first, each feature should be ranked by their importance value.To achieve this, we evaluate the importance of each feature using the relevance criterion [20]: where Rel f i and Red f i denote the relevance and redundancy of the i-th feature and I f i denotes the importance of the i-th feature.Although both functions can be implemented differently according to the subject of each study, we use a recent filter method for measuring the importance of features.In the work of [15], we proposed a filter method for multilabel dataset, and was shown to outperform conventional filter methods.Because of this reason, we use this method for measuring the importance of features.Accordingly, Rel f i can be implemented as

5
Journal of Sensors where M x ; y = H x − H x, y + H y indicates the mutual information between variables x and y and H x = − ∑P x log P x is the joint entropy obtained from the probability P x , P y , and P x, y .Next, Red f i can be implemented as Thus, the feature f i 's importance is measured by Then the rank of each feature can be determined by using (6) and remembered (line 8).After then, the function decides whether to choose feature from the most important subsets by P (lines 9 to 13).If a feature is chosen, it is added to subset S k (line 11).In addition, after a subset is created, it is added to the set of neighbor feature subsets E (line 16).
It is well-known fact from the feature selection community that a set of individually good features is not necessarily a good feature subset due to the interaction among features.This means that the created feature subset can be unpromising even though (6) only included important features.To achieve this, select function described in Algorithm 3 that shows the process for selecting promising feature subsets in a neighbor set is necessary.In select function, a new feature subset filter method is employed [38].Specifically, it evaluates the fitness of the feature subset as By using (7), select function ranks feature subsets in the neighbor set E (line 3).Next, the algorithm selects m feature subsets G t using roulette wheel selection [37], which is a biased selection weight by (7) (line 4).
In summary, in the generation of a feature subset, the algorithm ranks the importance of features using the filter method and selects the most important feature i based on the probability P t i considering subset S selected at this point.If the i-th feature is not chosen, the next most important feature j can be selected with the probability P t j , and the process repeats until a feature is selected.Then, for each neighbor feature subset, (7) ranks the importance of feature subsets, and feature subsets with highest E • values are likely to selected.

Experimental Results
We conducted experiments on 14 datasets from various domains.The Birds dataset is audio data containing samples of multiple bird calls.The Enron and Language Log (Llog) datasets are generated from text mining applications, where each feature corresponds to the presence of a word and each label represents the relevance of each text pattern to a specific subject.The Mediamill dataset contains video data from an automatic detection system.The Medical dataset is sampled from a large corpus of suicide letters obtained from the natural language processing of clinical free texts.The TMC2007 dataset contains safety reports of a complex space system.The remaining eight datasets came from the Yahoo dataset collection.We performed unsupervised dimensionality reduction on datasets, including the TMC2007 and Yahoo collections, consisting of more than 10,000 features.Because our algorithm uses information theory, for numeric features, we performed discretization using the supervised discretization method [39].Table 1 shows the standard characteristics of the multilabel datasets used in our experiments, including the number of patterns in the datasets W , number of features F , type of features, and number of labels L .The label cardinality measure C ard represents the average number of labels for each instance.The label density measure Den is the label cardinality over the total number of labels.The number of distinct label sets Distinct indicates the number of unique label subsets in L. Domain represents the applications associated with the extracted datasets.
We compared the proposed method with conventional methods, including the genetic algorithm (GA) [14], nondominated sorting genetic algorithm-II (NSGA-II) [25], and multiobjective particle swarm optimization feature selection (MPSOFS) [26].We considered a conventional multilabel classifier, namely, the multilabel naïve Bayes (MLNB) classifier [14].We used conventional hold-out cross-validation for each dataset.Of the patterns, 80% were randomly chosen as a training set and the remaining 20% were chosen as a test set.We set the size of the population to 20, and the maximal number of FFCs was limited to 100.In our proposed method, we created 500 feature subsets using the probability model and set the learning rate (LR) to 0.4.The GA and NSGA-II created two offspring feature subsets and one feature subset from mutation operators in each generation.The MPSOFS preserved the global best particle solutions and each particle's best solutions.Thereafter, the MPSOFS updated the velocity values.All experiments were repeated 10 times, and the average measured values were used to compare the performances of the methods.
To measure the methods' performances, we employed the following four evaluation metrics: multilabel accuracy, 1: Input: neighbor set E, population size m 2: Output: filtered set G t 3: rank feature subsets in set E by Eq. ( 7) 4: select m feature subsets by roulette wheel selection 5: G t ← selected feature subsets Algorithm 3: select function.6 Journal of Sensors hamming loss, ranking loss, and normalized coverage.Multilabel accuracy is defined as where T is a given test set.Hamming loss is defined by where λ denotes the correct label subset and Δ denotes the symmetric difference between the two sets.Ranking loss is defined by where λ i is a complementary set of λ i .Ranking loss measures the average fraction of a, b pairs with ψ i,a ≤ ψ i,b over all possible relevant and irrelevant label pairs.Finally, normalized coverage is defined as: where rank • returns the rank of the corresponding relevant label l ∈ λ i according to ψ i,l in nonincreasing order.Therefore, normalized coverage measures how many labels must be marked as positive for all relevant labels to be positive.Higher values of multilabel accuracy and lower values of hamming loss, ranking loss, and normalized coverage indicate good classification performance.
Tables 2, 3, 4, and 5 list the experimental results for the different performance measures as averages over the experiments on the employed datasets.The best performance of each dataset is indicated by a bold font.In each table, the last column shows the average rank (Avg.rank) of each comparison method over all the multilabel datasets.In terms of the multilabel accuracy and ranking loss measures, the proposed method outperformed the GA, NSGA-II, and MPSOFS, on all datasets.In terms of the hamming loss, the proposed method outperformed conventional methods on all datasets except TMC2007.In terms of the normalized coverage, the proposed method outperformed the conventional methods on all datasets except Llog.
After measuring the performance of the methods on all datasets, we analyzed the performance using statistical tools.We employed the Friedman test, a widely used statistical test, for comparing multiple methods over a number of datasets [40].Supposing there are k methods and N datasets, and let R j denote the average rank for the j-th method under the null hypothesis (i.e., when all of the methods perform equally well).Then, the following Friedman statistic F F is distributed according to the F-distribution with k − 1 numerator degrees of freedom and (k − 1) (N − 1) denominator degrees of freedom as parameters: where χ 2 F is defined as If F F is larger than the critical value at a significance level α, the null hypothesis is rejected, implying that the compared methods have different performances.After the  [41].Critical difference (CD) is used to compare the proposed method and one comparison method.CD is defined as where the critical value q α is constant and is determined by the number of methods and the significance level.If the difference between the two compared methods' average ranks is greater than CD, the better-ranking method is concluded to perform significantly better than the other method.Because our experiment used four methods, including the proposed method, and 14 datasets, we set k = 4 and N = 14.We employed the Friedman test when the significance level α was 0.05.Table 6 shows the summary of the employed Friedman test.The critical value for 3 and 39 degrees of freedom was 2.845.The Friedman statistic F F for all performance measures was above the critical value.Thus, the null hypothesis that the compared methods perform equally well was rejected.
To employ the Bonferroni-Dunn test, the calculated CD with α = 0 05 was 1.168 since q α = 2 394 at the significance level α of 0.05. Figure 2 shows the CD diagrams for all evaluation measures, where the average rank of each method is on the top of each figure.Our proposed method significantly outperforms other, conventional, methods on all evaluation measures.

Conclusion
To handle multilabel sensor datasets, we proposed an effective search based on a promising feature subset generation method for multilabel feature selection problem.The main contribution of this work is to propose and validate a new feature subset generation method.Specifically, the proposed method generates candidate feature subsets using important features and chooses promising subsets of features without consuming significant computational cost.Experimental results show that our method converges faster than other conventional methods.In the future, we would like to investigate a new feature subset generation that is more effective because the proposed feature subset generation is strongly dependent on the employed filter method, and it may result redundant feature subsets during the search process.In addition, we would like to apply the proposed method to various sensor datasets and compare the performance with conventional feature selection methods considered from sensory data analysis.

4
Journal of Sensors (line 5) and stores the global best feature subset to S g (line 6).P is updated by (2) (line 8).The method creates a set of neighbor feature subsets E, which is based on a filter method by create function (line 9).Then, m feature subsets are selected by roulette wheel selection weighted by select function in the set E t and yield the new generation O t + 1 (line 6).The feature subset O t + 1 is evaluated (line 11), and sets consumed FFCs u (line 12

Table 1 :
Standard characteristics of employed datasets.

Table 3 :
Comparison results in terms of hamming loss.

Table 2 :
Comparison results in terms of multilabel accuracy.

Table 5 :
Comparison results in terms of normalized coverage.

Table 6 :
Summary of the Friedman statistics F F (k = 4, N = 14) and critical value in terms of each evaluation measure.

Table 4 :
Comparison results in terms of ranking loss.