Multi-Label Feature Selection with Conditional Mutual Information

Feature selection is an important way to optimize the efficiency and accuracy of classifiers. However, traditional feature selection methods cannot work with many kinds of data in the real world, such as multi-label data. To overcome this challenge, multi-label feature selection is developed. Multi-label feature selection plays an irreplaceable role in pattern recognition and data mining. This process can improve the efficiency and accuracy of multi-label classification. However, traditional multi-label feature selection based on mutual information does not fully consider the effect of redundancy among labels. The deficiency may lead to repeated computing of mutual information and leave room to enhance the accuracy of multi-label feature selection. To deal with this challenge, this paper proposed a multi-label feature selection based on conditional mutual information among labels (CRMIL). Firstly, we analyze how to reduce the redundancy among features based on existing papers. Secondly, we propose a new approach to diminish the redundancy among labels. This method takes label sets as conditions to calculate the relevance between features and labels. This approach can weaken the impact of the redundancy among labels on feature selection results. Finally, we analyze this algorithm and balance the effects of relevance and redundancy on the evaluation function. For testing CRMIL, we compare it with the other eight multi-label feature selection algorithms on ten datasets and use four evaluation criteria to examine the results. Experimental results illustrate that CRMIL performs better than other existing algorithms.


Introduction
In the era of big data, data in all fields are increasing explosively [1][2][3]. erefore, feature selection has rapidly become a hot topic. Proper feature selection can improve the efficiency and accuracy of classifiers. Compared with the traditional single-label feature selection, multi-label feature selection is more suitable for solving problems in the real world [4]. erefore, multi-label feature selection applies to various fields, such as image processing [5,6], text categorization [7,8], and bioinformatics [9].
Multi-label feature selection algorithms usually consider how to reduce the influence of redundancy among information. e commonly used processing methods include the swarm intelligence algorithm [10], which regards features as individuals and a group of features as populations for reproduction, evolution, and mutation to reduce the redundancy of information and improve the algorithm's accuracy. Another idea is manifold learning [11]. is approach can diminish useless features for classifiers from the perspective of dimension reduction. Considering the relevance between features and labels by calculating mutual information between features and labels is another approach [12]. is method can help judge which features need to be kept. Much prior work has proved that mutual information is an efficient method to extract features [13,14]. Because mutual information is more concise and effective [15], this paper will explore multi-label feature selection based on mutual information.
Many multi-label feature selection algorithms have been based on mutual information [16][17][18]. Once the mutual information of two different features or two labels is greater than zero, redundancy appears. Although these algorithms have considered the relevance between features and labels, and the redundancy among features, they do not adequately process the redundancy among labels, eventually leading to an unsatisfactory result. is paper proposes a new approach to deal with the redundancy among labels and a multi-label feature selection based on this approach. e rest of the paper reads as follows: In Section 2, the related work is summarized. We then propose a new multilabel feature selection algorithm in Section 3. In Section 4, relevant experiments prove the efficiency of the proposed algorithm. In Section 5, we summarize this paper and explain the directions of future work.
In summary, the study offers the following contributions: (i) We propose a new method to avoid repeating calculations on redundant label information. (ii) We propose a novel algorithm of multi-label feature selection and get good results. It performs better on most datasets, which have redundancy among labels. (iii) We set many experiments from different perspectives to test the proposed algorithms; some of them are innovative.

Related Work
In the early stage of multi-label feature selection, most proposed algorithms transform multi-label datasets into multiple single-label datasets and process all single-label datasets with traditional single-label feature selection algorithms. For example, literature [19] divides a dataset D into q independent 01 datasets by Binary Relevance (BR) and transforms each possible label combination into unique classes by Label Powerset (LP). en this paper deals with new datasets by Relief and traditional single-label feature selection algorithm based on mutual information. However, this kind of algorithm cannot work on large datasets. To overcome this challenge, the literature [20] pruned the labels that infrequently appeared in datasets. is approach can reduce the size of final datasets. However, this algorithm only transforms multi-label datasets into many single-label datasets. which may ignore the effects between features and features, labels and labels in the original datasets. In recent years, many algorithm adaptation methods have been applied to high-dimension feature selection. For example, the literature [21] details two stages to implement feature selection of gene datasets. A greedy approach is used to assign the maximum number of samples to different gene classes in the first step. In the second step, clustering and lasso methods are selected to extract the remaining features. Additionally, Deep Neural Network is embedded into a high-dimension feature selection method [22]. To reduce the effects of outliers and noise in datasets, the literature [23] proposes Unsupervised Feature Selection with Robust Data Reconstruction (UFS-RDR) by minimizing the graph regularized weighted data reconstruction error function. e relevant estimation tools are also developed. To evaluate the stability of high-dimension feature selection approaches, the literature [24] proposes a novel estimator considering interintrastability of subsets. ese high-dimension feature selection algorithms provide ideas for multi-label feature selection. Particularly, multi-label feature selection based on mutual information attracts extensive attention. e literature [25] has considered the interaction between selected features and unselected features and proposed MDMR as follows: where S is the selected feature set and L is the label set. e literature [26] considers redundancy when computing the relevance between features and labels. is paper regards redundancy existing among information as part of the relevance, which means that e coefficient C should become greater when the selected features are strongly dependent on other features, and conversely, C should become smaller. erefore, I(f k , f i ) can be a part of C. Additionally, because C ∈ (0, 1), H(f k ) is used to normalize I(f k , f i ). As a result, the selected feature can be described as However, the algorithm directly computes the relevance and redundancy without further processing. is method might lead to the effects of relevance and redundancy being unbalanced. To solve this problem, the literature [27] proposes granular feature selection, which transforms features into granular feature groups. After computing the relevance and redundancy, the results divide by the size of related sets.
is idea can be detailed by the following formula: where |G| is the granularity. However, these algorithms do not consider the redundancy among labels. Literature [15,28] achieves better results after considering the redundancy among labels. e algorithm can be described as formula (5), respectively.
Although the redundancy among labels has been considered, the redundant information may be accumulated more than once. is problem is detailed in Section 3, and we propose a solution in that section.

Multi-Label Feature Selection considering Redundancy on Mutual Information of Labels (CRMIL)
Firstly, a problem in traditional multi-label feature selection is introduced. Many multi-label feature selection algorithms, 2 Computational Intelligence and Neuroscience which are proposed for solving this problem, have shortages.
To improve the accuracy, we propose a new method to compute the redundancy among labels. is method can reduce the redundancy among labels and calculate the relevance between features and labels. en, the redundancy among features is computed. Finally, we propose the new multi-label feature selection algorithm and detail the pseudocode.

A Problem.
Traditional multi-label feature selection, which does not consider redundancy among labels, might encounter the following problem: In Figure 1 and 2 show that Feature A and Feature B contain 16% and 20% of useful information, respectively. Feature B should be selected. If the redundancy among labels is ignored, the valuable information provided by Feature A and Feature B is 24% and 20%, respectively. As a result, Feature A will be selected due to the redundancy among labels. After considering the redundancy among labels, the mutual information between features and labels is 16% and 20%, respectively. Feature B will be selected. erefore, the redundancy among labels is worth considering. e following parts will focus on how we design the multi-label feature selection algorithm considering the redundancy.

Multi-Label Conditional Mutual Information.
Existing multi-label feature selection algorithms usually use conditional mutual information to calculate the redundancy among labels. In the literature [15,28], I(f, l i |l j ) is essential to compute the redundancy among labels. However, these algorithms enumerate every label as a condition and sum up all conditional mutual information.
e sum can be regarded as the relevance between features and labels with diminishing redundancy among labels, such as formula (6). Once more than two labels contain the same information, the overlapping information will be counted more than once. is situation may reduce the accuracy of the result.
where f is the pending feature, l i and l j are the label elements that are different at any time. Formula (6) has been proved and detailed in the literature [22]. We propose that regarding part of the label set as conditions on mutual information can overcome this challenge. In the proposed multi-label feature selection algorithm, the relevant part which computes the redundancy among labels, can be detailed in the following formulafd7: where Y � l j |l j ∈ L, l j ≠ l i . is can reduce the effects of the redundancy among labels.
Proof Computational Intelligence and Neuroscience is shows that, compared to the traditional formula (6), formula (7) does not sum every element of every label in label sets. erefore, this method calculates the better result in Section 3.1. Formula (7) thus can avoid the repeated calculation on information that many labels contain. □ 3.3. Alleviate the Redundancy among Features. After considering the redundancy among labels, the proposed algorithm calculates the redundancy among features. Mutual information can reflect the total information shared by two random variables. In feature selection, features can be seen as random variables. erefore, we regard the mutual information of all pairs of features as the redundancy among features. en, when a new feature is selected, the redundancy of features is computed by the following formulafd9: where f is a pending feature.

Proposed Algorithms.
Based on above proofs, features with larger value on formula (8) and less value on formula (9) should be selected. After analyzing the relevance and redundancy of information, we use the size of the label (α/|L|) and the selected feature set (β/|S|) to balance the effect of relevance and redundancy on the results. α and β are used to affect the importance of the label set and the selected feature set, respectively. We choose α � β � 1 (this will be proved in Section 4.3). Finally, we proposed a new multilabel feature selection algorithm (CRMIL). e evaluation function can be defined as follows: where Y � l j |l j ∈ L, l j ≠ l i and f k is a pending feature. Proof.
□ Property 2. J(f k ) ∈ (−1, 0), when most of the relevance between features and labels satisfies 0 < I(f k , l i |Y) < α (α ⟶ 0) and most of the redundancy among features However, this is hardly the case in normal datasets. □ Property 3. Because the size of datasets is considered, in normal datasets, J(f k ) ∈ (0, 1).
In the beginning, S is empty. To choose k features, we need k steps. In every step, we choose the feature with the largest J(f k ). en we put the selected feature into S and delete the feature from the label set. Finally, the output is a k-dimension vector containing the index of selected features. e proposed algorithm requires a feature set F, a label set L, and the number of features K and returns the number set of selected features. Lines 1-2: initializing the number set of selected features and the number of selected features k. Lines 3-7: preprocessing the relevance between features and labels in formula (7). Lines 8-22: selecting k features by iterating. Among these lines, lines 9-10 select the first feature. e feature with the greatest relevance is selected because there is no element in the selected feature set. Lines 12-17: the redundancy among features is calculated by using formula (8). Lines 18-20: after selecting a feature, the feature needs to be added to the selected feature set and deleted from the original feature set. Finally, the number set of selected features is returned.

Time Complexity Analysis.
In the following explanation, N is the number of samples, |F| is the number of features, and |L| is the number of labels. e time complexity of the proposed algorithm is up to three main parts. Firstly, processing the mutual information among features needs to enumerate two different features.
is step consumes O(|F| 2 ). Calculating information entropy needs O(N). erefore, this part consumes O(N|F| 2 ). Secondly, the proposed algorithm preprocesses the relevance between features and labels, which is the main part of the algorithm. Enumerating every feature and label consumes O(|F‖L|), and computing the conditional mutual information consumes O(N|L|). erefore, the time complexity of this part is O(N|L| 2 |F|). irdly, the algorithm needs to select K features. In every selection, pending features and selected features need to be enumerated simultaneously, which consumes O(|F| 2 ) at most. erefore, the upper-bound time complexity limit on this part is O(K|F| 2 ). As a result, the algorithm's time complexity should be max(O(N|F| 2 ), O(N|L| 2 |F|)), which depends on the kinds of data in the datasets.
As the time complexity test of a prior work [29], we use Intel(R) Core(TM) i9-9880H CPU @ 2.30 GHz to test the time cost on different datasets. All results are the average level after five times calculations. For example, when a dataset consists of 850 instances, 1000 features, and 50 labels, it takes on average 9.2 s. e number of instances in the dataset then is doubled and the dataset costs around 17.3 s. Furthermore, if the number of features is compressed by half, the time needed is around 2.1 s. ese prove that, in reality, the analysis of time complexity is right with great possibility.

Evaluation Criteria.
is paper uses four evaluation criteria to examine the results of multi-label feature selection: Hamming Loss, Average Precision, One Error, and Ranking Loss. ese criteria are usually used by multi-label feature selection papers [36,37]. Hamming Loss can be defined as follows: where L i ′ is the predicted label for every sample, L i is the real label for every sample, and ⊕ is the XOR operation. Hamming Loss reflects the misclassification of every single-label. e lower Hamming Loss is, the better classification performance is. Average Precision can be defined by the following: Average Precision� where |L i | is the size of every label in the label set, and rank(f , l) records the rank of l after all labels are sorted in descending order. Average Precision reflects the average fraction of labels ranked higher than a specific label. Greater Average Precision indicates better classification performance. One Error can be defined as follows: Input: a feature set F, a label set L, and the number of selected features K. Output: selected feature subset S. (1) S←ϕ (2) k←0 (3) for i � 1 to n do (4) for j � 1 to m do (5) calculate the relevance between f i and l j (6) end for (7) end for (8) while k < K do (9) if k �� 0 then (10) select the feature f i with the greatest relevance (11) else (12) for every elements f i in F do (13) for every elements f j in F except f i do (14) sum the rebundancy between f i and f j (15)end for (16) according to formula (16) and calculate the J (f i ) (17) end for (18)  arg max One Error records the percentage of labels with the highest predicted value that are not contained by the relevant label set. e lower One Error is, the better classification performance is. Ranking Loss can be defined by the following: Ranking Loss� where f(f, l) is the likelihood that l is the proper label of f, and L i is the complementary set of L i . Ranking Loss reflects the average rank of these likelihoods. e lower Ranking Loss is, the better the classification performance is.

4.2.
Datasets. e ten datasets are from Mulan Library [38], and Table 1 lists the detailed information of them. e domains of Corel5k, Flags, and Scene are images. Delicious, Medical, and Enron are text. GenBase is biology. e ten datasets contain various orders of magnitude, the number of features, and the number of labels. Additionally, datasets include different types of features, such as binary and polybasic. For experiments, every dataset has been divided into the training set and the test set by referring to the recommended size of the Mulan Library. e visualized results are represented in Figure 3. We can know that the corresponding bars are the lowest when α is equal to β. Moreover, if the ratio of α to β is larger, the results roughly become worse. is indicates CRMIL selects the best feature subset when α is equal to β. erefore, the constant of formula (15) is suitable.

Experiment 2.
To explore the comparative performance of CRMIL on different datasets, we test CRMIL and the other eight multi-label feature selection algorithms on mentioned datasets. e results are evaluated by Hamming Loss, Average Precision, One Error, and Ranking Loss. Tables 2-5 demonstrate all experimental results in detail.
ese experimental results are obtained by averaging the results as they tend to stabilize after five simulations.
According to Hamming Loss, CRMIL performs better than the best-performing algorithms among the other eight algorithms on ten datasets. For example, CRMIL is 25.8%, 23.5%, and 12.8% better than AMI on Enron, Corel5k and Delicious, respectively. Compared with FSSL, CRMIL optimizes the target by 9% in Flags, 15.9% in Medical, and 9.9% in Scene.

Experiment 3.
To study how many features should be selected when CRMIL can achieve stable experimental results, on Flags and Scene, we record the results with the increasing numbers of the selected features.           Computational Intelligence and Neuroscience is experiment indicates that CRMIL has a faster convergence. Compared with other algorithms, CRMIL can achieve better results and tend to be stable when the number of selected features is small.

Experiment 4.
To further explore the performance of CRMIL and investigate the improvement if algorithms consider the redundancy among labels, we make a comparative experiment regarding SCLS as the baseline. SCLS innovates multi-label feature selection by using mutual information without considering the redundancy among labels. If we can figure out the redundancy among labels and results improvement on every dataset, we can know the relation between label redundancy and results improvement by using CRMIL. To some extent, we can verify the efficiency of CRMIL on label-redundant datasets.
We set the mean of the optimization percentage of the experimental results of SCLS by CRMIL on Hamming Loss, Average Precision, One Error, and Ranking Loss as the results of improvement. Table 6 details the mean value. Additionally, to understand the relation directly, we show both the redundancy between every two labels and the total label redundancy of every dataset. To illustrate the redundancy between every two labels, we use heatmaps (Figures 12-16) of five datasets. Both x and y axis represent labels in datasets and heat represents the redundancy between every two labels. e brighter color means the more redundancy among labels. From Figure 12-16, we can see that the heatmaps become brighter, which means the  redundancy between every two labels of the five datasets increases in order of Corel5k, Delicious, Medical, Scene, and Flags. According to Table 6 and Figures 12-16, the proposed algorithm can get better results if more redundancy exists among labels. To describe the total label redundancy of datasets, we use formula (18) to represent the redundant value among labels. Table 7 records the results.
According to Table 6 and 7, the larger the redundancy among labels is, the better CRMIL will perform. As shown in Figure 17, the improvement of the results is roughly proportional to the redundancy among labels.

Conclusion and Future Work
In recent years, multi-label feature selection has become a hot topic. However, the existing multi-label feature selection algorithms have not fully considered the redundancy among labels.
is paper proposes a new multi-label feature selection algorithm (CRMIL) that has considered the label set as the condition when computing the mutual information between features and labels.
To test the performance of this algorithm, we compare CRMIL with eight existing multi-label feature selection algorithms (SCLS, D2F, FIMF, PMU, AMI, NMDG, FSSL, and MFS-MCDM) on ten commonly used datasets (Corel5k, Delicious, Flags, Medical, Scene, Enron, GenBase, Social, Yeast, and Emotions) and use four evaluation criteria  However, according to the proposed multi-label feature selection algorithm, when the redundancy among labels is too dense, part of mutual information may not be counted in the final result, which can reduce the accuracy of the results. We may implement more high-dimension methods to partly overcome these challenges. In the future, we will take more special cases into account, study how to deal with the redundancy among labels more reasonably, and make the relevance between features and labels closer to the real value.
Data Availability e data that support the findings of this study are available from the author upon reasonable request.

Conflicts of Interest
e author declares no conflicts of interest.