Feature Selection and Overlapping Clustering-Based Multilabel Classification Model

Multilabel classification (MLC) learning, which is widely applied in real-world applications, is a very important problem in machine learning. Some studies show that a clustering-based MLC framework performs effectively compared to a nonclustering framework. In this paper, we explore the clustering-based MLC problem. Multilabel feature selection also plays an important role in classification learning because many redundant and irrelevant features can degrade performance and a good feature selection algorithm can reduce computational complexity and improve classification accuracy. In this study, we consider feature dependence and feature interaction simultaneously, and we propose a multilabel feature selection algorithm as a preprocessing stage before MLC. Typically, existing cluster-based MLC frameworks employ a hard cluster method. In practice, the instances of multilabel datasets are distinguished in a single cluster by such frameworks; however, the overlapping nature of multilabel instances is such that, in real-life applications, instances may not belong to only a single class. Therefore, we propose a MLC model that combines feature selection with an overlapping clustering algorithm. Experimental results demonstrate that various clustering algorithms show different performance for MLC, and the proposed overlapping clustering-based MLC model may be more suitable.


Introduction
The multilabel classification (MLC) problem, which is applicable to a wide variety of domains, such as music classification and bioinformatics [1], has received increasing attention.However, situations where single instances are associated with multiple labels remain challenging.Most algorithms treat such MLC tasks as multiple binary classification tasks.However, this approach may not consider potential correlations among features and labels.
A good MLC solution must be effective and efficient; however, a large number of redundant and irrelevant attributes may increase computational costs and the time required to learn and test a multilabel classifier, which reduces classification performance.Feature selection, which is an important technique in data mining and machine learning, has been widely used in classification models to enhance performance.Selecting features before applying classification methods to original datasets has many advantages, such as refining the data, reducing computational costs, and improving classification accuracy [2,3].Therefore, we utilise a feature selection algorithm to improve the quality of MLC.
Various feature selection methods have been proposed, for example, statistics, rough set methods, information gain, and mutual information (MI).A wide variety of research has shown that no single feature selection method can handle all situations.Many studies have demonstrated that MIbased feature selection methods are effective and efficient because the MI can handle different types of attributes, does not make any assumptions, and can measure nonlinear relations between variables [4].Recently, many algorithms to select significant features for MLC have been proposed.However, most of these methods do not consider that a single attribute may affect various labels differently.The concept of interaction information has become more relevant because it can reflect the relevance, redundancy, and complementarity among attributes and labels; thus, it is an effective feature selection method.In this study, we propose an algorithm to improve MLC performance by selecting significant attributes based on interaction information between attributes and labels.
Some studies have shown that clustering-based MLC methods can improve predictive performance and reduce time costs; however, those studies used nonoverlapping clustering methods to handle multilabel datasets.We know that, in MLC, one object may belong to multiple classes; however, algorithms based on nonoverlapping clustering, that is, hard division methods, do not consider such situations.In contrast, overlapping clustering-based methods consider this situation when they handle datasets.Therefore, we propose an overlapping clustering-based MLC (OCBMLC) model.
The remainder of this paper is organised as follows: Section 2 describes related work, Section 3 provides background information, Section 4 describes the proposed multilabel feature selection algorithm and MLC model, Section 5 introduces experimental data, evaluation criteria, and experimental results, and conclusions and suggestions for future work are presented in Section 6.
PT methods convert multilabel data to single-label data; thus, a single-label classification method can be used.Label powerset, binary relevance [10], and random ensemble learning with k-label sets [11] are classic PT methods.The AA approaches extend single-label algorithms to process multilabel data directly.BP-MLL [12] and ML-KNN are two popular AA methods.BP-MLL is a widely used MLC backpropagation algorithm.An important characteristic of this algorithm is the introduction of an error function that considers multiple labels.The ML-KNN AA method [13] determines the labels of a new object using the maximum a posteriori principle.The ML-KNN algorithm obtains a label set based on the statistical information of the label sets of the -nearest neighbours of a test instance.
Many studies have proven that redundant and irrelevant features can increase computational costs, reduce performance, and result in overfitting.These problems also exist in MLC.Many feature selection methods have been proposed to handle these problems and improve MLC.Battiti [14] proposed the Mutual Information Feature Selection algorithm, which selects the maximum relevance term, to address these problems.Peng et al. [15] introduced an improvement algorithm called Minimal-Redundancy and Max-Relevance, and Lin et al. [16] proposed a multilabel feature selection algorithm that combines MI with max-dependency and minredundancy.In addition, over the past few years, unsupervised, clustering, and other technologies have been used to reduce dimensionality.For example, Li et al. [17] proposed a clustering-guided sparse structural learning algorithm that integrates clustering and a sparse structure in a united framework to select the most useful features.They also proposed an algorithm [18] that employs nonnegative spectral clustering and controls the redundancy between features to select significant features.Cai et al. [19] presented the Unified Sparse Subspace Learning (USSL) framework, which employs a dimension reduction technique that incorporates a subspace learning method.The USSL framework has demonstrated good performance.Li et al. [20] proposed the Robust Structured Subspace Learning (RSSL) framework that combines subspace learning theory and features learning.Their experimental results demonstrated that the RSSL framework performed well for image understanding tasks.
Recently, Kommu et al. [21] proposed two methods based on probabilistic theory to solve multilabel learning problems.In the first method, their algorithm uses logistic regression and a nearest neighbour classifier for MLC.Note that Partial Information is used in this approach.In the second method, their algorithm deals with the concept of grouping related labels.Association Rules are also introduced in the second approach.Guo and Li [22] proposed the Improved Conditional Dependency Networks framework for MLC.This method uses label correlations in the training stage and CDNs in the testing stage.Yu et al. [23] used a rough sets approach for MLC that considers the associations between labels.They evaluated the performance of their approach using seven multilabel datasets.
Nasierding et al. [24,25] presented an effective CBMLC framework that combines a clustering algorithm with an MLC algorithm.Various clustering methods, such as means, EM, and Sequential Information Bottleneck, are used for training.Note that, with this framework, labels are ignored during training phase.Nasierding et al. [26] compared clustering and nonclustering MLC methods for image and video annotations.Tahir et al. [27] proposed a method that combines a multilabel learning approach with fusion techniques.They used various multilabel learners to select a label set and demonstrated that ensemble techniques can avoid the disadvantages of different learners.

Entropy and Mutual Information.
In this section, we introduce the theories of Entropy and MI.Here, we assume that all variables are discrete or data attributes can be discretised using different discrete methods.Shannon's entropy [28] is the uncertainty measure of a random variable, and it has been widely used in various domains.Here, let  ∈ { 1 ,  2 , . . .,   } be a discrete variable and (  ) = probability { = } be the probability density function.Formally, the entropy of  is defined as follows: Assume that  and  ∈ { 1 ,  2 , . . .,   } are two random discrete variables.(  ,   ) is the joint probability of  and .The joint entropy (, ) is defined as follows: If the value of random variable  is known and variable  is not, the remaining uncertainty of variable  can be measured by the conditional entropy defined as follows: The minimum value of ( | ) is zero when random variable  is statistically dependent on random variable .The maximum conditional entropy value occurs when the two variables are statistically independent.
The relationship between conditional and joint entropy can be defined as follows: MI is the amount of information shared by two variables and is defined as follows: Note that the two random variables are statistically independent when (, ) is zero.The relation between MI and entropy can be defined as follows: Let  be a random variable  = { 1 ,  2 , . . .,   }.The conditional MI and joint MI represent the information of two variables in the context of a third variable and are defined as follows: Multi-information, which was introduced by McGill [29], is an extension of two random variables that can handle the interaction among more than two random variables.Mathematically, multi-information is defined as follows:  (; ; ) =  (, ; ) −  (, ) −  (; ) .
Multi-information can be positive, negative, or zero [30].If the multi-information value is zero, the random variables are independent in the context of the third variable.If the value is negative, the variables have redundant information and a positive value indicates that together the random variables can provide more information than each variable taken individually.

Overlapping Clustering Algorithm. Fuzzy 𝐶-Means
(FCM) algorithms are widely used in fuzzy clustering learning.Fuzzy clustering, which is a type of overlapping clustering, differs from hard clustering.The FCM clustering algorithm assigns data points (examples) to a cluster, and the fuzzy membership of data points indicates the extent to which data points pertain to their clusters [31].
Note that examples can belong to more than one cluster with different degrees of membership.The object function of the FCM algorithm obtains the minimum value as follows [32]: where  = {  } is the membership degree matrix, parameter  ∈ (1, +∞) is the weight exponent that defines the fuzziness of the resulting clusters, and   = ‖  −   ‖ is the Euclidian distance between object   and the cluster centre V  .The objective function is minimized by updating the partition matrix and cluster centre as follows: The FCM membership function is defined as follows: Here,  , is the membership value of the th object and th cluster,  is the number of clusters, and V  is the cluster centre of the th cluster.

MI (𝐹
Such a method is impractical because it is difficult to calculate the high-dimensional probability density function.Therefore, some efficient methods have been proposed to approximate the ideal solution [14][15][16].Generally, most multilabel feature selection methods based on MI consider the relevance and redundancy terms.In practice, such methods and their variants calculate the MI between a candidate feature and the selected features subset; however, they do not sufficiently consider interaction information among attributes and class labels, ignore feature cooperation, and allow all features to be competitive.
We know that a candidate feature for multilabel feature selection should have one of the highest MI values for all class labels.This is referred to as the relevance term.Multilabel feature relevance terms have been defined previously, and we use the following definition.Definition 1.Let   denote a candidate feature and   ∈  be a class label.The relevance term is expressed as follows: We can obtain two properties according to this definition.According to the above properties, we can use Definition 1 to select relevant candidate features.However, classes combined with previously selected features may produce interaction.Therefore, we should consider the interaction information among the candidate feature, the selected features, and the classes during feature selection.Differing from existing feature selection methods, we consider the interaction information between a feature and a single class and the pairwise interaction between features and all class labels.Our interaction metric is defined as follows: Here,  is the selected features subset,  denotes the label set, and   denotes the candidate feature.
It is well known that multilabel feature selection attempts to select a set of features with the highest discrimination power for all labels.According to the above discussion, we combine ( 14) and ( 15) using the feature interaction maximum of the minimum criteria to propose a new goal function (referred to as max-dependence and interaction (MDI)) for multilabel feature selection.Here, the candidate features are considered to have the highest relevance and beneficial interaction with all class labels.The proposed MDI goal function is expressed as follows: With this function, the first term is the relevance between the candidate features and all class labels, and the second term focuses on the interaction information among   ,   , and .The proposed goal function can select features with the greatest discrimination power.The pseudocode of the proposed algorithm is as Pseudocode 1.

Proposed Multilabel Classification Model.
There are some experimental results that show that CBMLC methods can improve the predictive classification performance and reduce algorithm training time compared to existing popular multilabel methods [24][25][26].The results of those models show that the classification performance of clustering-based methods is effective.However, those algorithms were used for nonoverlapping clustering methods, such as EM and -means, prior to MLC.Therefore, the original data will be set into several disjoint data clusters in nonoverlapping methods.
Clustering methods are usually classified into hard clustering and fuzzy clustering.In hard clustering, instances are distinguished in a single cluster.However, due to the overlapping nature of instances, generally, they do not belong to only a single class in real-world applications.This property limits the practical application of hard clustering, especially for MLC.
FCM is an effective classic fuzzy clustering method based on an objective function concept and is widely used in clustering.The FCM approach uses alternating optimisation strategies to solve nonlinear and nonmonitor clustering problems.We know that one instance may own multiple classes in multilabel data, and the FCM algorithm can handle one instance that belongs to more than one cluster simultaneously.This allows the use of a fuzzy clustering method that assigns a single object to several clusters.Therefore, we propose the OCBMLC model in combination with the FCM algorithm to improve performance.Figure 1 shows the basic procedure of the proposed OCBMLC model.

Datasets.
In our experiments, we used three public multilabel datasets, that is, the emotions, yeast, and scene datasets.These datasets were taken from the Mulan Library.The emotions dataset contains examples of songs according to people's emotions [33].The yeast dataset includes information about genes functions [34], and the scene [35] dataset includes a series of landscape patterns.Table 1 shows the statistics of the three multilabel benchmark datasets.
In Table 1, "Domain" denotes the dataset domain, "Instances" is the number of instances in the dataset, "Features" is the number of attributes, "Labels" is the number of labels in the datasets, and "Cardinality" is the average number of labels associated with each instance.

Experimental Setting.
At the multilabel feature selection stage, in order to calculate MI convenience, we discretise continuous features into 10 bins using an equal-width strategy.The evaluation approaches for MLC differ from traditional single-label classifications.Note that the Hamming loss and micro 1-measure evaluation criteria are widely used for MLC; thus, we used these criteria in our experiments.Note that non-OCBMLC models use -means and EM algorithms to cluster original datasets, and OCBMLC model uses the FCM algorithm on the data after dimension reduction.The overlapping and nonoverlapping frameworks both employ ML-KNN as the classifier.The number of clusters  in -means, EM, and FCM is all set between 2 to 7. In this study, a cross-validation strategy was used for each combination of algorithm framework and dataset.All experiments used MATLAB 2012 on an Intel Core-i5 2.3 GHz processor with 8 GB memory.

Evaluation Metrics.
The evaluations of an MLC system differ from that of a single-label classification system.Note that some criteria that evaluate the performance of an MLC system have been employed previously [36].Among such evaluation metrics, we employed Hamming loss and the micro 1-measure criteria.
Here, let  = {( 1 ,  1 ), . . ., (  ,   )} be a set of n test examples and  * = ℎ(  ) be the predict label set for the test instance   .  is the ground truth label set for   .The Hamming loss indicates the number of erroneous labels to the total number of labels, where a smaller Hamming loss value indicates better classification performance.The Hamming loss value is calculated as follows: The micro 1-measure represents the harmonic means between precision and recall, and it is calculated from false positives, false negatives, true positives, and true negatives.The 1-measure and microaveraging are evaluated as follows: Here,   denotes true positives and   denotes false positives, and   and   are true and false negatives, respectively, for  labels after a separate binary evaluation is performed.Note that a greater micro 1-measure value indicates better classification performance of a multilabel algorithm.

Results.
In this study, we used Hamming loss and the micro 1-measure as experimental evaluation metrics and employed ML-KNN as the multilabel classifier.Note that, in all cases, we indicate the best results in bold values in Tables 2, 3, 4, 5, 6, 7, 8, and 9.

Comparisons of Feature Selection Methods.
To demonstrate the efficacy of the proposed feature selection algorithm, we compared the proposed feature selection method to other MLC models based on clustering using the emotion dataset.We also compared a feature selection method that only considers the dependence between features and classes using the proposed algorithm in which interaction information among features and classes is considered.Here, we refer to the criterion that considers only dependence as the max-dependence criterion, where DEP max = max ∑   ∈ MI(  ;   ).This criterion was used to select candidate features.
In this experiment, "DEP max" represents the features selected by the max-dependence criterion, which ignores interaction information when selecting candidate features, and "MDI" represents features selected by the proposed algorithm, which considers dependence and interaction    Table 2 shows the Hamming loss values obtained when we used the features selected according to MDI, DEP max, and the original features from the emotion dataset, and Table 3 shows the micro 1-measure values when we employed features selected according to MDI, DEP max, and the original features.In terms of the feature selection methods, we found that the performance of DEP max is no better than that of the other models even though we used the original feature subset.However, the MDI performance is better and more stable when the clustering number  is 2 to 7. It is likely that features selected by only considering Max-Relevance could generate abundant redundancy, which means that the dependence among those features could be large.Therefore, the proposed feature selection function may be better suited for MLC, and we observed the same from the experimental results.4 and 5.We selected the top % ( = 20, 30, 40) in the selected feature subset as the final feature subset for use with the proposed model.Table 4 demonstrates that the proposed OCBMLC framework achieved the lowest Hamming loss value (0.1983 ± 0.0154) with the emotions dataset.Table 5 shows that the proposed framework achieved the highest micro 1-measure (0.6745 ± 0.0265) with the emotions dataset.As shown in Figures 2 and 3, the predictive performance of the proposed model achieved the best results with the emotions dataset when  = 3.As shown in Figures 2 and 3, respectively, the Hamming loss demonstrates the minimum value and the micro 1-measure demonstrates the maximum value when we used the MDI feature selection method to select the top % ( = 30) features as the classification attributes subset.
To demonstrate the classification performance of the proposed model, we also selected the top % ( = 20, 30, 40) in the selected feature subset as an experimental feature subset.The Hamming loss and micro 1-measure results of the MLC model with the yeast dataset are shown in Tables 6 and 7.As shown in Figures 4 and 5, the Hamming loss and micro 1-measure demonstrate the best results when  = 3 with 40% of the features selected from the original data attributes.In addition, it was found that the evaluation criterion value of MLC was reduced with an increasing number of clusters.Tables 8 and 9 show that the OCBMLC model achieved the top predictive performance (Hamming loss = 0.0879 ± 0.0048; micro 1 = 0.7281 ± 0.0206) with the scene dataset.Figures 6 and 7 show that the Hamming loss and micro 1measure values outperformed the "EM and ML-KNN" and "-means and ML-KNN" models when  = 2 and 30% of the features of the original data attributes were selected for the When we selected the top 30% or 40% of features using the proposed feature selection algorithm for MLC, the proposed OCBMLC model achieved the best performance because it can select features with max-dependence in consideration of the classes and interaction among features and each class.Thus, the proposed feature selection algorithm can select features with the best discrimination power.The experimental results prove that the proposed feature selection algorithm directly improves classification performance, and it is almost always better than models that use all total features from the data.The experimental results also show that the model based on overlapping clustering outperforms models based on hard clustering.In a multilabel dataset, one instance may belong to multiple labels; however, hard clustering methods attempt to assign a single instance to a single label.Therefore, such methods may not be suitable for multilabel datasets.In contrast, overlapping clustering

Conclusion
This paper has proposed an overlapping clustering-based MLC model that includes a feature selection phase for original datasets.We have also proposed a new multilabel feature selection algorithm that can effectively select significant features to improve classification performance.The proposed MLC framework includes an initial overlapping clustering phase.The proposed model considers the fact that multilabel data examples may not be related to a single class but may belong to multiple classes in many cases.Therefore, overlapping clustering may be more suitable for such situations.Experimental results show that the proposed model can increase predictive performance compared to a (;  | ) =  ( | ) −  ( | , )  (, ; ) =  (;  | ) −  (, ) .

Figure 2 :F1Figure 3 :
Figure 2: Hamming loss for all models and the number of clusters in emotions.

Figure 4 :
Figure 4: Hamming loss for all models and the number of clusters in yeast.

F1Figure 5 :
Figure 5: Micro 1 measure for all models and the number of clusters in yeast.

Figure 6 :F1Figure 7 :
Figure 6: Hamming loss for all models and the number of clusters in scene.
Property 2. If candidate feature   and each class label   ∈  are mutually independent, then the MI of   and  is minimum.
Property 3. If each class label   ∈  is determined completely by   , then the MI of   and  is maximum.

Table 1 :
The statistic of the multilabel benchmark datasets.

Table 2 :
Hamming loss measure for the models based on -means clustering on emotions dataset. is the number of attributes, and  is the amount of clusters.% is the average value of Hamming loss.
information among the candidate features and each class simultaneously.Here, we selected the top % (= 20, 30, 40)

Table 3 :
Micro 1-measure for all models on emotions dataset. is the number of attributes, and  is the amount of clusters.% is the average value of micro 1-measure.
features for MLC according to the MDI and DEP max criteria, and we used the average Hamming loss and micro

Table 4 :
Hamming loss measure for all models on emotions dataset. is the number of attributes, and  is the amount of clusters.

Table 5 :
Micro 1-measure for all models on emotions dataset. is the number of attributes, and  is the amount of clusters.

Table 6 :
Hamming loss measure for all models on yeast dataset. is the number of attributes, and  is the amount of clusters.

Table 7 :
Micro 1-measure for all models on yeast dataset. is the number of attributes, and  is the amount of clusters.

Table 8 :
Hamming loss measure for all models on scene dataset. is the number of attributes, and  is the amount of clusters.

Table 9 :
Micro 1-measure for all models on scene dataset. is the number of attributes, and  is the amount of clusters.In addition, the results demonstrate that feature selection plays an important role in classification.In future, we plan to further explore and develop a better and more robust feature selection method or overlapping clustering algorithm for MLC tasks.