An Ensemble Method for High-Dimensional Multilabel Data

Multilabel learning is now receiving an increasing attention from a variety of domains and many learning algorithms have been witnessed. Similarly, the multilabel learning may also suffer from the problems of high dimensionality, and little attention has been paid to this issue. In this paper, we propose a new ensemble learning algorithms for multilabel data. The main characteristic of our method is that it exploits the features with local discriminative capabilities for each label to serve the purpose of classification. Specifically, for each label, the discriminative capabilities of features on positive and negative data are estimated, and then the top features with the highest capabilities are obtained. Finally, a binary classifier for each label is constructed on the top features. Experimental results on the benchmark data sets show that the proposed method outperforms four popular and previously published multilabel learning algorithms.


Introduction
Data classification is one of the major issues in data mining and machine learning.Generally speaking, it consists of two stages, that is, building classification models and predicting labels for unknown data.Depending on the number of labels tagged on each data, the classification problems can be divided into single-label and multilabel classification [1].In the former, the class labels are mutually exclusive and each instance is tagged with only one class label.On the contrary, each instance may be tagged with more than one class label simultaneously.The multilabel classification problems are ubiquitous in real-world applications, such as text categorization, image annotation, bioinformatics, and information retrieval [1,2].For example, the movie "avatar" may be tagged with action, science fiction, and love types.Now, many multilabel classification algorithms have been witnessed.Roughly speaking, they can be grouped into two categories, that is, algorithm adaption and problem transformation [1].The first kind of technique extends traditional single-label classifiers, such as NN, C4.5, SVM, and AdaBoost, by modifying some constraint conditions to handle multilabel data.Typical examples include AdaBoost.MH [3], BRNN [4], and LPNN [4].For instance, Zhang and Zhou [5] proposed MLNN and applied tscene classification, while Clare and King [2] employed C4.5 to deal with multilabel data by altering the discriminative formula of information entropy.
The second technique of multilabel learning transforms multilabel data into corresponding single-label ones and then handle them one by one using the traditional methods.An intuitive approach is to treat the multilabel problem as a set of independent binary classification problems, one for each class label [6,7].However, they often have not considered the correlations among the class labels and may suffer from the problem of unbalanced data, especially when there are a large number of the class labels [8].To cope with these problems, several strategies have been introduced.For example, Zhu et al. [9] explored the label correlation with maximum entropy, while Cai and Hofmann [10] captured the correlation information among the labels by virtue of a hierarchical structure.
Analogous to traditional classification, multilabel learning may also encounter the problems, such as over-fitting and the curse of dimensionality, raised from high dimensionality of data [11,12].To alleviate this problem, an effective solution is to perform dimension reduction or feature selection on data in advance.As a typical example, Ji et al. [13] extracted Mathematical Problems in Engineering a common subspace shared among multiple labels by using ridge regression.One common characteristic of these methods is that they make use of only one feature set to achieve the learning purpose under the context of multilabel data.However, in reality, only one feature subset can not represent the properties of different labels exactly.Therefore, it is necessary to choose different features for each label during the multilabel learning stage.A representative example of such kind is LIFT [14].
In this paper, we propose a new multilabel learning algorithm.The main characteristic of our method is that during the procedure of constructing binary classifiers different feature subsets will be exploited for each label.More specifically, given a class label, the features with high discriminative capabilities with respect to the label are chosen and then used to train a binary classifier.This means that the selected features have local properties.Note that they may have lower discriminative capabilities with respect to other class labels.Other binary classifiers can also be constructed in a similar manner.Finally, all binary classifiers are assembled into an overall one, which will be used to predict or classify the labels of unknown data.
The rest of this paper is organized as follows.We describe the details of the proposed method in Section 2. Experimental results conducted to evaluate the effectiveness of our method are presented in Section 3. Finally, conclusions and future works are given in the end.

Binary Classification.
According to the formal description of D, we know that the multilabel data is a general form of traditional single-label data, whereas y  only involves one single label.Thus, a natural and intuitive solution for multilabel learning is to transform the multilabel data into its corresponding single-label data and then train classifiers on the generated data.There are many transformation strategies.Copy, selection, and ignore are three typical transformation techniques [1].Besides, the power set of labels is also introduced in the literature, where every y  is often taken as a new class label.
Before giving the principle of binary classification, let us introduce the concepts of positive and negative samples of labels.
Definition 1.Given a multilabel data set D with  samples associated with  labels , for each class label   ∈ , its positive samples (  ) and negative samples (  ) are defined as follows: From this definition, we know that, given a label   , all examples of the original data set are positive if they are associated with the class label   and negatively otherwise.Moreover, (  ) ∩ (  ) = 0 and (  ) ∪ (  ) = .
Binary relevance (BR), also known as one-against-all method, is the most popular and most commonly used transformation method for multilabel learning in the literature [1].It learns  different binary classifiers independently, one for each different label in .Specifically, it transforms the original data set D into  data sets D  ,  = 1, .., .Each data set D  consists of the positive samples (  ) and negative samples (  ) with respect to   .Based on the new data set D  , a binary classifier   for the label   can be built using the off-the-shelf learning methods, for example, NN and SVM.After obtaining  binary classifiers for all labels, the prediction of BR for a new sample  is the union of the labels   that are positively predicted by the  classifiers; that is, () = [ 1 (),  2 (), . . .,   ()]  , where   () takes a value of 0 or 1, indicating  is predicted positively or negatively by the classifier   .
BR is a straight forward transformation method and widely used as a baseline in comparson with multilabel learning algorithms.However, the drawback of BR is that it does not take correlations among the labels into account and treats all labels independently.In addition, it also suffers from the class imbalance problem.In multilabel data, the number of positive samples (  ) is significantly less than the number of negative samples (  ) for some labels due to the typical sparsity of labels.To alleviate this problem, feature selection should be performed on the data sets in advance.

Feature Selection.
The purpose of feature selection is to select significant features to represent data from the original space without losing information greatly.It has been extensively studied in the traditional learning.However, little work of feature selection has been done in the context of multilabel learning.Currently, there are many criteria available to measure the interestingness of features [15].Here, we exploit the concept of density distribution of data to represent the interestingness of features.
Definition 2. Given a data set D  with  samples, the density of value distribution of the th feature is defined as where   denotes the th feature value of   and sim is a similar function between two values.
In (3), the sim function is often taken as the form of inverse Euclidean distance.If the positive samples (  ) and negative samples (  ) are considered in (3), we can get the positive and negative densities of features.Definition 3. Given the positive samples (  ) and negative samples (  ), the positive and negative densities of the th feature are defined as The positive density   + , as well as the negative density   − , can effectively represent the specific characteristic of data.
The larger the value of   + (or   − ), the better discriminative capability to distinguish positive (or negative) samples from others.
Based on this principle, we adopt these two criteria to choose significant features during the learning stage.Specifically, for each feature   in D  , we calculate its positive density   + and negative density   − , respectively.Then, the positive densities of all features will be ranked in a decreasing order, and the top  features with high positive densities will be selected.Similar situation can be done for the negative densities.Finally, the features with high positive and negative densities will be used to train desirable binary classifiers.
How many features should be selected for classification is still an open problem.Here, we empirically determine the number of selected features with a concept called  minimum density, which is defined in the following.
The   -minimum density can effectively measure the information amount that one feature has.If the density is larger than the   -minimum density, the corresponding feature has enough information to represent the characteristics of data.As a result, the feature will be chosen during the stage of feature selection.In other words, after calculating   + and   − , we retain the features with   + or   − larger than   and discard the others.Note that the parameter  in Definition 4 is to control the number of selected features.The larger value of  is, the more features would be chosen.In our empirical experiments, the classifier achieved good performance when  was set to 0.1.

The Proposed Method. Based on the analysis above, we
propose a new multilabel learning algorithm.The framework of our algorithm is shown as Algorithm 1.The proposed method works in a straightforward way and can be easily understood.It consists of two major stages, that is, learning and prediction stages.In the training stage, a new data set will be generated for each class label by obtaining its positive and negative samples.Subsequently, we estimate the interestingness of features in the data set, so as to retain significant features for classification.Finally, a binary classifier is constructed with a baseline learning method.Given a new sample, its class labels can be predicted by testing it with all binary classifiers.

Empirical Study
To validate the performance of our proposed method, we made a comparson ofEMCFS with three popular multilabel learning algorithms.They are ML-NN, LIFT, and Rank-SVM, standing for different kinds of learning types.In addition, we took a linear support vector machine as the baseline binary classification algorithm and assigned 0.1 to the parameter .Other parameters were set as their default values suggested by authors.For example, the number of the nearest neighbors in ML-NN was 10 and the distance measure is the Euclidean distance [5].In the Rank-SVM classifier, the degree of polynomial kernels was 8 and the cost parameter  was assigned as one [16].

Data Sets.
To validate the effectiveness of our method roundly, our experiments have been conducted on four data sets, including emotions, medical, corel16k (sample 1), and delicious.They are often used to verify the performance of multilabel classifiers in the literature and are available at http://mulan.sourceforge.net/datasets.html.Table 1 summarizes their general information, where the cardinality and density columns refer to the average number of class labels of the samples and its fraction by the number of labels.These multilabel data sets vary from the quantities of labels and differ greatly in the sizes of samples and features [17].

Experimental Results.
There are lots of evaluation criteria available to evaluate the performance of multilabel classifiers.In this work, average precision, hamming loss, one error, and coverage have been adopted to assess the effectiveness of the proposed method.Their detailed descriptions can be found in several literatures such as those in [1,18].
Table 2 reports the average precision of the multilabel classifiers on the data sets.In in this table, each row denotes an observation on the data sets.The best result comparable with others in the same row is highlighted in boldface, where the larger the value, the better the performance.From the table, one may observe that the proposed method, EMCFS, works quite well and is comparable to others in most cases with the average precision.For example, on the delicious data set (the last row in in the table), the precision of EMCFS is 29.5%, which is the best one among the others.
Apart from the average precision, we also compared EMCFS to the others from the perspective of the hamming loss, one error, and coverage.Tables 3, 4, and 5 present the averaged performance of the learning algorithms in terms of these three criteria, respectively, where the smaller the value, the better the performance.The best results are also highlighted in boldface.
According to algorithms the results in these tables, we know that similar to the average precision, EMCFS is also superior to other regarding the aspects of hamming loss, one error and coverage.Although EMCFS achieved slightly poor coverage on the corel16k data set, the performance is 140.428, which is slightly worse than the best one.However, it is not the worst in comparson with ML-NN.

Conclusions
In this paper, we propose a new ensemble multilabel learning method.The central idea of our method is that, for each label, it exploits different features to build learning models.The advantage is that the classifiers are constructed on the features with strong local discriminative capabilities.Generally, the proposed method consists of three steps.Firstly, for each label, a new data set is generated by identifying the positive and negative samples.Then, the interestingnes's of features will be estimated and the features with high density will be retained to train a learning model.Finally, all binary classifiers built with the selected features will be integrated into an overall one.Experimental results on four multilabel data sets show that the proposed method can potentially improve performance and outperform other competing and popular methods.

Definition 4 .
Let D  be the data set, and let (  ) and (  ) be the positive and negative samples of the th feature, respectively.The   -minimum density of D  with respect to   is   =  ⋅ min (      (  ) ,  (  )     ) ,

Table 1 :
y 1 ) , . . ., (x  , y  )}: The training multilabel data set; : The parameter of   ;  = x 1 , . . ., x  : The test data set without labels; Output: Y = {y 1 , . . ., y  }: The label set of ; Training stage For each label   in  do Obtain  (  ) and  (  ) of   from D according to Def.(1)(2); Calculate   with  (  ) and  (  ) according to Def. (6); For each feature   in D  =  (  ) ∪  (  ) do Calculate   + and   − according to Def.(4)(5); Select the features whose   + or   − is larger than   ; Train a binary classifier   on D  with the selected features; Endfor Prediction stage For each sample   in  do y  = 0; For each classifier   do y  = y  ∪ {  (  )}, where   returns 0 or 1; Endfor Algorithm 1: Ensemble multilabel classifier using feature selection (EMCFS).The brief description information of data sets in experiments.

Table 3 :
A comparison of hamming loss of four classifiers on data sets.

Table 4 :
A comparison of one error of four classifiers on data sets.

Table 5 :
A comparison of coverage of four classifiers on data sets.