Scalable Multilabel Learning Based on Feature and Label Dimensionality Reduction

. The data-driven management of real-life systems based on a trained model, which in turn is based on the data gathered from its daily usage, has attracted a lot of attention because it realizes scalable control for large-scale and complex systems. To obtain a model within an acceptable computational cost that is restricted by practical constraints, the learning algorithm may need to identify essential data that carries important knowledge on the relation between the observed features representing the measurement value and labels encoding the multiple target concepts. This results in an increased computational burden owing to the concurrent learning of multiple labels. A straightforward approach to address this issue is feature selection; however, it may be insu ﬃ cient to satisfy the practical constraints because the computational cost for feature selection can be impractical when the number of labels is large. In this study, we propose an e ﬃ cient multilabel feature selection method to achieve scalable multilabel learning when the number of labels is large. The empirical experiments on several multilabel datasets show that the multilabel learning process can be boosted without deteriorating the discriminating power of the multilabel classi ﬁ er.


Introduction
Nowadays, the data-driven management of real-life systems based on a model obtained by analyzing data gathered from its daily usage is attracting significant attention because it realizes scalable control for large-scale and complex systems [1,2].Unfortunately, advances in the identification of important knowledge on the relation between the observed information and target concept are far from satisfactory for real-life applications such as text categorization [3], protein function prediction [4], emotion recognition [5], and assembly line monitoring [6].This is because the underlying combinatorial optimization problem is computationally difficult.To deal with this complicated task in a scalable manner, the algorithm may need to identify essential data that carries important knowledge for building an acceptable model while satisfying practical constraints such as real-time response, limited data storage, and computational capability [7].
Although the majority of current machine learning algorithms are designed to learn the relation between information sources or features and a single concept or label, recent complex applications require that the algorithm extracts the relation to multiple concepts [8].For example, a document can be assigned to multiple categories simultaneously [9], and protein compounds can also have multiple roles in a biological system [10].Therefore, to identify important knowledge in this scenario, the algorithm must learn the complex relation between features and labels, formalized as the multilabel learning problem in this field.This scenario differs from that of the single-label learning problem because the problem itself offers the opportunity to improve learning accuracy by exploiting the dependency between labels [11,12].However, the algorithm eventually suffers as a result of the computational cost of the learning process owing to the multiple labels.
To reduce the computational burden of the algorithm, a straightforward approach is to ignore unimportant features in the training process that do not influence the learning quality [13,14].However, in the multilabel learning problem, this approach may be insufficient to satisfy the practical constraints because a large number of labels can be involved in a related application.Moreover, the possible combinations of features and labels that should be considered for scoring the importance of features increases exponentially; i.e., the feature selection process can become computationally impractical.Additionally, the computational burden increases significantly because the number of features in the dataset is typically large when feature selection is considered.As a result, the number of possible combinations can increase considerably [15].This is a serious problem because conventional multilabel learning algorithms with and without the feature selection process are unable to finish the learning process owing to the presence of too many features and the scoring process of the features, respectively.
In this study, we devise a new multilabel feature selection method that facilitates dimensionality reduction of labels from the scoring process.Specifically, our algorithm first analyzes the amount of information content in labels and reduces the computational burden by discarding labels that are unimportant to the scoring of the importance of features.Our contribution to this study compared to our previous works and the strategy to deal with the scalability issue can be summarized as follows: (i) We propose an efficient multilabel feature selection method based on the simplest approximation of mutual information (MI) that is scalable to the number of labels; it costs constant time computations in terms of the number of labels (ii) The computational cost of the feature selection process can be controlled easily owing to its simple form.This is an important property when the execution time is limited (iii) The proposed method identifies a subset of labels that carries the majority of the information content compared to the original label set to preserve the quality of the scoring process (iv) According to the characteristics of labels in terms of information content, we suggest that the size of labels be considered in the feature scoring process to preserve the majority of the information content (v) In contrast to our previous works, the proposed method explicitly discards unimportant labels from the scoring process, resulting in a significant acceleration of the multilabel feature selection process

Multilabel Feature Selection
One of the most common methods of multilabel feature selection is the use of the conventional single-label feature selection method after transforming label sets into one or more labels [9,16,17].In this regard, the simplest strategy is known as binary relevance, in which each label is separated and analyzed independently [18].A statistical measure that can be used as a score function to measure feature importance can be employed after separating the label set; these measures include the Pearson correlation coefficient [19] and the odds ratio [20].Thus, prohibitive computations may be required to obtain the final feature score if a large label set is involved.In contrast, efficient multilabel feature selection may not be achieved if the transformation process consumes excessive computational resources.For example, ELA + CHI evaluates the importance of each feature using χ 2 statistics (CHI) between the feature and a single label obtained by using entropy-based label assignment (ELA), which separates multiple labels and assigns them to duplicated patterns [9].Thus, the label transformation process can be the bottleneck that incurs a prohibitive execution time if the multilabel dataset is composed of a large number of patterns and labels.
Although the computational cost of the transformation process can be reduced by applying a simple procedure such as a label powerset that treats each distinct label set as a class [17,21], the feature selection process may be inefficient if the scoring process incurs excessive computational costs during the evaluation of the importance of the features [18,22].For example, PPT + RF identifies appropriate weight values for the features based on a label that is transformed by the pruned problem transformation (PPT) [21] and the conventional ReliefF (RF) scheme [23] for single-label feature selection [24].Although the ReliefF method can be extended to handle multilabel problems directly [25], the execution time to obtain the final feature subset can be excessively long if the dataset is composed of a large number of patterns.This is because ReliefF requires similarity calculations for pattern pairs.Thus, the feature selection process itself should not incur a complicated scoring process to achieve efficient multilabel learning.
Instead of a label set transformation approach that may incur side effects [26], an algorithm adaptation approach that attempts to handle the problem of multilabel feature selection directly is considered [15,[27][28][29][30][31].In this approach, a feature subset is obtained by optimizing a specific criterion such as a joint learning criterion involving feature selection and multilabel learning concurrently [32,33], l 2,1 -norm function optimization [31], a Hilbert-Schmidt independence criterion [28], label ranking errors [27], F-statistics [34], label-specific feature selection [12], and memetic feature selection based on mutual information (MI) [35].However, if multilabel feature selection methods based on this strategy consider all features and labels simultaneously, the scoring process can be computationally prohibitive or even fail owing to the internal task of finding an appropriate hyperspace using pairwise pattern comparisons [27], a dependency matrix calculation [28], and iterative matrix inverse operations [31].
In our previous work [29], we demonstrated that MI can be decomposed into a sum of dependencies between variable subsets, which is a very useful property for solving multilabel learning problems [12,15] because unnecessary computations can be determined prior to the actual computation and be rejected [36].More efficient score functions, specialized into an incremental search strategy [37] and a quadratic programming framework [38], have also been considered.These score functions were employed to improve the 2 Complexity effectiveness of evolutionary searching [35,39].However, these MI-based score functions commonly require the calculation of the dependencies between all variable pairs composed of a feature and a label [14].Thus, they share the same drawback in terms of computational efficiency because labels known to have no influence on the evaluation of feature importance are included in the calculations [15,40].In contrast to our previous study, our method proposed in this study discards unimportant labels explicitly prior to any multilabel learning process.
Although the characteristics of multilabel feature selection methods can vary according to the manner in which the importance of features is modeled, conventional methods create a feature subset by scoring the importance of features either to all labels [9,17,28] or to all possible combinations drawn from the label set [15,27,29].Thus, these methods inherently suffer from prohibitive computational costs when the dataset is composed of a large number of labels.

Proposed Method
In this section, a formal definition of the multilabel classification and feature selection is provided.Based on our definition, the proposed label selection approach is described and a discussion on the influences of label subset selection to the feature selection is presented.
3.1.Problem Definition.Let W be a set of training examples or patterns where each example w i ∈ W 1 ≤ i ≤ W is described by a set of features ℱ = f 1 , … , f ℱ ; its association to multiple concepts can be represented using a subset of labels λ i ⊆ ℒ , where ℒ = l 1 , l 2 , … , l ℒ .In addition, let T = t i , λ i | 1 ≤ i ≤ T be a set of test patterns, where λ i is a true label set for t i and is unknown to the multilabel classifier, resulting in U = W ∪ T and W ∩ T = 0.The task of multilabel learning is to derive a family of ℒ functions, namely, h 1 , h 2 , … , h ℒ that are induced from the training examples, where each function h k t i → ℝ outputs the class membership of t i to l k .Thus, relevant labels of t i based on each function can be denoted as where ϕ is a predefined threshold.For example, in the work of [41], a mapping function h k for l k is induced using W . Based on h k , the class membership value h k t i for the given test pattern t i is determined, where h k t i ∈ 0, 1 .In this work, the threshold ϕ is set to 0.5 according to the maximum a posteriori theorem.Although the algorithm outputs l k as a relevant label for t i if the class membership value is larger than 0.5 in their work, the range of class membership value can be different according to the multilabel classification algorithm.Although there are some trials to improve the multilabel learning performance by adapting threshold for each label [42], most conventional studies have employed the same value for all the labels.
One of the problems of multilabel feature selection that distinguishes it from classical single-label feature selection is the computational cost for selecting a subset of features with regard to the given multiple labels.The multilabel feature selection can then be achieved through a ranking process by assessing the importance of ℱ features based on a score function and selecting the top-ranked n features from ℱ n ≪ ℱ .To perform multilabel feature selection, an algorithm must be able to measure the dependency, i.e., importance score, between each feature and label set.The dependency between a feature f ∈ ℱ and label set ℒ can be measured using MI [43].
where H ⋅ of (1) represents a joint entropy that measures the information content carried by given a set of variables, defined as where x is a state represented by a variable X and P ⋅ is a probability mass function.If the base of the log function, a in (2), is two, this is known as Shannon entropy.When ℒ is large, the calculation of H f , ℒ and H ℒ becomes unreliable because of too many joint states coming from ℒ with insufficient patterns.For example, to observe all the possible associations between patterns and label subsets, the dataset should contain at least 2 ℒ patterns.Let X * be the power set of X and X * k = e | e ∈ X * , e = k .Equation (1) can then be rewritten using the work of Lee and Kim [15].
where × denotes the Cartesian product of two sets.Next, V k ⋅ is defined as where I X is the interaction information for a given variable set X, defined as [44] Equation ( 3) indicates that M f ; ℒ can be approximated into interaction information terms involving a feature and all the possible label subsets.With regard to (3), the most efficient approximation of (1) is known as [36] Accordingly, the score function J for evaluating the importance of a given feature f is written as Equation (7) indicates that the computational cost increases linearly according to ℒ .By assuming that the computational cost for calculating a M ⋅ ;⋅ term is a unit cost, the algorithm will consume ℒ unit costs to compute the importance of a feature.

Label Subset Selection.
In our multilabel feature selection problem, the rank of each feature is determined based on importance score using (7).The bound of a MI term is known as Thus, the bound of ( 7) is Because H f is unknown before actually examining input features and any importance score cannot exceed the sum of entropy value of each label, (9) can be simplified as Equation (10) indicates that the score value of each feature is influenced by the entropy value of each label, and this fact implies Proposition 1 as follows [40].
Figure 1 represents how the importance score of a feature is determined with regard to Proposition 1; the height of the blue bar indicates the entropy value of the corresponding label, and height of the yellow bar indicates the MI between f and each label.Figures 1 and 2 represent two sample cases wherein each label carries the same amount of information content, and a small subset of label set carries the majority Labels Uniform case Figure 1: Score value calculation when label entropy values are uniform.
Skewed case Labels M (f; l 1 ) 4 Complexity information content, respectively.As shown in Figure 1, the value of M f ; l i can be varied according to l i ∈ ℒ ; however, its value is smaller than the entropy value of each label.When the entropy values of labels are uniformly distributed, all the MI terms between f and each label should be examined because each M f ; l i term has same chance of giving significant contribution to the final score J.However, as shown in Figure 2, if there is a set of labels having a small entropy, i.e., if the entropy values of the labels are skewed, there can be MI terms that insignificantly contribute to the extent of J, because all the M f ; l j will inherently have a small value, where l j is a label of small entropy.Although the characteristics of label entropy values can vary between uniform and skewed cases, it is observed from most real-world multilabel datasets that the skewed case occurs more frequently than uniform case [15].Additionally, as shown in Figure 2, because MI terms between a feature and labels with small entropy will not much contribute to the final score of the feature, they can be excluded for accelerating multilabel feature selection process.
Figure 3 shows the entropy value of each label in a BibTeX dataset [3] composed of 153 labels; please refer to Table 1 for details.The BibTeX dataset is created from the transactions of user activity in a tag recommendation system.For clarity, we represent the tool which is used to describe and process lists of reference as BibTeX whereas the name of the corresponding dataset is BibTeX subsequently.In this system, users freely submit BibTeX entries and assign relevant tags.The purpose of this system is recommending a relevant tag for the new BibTeX entries submitted by users.The system must identify the relation between BibTeX entry and relevant tags based on user transactions previously gathered, and hence, it can be regarded as a real-life text categorization system.For clarity, labels are sorted/ordered according to their entropy value.Figure 3 shows that each label gives a different entropy value; however, more importantly, approximately half of the labels give small entropy values, indicating that the MI terms with those labels will contribute weakly to the final score.Therefore, these labels can be discarded to accelerate the multilabel feature selection process.
Suppose that an algorithm selects Q ⊂ ℒ for reducing computational cost for multilabel feature selection.To prevent possible degradation, i.e., a change in the upper bound for J because of label subset selection, it is preferable that Q implies a similar upper bound compared to J.In other words, a subset of ℒ that minimizes Proposition 2. The optimal C is composed of labels with the lowest entropy.
Proof 1.Our goal is to identify a subset of labels C that influences the upper bound of J as insignificantly as possible, when C is discarded from ℒ for the feature scoring process.Equation (11) indicates that the upper bound of J is the sum of entropy values for each label and the entropy function always gives positive value, therefore the optimal ℒ should be composed of labels with the lowest entropy.
Proposition 2 indicates that the optimal C can be obtained by iteratively discarding a label with the smallest entropy until Q contains a desirable number of labels.After obtaining Q, the approximated score function for evaluating a feature f is written as Finally, the difference between J and J can be exactly calculated as where J − J is always positive because H X ≥ 0. Algorithm 1 describes the procedure of the proposed method.

Number of Remaining Labels.
A final issue related to label subset selection has to do with the number of labels that should be discarded.In fact, because the upper bound of (12) gets larger when the number of discarded labels is increased, there is a trade-off between computational efficiency and the accurate score of each feature.However, the actual computational cost can also be easily predicted after examining some features because the computational cost for examining ℱ features based on ( 7) is easily calculated as ℱ ⋅ ℒ , and the computational cost based on ( 13) is ℱ ⋅ Q .However, if there is no such constraint and a user only wants to determine a reasonable value of Q for a fast analysis, then a simple and efficient way would be helpful.
Suppose that the algorithm attempts to preserve the upper bound of the score function based on Q, then the upper bound should be greater than or equal to the error because of label subset selection; i.e., the inequality (15) should hold.

Complexity
According to the characteristics of the given labels, the number of labels to be discarded can then be identified as Lemmas 1, 2, and 3. Lemma 1. Skewed case.

Q = 1 16
Proof 2. For simplicity, suppose ℒ is sorted according to the entropy value of each label, such that l 1 has the smallest entropy and l ℒ has the largest entropy.Suppose that the entropy values of the labels are skewed, as shown in Figure 2. If l ℒ is the only one label with a positive entropy and the remaining labels have no entropy, then the algorithm will move l ℒ to Q and l 1 , … , l ℒ −1 to C, and then terminate.
So far, we considered the uniform and skewed cases that are the two extremes of the characteristics in the viewpoint of information content carried by each label.Next, we consider an intermediate between the uniform and skewed cases, in which the information content of each label is proportional to their sequence when they are ascendingly sorted according to their entropy values.For this case, about 30% of labels with the largest entropy should be included in Q.
For simplicity, suppose that ℒ is sorted according to the entropy value of each label, such that l 1 has the smallest entropy value and l ℒ has the largest entropy value.Suppose that the entropy values of the labels are proportional to the sequence number of labels in ℒ as shown in Figure 4.In this case, an entropy value can be represented as where i is the sequence number of label l i in ℒ .Because the actual entropy value is unnecessary for determining superiority among labels, the term α in (18) can be ignored.Then the entropy value of each label with regard to their sequence can be represented as Because the sum of the integers from 1 to i is equal to i i + 1 /2, (20) is obtained using (15).
Equation ( 20) can be simplified as The solution of ( 21) is given as Because Q is always a positive integer, the negative solution can be ignored.Therefore, we obtain For clarity, we approximate the solution as

24
The approximated solution 0.7 ℒ is slightly greater than the exact solution for Q .Therefore, (2) indicates that approximately 70% of labels will be discarded, whereas 30% of labels will remain in Q.
Proof 4. Suppose that the entropy values of the labels are uniformly distributed as shown in Figure 1.The figure indicates that Q should have corresponding labels with regard to each discarded label.Therefore, for the even case, the number of labels in Q and C must be the same for (15) to hold; thus, Q = ℒ /2.For the odd case, Q must have one more label than C; thus, Q = ℒ /2 + 1.
The proof indicates that the number of labels to be selected is decreased as the entropy values of labels are skewed.In addition, the proof guarantees that Q must be lesser than ℒ and the computational cost for evaluating Labels Proportional case Figure 4: Score value calculation when label entropy values are proportional to their rank.
7 Complexity the importance of each feature based on Q must be smaller than L /2 + 1.Therefore, Theorem 1 can be obtained.
Theorem 1 Q is always smaller than ℒ .Proof 5. Suppose that there are two label sets Q and C to be considered and ignored for calculating the importance of each feature, respectively.Because Q should carry the majority information content than C, ∑ l∈Q H l should be larger than ∑ l∈C H l .As shown in Proposition 2, the algorithm is able to achieve this goal by (1) including a label with the largest entropy in Q and removing that label from ℒ , (2) including labels with the smallest entropy in C and removing those labels from ℒ iteratively until ∑ l∈Q H l > ∑ l∈C H l , and (3) repeating ( 1) to ( 2) until ℒ has no element.If the entropy values of all the labels are the same, i.e., the largest entropy value and the smallest entropy value are the same, one label can be included in C when a label is included in Q as Lemma 3. Thus, C possibly has more labels than Q in the case when the smallest entropy value is actually smaller than the largest entropy value, indicating that the uniform case is the worst case from the viewpoint of the number labels in Consequently, the number labels in Q cannot be larger than ℒ /2 + 1.
Because Q is always smaller than ℒ and calculating one MI term is regarded as the unit cost, the computational cost for evaluating each feature using Ĵ is constant in the viewpoint of the number of labels.

3.4.
Influence to Feature Ranking.The multilabel feature selection is done by ranking each feature according to its importance value.After label subset selection is conducted, the importance score of each feature will be calculated by summing M f ; l i terms, where l i ∈ Q.However, when the entropy values of labels are skewed, the rank based on J and that based on J are unlikely to change.To demonstrate this aspect, we illustrate how the importance score is calculated under the skewed case in Figure 5.In the figure, there are three labels, namely l 1 , l 2 , and l 3 ; l 1 has the highest entropy, whereas l 2 and l 3 have insignificant entropies.The MI between each feature and each label is represented as yellow bars, and the final score of each feature is represented on the right hand side of the figure.The figure indicates that (1) the MI between each feature and each label is bound by the entropy of each label and (2) the MI between each feature and the labels of high entropy mostly determines the final score of each feature.In other words, (3) the influence of MIs between each feature and l 2 and l 3 is insignificant to the final score.
With regard to the process of feature selection, Figure 5 implies three more indications.The first indication is related to the influence of labels with high entropy to the final score.Because the final score is determined by summing MI terms between a feature and all the labels, a feature that is dependent on labels with high entropy is likely to have a high importance score.Therefore, those features will be included to the final feature subset S because of their higher rank, and they show promise as potential members of S. The second indication is related to the change among similarly ranked features.However, because the goal of feature selection is to select a feature subset that is composed of n features, the specific rank of each feature is unimportant.For example, suppose that the algorithm tries to choose ten features from S because ℱ is set to ten by users or there is a limitation on the storage.The label subset selection may change the rank of the second-and the third-ranked features; however, these two features will be included in the final feature subset S because the algorithm is allowed to select ten features.The final indication is related to the rank among unimportant features.Although there may be a set of features that are dependent on labels with small entropy, these features will have low importance scores and hence will be discarded from S.
Although the example of Figure 6 indicates that the rank of each feature will be unlikely to change or may be changed meaninglessly, empirical experiments should be followed to investigate the availability of label subset selection.8 Complexity

Experimental Results
A description of the multilabel datasets, algorithms, statistical tests, and other settings used in the experimental study is provided in this section.Next, the experimental results based on different multilabel learning methods, the datasets, and the analysis are presented subsequently.
4.1.Experimental Settings.Twenty real multilabel datasets were employed in our experiments [12,25,35], where the number of relevant and irrelevant features is unknown.
Table 1 shows the standard statistics of the multilabel datasets and the meaning of each notation is described as follows: These statistics show that the 20 datasets cover a broad range of cases with diversified multilabel properties.In the case where the feature type is numeric, we discretized the features using the LAIM discretization method [45].In addition, datasets that are composed of more than 10,000 features are preprocessed to contain the top 2% and 5% features with the highest document frequency [12,46].We conducted an 8 : 2 hold-out cross-validation, and each experiment was repeated ten times.The average value was taken to represent the classification performance.A wide variety of multilabel classifiers can be considered to conduct multilabel classification [8].In this study, we chose the multilabel naive Bayes classifier [41] because the learning process can be conducted quickly, owing to the wellknown naive Bayes assumption, without incurring an additional tuning process, and because our primary concern in this study is efficient multilabel learning.Finally, we considered four evaluation measures, which are employed in many multilabel learning studies: execution time for the training and test process, Hamming loss, multilabel accuracy, and subset accuracy [8,29].
The Friedman test was employed to analyze the performance of the multilabel feature selection methods; it is a widely used statistical test for comparing multiple methods over a number of datasets [47].The null hypothesis of the equal performance of the compared algorithms is rejected in terms of each evaluation measure if the Friedman statistic F F is greater than the critical value at significance level α.In this case, we need to proceed with certain post hoc tests to analyze the relative performance of the comparison methods.The Bonferroni-Dunn test is employed because we are interested in determining whether the proposed method achieves a performance similar to that of the feature selection process considering all of the labels and to that of the multilabel learning without the feature selection process [48].For the Bonferroni-Dunn test, the performances of the proposed method and another method are deemed to be statistically similar if their average ranks over all datasets are within one CD.For our experiments, the critical value at the significance level α = 0 05 is 2.492, and the CD with α = 0 05 is 1.249 because q 0 05 = 2 498 [48].

Comparative Studies.
In this section, we compare the proposed feature selection method based on the label subset selection strategy to the conventional multilabel learning without the feature selection process and the conventional feature selection method without the label subset selection.The detail of each method, besides the proposed method, is described as follows: (i) No: conventional multilabel learning the without feature selection process.Here, ℱ is used as the input features for the multilabel classifier (ii) SL: multilabel learning with the proposed feature selection process.Here, S is used as the input features.In the feature selection process, only one label with the highest entropy is considered to measure the importance of each feature (iii) 3L: multilabel learning with the proposed feature selection process.Here, S is used as the input features.In the feature selection process, 30% of labels with the highest entropy are chosen by the label selection strategy to compose Q (iv) 5L: multilabel learning with the proposed feature selection process.Here, S is used as the input features.In the feature selection process, 50% of labels with the highest entropy are chosen by the label selection strategy to compose Q (v) AL: multilabel learning with the conventional feature selection process.Here, S is used as the input features.The same feature subset can be obtained by setting Q = L for the proposed method All methods were carefully implemented in a MATLAB 8.2 programming environment and tested on an Intel Core i7-3930 K (3.2 GHz) with 64 GB memory.
Tables 2-5 report the detailed experimental results of each method under comparison on 20 multilabel datasets.For each evaluation measure, ↓ means the smaller the better whereas ↑ means the larger the better.The best 9 Complexity performance among the five methods under comparison is shown in boldface with a bullet mark.In addition, the average rank of each method under comparison over all the multilabel datasets is presented in the last column of each table.Table 6 represents the Friedman statistics F F and the corresponding critical values on each evaluation measure.As shown in Table 6, at significance level α = 0.05, the null hypothesis of equal performance among the methods under comparison is clearly rejected in terms of each evaluation measure.
To show the relative performance of the proposed method and conventional multilabel learning methods, Figure 7 illustrates the CD diagrams on each evaluation measure, where the average rank of each method is marked along the axis with better ranks placed on the right hand side of each figure [47].In each figure, any comparison method whose average rank is within one CD to that of the best method is interconnected with a thick line; the length of the thick line indicates the extent of CD on a diagram.Otherwise, any method not connected with the best method is considered to have a significantly different performance from the latter.Based on the empirical experiments and statistical analysis, the following indications can be observed: (1) As Figure 7 shows, the multilabel learning and classification process is significantly accelerated by the feature selection process.In particular, the multilabel classification with SL and 3L is completed significantly faster than No, indicating the superiority of the proposed approach (2) Focusing on the average rank of AL and No in Figure 7, the advantage of multilabel feature selection from the viewpoint of the execution time is insignificant, indicating that the merit given by feature selection process on the execution time can disappear owing to a large number of labels (3) As Figure 7 shows, the feature subset selected by AL is able to deliver a statistically similar classification performance to the baseline performance No.This means that the dimensionality of the input space can be reduced to accelerate the multilabel learning process without degrading the predictive performance (4) The feature subset selected by the proposed methods based on the label subset selection such as 3L and 5L is able to deliver a comparable classification performance to the classifier if a moderate number of labels are considered for evaluating the importance of features (5) A notable exception can be observed from the experimental results of SL, which considers only one label for the feature scoring process.However, it also gives a statistically better performance than No in the experiments involving Hamming loss and a comparable performance in the experiments involving multilabel accuracy and subset accuracy (6) Surprisingly, if a moderate number of labels are considered from the feature scoring process like 3L or 5L, the feature subset gives statistically better discriminating power than the baseline performance given by No.For example, in the experiments involving Hamming loss, as shown in In summary, the experimental results show that the proposed method based on the label subset selection strategy achieves a significantly better execution time than the baseline multilabel setting No and conventional multilabel learning with feature selection AL, indicating that the proposed method is able to accelerate the multilabel learning process.Furthermore, the feature subset selected by the proposed method, such as 3L and 5L, yields a similar classification performance compared to the other methods.Because the proposed method has a lower execution time compared to the other methods, this means that the proposed method is able to quickly identify the important feature subset, without degrading the multilabel classification performance.
Finally, we conducted additional experiments to validate the scalability and efficiency of the proposed method.For this purpose, we employed the Delicious dataset, which is composed of a large number of patterns and labels [3].Specifically, the Delicious dataset was extracted from the del.icio.ussocial bookmarking site where textual patterns and associated labels represent web pages and relevant tags.This dataset is composed of 16,105 patterns, 500 features, and 983 labels from 15,806 unique label subsets.To demonstrate the superiority of the proposed method, we employed MLCFS [19] and PPT + RF [24].In this experiment, we regard 3L as the proposed method because it performs better than SL, 5L, and AL, as shown in Figure 7. Table 7 represents the experimental results of three multilabel feature selection methods, including the proposed method.The experimental results indicate that the proposed method outputs the final feature subset much faster than the compared methods with similar multilabel classification performances in terms of Hamming loss, multilabel accuracy, and subset accuracy.

Conclusion
In this study, we proposed an efficient multilabel feature selection method to achieve scalable multilabel learning when the number of labels is large.Because the computational load of the multilabel learning process increases with the increasing number of features in the input data, the proposed method accelerates the multilabel learning process by selecting important features to reduce the dimensionality of features.In addition, with regard to the multiple labels considered for the feature scoring process, we demonstrated that the feature selection process itself can be accelerated for further acceleration of the multilabel learning process.Furthermore, empirical experiments on 20 multilabel datasets showed that the multilabel learning process can be boosted without deteriorating the discriminating power of the multilabel classifier.Future research directions include scalability against a large number of training examples.Although this can be achieved by a multilabel classification approach using distributed computing [49], the performance should be tested empirically to validate the potential.In addition, we will investigate the multilabel learning performance with respect to the label selection strategy.Our experiments indicate that the feature subset selected by the proposed method can possibly deliver a better discriminating capability, despite only a part of the labels in a given label set being considered for the feature scoring process.Because this was an unexpected result, as the primary goal of this study was the acceleration of the multilabel learning process, we would like to investigate this issue more thoroughly in the future.

Figure 2 :
Figure 2: Score value calculation when label entropy values are skewed.

Figure 3 :
Figure 3: Entropy of each label in BibTeX dataset.

Figure 5 : 5 Figure 6 :
Figure 5: Importance score of each feature in the viewpoint of entropy of each label when entropy values of labels are skewed.
(i) U : number of patterns in the dataset (ii) : number of features (iii) Feature type: type of feature (iv) ℒ : number of labels (v) Card: average number of labels for each instance (label cardinality) (vi) Den: label cardinality divided by the total number of labels (label density) (vii) Distinct: number of unique label subsets in ℒ (distinct label set) (viii) PDL: number of distinct label sets divided by the total number of patterns (portion of distinct labels) (ix) Domain: applications to which each dataset corresponds (x) S : number of features to be selected ( W )

Figure 7 :
Figure 7: Bonferroni-Dunn test results of five comparing methods with four evaluation measures.Methods not connected with the best method in the CD diagram are considered to have significantly different performance (significance level α = 0 05).This is reproduced from Lee et al. (2017) (under the Creative Commons Attribution License/public domain).

Table 1 :
Standard characteristics of multilabel datasets.Sort ℱ based upon score values J descendingly; 12: Set S ← Top n features of high score in ℱ ;

Table 2 :
Execution time (↓) for training and testing process of each comparing method (mean ± std.deviation) on 20 multilabel datasets.

Table 3
(7)L gives a better Hamming loss performance than No on 85% of multilabel datasets(7)Furthermore, based on the comparison to the multilabel classification performance given by No, the feature subset selected by 3L gives a better

Table 6 :
Summary of the Friedman statistics F F (k = 5, N = 20) and the critical value in terms of each evaluation measure.

Table 7 :
Comparison results of proposed method, MLCFS, and PPT + RF on the Delicious dataset.