Multilabel Classification Using Low-Rank Decomposition

,

In recent years, the field of multilabel learning has gradually attracted significant attention. A variety of algorithms have been proposed, which can be basically divided into two categories [15]: algorithm adaptation and problem transformation. e core idea of the former is to transform the previous supervised learning algorithm so that it can be used to solve multilabel learning problems, such as ML-kNN [16], while the latter is to convert the multilabel learning problem into other known problems to solve, such as BR [17]. Some multilabel algorithms solve the multilabel learning problem without using the correlation among different labels, such as LIFT [18]. e main idea of the LIFT is to obtain the identifying characteristics of each label and build a new feature space. It first obtains the positive and negative examples corresponding to each label and then performs cluster analysis on the corresponding set of examples to obtain the cluster centers and finally uses the cluster centers to construct the label-specific features. In the process of solving the multilabel learning problem, LIFT does not consider label correlations; hence, it can be regarded as a new feature conversion method. Some algorithms consider the label correlation [19][20][21][22][23][24][25] for solving the multilabel learning problem. For example, the basic idea in [20] is to model the correlation among labels based on the Bayesian network and to achieve efficient learning by using the approximate strategy. Indeed, the rational use of the correlation among labels can effectively boost the performance of multilabel classification. For example, if an image has labels "football" and "rainforest," it is likely to be labeled "Brazil". It has a low probability of being labeled "river" if a document is annotated with "desert". erefore, how to effectively explore and make full use of label correlations is a crucial problem for multilabel learning.
In fact, for an object with multiple labels, the importance of the related labels is still different. Although the importance of each label is not given directly, we can judge the importance of each label through external observation. Generally speaking, the larger the proportion in the original object, the more important the corresponding label. Accordingly, how to accurately express the importance of the label is also a challenge. e method in [26] decomposes the original output space in order to obtain potential label semantic information, which can effectively increase the ability of the subsequent feature selection. Motivated by the decomposition of the label space in [26], in the paper, we propose a method named label low-rank decomposition (LLRD) for multilabel classification. e LLRD algorithm first performs low-rank decomposition on the label matrix, then combines the decomposed results with the original features to form new features, and mines the structural information of the feature through sparse reconstruction. ird, it transforms the binary label into the real-valued and finally converts the classification problem into a regression problem. e contribution of this paper is as follows: (1) Utilize low-rank decomposition to reveal the global label correlations and achieve good classification results (2) Combine the low-rank decomposition results with the original features reducing the information loss in the subsequent label transformation process (

e Process of LLRD.
First, LLRD decomposes the label matrix with low-rank method. In the framework of multilabel learning, label matrix is often considered to be low rank [27,28] due to the existence of label correlations. Low-rank structure is also a way to explore the global relationship between labels. erefore, we can perform low-rank decomposition on the label matrix. Assuming that the rank of Y is r < q, Y can be written as follows: where A ∈ R q×r represents the dependency of B ∈ R r×p on the original label space and B is a mapping of the original label and also contains label correlation information. Second, we combine B with X to form a new feature space N � [X; B][n 1 , n 2 , . . . , n p ] � ∈ R (r+d)×p . In order to reveal the inner structure of the feature space, we use sparse reconstruction [29] method to model the relationship between the training instances. Specifically, we use W[s ij ]� p×p to represent the training object relationship matrix, where s ij is a measure of the relationship between n i and n j . Let S i � [s 1i , . . . , s i− 1,i , s i+1,i , . . . , s pi ] T denote the corresponding sparse reconstruction coefficient related to n i . According to the sparse representation theory, S i can be calculated as follows: where N i � [n 1 , n 2 , . . . , n i−1 , n i+1 , . . . , n p ] represent a combination of all training instances except n i . We can solve the above problem using alternating direction method of multiplier [30]. ird, we transform the original binary label set y i � (l i1 , l i2 , . . . , l iq ) T associated with any x i in the training set into a real-valued label vector c i � (c i1 , c i2 , . . . , c iq ) T , where l ij ∈ −1, 1 { } and c ij ∈ R. Because the real value contains more information, and through the size of the value, we can also infer the importance of the label. Since the input space Discrete Dynamics in Nature and Society and the label space are often interrelated, it is assumed that the relationship between n i and n j in the input space also exists between c i and c j in the label space. Accordingly, the representation errors of different elements in the label space can be written as follows: where c � [c 1 , c 2 , . . . , c p ]. e above quadratic programming problem can be solved by mature tools related to quadratic programming. e original multilabel classification problem can be transferred into a multioutput regression problem. ere are many solutions [31] to solve it. e learning of LLRD method contains three phases: lowrank decomposition, sparse reconstruction, and multioutput regression. e time complexity of low-rank decomposition and sparse reconstruction is O(d 2 p + d 3 ). If we choose multioutput support vector regression to realize the classification, the time complexity is O(qp 3 ).
us, the total complexity of LLRD is

Experiment Setup.
In this subsection, we investigate comparisons between our LLRD and other six multilabel learning methods on six multilabel evaluation criteria, which include two categories: example-based and labelbased metrics [32]. e example-based metric is to first obtain the performance of the learning system on each test example and finally returns the average of the entire test set. Unlike the above example-based metric, the label-based metric first returns the performance of the system on each label and finally gets the macro/microaveraged F1 value on all labels.
In this paper, one-error, coverage, ranking loss, and average precision are employed for example-based performance evaluation. And macroaveraging and microaveraging F1 are label-based metrics. For example-based metrics except average precision, as their values increase, it means that the performance of the algorithm is worse. For the remaining metrics, their values are proportional to the performance of the algorithm. Let the multilabel test set and f(x, l) can be seen as the confidence of l being the corresponding label associating with x. In addition, f(x, l) can be converted into a ranking function rank f (x, l).
e six evaluation criteria for the algorithm used in the paper are defined as follows: (1) One-error: (3) Ranking loss: (4) Average precision: (5) Macroaveraging F1: where FN j , TN j , FP j , and TP j indicate the number of falsenegative, true-negative, false-positive, and true-positive instances with regard to l j . In order to test the effectiveness of LLRD, we chose six multilabel learning algorithms MLFE [33], RAKEL [34], ML 2 [35], CLR [36], LIFT [18], and RELIAB [37] for performance comparison. MLFE makes full use of the intrinsic information in feature space, making the semantics of the label space more abundant. e specific parameters of MLFE are set as follows: ρ � 1, c 1 � 1, c 2 � 2, and β 1 , β 2 , and β 3 searched from {1, 2, . . .,10}, {1, 10, 15}, and {1, 10}. RAKEL is a high-order approach. e basic idea of the algorithm is to transform the multilabel learning problem into integration of multiclass classification Discrete Dynamics in Nature and Society 3 problem. We use the default settings recommended by RAKEL algorithm, namely, k � 3, ensemble size n � 2q. For ML 2 , respective parameter values are recorded as follows: λ � 1, K � l + 1 and C 1 and C 2 selected from {1, 2, . . ., 10}. ML 2 is the first multilabel learning algorithm to attempt to explore manifolds at the label level. CLR is a second-order problem transformation method. It solves the problem of multilabel classification by using label ranking, in which ranking among labels is implemented by pairwise comparison. e associated parameter ensemble size is set to (

Experimental Results.
For each dataset in our experiment, we adopt the tenfold cross-validation strategy. Our experimental results are mainly distributed in Tables 2 and 3, where we record the performance of different algorithms in different multilabel datasets. Specifically, the average and standard deviation of the corresponding evaluation criteria are recorded in the tables. For each evaluation metric, "↓" indicates "the smaller the better" and "↑" indicates "the larger the better". e best results are shown in bold form.  We use Friedman test [38] based on the average ranks for verifying whether the difference between algorithms is statistically significant. If the assumption that "all algorithms have equal performance" is rejected, it means that the performance of each algorithm is significantly different. As can be seen from the data presented in Table 4, the hypothesis that there is no significant difference among the algorithms is not valid under the condition of 0.05 significance level. erefore, we need to conduct a post hoc test to further distinguish the various algorithms. Usually, there are two options for post hoc test, one is the Nemenyi test [38] and the other is the Bonferroni-Dunn test [39]. For k algorithms, the former needs to compare k(k − 1)/2 times, while the latter only needs k − 1 times in some cases. us, we choose the latter. e Bonferroni-Dunn test is used to test whether LLRD is more competitive than the comparative algorithm, in which LLRD plays a role of control algorithm. When the difference of average rank between two algorithms is more than one critical difference CD, the performance of two algorithms is obviously different. e CD value mentioned here can be calculated from CD � q α ���������� � k(k + 1)/6N, where k � 7 and N � 13, when the significance level is 0.05, the corresponding q α � 2.638. e CD diagram associated with LLRD and its comparison algorithm is shown in Figure 1. e numbers on the horizontal axis of the coordinate indicate the average rank value of each algorithm under different evaluation criteria.
ere is no significant difference in performance among the various algorithms connected by solid lines. rough the analysis of the above experimental results, we can draw the following conclusions: (1) In terms of the four evaluation criteria of one-error, coverage, ranking loss, and average precision, LLRD is obviously superior to RELIAB, RAKEL, and CLR. (2) e smaller the average rank value, the better the performance of the corresponding. For LLRD, five of the average rank value in the six CD subdiagrams are optimal, which shows LLRD outperforms other algorithms. (3) For regular-size datasets, LLRD ranks first in 69% of the cases under different evaluation criteria, while for large-scale datasets, it ranks first in 36.1%.

Conclusions
In this work, we propose a novel multilabel classification algorithm named LLRD, which adopts the low-rank decomposition to gain the internal information of label and further reduce the information loss of the label transformation via the new feature space. Experimental results show that the performance of the proposed LLRD is better than many state-of-the-art multilabel classification techniques. In the future, we will explore alternative models combining the low-rank decomposition and classification into a joint optimization problem for considering more complex correlation of labels.