Multilabel learning that focuses on an instance of the corresponding related or unrelated label can solve many ambiguity problems. Label distribution learning (LDL) reflects the importance of the related label to an instance and offers a more general learning framework than multilabel learning. However, the current LDL algorithms ignore the linear relationship between the distribution of labels and the feature. In this paper, we propose a regularized sample self-representation (RSSR) approach for LDL. First, the label distribution problem is formalized by sample self-representation, whereby each label distribution can be represented as a linear combination of its relevant features. Second, the LDL problem is solved by L2-norm least-squares and L2,1-norm least-squares methods to reduce the effects of outliers and overfitting. The corresponding algorithms are named RSSR-LDL2 and RSSR-LDL21. Third, the proposed algorithms are compared with four state-of-the-art LDL algorithms using 12 public datasets and five evaluation metrics. The results demonstrate that the proposed algorithms can effectively identify the predictive label distribution and exhibit good performance in terms of distance and similarity evaluations.
National Natural Science Foundation of China6137904961379089617031961. Introduction
Multilabel learning allows more than one label to be associated with each instance [1]. In many practical applications, such as text categorization, ticket sales, and torch relays [2], objects have more than one semantic label, often expressed as the objects ambiguity. As an effective learning paradigm, multilabel learning is applied in a variety of fields [3, 4], but it mainly focuses on an instance of the corresponding related or unrelated label.
Though multilabel learning can solve many ambiguity problems, it is not well-suited to some practical problems [5, 6]. For example, consider the image recognition problem of a natural scene that is annotated with mostly water, lots of sky, some cloud, a little land, and a few trees. As can be seen from Figure 1, each label in Figure 1(a) should be assigned a different importance. Multilabel learning mainly focuses on an instance of the corresponding related label or unrelated label, rather than the difference in importance [7]. This leads to the question of how to determine the importance of different labels in an instance. Label distribution learning (LDL) can reflect the importance of each label in an instance in a similar way to a probability distribution. Figure 1(b) shows an example of a label distribution. This scenario is encountered in many types of multilabel tasks, such as age estimation [7], expression recognition [8], and the prediction of crowd opinions [9].
A natural scene image which has been annotated with water, sky, cloud, land, and trees.
Multilabel
Label distribution
In contrast to the multilabel learning output of a set of labels, the output of LDL is a probability distribution [10]. In recent years, LDL has become a popular topic of research as a new paradigm in machine learning. For instance, Geng et al. proposed the IIS-LDL and CPNN algorithms to estimate the ages of different faces [11]. Their approach achieves better results than previous age estimation algorithms, because they use more information in the training process. Thereafter, Geng developed a complete framework for LDL [10]. This framework not only defines LDL but also generalizes LDL algorithms and gives corresponding metrics to measure their performance.
At present, the parameter model for LDL is mainly based on Kullback-Leibler divergence [12]. Different models can be used to train the parameters, such as maximum entropy [13] or logistic regression [14], although there is no particular evidence to support their use. To some extent, the LDL process ignores the linear relationship between the features and the label distribution. Unlike other applications, LDL aims to predict the label distribution rather than the category. Thus, the overall label distribution can be effectively reconstructed from the corresponding samples.
In this paper, we propose an LDL method that uses the property of sample self-representation to reconstruct the labels. As the labels are similar to a probability distribution, but not actually a probability distribution, in LDL we can represent the labels through the feature matrix instead of the distance between two probability distributions. With the above considerations, we use a least-squares model to establish the objective function. That is, as far as possible, each label distribution is represented as the linear combination of its relevant features. The goal of this optimization model is to minimize the residuals. We combine LDL with sparsity regularization to optimize the model and then introduce regularization terms to solve the model. To solve the objective function efficiently, we use the L2-norm and L2,1-norm as the regularization terms. The corresponding algorithms are named regularized sample self-representation RSSR-LDL2 and RSSR-LDL21. The proposed algorithms not only have strong interpretability but also avoid the problem of overfitting. In a series of experiments, we demonstrate that the similarity and distance in a variety of evaluation metrics are superior to those of four state-of-the-art algorithms. The results of the experimental analysis on public datasets show that the proposed method can effectively predict the labels.
The remainder of this paper is organized as follows. A brief review of related work on LDL and sparsity regularization is presented in Section 2. In Section 3, we introduce the LDL task and evaluation metrics. We describe the RSSR-LDL method and develop two algorithms in Section 4. Section 5 presents and analyzes the experimental results. Finally, we conclude this paper and present some ideas for future work in Section 6.
2. Related Work
The continued efforts of researchers have led to various LDL algorithms being proposed [9, 13, 15]. There are three main design strategies in the literature [10]: problem transformation, algorithm adaptation, and specialized algorithm design. Problem transformation (PT) takes the label distribution instances and transforms them into multilabel instances or single-label instances; PT-SVM and PT-Bayes are the representative algorithms in this class. Algorithm adaptation (AA) extends some existing supervised learning algorithms to deal with the problem of label distribution, such as the AA-kNN and AA-BP algorithms [10]. Unlike problem transformation and algorithm adaptation, specialized algorithm (SA) design sets up a direct model for the label distribution data. Typical algorithms include LDLogitBoost, based on logistic regression [16], SA-BFGS, based on maximum entropy [10], and DLDL, which combines LDL with deep learning [17, 18].
Unlike traditional clustering [19] or classification learning [20], the labels in LDL have similar patterns to probability distributions. According to the definition of the label distribution, we assume that there may be a function that matches the feature to the label. We find that each label can be well approximated by a linear combination of its relevant features.
Linear reconstruction is not strictly expressed as y=Ax, which A is a coefficient matrix. Then we transform it to minimum residual and optimize it by least-square method. In order to avoid overfitting and to solve the problem, regularization term L2-norm is often added. In this way, the label distribution is constructed directly from the sample information of the data by the coefficient matrix. To find the corresponding relevant features while avoiding effects of noise in high dimensional data [21], we introduce sparse reconstruction [22, 23]. Sparsity reconstruction is the addition of sparse regularization terms L1-norm or L0-norm on the linear reconstruction. Nowadays, there are sparse regularization terms L2,1-norm and L2,0-norm proposed gradually [24]. L2,1-norm regularization is performed to select features across all of the data points with joint sparsity [25]. For matrix A=(aij),(1)A2,1=∑i∑jaij2.
Sparsity reconstruction is widely used in machine learning, especially for data dimensionality reduction [26, 27]. For example, Cai proposed the MCFS algorithm by using L1-regularized least-squares to deal with multicluster data [28], and Zhu et al. proposed the RMR algorithm based on regularized self-representation for feature selection [29]. Nie developed the RFS and JELSR algorithms using L2,1-regularized least-squares to optimize the objective function [25, 30]. Furthermore, L1-SVM and sparse logistic regression [31] have been shown to be effective. In general, linear reconstruction and sparsity reconstruction produce good performance in feature selection and classifier [32, 33]. Next, we will propose a label distribution learning method based on linear reconfiguration and sparsity reconstruction.
3. The Proposed Model
The goal of LDL is to obtain a set of probability distributions. Therefore, LDL is different from previous approaches in terms of the problem statement. In this section, the problem statement is briefly reviewed and the proposed model is introduced.
LDL is the process of describing an instance x more naturally using labels [10]. We assign a value dxy to each of the corresponding possible labels y. This value represents the extent to which the label y describes instance x, that is, the description degree. Taking account of the corresponding subset of labels can give a complete description of the sample. Therefore, it is assumed that dxy∈[0,1] and ∑ydxy=1. That is, the data for dxy have a form that is similar to a probability distribution for an instance. The learning process based on such data is called label distribution learning.
We use x1,x2,…,xn to represent the n instances, xi∈Rd, and X=[x1,x2,…,xn]. Let Y=[y1,y2,…,yc] denote the complete class labels, where yk∈Rc and y1,y2,…,yc represent the c labels. Corresponding label distribution D=[d1,d2,…,dn]T, among di=[dxiy1,dxiy2,…,dxiyc], represents the label distribution of instance xi. Concretely, dxiyk represents the extent to which the label yk describes the instance xi. Therefore, we can represent the training set of label distribution learning S=[(x1,d1),(x2,d2),…,(xn,dn)]T in this paper. In addition, the test data is defined as X′=[x1′,x2′,…,xm′]T and the corresponding predicted label distribution is defined as D′=[d1′,d2′,…,dm′]T.
We combine LDL with regularized sample self-representation to give the RSSR-LDL model. For RSSR-LDL, each sample and the corresponding description degree have the following relationship:(2)di=xiP,where P is the transformation matrix from the sample to the description degree and P∈Rd×c. According to the definition, this is equivalent to(3)D=XP.In general, n>m for X, and (3) cannot be solved [34].
In order to solve the optimal P, we introduce residual sum function L:(4)LP=XP-D22.When P=P^, the minimum value of L(P), the objective function of the model is(5)P^=argminPXP-D22.
Using the difference values from (4), we obtain(6)XTXP-XTD=0.When X is not full rank, or there is a significant linear correlation between columns, the determinant will be close to 0, which makes the calculation of XTX an ill-posed problem. This will introduce a large error into the calculation of (XTX)-1, resulting in a lack of stability and reliability in (5). Therefore, we introduce a regularization term R(P) with parameter γ to optimize the objective function.
In other words,(7)P^=argminPXP-D22+γRP,γ>0.There are several possible regularizations [25],(8)R1P=∑j=1cpj1,R2P=P22,R3P=∑k=1d∑j=1cPkj2;R1(P) is the LASSO regularization. R2(P) is the ridge regression, also known as Tikhonov regularization. It is the most frequently used regularization method for ill-posed problem. R3(P) is a new joint regularization [25, 29].
4. Regularized Model and Algorithm
To solve the objective function efficiently, we use the L2-norm and L2,1-norm to regularize the RSSR-LDL model, resulting in RSSR-LDL2 and RSSR-LDL21. The RSSR-LDL2 and RSSR-LDL21 algorithms are presented in this section.
4.1. Regularized Sample Self-Representation by L2-Norm
We use the L2-norm of P to solve the RSSR-LDL problem. For convenience, we use the regularization term R2(P) in (7). Then, (7) is as follows:(9)P^=argminPXP-D22+γ1P22.Because (9) is smooth, (9) can be solved by the differential as (10).(10)XTXP+γ1IP-XTD=0,or equivalently,(11)XTX+γ1IP=XTD.XTX+γ1I is a nonsingular matrix definitely; then(12)P=XTX+γ1I-1XTD.
We can predict the label distribution of the test dataset using the learned matrix P. Specifically, the predictive label distribution is defined as follows:(13)di0=xi′P,di′=di0∑k=1cdi0.Borrowing from the above theoretical analysis, we summarize Algorithm 1.
Algorithm 1: Regularized sample self-representation by L2-norm (RSSR-LDL2).
Input: Train matrix S, test data X′ and the regularization parameter γ1.
Output:The corresponding predicted label distribution of test data X′,d1′,d2′,…,dm′.
(1)P=XTX+γ1I-1XTD where X=x1,…,xnT and D=d1,…,dnT; // Computational transformation matrix.
(2)di0=xi′P; // Using the transition matrix to represent the distribution of the predicted labels.
(3)di′=di0/∑k=1cdi0; // We normalize di0(i=1,2,…,m) because of dxy∈[0,1] and ∑ydxy=1.
Algorithm 1 does not include an iterative process. In other words, the matrix P can be solved directly, which makes the algorithm faster. Although this approach is efficient and easy to understand, it is not very accurate. Thus, in the next section, we use the L2,1-norm of P to solve the RSSR-LDL problem.
4.2. Regularized Sample Self-Representation by L2,1-Norm
Combined with the characteristics of R1(P) and R2(P), we choose the L2,1-norm of P,that is, R3(P), as the regularization term. This gives the RSSR-LDL21 algorithm. The objective optimization function of (7) is shown in the following expression:(14)P^=argminPXP-D2.1+γ2P2,1,which can be transformed into(15)argminP1γ2XP-D2.1+P2,1.
According to [25], this can be further transformed into(16)minQQ2,1s.t.ZQ=D,where Q=[P,A]T∈R(d+n)×c, A∈Rn×c, Z=[XT,γ2I]∈Rn×(d+n), and I∈Rn×n. The problem in (16) becomes one of solving a Lagrangian function; that is,(17)Q=B-1ZTZB-1ZT-1D,where B∈R(d+n)×(d+n) is a diagonal matrix with dll=1/2ql2,l=1,2,…,d+n. The solution in (17) is convergent [25], so the iteration is viable. The RSSR-LDL21 algorithm is shown in Algorithm 2.
Algorithm 2: Regularized sample self-representation by L2,1-norm (RSSR-LDL21).
Input: Train matrix S, test data X′, the number of iterations Iter and the regularization parameter γ2.
Output: The corresponding predicted label distribution of test data X′, d1′,d2′,…,dm′.
(1) Initialize B0∈R(d+n)×(d+n) which is an identity matrix, set t=0; // Initialization
(2)fort=1 to Iter do.
(3)Qt+1=Bt-1ZT(ZBt-1ZT)-1D, // calculate Qt+1
(4)Bt+1=diag(1/2qt+1l), // calculate Bt+1
(5)end for
(6)P=Q(d,:); // Removing the matrix A part of matrix Q.
(7)di0=xi′P; // Using the transition matrix to represent the distribution of the predicted labels.
(8)dj′=|di0|/∑k=1cdi0; // Because of dxy∈[0,1] and ∑ydxy=1, the normalized di0, i=1,2,…,m.
In Algorithm 2, the iteration is repeated until Iter=30. In each iteration, Q is calculated with the previous B and B is calculated with the current Q.
5. Experiments
To demonstrate the performance of the proposed RSSR-LDL2 and RSSR-LDL21 algorithms, we apply them to gene expression levels, facial expression, and movie score problems. In this section, we use five evaluation metrics to test the proposed algorithms on 12 publicly available datasets (http://cse.seu.edu.cn/PersonalPage/xgeng/LDL/index.htm). The proposed algorithms are also compared with four state-of-the-art LDL algorithms.
5.1. Evaluation Metrics
In LDL, there are multiple labels associated with each instance, and these reflect the importance of each label for the instance. As a result, performance evaluation is different from that of both single- and multilabel learning. Because the label distribution is similar to a probability distribution, we use the similarity and distance between the original distribution and the predicted distribution to evaluate the effectiveness of LDL algorithms. There are many measures of the distance and similarity between probability distributions. In [35], 41 kinds of distance and similarity evaluation metrics were identified across eight classes. The various distance/similarity measures offer different performances in terms of comparing two probability distributions.
According to the agglomerative single linkage with average clustering method [36], screening rules [10], and experimental conditions, we evaluated five methods: the Chebyshev distance [37], Clark distance [38], Canberra distance [39], intersection similarity [36], and cosine similarity [39]. The related names and expressions are listed in Table 1. A “↓” after the distance measure indicates that smaller values are better, whereas “↑” after the similarity measure indicates that larger values are better.
Evaluation metrics description.
ID
Evaluation metrics
Expression
1
Chebyshev ↓
distance1=maxkdk-dk′
2
Clark ↓
distance2=∑k=1mdk-dk′2dk+dk′2
3
Canberra ↓
distance3=∑k=1mdk-dk′dk+dk′
4
Intersection ↑
similarity1=∑k=1mmindk,dk′
5
Cosine ↑
similarity2=∑k=1mdkdk′∑k=1cdk2∑k=1mdk′2
5.2. Experimental Setting
Experiments are conducted on 12 public datasets. In the movie dataset, each instance represents the characteristics of a movie and the category score that the movie may belong to. SBU-3DFE and SJAFFE datasets represent facial expression images. Each instance represents a facial expression and scores of possible expression class. The Yeast family contains nine yeast gene expression levels. Each instance represents the expression level of a gene at a certain time. These datasets are described in Table 2.
Data description.
ID
Dataset
♯ Instance
♯ Feature
♯ Label
♯ Data type
1
Movie
7755
1869
5
Movie score
2
SBU-3DFE
2500
243
6
Facial expression
3
SJAFFE
213
243
6
Facial expression
4
Yeast-alpha
2465
24
18
Gene expression
5
Yeast-cdc
2465
24
15
Gene expression
6
Yeast-cold
2465
24
4
Gene expression
7
Yeast-diau
2465
24
7
Gene expression
8
Yeast-dtt
2465
24
4
Gene expression
9
Yeast-elu
2465
24
14
Gene expression
10
Yeast-heat
2465
24
6
Gene expression
11
Yeast-spo
2465
24
6
Gene expression
12
Yeast-spo5
2465
24
3
Gene expression
To verify the effectiveness and performance of our LDL method, we compared the RSSR-LDL2 and RSSR-LDL21 algorithms with four existing LDL algorithms. According to [10], we selected comparative algorithms that use different strategies.
PT-SVM. PT-SVM is applied to training sets in which label distribution is obtained by the problem resampling method [40]. PT-SVM uses pairwise coupling to solve the multiclassification problem [41]. This algorithm calculates the posterior probability of each class as the description degree of a label.
AA-BP. AA-BP is a three-layer backpropagation neural network. This algorithm has n input units and c output units, which receive X and output D, respectively.
SA-IIS. SA-IIS uses the maximum entropy to solve the LDL problem. The optimization strategy of this algorithm is similar to that of the scaling-based IIS [42].
SA-BFGS. SA-BFGS is an improved algorithm based on SA-IIS. This improved algorithm employs an effective quasi-Newton method and is more efficient than the standard line search approach.
For the parameter settings in these four algorithms, we refer to [10]. For our algorithms, we tuned the regularization parameter using values of {0.001,0.01,0.1,1,10,100,1000} and present the best results [29]. The performance of the above LDL algorithms was evaluated by considering the distance and similarity between the original label distribution and the predicted label distribution.
5.3. Results Analysis of Experiments
In this section, the performance of the proposed algorithms is compared with that of four existing state-of-the-art LDL algorithms in terms of five evaluation metrics. We also present the predicted label distribution given by the six algorithms and the real label distribution.
5.3.1. Distance and Similarity Comparison
To verify the advantages of the proposed RSSR-LDL2 and RSSR-LDL21, experiments were conducted on 12 public datasets. Each experiment used tenfold cross-validation [43, 44], and the mean value and standard deviation of each evaluation were recorded. Because many results were close to zero, they are represented as “(mean±std)×103.” The main measure of the size of individual differences is the distance, whereas the similarity reflects the trend and direction of the vector. Therefore, we use distance and similarity to demonstrate the superiority of the proposed algorithms.
The results for the Chebyshev distance, Clark distance, Canberra metric, cosine coefficient, and intersection similarity are presented in Tables 3–6, respectively. In each table, the best results are given in bold and the second-best results are italicized (if the mean is the same, the algorithm with the smaller standard deviation is considered to be better). The first evaluation metrics measure distance, and so smaller values are better; the latter two measure similarity, and so larger values are better. From these results, we can see that RSSR-LDL21 achieves the best performance of all the algorithms and RSSR-LDL2 is better than the others.
Chebyshev distance ↓ (mean ± std) × 103 of different algorithms on the twelve datasets. The best results are enlightened in bold and the second best results are italicized.
Algorithms
PT-SVM
AA-BP
SA-IIS
SA-BFGS
RSSR-LDL2
RSSR-LDL21
Movie
233.5 ± 26.0
139.8 ± 1.4
129.7 ± 3.0
126.6 ± 3.6
113.4 ± 0.7
114.1 ± 3.6
SBU-3DFE
142.2 ± 5.4
144.2 ± 6.1
133.2 ± 4.7
104.2 ± 4.5
124.0 ± 4.9
107.6 ± 2.4
SJAFFE
121.0 ± 10.3
136.3 ± 16.5
117.2 ± 8.5
105.2 ± 15.0
96.3 ± 18.9
90.7 ± 9.3
Yeast-alpha
13.8 ± 0.4
37.6 ± 2.4
16.9 ± 0.3
13.4 ± 0.4
13.4 ± 0.4
13.4 ± 0.5
Yeast-cdc
17.2 ± 0.8
38.0 ± 1.9
20.0 ± 0.5
16.2 ± 0.4
16.2 ± 0.5
16.2 ± 0.5
Yeast-cold
57.8 ± 3.9
57.8 ± 2.2
56.7 ± 1.9
51.1 ± 2.2
51.0 ± 1.7
50.9 ± 1.9
Yeast-diau
43.2 ± 3.9
49.1 ± 1.9
41.2 ± 1.3
36.9 ± 1.1
36.9 ± 1.0
36.9 ± 1.3
Yeast-dtt
38.6 ± 2.2
44.6 ± 2.5
43.3 ± 1.6
36.0 ± 1.8
35.9 ± 1.5
35.9 ± 1.7
Yeast-elu
17.0 ± 0.3
37.8 ± 2.5
20.2 ± 0.8
16.3 ± 0.5
16.2 ± 0.5
16.2 ± 0.5
Yeast-heat
44.0 ± 1.0
55.1 ± 4.0
46.5 ± 1.1
42.3 ± 1.0
42.2 ± 1.5
42.2 ± 1.1
Yeast-spo
64.7 ± 3.0
66.3 ± 2.9
61.7 ± 1.7
58.3 ± 2.8
58.2 ± 2.7
58.0 ± 2.2
Yeast-spo5
92.9 ± 3.9
95.7 ± 4.2
94.6 ± 2.4
91.4 ± 3.1
91.1 ± 3.9
91.2 ± 3.4
Clark Distance↓ (mean ± std) × 103 of different algorithms on the twelve datasets. The best results are enlightened in bold and the second best results are italicized.
Algorithms
PT-SVM
AA-BP
SA-IIS
SA-BFGS
RSSR-LDL2
RSSR-LDL21
Movie
871.2 ± 77.4
643.8 ± 11.8
553.6 ± 12.9
551.8 ± 10.7
521.4 ± 5.4
514.7 ± 10.6
SBU-3DFE
430.1 ± 16.3
469.7 ± 25.2
410.0 ± 9.0
348.5 ± 11.4
390.2 ± 5.6
380.0 ± 2.9
SJAFFE
437.9 ± 26.8
508.0 ± 46.4
417.6 ± 18.1
420.0 ± 36.6
366.9 ± 36.3
348.5 ± 18.1
Yeast-alpha
220.9 ± 5.1
752.2 ± 54.1
260.4 ± 4.2
210.0 ± 6.7
209.3 ± 5.3
209.2 ± 5.9
Yeast-cdc
228.1 ± 8.9
585.1 ± 27.1
258.9 ± 4.6
215.8 ± 3.5
215.1 ± 5.5
214.7 ± 5.7
Yeast-cold
155.9 ± 10.0
156.9 ± 5.7
153.1 ± 5.7
139.5 ± 6.3
139.3 ± 4.8
139.0 ± 5.7
Yeast-diau
235.3 ± 20.4
270.7 ± 12.0
222.2 ± 7.0
200.5 ± 6.9
200.3 ± 5.1
200.1 ± 6.2
Yeast-dtt
104.7 ± 6.3
121.6 ± 6.5
116.3 ± 5.2
98.3 ± 5.5
98.0 ± 3.9
97.9 ± 4.9
Yeast-elu
210.9 ± 4.5
528.0 ± 40.0
240.5 ± 7.1
198.9 ± 5.8
198.3 ± 5.3
198.3 ± 4.2
Yeast-heat
190.3 ± 5.8
242.0 ± 20.4
200.5 ± 4.7
182.7 ± 5.0
182.3 ± 6.2
182.0 ± 4.1
Yeast-spo
272.4 ± 11.3
287.5 ± 12.5
263.7 ± 5.8
249.6 ± 11.9
249.4 ± 11.4
248.7 ± 8.5
Yeast-spo5
187.0 ± 8.6
192.1 ± 8.8
190.1 ± 4.8
184.3 ± 7.2
183.7 ± 8.8
183.7 ± 6.9
Canberra Meric ↓ (mean ± std) × 103 of different algorithms on the twelve datasets. The best results are enlightened in bold and the second best results are italicized.
Algorithms
PT-SVM
AA-BP
SA-IIS
SA-BFGS
RSSR-LDL2
RSSR-LDL21
Movie
1693 ± 183.1
1232 ± 22.2
1063 ± 27.0
1063 ± 22.9
992.0 ± 10.3
989.1 ± 24.6
SBU-3DFE
925.7 ± 34.3
984.1 ± 47.6
888.8 ± 20.6
725.1 ± 24.9
836.2 ± 14.3
782.5 ± 8.6
SJAFFE
917.8 ± 59.2
1034.8 ± 97.7
870.6 ± 44.4
862.5 ± 76.1
735.9 ± 80.7
705.8 ± 46.8
Yeast-alpha
723.1 ± 19.3
2483.5 ± 172.3
859.2 ± 16.0
681.9 ± 21.1
679.0 ± 16.7
678.3 ± 19.1
Yeast-cdc
685.7 ± 24.7
1772.5 ± 74.3
786.7 ± 13.4
647.3 ± 14.9
645.0 ± 14.6
642.3 ± 14.9
Yeast-cold
269.5 ± 17.6
269.9 ± 8.6
264.5 ± 10.0
240.1 ± 10.0
239.7 ± 8.6
239.4 ± 9.7
Yeast-diau
508.8 ± 47.2
584.7 ± 28.5
480.8 ± 13.9
430.5 ± 15.4
429.9 ± 10.6
429.7 ± 11.2
Yeast-dtt
179.9 ± 10.1
209.4 ± 11.3
201.0 ± 8.8
169.0 ± 8.8
168.6 ± 5.9
168.4 ± 8.4
Yeast-elu
621.2 ± 16.2
1546.8 ± 120.2
714.7 ± 18.0
582.6 ± 18.0
581.1 ± 11.5
581.0 ± 13.1
Yeast-heat
380.7 ± 11.8
486.5 ± 40.4
403.3 ± 9.9
364.4 ± 9.0
363.5 ± 12.0
362.8 ± 7.6
Yeast-spo
562.8 ± 21.8
589.2 ± 24.6
541.6 ± 12.8
512.9 ± 24.1
512.5 ± 23.4
511.8 ± 18.3
Yeast-spo5
287.3 ± 12.8
295.5 ± 13.3
292.3 ± 7.3
283.1 ± 10.5
282.1 ± 12.9
282.3 ± 10.5
Intersection ↑ (mean ± std) × 103 of different algorithms on the twelve datasets. The best results are enlightened in bold and the second best results are italicized.
Algorithms
PT-SVM
AA-BP
SA-IIS
SA-BFGS
RSSR-LDL2
RSSR-LDL21
Movie
675.3 ± 45.9
795.9 ± 3.1
820.7 ± 4.1
822.1 ± 4.2
837.9 ± 1.4
837.2 ± 5.0
SBU-3DFE
833.8 ± 6.1
823.0 ± 7.9
840.8 ± 4.0
871.4 ± 4.7
850.4 ± 3.2
864.3 ± 1.7
SJAFFE
843.3 ± 10.5
823.1 ± 18.1
851.8 ± 8.5
858.1 ± 15.6
878.4 ± 15.6
883.3 ± 9.2
Yeast-alpha
960.1 ± 1.0
870.8 ± 8.1
952.0 ± 0.9
962.4 ± 1.1
962.5 ± 0.9
962.6 ± 1.0
Yeast-cdc
954.8 ± 1.6
888.2 ± 4.2
947.6 ± 0.9
957.4 ± 1.1
957.6 ± 0.9
957.7 ± 1.0
Yeast-cold
933.2 ± 4.5
933.4 ± 2.0
934.5 ± 2.4
940.8 ± 2.3
940.9 ± 2.1
941.0 ± 2.3
Yeast-diau
929.1 ± 6.7
919.0 ± 4.0
932.8 ± 1.9
940.3 ± 2.1
940.4 ± 1.4
940.4 ± 1.4
Yeast-dtt
955.6 ± 2.4
948.3 ± 2.8
950.1 ± 1.9
958.3 ± 2.0
958.4 ± 1.4
958.4 ± 2.0
Yeast-elu
956.1 ± 1.2
895.0 ± 7.8
948.9 ± 1.3
958.9 ± 1.2
959.0 ± 0.8
959.0 ± 1.0
Yeast-heat
937.4 ± 1.8
920.3 ± 6.2
933.4 ± 1.7
940.2 ± 1.3
940.3 ± 1.9
940.4 ± 1.2
Yeast-spo
906.7 ± 3.6
903.2 ± 3.8
910.5 ± 2.2
915.6 ± 3.9
915.6 ± 3.7
915.8 ± 3.0
Yeast-spo5
907.1 ± 3.9
904.3 ± 4.2
905.4 ± 2.4
908.6 ± 3.1
908.9 ± 3.9
908.8 ± 3.4
From the results in Tables 4–6, our algorithms have obvious advantages. In particular, the L21 algorithm offers better performance than the other algorithms with almost every dataset. The SA-BFGS algorithm achieves equivalent performance in terms of the Chebyshev distance (Table 3) and cosine coefficient (Table 7) with some datasets, mainly those for the yeast genes. In addition, our algorithms not only produce good results, but they are also very stable, especially RSSR-LDL21.
Cosine ↑ (mean ± std) × 103 of different algorithms on the twelve datasets. The best results are enlightened in bold and the second best results are italicized.
Algorithms
PT-SVM
AA-BP
SA-IIS
SA-BFGS
RSSR-LDL2
RSSR-LDL21
Movie
766.2 ± 53.3
901.0 ± 2.4
922.5 ± 3.1
923.0 ± 3.6
936.9 ± 0.6
934.4 ± 4.7
SBU-3DFE
912.9 ± 5.4
902.6 ± 8.2
921.8 ± 3.4
947.2 ± 3.7
931.8 ± 3.6
943.4 ± 1.6
SJAFFE
927.3 ± 8.9
902.8 ± 20.6
934.3 ± 7.1
939.8 ± 13.9
953.8 ± 15.7
957.9 ± 7.8
Yeast-alpha
994.1 ± 0.3
945.1 ± 6.0
991.5 ± 0.2
994.6 ± 0.3
994.6 ± 0.3
994.6 ± 0.3
Yeast-cdc
992.5 ± 0.5
957.5 ± 3.2
990.2 ± 0.4
993.3 ± 0.2
993.3 ± 0.3
993.3 ± 0.3
Yeast-cold
985.5 ± 1.9
985.6 ± 1.2
986.1 ± 1.0
988.6 ± 0.9
988.6 ± 0.8
988.6 ± 1.0
Yeast-diau
983.8 ± 2.6
978.2 ± 2.0
985.1 ± 0.8
988.0 ± 0.8
988.0 ± 0.6
988.0 ± 0.6
Yeast-dtt
993.4 ± 0.8
991.1 ± 0.9
991.6 ± 0.7
994.1 ± 0.7
994.1 ± 0.4
994.1 ± 0.7
Yeast-elu
993.4 ± 0.3
961.8 ± 5.0
990.9 ± 0.6
994.0 ± 0.3
994.1 ± 0.3
994.1 ± 0.3
Yeast-heat
986.8 ± 0.6
978.9 ± 3.3
985.4 ± 0.7
988.0 ± 0.5
988.0 ± 0.8
988.0 ± 0.5
Yeast-spo
971.3 ± 2.1
970.2 ± 2.2
974.5 ± 1.2
977.0 ± 2.0
977.0 ± 1.7
977.0 ± 1.7
Yeast-spo5
973.1 ± 2.1
971.4 ± 2.7
972.3 ± 1.1
974.1 ± 1.6
974.3 ± 2.0
974.2 ± 1.5
The proposed algorithms perform differently with the different datasets. The results show that the RSSR-LDL approach has an absolute advantage over the other algorithms with the movie dataset. This is because the characteristics of sparse representation offer obvious advantages when there are a large number of features. The proposed algorithms continue to offer some advantages over the other algorithms with the facial expression datasets, although some results are similar to those given by the SA-BFGS algorithm. As the number of features in the yeast datasets is small, our algorithms do not show the best performance with all evaluation metrics but still achieve similar performance to the SA-BFGS algorithm. Moreover, the performance of the proposed algorithms is better than the other comparative algorithms. Especially, there is a more obvious advantage in high dimensional data.
5.3.2. Label Distribution Showing
Unlike classification learning and clustering, LDL reflects the importance of each label for an instance. Hence, our ultimate goal is no longer categorization but a sort of probability distribution. Two typical examples of the original label distribution and that predicted by the six LDL algorithms are presented in Table 8. We select the [n/2]th sample of the label distribution as a demonstration.
The real and predictive distribution of two typical examples on six algorithms.
Algorithms
Movie
SBU-3DFE
Real
PT-SVM
AA-BP
SA-IIS
SA-BFGS
RSSR-LDL2
RSSR-LDL21
In Table 8, the second and third columns represent the real label distribution and the predicted label distributions given by the six different algorithms for the movie and SBU-3DFE datasets, respectively. Each point represents the corresponding value of a label in the subgraph in Table 8, and the spline shows the trend in the label distribution. According to the distribution law of the midpoint of the graph, the movie distribution was fitted using a Gaussian function and SBU-3DFE was fitted with a smooth spline.
Table 8 indicates that the proposed algorithms achieve perfect performance. On the one hand, the RSSR-LDLL21 algorithm has an absolute advantage, with the value and trend being almost consistent with the real label distribution. On the other hand, the RSSR-LDL2 algorithm is not as good as RSSR-LDL21 but achieves the same performance as SA-BFGS, which is obviously better than the other three comparative algorithms in terms of distance and similarity.
5.3.3. Parameter Sensitivity
Like many other learning algorithms, RSSR-LDL has parameters that must be tuned in advance. We tuned γ1=γ2={0.001,0.01,0.1,1,10,100,1000} and then recorded the best results given in Tables 3–8. For RSSR-LDL2, the Clark distance given by γ1={0.001,0.01,0.1,1,10,100,1000} on 3 representative datasets is shown in Figure 2 which belongs to three different data types, respectively. We observe that RSSR-LDL2 is relatively insensitive to γ1 for the facial expression and gene expression datasets, whereas it is slightly more sensitive for movie score datasets. Interestingly, in Figure 3, note that γ2 in RSSR-LDL21 is similar to γ1.
Clark distance of RSSR-LDL2 with respect to γ1.
Movie
SBU-3DFE
Yeast-alpha
Clark distance of RSSR-LDL21with respect to γ2.
Movie
SBU-3DFE
Yeast-alpha
6. Conclusion and Future Work
LDL deals with instances associated with multiple labels but also reflects the importance degree of each label on the instance. In this paper, we proposed a new criterion for LDL using regularized sample self-representation. We reconstructed the labels with features and a transformation matrix and described each label as a linear combination of features. Then, we used the L2-norm and L2,1-norm as regularization terms to optimize the transformation matrix. We conducted experiments on 12 real datasets and compared the proposed algorithms with four existing LDL algorithms using five evaluation metrics. The experimental results show that the proposed algorithms are efficient and accurate. In future work, we will use a least-angle regression model to develop a better generalization model for solving practical problems.
Conflicts of Interest
The authors declare that they have no conflicts of interest.
Acknowledgments
This work is in part supported by National Science Foundation of China (under Grant nos. 61379049, 61379089, and 61703196).
ZhangM.-L.ZhangK.Multi-label learning by exploiting label dependencyProceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD-2010July 2010USA99910072-s2.0-7795620176910.1145/1835804.1835930ZhangM.-L.ZhouZ.-H.A review on multi-label learning algorithms2014268181918372-s2.0-8489710937710.1109/TKDE.2013.39ZhouZ.-H.ZhangM.-L.Multi-instance multi-label learning with application to scene classificationProceedings of the 20th Annual Conference on Neural Information Processing Systems (NIPS '06)December 2006160916162-s2.0-84864028262ZhangM.-L.WuL.LIFT: multi-label learning with label-specific features201537110712010.1109/tpami.2014.23398152-s2.0-84916933802LiY.-K.ZhangM.-L.GengX.Leveraging implicit relative labeling-importance information for effective multi-label learningProceedings of the 15th IEEE International Conference on Data Mining, ICDM 2015November 2015USA2512602-s2.0-8496362370910.1109/ICDM.2015.41ZhuW.Relationship between generalized rough sets based on binary relation and covering2009179321022510.1016/j.ins.2008.09.015MR2473013Zbl1163.683392-s2.0-55949128171HeZ.LiX.ZhangZ.WuF.GengX.ZhangY.YangM.-H.ZhuangY.Data-dependent label distribution learning for age estimation20172683846385810.1109/TIP.2017.2655445MR36627552-s2.0-85020704471ZhouY.XueH.GengX.Emotion distribution recognition from facial expressionsProceedings of the 23rd ACM International Conference on Multimedia, MM 2015October 2015Australia124712502-s2.0-8496283314510.1145/2733373.2806328GengX.HouP.Pre-release prediction of crowd opinion on movies by label distribution learningProceedings of the 24th International Joint Conference on Artificial Intelligence, IJCAI 2015July 2015arg351135172-s2.0-84949795399GengX.Label Distribution Learning2016287173417482-s2.0-8497638922310.1109/TKDE.2016.2545658GengX.YinC.ZhouZ.-H.Facial age estimation by learning from label distributions20133510240124122-s2.0-8488315204710.1109/TPAMI.2013.51van ErvenT.HarremoP.Rényi divergence and Kullback-Leibler divergence20146073797382010.1109/TIT.2014.2320500MR3225930GengX.WangQ.XiaY.Facial age estimation by adaptive label distribution learningProceedings of the 22nd International Conference on Pattern Recognition, ICPR 2014August 2014Sweden446544702-s2.0-8491988430410.1109/ICPR.2014.764PearceJ.FerrierS.Evaluating the predictive performance of habitat models developed using logistic regression200013332252452-s2.0-003383170910.1016/S0304-3800(00)00322-7ZhangZ.WangM.GengX.Crowd counting in public video surveillance by label distribution learning20151661511632-s2.0-8493156758910.1016/j.neucom.2015.03.083XingC.GengX.XueH.Logistic boosting regression for label distribution learningProceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR '16)July 2016448944972-s2.0-84986290447YangX.GaoB.-B.XingC.HuoZ.-W.WeiX.-S.ZhouY.WuJ.GengX.Deep Label Distribution Learning for Apparent Age EstimationProceedings of the 15th IEEE International Conference on Computer Vision Workshops, ICCVW 2015December 2015chl3443502-s2.0-8496203161110.1109/ICCVW.2015.53GaoB.-B.XingC.XieC.-W.WuJ.GengX.Deep label distribution learning with label ambiguity20172662825283810.1109/TIP.2017.2689998MR3648673ElhamifarE.VidalR.Sparse subspace clustering: algorithm, theory, and applications201335112765278110.1109/TPAMI.2013.572-s2.0-84884541998ZhaoH.ZhuP.WangP.HuQ.Hierarchical Feature Selection with Recursive RegularizationProceedings of the Twenty-Sixth International Joint Conference on Artificial IntelligenceAugust 2017Melbourne, Australia3483348910.24963/ijcai.2017/487LuoX.ChangX.BanX.Regression and classification using extreme learning machine based on L1-norm and L2-norm20161741791862-s2.0-8494006345310.1016/j.neucom.2015.03.112HouC.NieF.LiX.YiD.WuY.Joint embedding learning and sparse regression: A framework for unsupervised feature selection20144467938042-s2.0-8490125068010.1109/TCYB.2013.2272642ZhangB.PerinaA.MurinoV.Del BueA.Sparse representation classification with manifold constraints transferProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2015June 2015usa455745652-s2.0-8496135853510.1109/CVPR.2015.7299086LuoD.DingC.HuangH.Towards structural sparsity: An explicit ℓ2/ℓ 0 approachProceedings of the 10th IEEE International Conference on Data Mining, ICDM 2010December 2010Australia3443532-s2.0-7995173739210.1109/ICDM.2010.155NieF.HuangH.CaiX.DingC. H.Efficient and robust feature selection via joint l2,1-norms minimization2010MIT Press18131821MosciS.RosascoL.SantoroM.VerriA.VillaS.Solving structured sparsity regularization with proximal methods2010632224184332-s2.0-7804944398310.1007/978-3-642-15883-4_27WuF.HanY.TianQ.ZhuangY.Multi-label boosting for image annotation by structural grouping sparsityProceedings of the 18th ACM International Conference on Multimedia ACM Multimedia 2010, (MM'10)October 2010ita152410.1145/1873951.18739572-s2.0-78650979494CaiD.ZhangC.HeX.Unsupervised feature selection for multi-cluster dataProceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD '10)July 2010Washington, DC, USAACM33334210.1145/1835804.18358482-s2.0-77956216411ZhuP.ZuoW.ZhangL.HuQ.ShiuS. C. K.Unsupervised feature selection by regularized self-representation20154824384462-s2.0-8502795522410.1016/j.patcog.2014.08.006Zbl1373.68344HouC.NieF.YiD.Feature selection via joint embedding learning and sparse regression22Proceedings of the 22nd International Joint Conference on Artificial Intelligence201113241329ShevadeS. K.KeerthiS. S.A simple and efficient algorithm for gene selection using sparse logistic regression200319172246225310.1093/bioinformatics/btg3082-s2.0-0345327592ZhuJ.RossetS.HastieT.TibshiraniR.1-norm support vector machines15Conference on Neural Information Processing Systems20034956LiC.-N.ShaoY.-H.DengN.-Y.Robust L1-norm two-dimensional linear discriminant analysis201565921042-s2.0-8492319294010.1016/j.neunet.2015.01.003McCallaW. J.198837USASpringer10.1007/978-1-4613-2011-1_2ChaS.Comprehensive survey on distance/similarity measures between probability density functions200712300307DudaR. O.HartP. E.StorkD. G.20012ndNew York, NY, USAFahrooF.RossI. M.Direct trajectory optimization by a Chebyshev pseudospectral method6Proceedings of the American Control Conference20003860386410.1109/ACC.2000.876945PearsonK. X.On the criterion that a given system of deviations from the probable in the case of a correlated system of variables is such that it can be reasonably supposed to have arisen from random sampling190050302157175DezaE.DezaM.-M.2006Elsevier10.1016/B978-0-444-52087-6.X5000-82-s2.0-85013932278LinH.-T.LinC.-J.WengR. C.A note on Platt's probabilistic outputs for support vector machines200768326727610.1007/s10994-007-5018-62-s2.0-34548160247WuT.-F.LinC.-J.WengR. C.Probability estimates for multi-class classification by pairwise coupling200459751005MR2248006Zbl1222.68336MaloufR.A comparison of algorithms for maximum entropy parameter estimationProceedings of the proceeding of the 6th conferenceAugust 2002Not Known1710.3115/1118853.1118871ChekroudA. M.ZottiR. J.ShehzadZ.GueorguievaR.JohnsonM. K.TrivediM. H.CannonT. D.KrystalJ. H.CorlettP. R.Cross-trial prediction of treatment outcome in depression: A machine learning approach2016332432502-s2.0-8495957158610.1016/S2215-0366(15)00471-XXuC.LiuT.TaoD.XuC.Local Rademacher complexity for multi-label learning20162531495150710.1109/TIP.2016.2524207MR34649822-s2.0-84962777544