Label Distribution Learning by Regularized Sample Self-Representation

Multilabel learning that focuses on an instance of the corresponding related or unrelated label can solve many ambiguity problems. Label distribution learning (LDL) reflects the importance of the related label to an instance and offers a more general learning framework than multilabel learning. However, the current LDL algorithms ignore the linear relationship between the distribution of labels and the feature. In this paper, we propose a regularized sample self-representation (RSSR) approach for LDL. First, the label distribution problem is formalized by sample self-representation, whereby each label distribution can be represented as a linear combination of its relevant features. Second, the LDL problem is solved by L2-norm least-squares and L2,1-norm least-squares methods to reduce the effects of outliers and overfitting. The corresponding algorithms are named RSSR-LDL2 and RSSR-LDL21. Third, the proposed algorithms are compared with four state-of-the-art LDL algorithms using 12 public datasets and five evaluation metrics. The results demonstrate that the proposed algorithms can effectively identify the predictive label distribution and exhibit good performance in terms of distance and similarity evaluations.


Introduction
Multilabel learning allows more than one label to be associated with each instance [1].In many practical applications, such as text categorization, ticket sales, and torch relays [2], objects have more than one semantic label, often expressed as the objects ambiguity.As an effective learning paradigm, multilabel learning is applied in a variety of fields [3,4], but it mainly focuses on an instance of the corresponding related or unrelated label.
Though multilabel learning can solve many ambiguity problems, it is not well-suited to some practical problems [5,6].For example, consider the image recognition problem of a natural scene that is annotated with mostly water, lots of sky, some cloud, a little land, and a few trees.As can be seen from Figure 1, each label in Figure 1(a) should be assigned a different importance.Multilabel learning mainly focuses on an instance of the corresponding related label or unrelated label, rather than the difference in importance [7].This leads to the question of how to determine the importance of different labels in an instance.Label distribution learning (LDL) can reflect the importance of each label in an instance in a similar way to a probability distribution.Figure 1(b) shows an example of a label distribution.This scenario is encountered in many types of multilabel tasks, such as age estimation [7], expression recognition [8], and the prediction of crowd opinions [9].
In contrast to the multilabel learning output of a set of labels, the output of LDL is a probability distribution [10].In recent years, LDL has become a popular topic of research as a new paradigm in machine learning.For instance, Geng et al. proposed the IIS-LDL and CPNN algorithms to estimate the ages of different faces [11].Their approach achieves better results than previous age estimation algorithms, because they use more information in the training process.Thereafter, Geng developed a complete framework for LDL [10].This framework not only defines LDL but also generalizes LDL algorithms and gives corresponding metrics to measure their performance.
At present, the parameter model for LDL is mainly based on Kullback-Leibler divergence [12].Different models can be used to train the parameters, such as maximum entropy [13] or logistic regression [14], although there is no particular evidence to support their use.To some extent, the LDL process ignores the linear relationship between the features and the label distribution.Unlike other applications, LDL aims to predict the label distribution rather than the category.Thus, the overall label distribution can be effectively reconstructed from the corresponding samples.
In this paper, we propose an LDL method that uses the property of sample self-representation to reconstruct the labels.As the labels are similar to a probability distribution, but not actually a probability distribution, in LDL we can represent the labels through the feature matrix instead of the distance between two probability distributions.With the above considerations, we use a least-squares model to establish the objective function.That is, as far as possible, each label distribution is represented as the linear combination of its relevant features.The goal of this optimization model is to minimize the residuals.We combine LDL with sparsity regularization to optimize the model and then introduce regularization terms to solve the model.To solve the objective function efficiently, we use the  2 -norm and  2,1 -norm as the regularization terms.The corresponding algorithms are named regularized sample self-representation RSSR-LDL2 and RSSR-LDL21.The proposed algorithms not only have strong interpretability but also avoid the problem of overfitting.In a series of experiments, we demonstrate that the similarity and distance in a variety of evaluation metrics are superior to those of four state-of-the-art algorithms.The results of the experimental analysis on public datasets show that the proposed method can effectively predict the labels.
The remainder of this paper is organized as follows.A brief review of related work on LDL and sparsity regularization is presented in Section 2. In Section 3, we introduce the LDL task and evaluation metrics.We describe the RSSR-LDL method and develop two algorithms in Section 4. Section 5 presents and analyzes the experimental results.Finally, we conclude this paper and present some ideas for future work in Section 6.

Related Work
The continued efforts of researchers have led to various LDL algorithms being proposed [9,13,15].There are three main design strategies in the literature [10]: problem transformation, algorithm adaptation, and specialized algorithm design.Problem transformation (PT) takes the label distribution instances and transforms them into multilabel instances or single-label instances; PT-SVM and PT-Bayes are the representative algorithms in this class.Algorithm adaptation (AA) extends some existing supervised learning algorithms to deal with the problem of label distribution, such as the AA-kNN and AA-BP algorithms [10].Unlike problem transformation and algorithm adaptation, specialized algorithm (SA) design sets up a direct model for the label distribution data.Typical algorithms include LDLogitBoost, based on logistic regression [16], SA-BFGS, based on maximum entropy [10], and DLDL, which combines LDL with deep learning [17,18].
Unlike traditional clustering [19] or classification learning [20], the labels in LDL have similar patterns to probability distributions.According to the definition of the label distribution, we assume that there may be a function that matches the feature to the label.We find that each label can be well approximated by a linear combination of its relevant features.
Linear reconstruction is not strictly expressed as  = A, which A is a coefficient matrix.Then we transform it to minimum residual and optimize it by least-square method.In order to avoid overfitting and to solve the problem, regularization term  2 -norm is often added.In this way, the label distribution is constructed directly from the sample information of the data by the coefficient matrix.To find the corresponding relevant features while avoiding effects of noise in high dimensional data [21], we introduce sparse reconstruction [22,23].Sparsity reconstruction is the addition of sparse regularization terms  1 -norm or  0 -norm on the linear reconstruction.Nowadays, there are sparse regularization terms  2,1 -norm and  2,0 -norm proposed gradually [24]. 2,1 -norm regularization is performed to select features across all of the data points with joint sparsity [25].For matrix A = (  ), Sparsity reconstruction is widely used in machine learning, especially for data dimensionality reduction [26,27].For example, Cai proposed the MCFS algorithm by using  1 -regularized least-squares to deal with multicluster data [28], and Zhu et al. proposed the RMR algorithm based on regularized self-representation for feature selection [29].Nie developed the RFS and JELSR algorithms using  2,1regularized least-squares to optimize the objective function [25,30].Furthermore,  1 -SVM and sparse logistic regression [31] have been shown to be effective.In general, linear reconstruction and sparsity reconstruction produce good performance in feature selection and classifier [32,33].Next, we will propose a label distribution learning method based on linear reconfiguration and sparsity reconstruction.

The Proposed Model
The goal of LDL is to obtain a set of probability distributions.Therefore, LDL is different from previous approaches in terms of the problem statement.In this section, the problem statement is briefly reviewed and the proposed model is introduced.
LDL is the process of describing an instance x more naturally using labels [10].We assign a value    to each of the corresponding possible labels y.This value represents the extent to which the label y describes instance x, that is, the description degree.Taking account of the corresponding subset of labels can give a complete description of the sample.Therefore, it is assumed that    ∈ [0, 1] and ∑     = 1.That is, the data for    have a form that is similar to a probability distribution for an instance.The learning process based on such data is called label distribution learning.
We use x 1 , x 2 , . . ., x  to represent the  instances, x  ∈ R  , and x  represents the extent to which the label y  describes the instance x  .Therefore, we can represent the training set of label distribution learning S = [(x 1 , d 1 ), (x 2 , d 2 ), . . ., (x  , d  )]  in this paper.In addition, the test data is defined as and the corresponding predicted label distribution is defined as .We combine LDL with regularized sample selfrepresentation to give the RSSR-LDL model.For RSSR-LDL, each sample and the corresponding description degree have the following relationship: where P is the transformation matrix from the sample to the description degree and P ∈ R × .According to the definition, this is equivalent to In general,  >  for X, and (3) cannot be solved [34].
In order to solve the optimal P, we introduce residual sum function : When P = P, the minimum value of (P), the objective function of the model is Using the difference values from (4), we obtain When X is not full rank, or there is a significant linear correlation between columns, the determinant will be close to 0, which makes the calculation of X  X an ill-posed problem.This will introduce a large error into the calculation of (X  X) −1 , resulting in a lack of stability and reliability in (5).Therefore, we introduce a regularization term (P) with parameter  to optimize the objective function.
In other words, There are several possible regularizations [25], 1 (P) is the LASSO regularization. 2 (P) is the ridge regression, also known as Tikhonov regularization.It is the most frequently used regularization method for ill-posed problem. 3 (P) is a new joint regularization [25,29].

Regularized Model and Algorithm
To solve the objective function efficiently, we use the  2 -norm and  2,1 -norm to regularize the RSSR-LDL model, resulting in RSSR-LDL2 and RSSR-LDL21.The RSSR-LDL2 and RSSR-LDL21 algorithms are presented in this section.
or equivalently, X  X +  1 I is a nonsingular matrix definitely; then We can predict the label distribution of the test dataset using the learned matrix P. Specifically, the predictive label distribution is defined as follows: Borrowing from the above theoretical analysis, we summarize Algorithm 1. Algorithm 1 does not include an iterative process.In other words, the matrix P can be solved directly, which makes the algorithm faster.Although this approach is efficient and easy to understand, it is not very accurate.Thus, in the next section, we use the  2,1 -norm of P to solve the RSSR-LDL problem.

Regularized Sample Self-Representation by 𝐿
Combined with the characteristics of  1 (P) and  2 (P), we choose the  2,1 -norm of P,that is,  3 (P), as the regularization term.This gives the RSSR-LDL21 algorithm.The objective optimization function of ( 7) is shown in the following expression: which can be transformed into According to [25], this can be further transformed into min where ) , and I ∈ R × .The problem in (16) becomes one of solving a Lagrangian function; that is, where B ∈ R (+)×(+) is a diagonal matrix with   = 1/2‖q l ‖ 2 ,  = 1, 2, . . ., +.The solution in (17) is convergent [25], so the iteration is viable.The RSSR-LDL21 algorithm is shown in Algorithm 2.
In Algorithm 2, the iteration is repeated until Iter = 30.In each iteration, Q is calculated with the previous B and B is calculated with the current Q.

Experiments
To demonstrate the performance of the proposed RSSR-LDL2 and RSSR-LDL21 algorithms, we apply them to gene expression

Evaluation Metrics.
In LDL, there are multiple labels associated with each instance, and these reflect the importance of each label for the instance.As a result, performance evaluation is different from that of both single-and multilabel learning.Because the label distribution is similar to a probability distribution, we use the similarity and distance between the original distribution and the predicted distribution to evaluate the effectiveness of LDL algorithms.There are many measures of the distance and similarity between probability distributions.In [35], 41 kinds of distance and similarity evaluation metrics were identified across eight classes.The various distance/similarity measures offer different performances in terms of comparing two probability distributions.
According to the agglomerative single linkage with average clustering method [36], screening rules [10], and experimental conditions, we evaluated five methods: the Chebyshev distance [37], Clark distance [38], Canberra distance [39], intersection similarity [36], and cosine similarity [39].The related names and expressions are listed in Table 1.A "↓" after the distance measure indicates that smaller values are better, whereas "↑" after the similarity measure indicates that larger values are better.

Experimental Setting.
Experiments are conducted on 12 public datasets.In the movie dataset, each instance represents the characteristics of a movie and the category score that the movie may belong to.SBU-3DFE and SJAFFE datasets represent facial expression images.Each instance represents a facial expression and scores of possible expression class.The Yeast family contains nine yeast gene expression levels.Each instance represents the expression level of a gene at a certain time.These datasets are described in Table 2.
To verify the effectiveness and performance of our LDL method, we compared the RSSR-LDL2 and RSSR-LDL21 algorithms with four existing LDL algorithms.According to [10], we selected comparative algorithms that use different strategies.

PT-SVM. PT-SVM is applied to training sets in which label
distribution is obtained by the problem resampling method [40].PT-SVM uses pairwise coupling to solve the multiclassification problem [41].This algorithm calculates the posterior probability of each class as the description degree of a label.AA-BP.AA-BP is a three-layer backpropagation neural network.This algorithm has  input units and  output units, which receive X and output D, respectively.SA-IIS.SA-IIS uses the maximum entropy to solve the LDL problem.The optimization strategy of this algorithm is similar to that of the scaling-based IIS [42].
SA-BFGS.SA-BFGS is an improved algorithm based on SA-IIS.This improved algorithm employs an effective quasi-Newton method and is more efficient than the standard line search approach.
For the parameter settings in these four algorithms, we refer to [10].For our algorithms, we tuned the regularization parameter using values of {0.001, 0.01, 0.1, 1, 10, 100, 1000} and present the best results [29].The performance of the above LDL algorithms was evaluated by considering the distance and similarity between the original label distribution and the predicted label distribution.

Results Analysis of Experiments.
In this section, the performance of the proposed algorithms is compared with that of four existing state-of-the-art LDL algorithms in terms of five evaluation metrics.We also present the predicted label distribution given by the six algorithms and the real label distribution.

Distance and Similarity Comparison.
To verify the advantages of the proposed RSSR-LDL2 and RSSR-LDL21, experiments were conducted on 12 public datasets.Each experiment used tenfold cross-validation [43,44], and the mean value and standard deviation of each evaluation were recorded.Because many results were close to zero, they are represented as "(mean ± std) × 10 3 ."The main measure of the size of individual differences is the distance, whereas the similarity reflects the trend and direction of the vector.Therefore, we use distance and similarity to demonstrate the superiority of the proposed algorithms.
The results for the Chebyshev distance, Clark distance, Canberra metric, cosine coefficient, and intersection similarity are presented in Tables 3-6, respectively.In each table, the best results are given in bold and the second-best results are italicized (if the mean is the same, the algorithm with the smaller standard deviation is considered to be better).The first evaluation metrics measure distance, and so smaller values are better; the latter two measure similarity, and so larger values are better.From these results, we can see that RSSR-LDL21 achieves the best performance of all the algorithms and RSSR-LDL2 is better than the others.
From the results in Tables 4-6, our algorithms have obvious advantages.In particular, the L21 algorithm offers better performance than the other algorithms with almost every dataset.The SA-BFGS algorithm achieves equivalent performance in terms of the Chebyshev distance (Table 3) and cosine coefficient (Table 7) with some datasets, mainly those for the yeast genes.In addition, our algorithms not only produce good results, but they are also very stable, especially RSSR-LDL21.
The proposed algorithms perform differently with the different datasets.The results show that the RSSR-LDL approach has an absolute advantage over the other algorithms with the movie dataset.This is because the characteristics of sparse representation offer obvious advantages when there are a large number of features.The proposed algorithms continue to offer some advantages over the other algorithms with the facial expression datasets, although some results are similar to those given by the SA-BFGS algorithm.As the number of features in the yeast datasets is small, our algorithms do not show the best performance with all evaluation metrics but still achieve similar performance to the SA-BFGS algorithm.Moreover, the performance of the proposed algorithms is better than the other comparative algorithms.Especially, there is a more obvious advantage in high dimensional data.

Label Distribution Showing.
Unlike classification learning and clustering, LDL reflects the importance of each label for an instance.Hence, our ultimate goal is no longer categorization but a sort of probability distribution.Two typical examples of the original label distribution and that predicted by the six LDL algorithms are presented in Table 8.We select the [/2]th sample of the label distribution as a demonstration.
In Table 8, the second and third columns represent the real label distribution and the predicted label distributions given by the six different algorithms for the movie and SBU-3DFE datasets, respectively.Each point represents the corresponding value of a label in the subgraph in Table 8, and the spline shows the trend in the label distribution.According to the distribution law of the midpoint of the graph, the movie distribution was fitted using a Gaussian function and SBU-3DFE was fitted with a smooth spline.Table 8 indicates that the proposed algorithms achieve perfect performance.On the one hand, the RSSR-LDLL21 algorithm has an absolute advantage, with the value and trend being almost consistent with the real label distribution.On the other hand, the RSSR-LDL2 algorithm is not as good as RSSR-LDL21 but achieves the same performance as SA-BFGS, which is obviously better than the other three comparative algorithms in terms of distance and similarity.

Parameter Sensitivity.
Like many other learning algorithms, RSSR-LDL has parameters that must be tuned in advance.We tuned  1 =  2 = {0.001,0.01, 0.1, 1, 10, 100, 1000} and then recorded the best results given in Tables 3-8.For RSSR-LDL2, the Clark distance given by  1 = {0.001,0.01, 0.1, 1, 10, 100, 1000} on 3 representative datasets is shown in Figure 2 which belongs to three different data types, respectively.We observe that RSSR-LDL2 is relatively insensitive to  1 for the facial expression and gene expression datasets, whereas it is slightly more sensitive for movie score datasets.Interestingly, in Figure 3, note that  2 in RSSR-LDL21 is similar to  1 .

Conclusion and Future Work
LDL deals with instances associated with multiple labels but also reflects the importance degree of each label on the instance.In this paper, we proposed a new criterion  for LDL using regularized sample self-representation.We reconstructed the labels with features and a transformation matrix and described each label as a linear combination of features.Then, we used the  2 -norm and  2,1 -norm as regularization terms to optimize the transformation matrix.
We conducted experiments on 12 real datasets and compared the proposed algorithms with four existing LDL algorithms using five evaluation metrics.The experimental results show that the proposed algorithms are efficient and accurate.In future work, we will use a least-angle regression model to develop a better generalization model for solving practical problems.

2 MathematicalFigure 1 :
Figure 1: A natural scene image which has been annotated with water, sky, cloud, land, and trees.
)Input: Train matrix S, test data X Input: Train matrix S, test data X
levels, facial expression, and movie score problems.In this section, we use five evaluation metrics to test the proposed algorithms on 12 publicly available datasets (http://cse.seu.edu.cn/PersonalPage/xgeng/LDL/index.htm).The proposed algorithms are also compared with four state-of-the-art LDL algorithms.

Table 3 :
Chebyshev distance ↓ (mean ± std) × 10 3 of different algorithms on the twelve datasets.The best results are enlightened in bold and the second best results are italicized.

Table 4 :
Clark Distance ↓ (mean ± std) × 10 3 of different algorithms on the twelve datasets.The best results are enlightened in bold and the second best results are italicized.

Table 5 :
Canberra Meric ↓ (mean ± std) × 10 3 of different algorithms on the twelve datasets.The best results are enlightened in bold and the second best results are italicized.

Table 6 :
Intersection ↑ (mean ± std) × 10 3 of different algorithms on the twelve datasets.The best results are enlightened in bold and the second best results are italicized.

Table 7 :
Cosine ↑ (mean ± std) × 10 3 of different algorithms on the twelve datasets.The best results are enlightened in bold and the second best results are italicized.

Table 8 :
The real and predictive distribution of two typical examples on six algorithms.