Robust Graph Structure Learning for Multimedia Data Analysis

With the rapid development of computer network technology, we can acquire a large amount of multimedia data, and it becomes a very important task to analyze these data. Since graph construction or graph learning is a powerful tool for multimedia data analysis, many graph-based subspace learning and clustering approaches have been proposed. Among the existing graph learning algorithms, the sample reconstruction-based approaches have gone the mainstream. Nevertheless, these approaches not only ignore the local and global structure information but also are sensitive to noise. To address these limitations, this paper proposes a graph learning framework, termed Robust Graph Structure Learning (RGSL). Different from the existing graph learning approaches, our approach adopts the self-expressiveness of samples to capture the global structure, meanwhile utilizing data locality to depict the local structure. Specially, in order to improve the robustness of our approach against noise, we introduce l2,1-norm regularization criterion and nonnegative constraint into the graph construction process. Furthermore, an iterative updating optimization algorithm is designed to solve the objective function. A large number of subspace learning and clustering experiments are carried out to verify the effectiveness of the proposed approach.


Introduction
With the rapid growth of information technology and computer network technology, a large number of multimedia data can be collected from many research fields such as computer vision, image processing, and natural language processing. However, most of the multimedia data are represented by the high dimension and complex structures [1,2]. Therefore, how to accurately analyze these data becomes a vital problem. Inspired by the pattern recognition and machine learning techniques, many multimedia data analysis approaches based on subspace learning and clustering have been put forward recently [3][4][5][6]. However, learning or constructing a valuable graph to describe the pairwise similarity or relationship among the samples is a key issue to multimedia data analysis [7].
Nowadays, a series of graph learning approaches have been proposed in which the heat-kernel function is the most widely used graph construction manner, such as k-nearest-neighborhood graph (k-NN graph) or ε-nearest-neighborhood graph (ε graph). The edges of vertexes are computed based on the Euclidean distance among samples and then the weights of the edge between two vertexes are estimated by the heat kernel [8]. However, there are two main limitations in these approaches [9]. First, the choice of parameters in these approaches, such as the neighbor number k or radius ε, is very challenging, which can impact the final performance of the task. Second, the processes of neighbor selection and weight calculation are independent, which are sensitive to noise and often cannot well reveal the real similarities of samples [10].
To overcome these drawbacks, sparse representation (SR) based graph construction approach has been proposed, which is often called l 1 -graph or sparse graph. In l 1 -graph [11], each sample is regarded as the query sample and the rest of samples are considered as the dictionary to represent the query sample. Therefore, the similarities between the query sample and the remainder samples can be measured. Since l 1 -graph employs l 1 -norm constraint on the regression model for selecting a few important samples, it has a better discriminability and more robustness to deal with noise. In the past decades, a series of excellent learning approaches based on l 1 -graph have been designed and successfully applied in different areas [12]. Although l 1 -graph can reveal the linear relationship between a single point and other points, there are still some limitations as follows. First, l 1 -graph strictly assumes the dictionary of regression should be overcompleted, which is unsatisfied in many real applications especially for the graph learning. Second, l 1 -graph pays too much attention on the sparsity while it neglects the correlations between the samples, so it cannot offer a smooth data representation. Therefore, SR is not a good choice for graph construction. To overcome the aforementioned problems suffered by SR, Zhang et al. [13] introduced a Collaborative Representation (CR) linear regression approach by employing the l 2 -norm rather than l 1 -norm sparsity regularization. Compared to SR, CR provides more relaxation for regression coefficients and obtains a smoother data representation.
Considering that both SR and CR usually reveal the linear relationship between a single data point and other data points, the global structure of the data is ignored. To address this problem, Liu et al. [14] suggested Low-Rank Representation (LRR) for subspace clustering. The main purpose of LRR is to find a coefficient matrix Z by trying to reconstruct each data point as a linear combination of all the other data points, which is called self-representation. Different from the traditional similarity measurement approaches based on distance, i.e., k-nearest neighborhood or ε-nearest neighborhood, the representation-based approaches, such as SR, CR, or LRR, measure the similarity between data by solving an optimization problem. These approaches improve the image structure to achieve better classification and clustering performance overall. However, the objective function of LRR is not differentiable which has a high computation complexity on solving the rank minimization problem. To efficiently solve the limitation of LRR, Lu et al. [15] proposed Least Squares Regression (LSR) by grouping the highly correlated data together, which is robust to noise. Compared with LRR, LSR is simpler and more efficient.
In recent years, researchers found that the relationships between data in real applications usually show high dimension nonlinear, so the aforementioned linear representation approaches can hardly achieve good performance. Many researchers paid more attention on revealing the nonlinear relationship between data points of interests [16][17][18][19][20][21][22][23][24][25][26][27]. For example, Wang et al. [28] explored the criterion of Locally Linear Embedding (LLE) and used it to construct the graph by computing the weights between the pairs of samples. Wei and Peng [29] adopted a similar criterion to that LLE to construct a neighborhood-preserving graph for semisupervised dimensionality reduction. Furthermore, Yu et al. [30] have found that the nonzero coefficients of the sparse coding always are assigned to the neighbor samples of the query sample. To encourage the coding to be locality, some local feature-based coding approaches have been proposed, which achieve excellent performance for the classification and clustering tasks [31]. With the usage of the merits of local constraints, Peng et al. [32] put forward Locality-Constrained Collaborative (LCC) representation, which achieves better classification performance than those nonlocal approaches. Chen and Yi [33] took the local constraint and LSR into consideration and designed Locality-Constrained LSR (LCLSR) for subspace clustering. LCLSR explores both the global structure of data points and the local linear relationship for data points, forcing the representation to prefer the selection of neighborhood points. Although LCLSR considers the locality structure of data, there are still some limitations on the graph structure. On the one hand, the objective function of LCLSR is based on l 2 -norm, which is very sensitive to noise; on the other hand, the process of sample reconstruction ignores the relationships between sample representations. For example, similar original samples should generate similar coding vectors, and this process weakens the effectiveness of graph learning approaches.
To combat these issues, we design a novel graph learning approach, named Robust Graph Structure Learning (RGSL). Specifically, the self-expressiveness of samples and adaptive neighbor selection approach are introduced to preserve both the local and global structures of data. For enhancing the robustness of graph construction, we introduce the l 2,1 -norm constraint and nonnegative constraint on the adjacency graph weight matrix to reduce the influence of noise points in graph construction. Therefore, the proposed approach can estimate the graph from data alone by selfexpressiveness of samples and data locality, which is independent of a priori affinity matrix. We assess the benefits of the proposed approach on the subspace learning and clustering tasks. Extensive experiments verify the effectiveness of the proposed approach over other state-of-the-art approaches. The framework of the proposed approach is shown in Figure 1.
The outline of this paper is as follows. Section 2 reviews some related work briefly. Section 3 gives the proposed approach in details. Section 4 shows extensive experiments to prove the effectiveness of the proposed approach. Section 5 presents some conclusions.

Related Work
In this section, first, many classic and widely used graph construction approaches are introduced. Then, two kinds of multimedia data analysis techniques including subspace learning and spectral clustering are presented in detail accordingly.

Graph Construction Approaches.
Recently, many graph construction approaches have been proposed for multimedia data analysis. In this subsection, we will review some graph construction approaches related to our work as below.
Liu et al. [14] proposed a Low-Rank Representation (LRR) graph construction approach, in which each sample can be represented by a linear combination of all samples, and meanwhile, a low-rank constraint of coefficient matrix is imposed. Given a high dimensional database X = ½x 1 , x 2 , ⋯, x N ∈ R d×N in which d is the data dimensionality and N 2 Wireless Communications and Mobile Computing is the number of samples. The LRR graph can be obtained by optimizing the following problem: where k⋅k * denotes the nuclear norm of a matrix, i.e., the sum of the singular values of the matrix. W denotes the coefficient matrix of data X with the lowest rank.
Although LRR graph can obtain the global structure of data, it is very time-consuming to solve the problem of optimal nuclear norm. Hence, Lu et al. [15] utilized the l 2 -graph based on Frobenius norm in place of nuclear norm for fast computing the weight matrix. The LSR graph is defined as where k⋅k F is the Frobenius norm. diag ð⋅Þ denotes the diagonal operation of a matrix. In order to make full use of the advantage of locality constraints, Chen and Yi [33] combined LSR and the locality constraints into a unified framework and proposed the LCLSR approach for graph construction. The objective function of LCLSR is where β 1 and β 2 are two balance parameters and the symbol ⊙ represents the Hadamard product. D = ½d ij N×N = ½e distðx i ,x j Þ N×N denotes the distance matrix between samples where the function distðx i , x j Þ is a distance metric, such as the Euclidean distance.

Subspace Learning. Locality Preserving Projection (LPP)
[34] is a well-known subspace learning approach which is used to discover the geometric property of high-dimensional feature space. Suppose that the adjacency graph weight matrix W is given, LPP as aims at ensuring that if the original highdimensional samples x i and x j are "close," then the lowdimensional representations y i and y j should be close as well.
With the usage of weight matrix W ij as a penalty, LPP is to minimize the following objective function: where trðSÞ is the trace of matrix S. d ii is used to measure the local density around x i and the bigger d ii indicates that y i is more important. Hence, a nature constraint can be imposed as Y T DY = I. Based on the equation Y = P T X, the LPP model can be rewritten as The projection matrix P is constructed by the eigenvectors associated with d smallest nonzero eigenvalues, which can be solved by For a new high-dimensional data x, with the usage of the obtained projection matrix P, we can obtain a lowdimensional data representation by y = P T x.

Spectral Clustering.
Spectral clustering is a popular clustering approach that uses eigenvectors of a symmetric matrix derived from the distance between data points [35,36]. Given a data set consisting of N data points X ∈ R N×D , spectral clustering approach is aimed at partitioning X into K disjoint clusters by exploiting the top K eigenvectors of the normalized graph Laplacian L. Suppose that the graph matrix W is obtained by graph construction approaches and the new representation Q ∈ R C×N can be acquired by optimizing the following objective function: where

Self-representation
Output graph matrix W

Wireless Communications and Mobile Computing
Finally, data clustering can be accomplished by performing K-means on the new representation Q.

Proposed Method
In this section, some notations are introduced first. Second, we give some detailed descriptions of the proposed RGSL approach. At last, an iterative update algorithm is designed to solve our RGSL approach.
3.1. Notations. Let X = ½x 1 , x 2 , ⋯, x N ∈ R D×N be the given high-dimensional original data matrix, where D is the dimensionality of samples and N corresponds to the total number of samples. For a matrix B ∈ R D×N , the definitions of Frobenius norm and l 2,1 -norm are as follows: which b i and b j are the i-th row and the j-th column of B, respectively.

Objective Function.
First, in order to enhance the robustness of graph learning algorithm to noise and obtain the discriminative graph structure, the l 2,1 -norm measure criterion on the traditional LSR is introduced, which is defined as where k⋅k 2 and k⋅k 2,1 denote l 2 -norm and l 2,1 -norm, respectively. α is a balance parameter. G and Q are diagonal matrices whose diagonal elements are, respectively, defined as G ij = 1/ ðkx i − Xw j k 2 + εÞ and Q ii = 1/ðkw i k 2 + εÞ. ε is a small nonnegative constant for preventing the value of the denominator from being zero. Second, the relationship between representation coefficients is ignored in the sample reconstruction, i.e., similar original samples should generate similar coding vectors, weakening the effectiveness of graph learning. To solve the abovementioned issue, a modified manifold constraint based on the l 2,1 -norm is designed, which is defined as where s ij denotes the similarity weight value between sample x i and sample x j . The elements in matrix R are defined as At last, the nonnegative constraint is also imposed on the representation coefficients, and the final objective function of the proposed approach is where α and β are two positive balance parameters.

Optimization.
In this section, we give the optimization procedures for the objective function of the proposed approach in Equation (10). From Equation (10), we can observe that the objective function is related to l 2,1 -norm. Thus, the variable W in the objective function is nonconvex and a closed form solution to Equation (10) cannot be given. With regard to this limitation, an iterative update algorithm is designed to optimize the objective function.
3.3.1. Fix G, Q, and R, Update W. First, we fix matrices G, Q, and R. After removing the irrelevant terms, the optimization problem with respect to W in Equation (10) can be simplified to The Lagrangian function of Equation (11) is represented as By computing the derivative of Equation (12) with respect to W and setting it equals to zero According to the KKT condition [37], we update the solution for W as below: 3.4. Algorithm. In conclusion, the proposed optimization algorithm for RGSL can be summarized as below.
In Algorithm 1, the convergence condition is defined as the change of the value of objective function in Equation (10) which is less than a threshold or a predefined maximum iteration number is reached.

Experiment and Results
In this section, first, we will introduce the used databases in our experiment. Next, some graph learning comparison approaches are given. At last, subspace learning and clustering tasks are employed for verifying the effectiveness of the proposed approach.

Databases.
Four commonly used multimedia databases from the Internet including Yale [38], AR [39], CMU PIE [40], and Extended YaleB [41] are used for verifying the effectiveness of the proposed approach. The detailed statistical information about the four different databases is depicted in Table 1.
Yale database: it contains 165 face images captured from 15 different subjects. Each subject has 11 different images with the varied facial expressions, under different illumination conditions, and wearing glasses or not. Some example images of the Yale database are depicted in Figure 2 Although we can obtain the graph structure from the proposed approach, it is intractable to assess the graph learning approaches using the estimated graph alone. Hence, we will assess the quality of the learned graph by two kinds of multimedia data analysis tasks including subspace learning and spectral clustering. In our experiments, we first vary the graph construction approaches by fixing the graph learning task and then observe the obtained performance associated with subspace learning and spectral clustering tasks.

Comparison among Several Graph Learning Approaches.
To investigate the performance of our approach on subspace learning and clustering, several state-of-the-art graph learning approaches are chosen to compare in our work, which are shown as below: (i) KNN graph [8]: the graph edges connected by two vertexes can be generated by the Euclidean distance-based K-nearest neighbor and the heat 1: Input: the data matrix X = ½x 1 , x 2 , ⋯, x N , two balance parameters α and β. 2: Initialize: set G t and Q t to be the identity matrices, random matrix W, and sample similarity matrix S, t = 1.

5:
Compute the matrix D t+1,ii = ∑ N j=1 R t+1,ij and the Laplacian matrix L t+1 = D t+1 − R t+1 6: Update the matrix W ij = W ij ð½X T GX ij /½αQ + βL + X T GX ij Þ 7: Update the matrix G t+1 = fG t+1,ij g N×N = 1/ðkx i − Xw t+1,j k 2 + εÞ 8: Update the matrix Q t+1 = fQ t+1,ii g N×N = 1/ðkw t+1,i k 2 + εÞ 9: t = t + 1 10: Until convergence 11: Output: the graph matrix W Algorithm 1: RGSL.  [28]: each sample is linearly reconstructed by its neighbors within a local area to preserve the local manifold structure (iii) L1 graph [11]: the locality structure of data by using L1 sparse representation optimization (iv) LRR graph [14]: based on self-expressive property, a low-rank graph is obtained (v) LSR graph [15]: self-expressive property and Frobenius norm are used for fast computing the weight matrix (vi) LCLSR graph [33]: it combines the locality constraint and LSR together to explore both the global structure of data points and the locality linear relationship of data points (vii) SGLS graph [42]: it integrates manifold constraints on the unknown sparse codes as a graph regularizer (viii) Our proposed RGSL graph: our approach takes the global and local structure information into consideration and also introduces the l 2,1 -norm regularization criterion and nonnegative constraint into graph construction process to enhance its robustness 4.3. Subspace Learning Experiment and Analysis. In this section, we employ an unsupervised subspace learning approach represented by Locality Preserving Projections (LPP) to verify the effectiveness of the proposed approach. In our experiments, different graphs are employed as W in LPP approach for subspace learning and then the classification accuracy is used for performance comparison. For each database, we randomly select l images from each class as the training samples. The remaining images are treated as the test samples. The values of l for Yale, AR, CMU PIE, and Extended YaleB databases are set as f4, 5, 6g, f4, 5, 6g, f6, 8, 10g, and f10,15,20g, respectively. In order to more effectively and fairly test the performance of the proposed approach, the random sample selection is repeated 20 times and the average classification accuracy and standard deviation are regarded as the final results for comparison. In this work, we employ the nearest neighbor classifier with Euclidean distance for classification due to its simplicity. To compare the performances of different approaches, the classification accuracy rate is chosen as the evaluation criterion, which is defined as where N t correct is the number of test samples which are correctly classified using the nearest neighbor classifier. N total is the total number of the test samples.
All the experiments are conducted using MATLAB 2016b on a 3.60 Hz with 8 G RAM. In order to acquire the optimal parameters of different approaches, we employ the grid-search manner in our experiments. Tables 2-5 depict the average classification accuracy rates and standard deviations of the proposed approach on the Yale, AR, CMU PIE, and Extended YaleB databases, respectively. Note that the brackets in Tables 2-5 mean the data dimensionality when achieving the best classification accuracy rates.
From the results depicted in Tables 2-5, we can clearly observe that most of the graph learning approaches perform better than the KNN graph and LLE graph. It indicates that   Wireless Communications and Mobile Computing graph learning based on Euclidean distance is very sensitive to noise points, weakening the classification performance.
Besides, compared to L1 graph learning approach, LRR graph, LSR graph, SGLS graph, and LCLSR graph take the locality structure of data into consideration during the pro-cess of graph construction to get more excellent performance. At last, the proposed RGSL approach performs best among all of the compared approaches. The main reasons are as follows: first, both the global structure and local structure are essential to the graph learning. Second, l 2,1 -norm    7 Wireless Communications and Mobile Computing regularization criterion and nonnegative constraint are introduced into graph construction process to improve the robustness of our approach against noise. Therefore, our approach can improve the classification performance further.
There are two parameters, i.e., α and β in the objective function of our proposed approach. Hence, how to appropriately set their values is very important for our approach. In this study, we tune the values of parameters α and β by searching the grid f0:001, 0:01, 0:1, 1, 10, 100g in an alternate manner. The best results of different parameter values on the four databases are shown in Figure 3.
As we can see from Figure 3, when the values of parameters α and β are relatively small, the performance of the proposed approach is relatively small. With the increase of parameters α and β, the performance of the proposed approach will be improved. However, after it achieves its best classification result, the performance of the proposed approach dramatically decreases with the increase of the two parameters. Therefore, the proposed approach can obtain its best classification results when the values of parameters α and β are set as neither too large nor too small. At last, the convergence curves of our RGSL on the four databases are shown in Figure 4. In this figure, the x-axis and the y-axis are, respectively, denoted as the iteration number and the value of the objective function. As seen from Figure 4, the value of the objective function declines at each iteration and converges very fast on all of the databases.

Clustering Experiment and Analysis.
In spectral clustering, the initialization has a major impact on the performance of the K -means clustering algorithm. Therefore, we carry out the process of clustering 50 times with different random initializations. Then, the average clustering results with standard deviations are used as the final results. In the experiments, three widely employed clustering evaluation indicators including Accuracy (ACC), Normalized Mutual Information (NMI), and Purity are used to evaluate the performance of the proposed approach.

Wireless Communications and Mobile Computing
For a given sample x i , supposing that the obtained clustering result is p i and true label is t i , the clustering accuracy is calculated as where δðx, yÞ = 1 if x = y, δðx, yÞ = 0 otherwise. The function mð·Þ maps the clustering result to the corresponding ground truth label. N is the number of samples. The Kuhn-Munkres algorithm [37] is employed to find the best mapping result. Assuming that P and T are, respectively, the clustering result and the true label set obtained by different approaches, the Mutual Information (MI) is defined as where Qðp i Þ and Qðt i Þ represent the probabilities that a sample is randomly selected from the dataset belonging to p i and t i , respectively. Qðp i , t i Þ represents the joint probability of a sample randomly being selected from the dataset belonging to p i and t i . Let HðPÞ and HðTÞ be the entropies of P and T, respectively. The Normalized Mutual Information (NMI) is calculated as Purity is defined as follows: where k represents the number of clusters, jC d i j is the number of elements in the most numerous category in cluster C i , and jC i j is the number of elements in cluster C i . Tables 6-9 show the best values of ACC, NMI, and Purity of eight approaches, respectively, on the Yale, AR, CMU PIE, and Extended YaleB databases. According to the results as shown in Tables 6-9, the following conclusions can be  9 Wireless Communications and Mobile Computing obtained. First, since KNN graph and LLE graph are based on Euclidean distance, they are very sensitive to the noise points, outliers, and parameter values. So the clustering performance based on KNN graph and LLE graph is lower than that based on other compared approaches. Second, the performance of LRR graph, LSR graph, SGLS graph, and LCLSR graph is superior to that of L1 graph because of taking the locality structure of data into consideration during the process of graph construction. However, these objective functions are all based on l 2 -norm, so it is very sensitive to the noise data.
Besides, the relationship between the representation coefficients is ignored in the sample reconstruction, i.e., similar original samples should generate similar coding vectors, weakening the effectiveness of graph learning. To overcome these disadvantages, our RGSL approach combines l 2,1 -norm with manifold constraints on the coding coefficients to learn a locality and smoothness representation. Therefore, the performance of the proposed approach is superior to that of all of the comparison approaches.
Similar to the subspace learning experiment, we also tune the values of parameters α and β by searing the grid f0:001, 0:01, 0:1, 1, 10, 100g in an alternate manner. From the objective function, we can learn that there are three terms. When the values of parameters α and β are set as small, the effectiveness of the second and third terms in the objective function will be weakened, and the role of the first term will be overemphasized. On the contrary, the second and third terms in the objective function will play a major role, reducing the effect of the first term. Therefore, the proposed RGSL approach can achieve the best performance when parameters α and β are set as moderate values, which is similar to the discussions of subspace learning.

Conclusions
This paper proposes a novel graph learning framework, named Robust Graph Structure Learning (RGSL) for effective multimedia data analysis. In order to preserve both local and global structures of data, we employ the data self-representativeness to capture the global structure and adaptive neighbor approach to describe the local structure. Furthermore, we also introduce the l 2,1 -norm regularization criterion and nonnegative constraint into graph learning to improve the robustness of the model against noise. Extensive experimental results associated with subspace learning and clustering tasks show that the proposed approach performs better performance than the state-of-the-art graph learning approaches. Since our proposed approach will be affected by the graph construction when the dimensionality of data is high, in the future, we will take the dimensionality reduction, subspace learning, and graph learning into a united framework to address this issue.

Data Availability
The data are derived from public domain resources.

Conflicts of Interest
The authors declare that they have no conflicts of interest.