Constructing a powerful graph that can effectively depict the intrinsic connection of data points is the critical step to make the graph-based semisupervised learning algorithms achieve promising performance. Among popular graph construction algorithms, low-rank representation (LRR) is a very competitive one that can simultaneously explore the global structure of data and recover the data from noisy environments. Therefore, the learned low-rank coefficient matrix in LRR can be used to construct the data affinity matrix. Consider the existing problems such as the following: (1) the essentially linear property of LRR makes it not appropriate to process the possible nonlinear structure of data and (2) learning performance can be greatly enhanced by exploring the structure information of data; we propose a new manifold kernelized low-rank representation (MKLRR) model that can perform LRR in the data manifold adaptive kernel space. Specifically, the manifold structure can be incorporated into the kernel space by using graph Laplacian and thus the underlying geometry of data is reflected by the wrapped kernel space. Experimental results of semisupervised image classification tasks show the effectiveness of MKLRR. For example, MKLRR can, respectively, obtain 96.13%, 98.09%, and 96.08% accuracies on ORL, Extended Yale B, and PIE data sets when given 5, 20, and 20 labeled face images per subject.
National Natural Science Foundation of China6160214061971193616330106147311061502129Zhejiang Science and Technology Program2017C330492018C04012LQ16F020004China Postdoctoral Science Foundation2017M620470Jiangsu Key Laboratory of Big Data Security & Intelligent ProcessingBDSIP201804Anhui UniversityADXXBZ2017041. Introduction
Since it is usually not easy to collect a large number of labeled samples to train learning models, the semisupervised learning (SSL) paradigm, which can harness both labeled and unlabeled samples to improve the learning performance, draws a lot of attention in recent studies [1–7]. Among existing SSL algorithms, graph-based algorithms are a class of the most popular approaches in which the label propagation can be performed on the graph [8–11]. The underlying idea for graph-based algorithms is to characterize the relationship between data pairs by an affinity matrix. Although researchers have pointed that sparsity, high discriminative power, and adaptive neighborhood are desirable properties for a good graph [12], how to learn a good graph that can accurately uncover the latent relationship in data is still a challenging problem.
Among existing graph construction methods, the k-nearest neighbors and ε-neighborhood are the two most widely used algorithms. However, they are usually sensitive to noisy environments, especially when data points contain outliers. To construct more effective graph, many new algorithms were proposed. The sparse graph [8] is parameter-free and insensitive to outliers, which is derived by encoding each datum as a sparse representation of the remaining samples. Sparse graph can automatically select the most informative neighbors for each datum. However, since sparse representation encodes each datum individually, the resultant sparse graph only emphasizes the local structure of data, while it neglects considering the global structure of data. This property will deteriorate its performance, especially when data are grossly corrupted [13]. Different from sparse representation that enforces the representation coefficient to be sparse [14], low-rank representation aims to learn the data affinities jointly, which can reveal the global structure of data and preserve the membership of samples that belong to the same class in noisy environments [15, 16]. The learned LRR graph can capture the global mixture of subspaces structure via the low rankness property and thus it is generative and discriminative for semisupervised learning tasks [9].
Apart from the conventional LRR model, many advanced variants were proposed recently. To efficiently explore the structure information of data, Zheng et al. imposed the local constraint characteristic on the representation coefficient and thus formulated the low-rank representation with local constraint (LRRLC) model [10]. Lu et al. proposed the graph regularized LRR (GLRR) that introduces the graph regularizer to enforce the local consistency of data [17]. Zhuang et al. proposed incorporating the sparse and nonnegative constraints into low-rank representation and formulated the NNLRS model [9]. The manifold low-rank representation (MLRR) [18] first uses a sparse learning objective to identify the data manifold and then incorporates the manifold information into low-rank representation as a regularizer. Additionally, [19] proposed preserving the structure information of data from two aspects: local affinity and distant repulsion. Li and Fu proposed constructing graph based on low-rank coding and b-matching constraint for obtaining a sparse and balanced graph [20]. All the above-mentioned low-rank models are linear; therefore, they inevitably have limitations in modeling complex data distribution, which does not strictly follow a linear model but a nonlinear one. To make the low-rank model effectively deal with the nonlinear structure of data, [11] proposed the kernel low-rank representation (KLRR) graph for semisupervised classification by using kernel trick. As a nonlinear extension of LRR, KLRR also showed excellent performance in face recognition [21].
Recent studies [22–26] have shown that learning performance can be greatly enhanced by considering the geometrical structure and local invariant idea [27]. It is obvious that this idea should be considered in both original data space and the reproducing kernel Hilbert space (RKHS). However, there is no existing LRR variant that takes into account the intrinsic manifold structure in RKHS. In this paper, we propose a novel manifold adaptive kernelized LRR for semisupervised classification. By using a data-dependent norm on RKHS proposed by [28], we can warp the structure of the RKHS to reflect the underlying geometry of the data. Then, the conventional low-rank representation can be performed in the manifold adaptive kernel space. The main contributions of this paper can be briefly summarized as follows:
We construct the manifold adaptive kernel space, where the underlying geometry of data can be reflected by the graph Laplacian.
We give the model formulation, the optimization method, and the complexity analysis of MKLRR in detail.
We conduct extensive experiments on semisupervised image classification tasks to evaluate the effectiveness of MKLRR and the experimental results show that MKLRR can get pretty promising performance.
The remainder of this paper is organized as follows. In Section 2, we give a brief review on the conventional LRR model and the semisupervised learning framework to be used in our work. Section 3 describes the model formulation, optimization method, and complexity analysis of the manifold adaptive kernelized LRR model in detail. Experimental studies of MKLRR on semisupervised image classification task will be introduced in Section 4. Section 5 concludes the whole paper and presents an extension of MKLRR as our future work.
2. Related Work
In this section, we give a brief review of the conventional low-rank representation model [15] and the semisupervised classification framework based on Gaussian Fields and Harmonic Functions (GHF) [1].
2.1. LRR
Given a set of samples X=[x1,x2,…,xn]∈Rd×n, LRR aims to represent each sample as a linear combination of the bases in A=[a1,a2,…,am]∈Rd×m by X=AZ, where Z=[z1,z2,…,zn] is the matrix in which each zi is the representation coefficient corresponding to sample xi. Therefore, each entry in zi can be viewed as the contribution to the reconstruction of xi with A as the dictionary. LRR seeks to find the lowest rank solution by solving the following optimization problem [15]:(1)minZrankZ,s.t.X=AZ.It is NP-hard to directly optimize the rank function. Therefore, we usually use the trace norm (also called nuclear norm) as the closest convex surrogate to the rank norm, which leads to the following objective [29]:(2)minZZ∗,s.t.X=AZ,where ·∗ is the sum of its singular values of a certain matrix [30]. Considering the fact that samples are usually noisy or even grossly corrupted, a more reasonable objective for LRR can be expressed as (3)minZ,EZ∗+λE2,1,s.t.X=AZ+E,where E∈Rd×n and E2,1=∑j=1n∑i=1deij2. The second term in (3) is to characterize the error term by modeling the sample-specific corruptions. Also, some existing studies employed the l1-norm to measure the error term [31, 32]. The optimal solution Z∗ can be obtained via the inexact augmented Lagrange multiplier method [31].
2.2. GHF
Assume that we have a data set X=[x1,…,xl,xl+1,…,xn]∈Rd×n from c classes, where xi,i=1,…,l, and xi,i=l+1,…,n, are the labeled and unlabeled samples, respectively. The label indicator matrix Y∈Rn×c is defined as follows: for each sample xi(i=1,…,n), yi∈Rc is its label vector. If xi is from the kth (k=1,2,…,c) class, then only the kth entry of yi is one and all the other entries are zeros. If xi is an unlabeled data, then yi=0.
GHF is a well-known graph-based semisupervised learning framework in which the predicted label matrix F∈Rn×c is estimated on the graph with respect to the label fitness and the manifold smoothness. Let fi and yi, respectively, denote the ith rows of F and Y. GHF tries to minimize the following objective:(4)minF12∑i,j=1nfi-fj2sij+λ∞∑i=1lfi-yi2,where λ∞ is a very large value such that ∑i=1lfi-yi2 can be approximately satisfied. S∈Rn×n is an affinity matrix to depict the pairwise similarity of samples. Obviously, (4) can be rewritten in the compact matrix form as(5)minF12TrFTLsF+TrF-YTUF-Y,where the graph Laplacian matrix LS∈Rn×n can be calculated as LS=D-S; dii=∑jsij (or ∑isij since S is usually a symmetric matrix) is a diagonal degree matrix. U is also a diagonal matrix with the first l and the remaining n-l diagonal entries as λ∞ and 0, respectively.
In this section, we show how to incorporate the manifold structure into the reproducing kernel Hilbert space (RKHS), which leads to manifold adaptive kernel space.
Kernel trick is usually applied with the hope of discovering the nonlinear structure in data by mapping the original nonlinear observations into a higher dimensional linear space [33]. The most commonly used kernels are Gaussian and Polynomial kernels. However, the nonlinear structure captured by the data-independent kernels may not be consistent with intrinsic manifold structure, such geodesic distance, curvature, and homology [34, 35].
In this work, we adopt the manifold adaptive kernel proposed by [28]. Let V be a linear space with a positive semidefinite inner product (quadratic form) and let S:H→V be a bounded linear operator. We define H~ to be the space of functions from H with manifold inner product:(6)f,gH~=f,gH~+Sf,SgV;H~ is still a RKHS [28].
Given samples x1,…,xm, let S:H→Rm be the evaluation map.(7)Sf=fx1,…,fxmT.Denote f=f(x1),…,f(xm)T and g=g(x1),…,g(xm)T. Note that f,g∈V; thus we have(8)Sf,SgV=f,g=fTMg,where M is a positive semidefinite matrix. For a data vector x, we define(9)κx=κx,x1,…,κx,xm.It can be shown that the reproducing kernel in H~ is(10)κ~x,z=κx,z-γκxTI+MK-1Mκz,where I is an identity matrix, K is the kernel matrix in H, and γ≥0 is a constant controlling the smoothness of the functions. The key issue now is the choice of M, so that the deformation of the kernel induced by the data-dependent norm is motivated with respect to the intrinsic geometry of the data.
Without loss of generality, we assume that there are nq data points to be utilized to derive the linear space V. It is easy to rewrite formulation (10) in compact matrix form as(11)K~H~=KH-γKHqnTIHq+MHqKHq-1MHqKHqn,where the matrices KH∈Rn×n, KHqn∈Rnq×n, and KHq∈Rnq×nq are all in H. Here, I is an identity matrix with the same size as KHq. K~H~ was referred to as the kernel matrix in the warped RKHS.
The key issue now is the choice of M. As mentioned above, manifold structure can be discovered by the graph Laplacian associated with the data points.
3.2. The Objective Function
From [11], the objective of kernel low-rank representation was formulated as(12)minZ¯,E¯rankZ¯+λE¯2,1,s.t.K=KZ¯+E¯.In order to learn the low-rank representation that is consistent with the manifold geometry, it is natural to take advantage of the manifold adaptive kernel in KLRR.
In order to model the manifold structure, we construct a nearest-neighbor graph G. For each data point xi, we find its p nearest neighbors denoted by N(xi) and put an edge between xi and its neighbors. There are many choices for the weight matrix on the graph and we use the “0-1” form defined as follows:(13)wij=1,if xi∈Npxj or xj∈Npxi0,otherwise.The graph Laplacian [36] is defined as L=D-W, where D is a diagonal degree matrix given by dii=∑jwij (or ∑iwij since W is symmetric). The graph Laplacian provides the following smoothness penalty on the graph:(14)fTLf=∑i=1nfxi-fxj2wij.Therefore, it is natural to substitute M with the graph Laplacian L. For convenience, we make use of all the available data points to derive the linear space V in the warped RKHS (i.e., nq=n); then (11) can be rewritten as(15)K~M=K-γKTI+LK-1MK,where K~M indicates that this kernel matrix is in a manifold RKHS.
Using the nuclear norm to replace the rank function, we arrive at the following objective of manifold adaptive kernelized LRR as(16)minZ¯,E¯Z¯∗+λE¯2,1,s.t.K~M=K~MZ¯+E¯.
Figure 1 shows the connection between MKLRR and LRR as well as its variants. As we can see, LRR variants such as GLRR, LRRLC, and MLRR can be reached by incorporating manifold information. By using the kernel trick, the KLRR model can find the lowest rank representation in RKHS. Further, by considering the geometric structure of data in RKHS, we can formulate the MKLRR model. Both KLRR and MKLRR are nonlinear models, since an implicit nonlinear mapping is employed.
Connections between several LRR models.
3.3. Optimization
To make objective (16) separable, we introduce an auxiliary variable J with respect to Z¯ and then we have the following objective:(17)minZ¯,E¯,JJ∗+λE¯2,1,s.t.K~M=K~MZ¯+E¯,Z¯=J.The corresponding augmented Lagrangian function is(18)minZ¯,J,E¯,Y1,Y2J∗+λE¯2,1+Y1,K~M-K~MZ¯-E¯+Y2,Z¯-J+μ2K~M-K~MZ¯-E¯F2+Z¯-JF2,where Y1 and Y2 are Lagrange multipliers and μ>0 is a penalty parameter. The inexact augmented Lagrange multiplier (ALM) algorithm is employed to optimize objective (18) [31]. The detailed optimization process is summarized in Algorithm 1.
Algorithm 1: Optimization to (18).
Input: data points {(xi,yi)}i=1l⋃{xi}i=l+1n, regularization parameters λ,
Z-=J=0, E-=0, Y1=Y2=0, μ=1e-6, μmax=1e6, ρ=1.1, and ε=1e-8;
Output: the low-rank representation coefficient matrix Z-.
(1) while not converged do
(2) Fix the other variables and update J by
J=argminJ1μJ∗+12J-Z-+Y2μF2∗
(3) Fix the others and update Z- by
Z-=I+K~MTK~M-1K~MTK~M-K~MTE-+J+K~MTY1-Y2μ
(4) Fix the others and update E- by
E-=argminE-λμE-2,1+12E--K~M-K~MZ-+Y1μF2
(5) Update the multipliers
Y1=Y1+μK~M-K~MZ--E-
Y2=Y2+μZ--J
(6) Update the parameter μ by μ=min(ρμ,maxμ)
(7) Check the convergence conditions
K~M-K~MZ--E-∞<ε,
Z--J∞<ε
(8) end while
The updating rule for J is based on singular value thresholding operator which is given by the following theorem [30].
Theorem 1.
Let C∈Rm×n and C=UΣVT be the SVD of C, where U∈Rm×r and V∈Rn×r have orthonormal columns, Σ∈Rr×r is diagonal, and r=rank(C). Then (19)TλC=argminW12W-CF2+λW∗is given by Tλ(C)=UΣλVT, where Σλ is diagonal with (Σλ)ii=max{0,Σii-λ}.
The updating rule for E¯ can be obtained by soft-shrinkage operator [15], which is given as below.
Theorem 2.
Let Q=[q1,q2,…,qi,…] be a given matrix and let ·F be the Frobenius norm. If the optimal solution to(20)minWλW2,1+12W-QF2is W∗, then the ith column of W∗ is(21)W∗:,i=qi-λqiqi,if λ<qi0,otherwise.
3.4. Algorithm Workflow and Complexity Analysis
As a whole, we summarize the manifold adaptive kernelized low-rank representation-based semisupervised classification algorithm as follows:
Construct the graph Laplacian: construct a p nearest-neighbor graph G with weight matrix defined in (13) and then calculate the graph Laplacian by L=D-W
Calculate the manifold adaptive kernel: assume that the kernel matrix K in H can be induced from any data-independent kernel (e.g., Gaussian kernel or linear kernel). Then calculate the manifold adaptive kernel K~M in the warped RKHS according to (15)
Manifold kernel low-rank representation: optimize the MKLRR model and obtain the low-rank representation coefficient matrix Z¯ via Algorithm 1. Shrink some small values in Z¯ and then make it symmetric and nonnegative as Z¯=(|Z¯|+|Z¯T|)/2
Semisupervised classification: calculate the Laplacian matrix Ls=D-Z¯ and do semisupervised classification based on (5)
Below we give a brief analysis on the computational complexity of MKLRR. Constructing the p nearest-neighbor graph in the first step of MKLRR needs O(pn2). In the second step, computing the data-independent kernel matrix K needs O(n2) and computing the manifold adaptive kernel matrix needs O(n3). In the fourth step, the complexity of semisupervised learning based on GHF is O(n3)(c≪n). Below we give a detailed description on the complexity of Algorithm 1. Obviously, the main computation burden of MKLRR lies in the updating of J, since it involves the singular value decomposition (SVD). Specifically, in equation (∗) in Algorithm 1, the SVD is operated on an n×n matrix, which is time-consuming if the number of samples (i.e., n) is large. As referred to in [37], by substituting A with the orthogonal basis of the dictionary, the computation can be reduced to O(r2n), where r is the rank of dictionary A. The computational complexity of updating Z¯ is trivial owing to its simple closed form solution. The complexity of updating E¯ is O(n2r). Thus, the computation complexity of MKLRR-based semisupervised learning is O(t(r2n+n2r)+pn2+n3) in general, where t is the number of iterations of loop in Algorithm 1.
4. Experiments
This section evaluates the effectiveness of the proposed MKLRR algorithm on semisupervised classification task. Specifically, we will compare the performance of MKLRR with some state-of-the-art graph construction methods on four representative image data sets. All experiments are conducted on platform Intel(R) Core(TM) i7-4700MQ CPU @2.40 GHz 16.0 GB RAM Windows 8.1 System and Matlab 2013a.
4.1. Experimental Settings
For the comparison methods, several baseline methods are compared including some state-of-the-art graph-based semisupervised learning methods:
kNN: if one sample is among the k nearest neighbors of the other, then these two samples are viewed as connected. In kNN-1, k is set to 5; and in kNN-2, k is set to 8. The distance information is measured by the “Heatkernel” function, where the variance δ is the average of squared Euclidean distances for all edged pairs on graph
l1 graph [8, 38]: the l1-norm regularized least squares problem is optimized by the l1-ls package [39]. The regularization parameter to enforce the sparsity is searched from {10-3,10-2,10-1,1,10,102}
LNP (linear neighborhood propagation): we follow the pipeline in [40] to construct the graph. The neighborhood size in LNP is set to 40
SPG (sparse probability graph) [41]: we implement the SPG algorithm by setting nknn as one-quarter of the size of data set and λ is set to 0.001 as suggested by [41]
LRR (low-rank representation) [15]: for all data sets, we tune the parameter in the range {10-3,10-2,10-1,1,10,102} to achieve the best performance
GLRR (graph regularized low-rank representation) [17]: in [17], the accelerated gradient method [42] was employed to optimize GLRR by updating J, which is the corresponding auxiliary variable with respect to Z, while in our implementation, the GLRR objective function was relaxed as described in [10] and J was updated by using the SVT operator [31]
NNLRS (nonnegative low-rank and sparse graph) [9]: we construct the LRR graph with nonnegative and sparse properties. The weighted parameters are set as guided in [9]
LRCB (low-rank representation with b-matching constraint) [20]: as suggested in [20], we set the parameters λ1 and λ2 as 2 and 0.03 for all the four data sets. The parameter b is set as 5, 5, 10, and 5 in the ORL, Extended Yale B, PIE, and USPS data sets, respectively.
KLRR (kernel low-rank representation) [11]
For both KLRR and MKLRR, we use the Gaussian kernel function defined as κ(xi,xj)=exp(-xi-xj2/2σ2) and the band width parameter is set as the mean value of all the distances of each data pair. When constructing the weight matrix, the number of nearest neighbors p is set as 5. The regularization parameter λ in conventional LRR model is searched from the candidate values of {10-3,10-2,10-1,1,10,102}. Similar to the usage in [43], we fix the parameter γ=1 in all experiments below.
4.2. Experiment on Synthetic Data
Similar to studies [11, 15], a synthetic data set is constructed as follows. We construct 5 independent subspaces {Si}i=15⊂R100 whose bases {Ui}i=15 are computed by Ui+1=TUi, i≤i≤4, where T is a random rotation and U1 is a random orthogonal matrix with dimension 100 × 100. Therefore, each subspace has a dimension of 100. We sample 200 data vectors from each subspace by Xi=UiQi, 1≤i≤5, with Qi being a 100 × 200 i.i.d. N(0,1) matrix. We randomly choose 30% of the total samples to corrupt. For example, if data vector x is chosen, its observed vector is computed by adding Gaussian noise with zero mean and variance 0.3x2.
We select different numbers of labeled samples to evaluate the performance of different graph construction methods. Table 1 shows the classification accuracies of different graphs on the synthetic data set. The results are obtained from ten independent runs. From the results, we can find that all LRR variants can achieve good performance even when given only a few labeled samples. KLRR is slightly better than MKLRR by 0.26% when there is only one labeled sample per class. MKLRR obtains the best results in all the remaining cases.
Classification accuracies (%) of different graphs on synthetic data set.
# labeled
kNN-1
kNN-2
LNP
l1-graph
SPG
1
93.67±0.74
93.81±2.32
86.22±4.72
75.12±8.73
85.30±4.70
2
93.98±0.34
94.61±0.20
89.99±1.49
85.57±5.66
87.91±2.62
3
93.93±0.13
94.65±0.24
90.80±0.92
90.32±2.95
91.59±1.97
4
93.98±0.20
94.59±0.26
90.91±0.72
91.11±2.33
93.74±1.75
5
93.95±0.24
94.67±0.24
91.22±0.33
92.83±1.26
95.52±1.45
LRR
NNLRS
GLRR
LRCB
KLRR
MKLRR
86.83±5.35
90.32±4.58
89.44±5.82
90.51±2.37
92.59±2.00
92.33±1.87
92.34±1.00
93.81±2.17
94.10±1.25
94.10±1.98
94.69±0.58
94.98±0.55
92.86±0.62
94.21±1.34
94.84±0.51
94.67±1.17
95.13±0.47
95.67±0.43
93.11±0.56
94.79±0.57
94.92±0.49
94.98±0.62
95.32±0.34
96.04±0.37
93.20±0.41
95.26±0.35
95.12±0.51
95.43±0.41
95.30±0.31
96.48±0.28
Since GLRR is also related to incorporating the structure information of data into LRR, we show the learned block diagonal structure, respectively, by LRR, GLRR, KLRR, and MKLRR in Figure 2. Generally, although the visual discrepancies between MKLRR and its counterparts are minor, the block diagonal structure obtained by MKLRR is clearer than the others. Most of the values within each block of MKLRR graph are obviously larger than those of KLRR graph.
Block diagonal structures, respectively, learned by LRR, GLRR, KLRR, and MKLRR.
LRR
GLRR
KLRR
MKLRR
4.3. Experiment on ORL Data Set
The ORL data set contains ten different images of each of 40 distinct subjects. The images were taken at different times, varying the lighting, facial expressions, and facial details. Each image is manually cropped and normalized to size of 32×32 pixels. Figure 3 shows some example images of two subjects from the ORL data set.
Example face images of two subjects from the ORL data set.
We repeat all the experiments ten times. In each time, we randomly select a subset of images from each subject to create a labeled sample set. In this experiment, 1, 2, 3, 4, and 5 images per subject are randomly selected as labeled samples and the remaining images are regarded as unlabeled samples. The random indices are kept the same for all compared algorithms. The classification accuracies of different algorithms with different numbers of labeled samples on the ORL data set are shown in Table 2, in which MKLRR outperforms all the compared algorithms. For example, when we select 1, 2, 3, 4, and 5 images per person as labeled samples, the accuracies of MKLRR are higher than those of the second best algorithm by 1.88% (LCRB), 3.18% (KLRR), 2.18% (SPG), 0.85% (SPG), and 1.28% (SPG), respectively.
Classification accuracies (%) of different algorithms with different number of labeled samples on ORL data set.
# labeled
kNN-1
kNN-2
LNP
l1-graph
SPG
1
64.11±2.74
56.69±3.76
71.50±2.25
65.58±2.36
74.03±3.14
2
72.53±2.23
65.47±2.28
82.44±2.20
77.47±2.03
85.41±1.25
3
76.75±2.52
69.93±2.65
87.79±2.53
84.71±2.67
89.39±2.55
4
79.96±2.61
73.42±1.49
91.17±2.32
88.87±2.55
93.04±1.63
5
82.20±1.87
75.70±1.70
93.70±1.93
92.10±2.25
94.85±1.33
LRR
NNLRS
GLRR
LRCB
KLRR
MKLRR
67.28±3.23
71.58±3.64
67.33±2.95
75.98±2.39
74.67±2.27
77.86±2.19
81.34±2.23
83.63±2.56
82.56±2.10
83.11±1.83
85.97±1.68
89.15±1.81
86.71±2.25
88.46±2.39
88.29±2.19
87.91±1.42
88.25±1.25
91.57±1.31
90.58±1.52
91.50±1.80
91.88±0.97
91.07±1.35
91.92±2.21
93.89±1.19
92.60±1.65
93.20±1.78
94.35±2.15
94.26±1.12
94.10±1.56
96.13±0.88
4.4. Experiment on Extended Yale B Data Set
The Extended Yale B data set consists of 2414 human face images of 38 subjects. Each subject has about 64 images taken under different illuminations. Half of the images are corrupted by shadows or reflection. Each image is cropped to 32×32 pixels. Figure 4 shows some images of two subjects from the Extended Yale B data set.
Example face images of two subjects from the Extended Yale B data set.
We use the first 20 subjects and get 1262 images in total in the Extended Yale B data set to evaluate different methods. In this experiment, 4, 8, 12, 16, and 20 images per subject are randomly selected as labeled samples and the remaining images are regarded as unlabeled samples. The random indices are kept the same for all compared algorithms. Table 3 shows the classification accuracies of different algorithms with different numbers of labeled samples on the Extended Yale B data set. We can easily find that, with increasing number of labeled samples, all algorithms can obtain better classification results. Although the results are close to being saturated, MKLRR still can make some improvements. For example, it gets the accuracy of 98.09% when given 20 labeled images per person, which is 1.13% higher than that of KLRR. In particular, when given small number of labeled samples, MKLRR shows great superiority to the remaining algorithms. There is about 3% improvement when comparing MKLRR with KLRR when the number of labeled samples of each person is only 4. Since there are some noises in this data set, the performance of the basic KNN algorithm greatly decreases.
Classification accuracies (%) of different algorithms with different number of labeled samples on the Extended Yale B data set.
# labeled
kNN-1
kNN-2
LNP
l1-graph
SPG
4
58.15±2.41
44.77±2.24
78.10±2.09
82.75±1.26
82.20±1.17
8
66.04±2.14
52.67±2.16
86.74±1.48
90.05±1.60
89.80±1.42
12
70.10±2.45
57.92±2.44
89.68±1.38
92.47±1.05
92.46±0.97
16
72.62±1.88
60.62±2.62
91.09±1.34
93.62±0.93
93.98±0.92
20
74.48±1.84
63.93±2.62
92.30±1.09
94.74±0.79
95.20±0.86
LRR
NNLRS
GLRR
LRCB
KLRR
MKLRR
77.86±1.89
92.11±1.02
82.24±1.38
91.03±1.07
91.12±0.57
94.08±1.12
87.28±1.27
94.12±0.79
89.85±1.33
91.72±0.83
94.44±1.00
96.12±0.86
90.97±1.08
94.92±0.85
92.53±1.06
92.43±0.79
95.69±0.74
96.84±0.77
92.90±1.15
95.49±0.75
93.90±0.98
95.11±0.81
96.39±0.68
97.47±0.59
94.35±1.24
95.97±0.71
95.58±1.01
95.87±0.75
96.96±0.63
98.09±0.64
4.5. Experiment on PIE Data Set
The CMU PIE data set contains 41368 images of 68 subjects with different poses, illumination, and expression. We only use their images in five near frontal poses (C05, C07, C09, C27, and C29) and under different illumination and expressions. The first 15 subjects are selected and there are 2550 face images in total. Each image is manually cropped and resized to 32×32 pixels. Figure 5 shows some images of two subjects from the PIE data set.
Example face images of two subjects from the PIE data set.
Identical to the Extended Yale B data set, we also select 4, 8, 12, 16, and 20 images per subject as labeled samples and let the remaining images be unlabeled samples. The random indices are kept the same for all compared algorithms. Table 4 shows the classification accuracies of different algorithms with different numbers of labeled samples on the PIE data set. It is obvious that MKLRR outperforms the other algorithms in all cases. Particularly, MKLRR performs much better than the others when given 4 labeled samples per person.
Classification accuracies (%) of different algorithms with different number of labeled samples on the PIE data set.
# labeled
kNN-1
kNN-2
LNP
l1-graph
SPG
4
56.84±3.03
55.42±2.63
78.10±2.09
78.79±2.55
84.71±3.15
8
71.01±2.99
67.18±2.72
86.74±1.48
88.85±1.52
91.31±1.63
12
76.93±2.60
71.15±1.46
89.68±1.38
91.96±1.59
93.82±1.21
16
81.25±1.41
74.67±1.57
91.09±1.34
93.58±1.21
95.03±1.06
20
83.89±1.77
77.18±2.36
92.30±1.09
94.61±1.11
95.84±0.73
LRR
NNLRS
GLRR
LRCB
KLRR
MKLRR
73.00±2.45
86.82±2.30
84.59±1.69
84.81±2.29
86.01±2.43
88.31±2.28
86.83±1.54
92.25±1.20
92.46±0.80
90.79±1.16
91.32±1.01
93.68±0.93
91.69±0.83
94.44±0.95
94.37±0.39
92.94±0.76
93.62±1.13
95.81±0.64
94.21±0.72
95.53±0.73
95.69±0.52
94.72±0.58
94.91±0.95
96.41±0.42
95.39±0.51
96.13±0.50
96.08±0.48
96.08±0.41
95.54±0.65
96.98±0.57
4.6. Experiment on USPS Data Set
The USPS digit database [44] consists of 9298 handwritten digit images of 10 numbers (0–9). The size of each image is 16 × 16 pixels. We select 200 samples from each class and thus the resultant data set has 2000 images in total. Figure 6 shows some images of the 10 numbers from the USPS data set.
Example digit images from the USPS data set.
In this experiment, we randomly select 10%, 20%, 30%, 40%, and 50% samples per digit as labeled samples and let the remaining images be unlabeled samples. The random indices are kept the same for all compared algorithms. Table 5 shows the classification accuracies of different algorithms with different numbers of labeled samples on the USPS data set. All algorithms can obtain excellent performance on this data set including the simple KNN algorithm. The classification accuracies of MKLRR are higher than other algorithms in most cases.
Classification accuracies (%) of different algorithms with different number of labeled samples on the USPS data set.
# labeled
kNN-1
kNN-2
LNP
l1-graph
SPG
20
94.33±0.35
93.16±0.33
94.31±0.61
92.95±0.79
92.89±0.45
40
94.64±0.35
93.69±0.43
95.26±0.54
94.54±0.69
94.76±0.60
60
94.70±0.45
93.74±0.40
95.64±0.44
95.37±0.51
95.54±0.60
80
94.92±0.45
93.99±0.43
96.02±0.43
95.90±0.53
96.18±0.72
100
94.98±0.52
93.99±0.39
96.10±0.43
96.35±0.63
96.43±0.68
LRR
NNLRS
GLRR
LRCB
KLRR
MKLRR
86.96±0.78
92.09±0.67
90.49±0.77
90.23±0.81
90.50±0.90
93.61±0.58
88.64±0.55
93.58±0.45
91.53±0.97
92.38±0.57
93.17±0.83
95.60±0.49
88.96±0.72
94.04±0.57
91.84±0.83
93.67±0.61
94.12±0.60
96.13±0.51
90.08±0.41
94.34±0.56
92.34±0.77
94.51±0.59
94.90±0.59
96.51±0.57
90.42±0.83
94.67±0.63
92.49±0.89
94.77±0.43
95.19±0.68
96.87±0.37
4.7. Parameter Sensitivity Analysis
There are two important parameters in MKLRR, which are the regularization parameter γ to construct the manifold adaptive kernel and λ to control the impact of the noise term. It is obvious that MKLRR will boil down to KLRR when we set the parameter γ as zero. In our previous experiments, the usage we employ is to empirically fix the value of γ to one that follows the similar ideas in [43, 45]. In this section, we will analyze the parameter sensitivity of γ and λ by the way of investigating one while fixing the other.
Figure 7 shows how the performance of MKLRR varies with the change of γ on the Extended Yale B and PIE data sets, respectively, where we fix λ=1. Here, four images per person are labeled and the remaining are unlabeled. For making it easier to do comparison, we also include the results of KLRR in the figure. We can find that [10-2,10] is a reasonable interval for the selection of values of γ.
The performance of MKLRR versus parameter γ by fixing λ=1.
Extended Yale B
PIE
Figure 8 shows how the performance of MKLRR varies with the change of λ on the Extended Yale B and PIE data sets, respectively, where we fix γ=1. There are also four labeled samples for each person. Generally, MKLRR is insensitive to the variation of λ if it is set as a slightly large value.
The performance of MKLRR versus parameter λ by fixing γ=1.
Extended Yale B
PIE
For the remaining data sets, the parameter sensitivities of MKLRR with respect to γ and λ have similar tendencies as shown in Figures 7 and 8.
5. Conclusion and Future Work
In this paper, we have proposed a new low-rank representation model for semisupervised image classification, which is called manifold adaptive kernel low-rank representation (MKLRR). Different from most existing LRR variants that consider the structure information in the original data space, our proposed model explicitly takes the intrinsic manifold structure depicted by nearest-neighbor graph into consideration. The graph Laplacian corresponding to the local geometry of the data is incorporated into the manifold adaptive kernel space, in which the low rank representation model is then calculated. Extensive experiments performed on both synthetic and benchmark data sets have shown excellent performance of MKLRR-based graph for semisupervised classification when given limited labeled samples.
As a limitation of general two-stage graph-based semisupervised learning methods, the information of labeled samples is neglected in graph construction stage. Therefore, it is necessary to take this point into consideration in order to construct more discriminative graph. This will be our future work and one of possible approaches is to introduce a constraint matrix that can depict the partial label information of data into LRR model.
Conflicts of Interest
The authors declare that they have no conflicts of interest.
Acknowledgments
This work was partially supported by National Natural Science Foundation of China (61602140, 61971193, 61633010, 61473110, and 61502129), Zhejiang Science and Technology Program (2017C33049, 2018C04012, and LQ16F020004), China Postdoctoral Science Foundation (2017M620470), Jiangsu Key Laboratory of Big Data Security & Intelligent Processing, NJUPT (BDSIP201804), and Co-Innovation Center for Information Supply & Assurance Technology, Anhui University (ADXXBZ201704).
ZhuX.GhahramaniZ.LaffertyJ.Semi-supervised learning using gaussian fields and harmonic functions3Proceedings of the 20th International Conference on Machine Learning (ICML '03)August 2003Washington, DC, USA9129192-s2.0-1942484430ZhuX.Semi-supervised learning literature surveyMadison, WI, USAUniversity of WisconsinCaiD.HeX.HanJ.Semi-supervised discriminant analysisProceedings of the 11th IEEE International Conference on Computer Vision (ICCV '07)October 2007Rio de Janeiro, Brazil1710.1109/ICCV.2007.44088562-s2.0-49049111209ZhouZ.-H.LiM.Semi-supervised learning by disagreement20102434154392-s2.0-7795670868910.1007/s10115-009-0209-zWangY.ChenS.Safety-aware semi-supervised classification20132411176317722-s2.0-8488694646010.1109/TNNLS.2013.2263512JiangZ.ZhangS.ZengJ.A hybrid generative/discriminative method for semi-supervised classification20133713714510.1016/j.knosys.2012.07.0202-s2.0-84870064676ForestierG.WemmertC.Semi-supervised learning using multiple clusterings with limited labeled data2016361-362486510.1016/j.ins.2016.04.040YanS.WangH.Semi-supervised learning by sparse representationProceedings of the SIAM International Conference on Data Mining2009792801ZhuangL.GaoH.LinZ.MaY.ZhangX.YuN.Non-negative low rank and sparse graph for semi-supervised learningProceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition (CVPR '12)June 20122328233510.1109/cvpr.2012.62479442-s2.0-84866660023ZhengY.ZhangX.YangS.JiaoL.Low-rank representation with local constraint for graph construction20131223984052-s2.0-8488420245510.1016/j.neucom.2013.06.013YangS.FengZ.RenY.LiuH.JiaoL.Semi-supervised classification via kernel low-rank representation graph20146911501582-s2.0-8492462409710.1016/j.knosys.2014.06.007WrightJ.MaY.MairalJ.SapiroG.HuangT. S.YanS.Sparse representation for computer vision and pattern recognition2010986103110442-s2.0-7795271720210.1109/JPROC.2010.2044470PengY.LuB.-L.Robust structured sparse representation via half-quadratic optimization for face recognition2017766885988802-s2.0-8496363549710.1007/s11042-016-3510-3PengY.LuB.-L.Discriminative extreme learning machine with supervised sparsity preserving for image classification20172612422522-s2.0-8501292012710.1016/j.neucom.2016.05.113LiuG.LinZ.YuY.Robust subspace segmentation by low-rank representationProceedings of the 27th International Conference on Machine Learning (ICML '10)June 20106636702-s2.0-77956529193JiangW.LiuJ.QiH.DaiQ.Robust subspace segmentation via nonconvex low rank representation2016340/34114415810.1016/j.ins.2015.12.038MR34584612-s2.0-84957880252LuX.WangY.YuanY.Graph-regularized low-rank representation for destriping of hyperspectral images20135174009401810.1109/tgrs.2012.22267302-s2.0-84880066314PengY.LuB.-L.WangS.Enhanced low-rank representation via sparse manifold adaption for semi-supervised learning2015651172-s2.0-8492176950410.1016/j.neunet.2015.01.001PengY.LongX.LuB.-L.Graph Based Semi-Supervised Learning via Structure Preserving Low-Rank Representation20154133894062-s2.0-8493994383110.1007/s11063-014-9396-zLiS.FuY.Low-rank coding with b-matching constraint for semi-supervised classificationProceedings of the 23rd International Joint Conference on Artificial Intelligence (IJCAI '13)August 2013Beijing, China147214782-s2.0-84896061548NguyenH.YangW.ShenF.SunC.Kernel Low-Rank Representation for face recognition201515532422-s2.0-8492285356910.1016/j.neucom.2014.12.051ZhengM.BuJ.ChenC.WangC.ZhangL.QiuG.CaiD.Graph regularized sparse coding for image representation20112051327133610.1109/TIP.2010.2090535MR2828091Zbl1372.943142-s2.0-79955371363CaiD.HeX.HanJ.HuangT. S.Graph regularized nonnegative matrix factorization for data representation2011338154815602-s2.0-7995953239510.1109/TPAMI.2010.231CaiD.HeX.HanJ.Locally consistent concept factorization for document clustering20112369029132-s2.0-7995552511710.1109/TKDE.2010.165HeX. F.CaiD.ShaoY. L.BaoH.HanJ.Laplacian regularized Gaussian mixture model for data clustering20112391406141810.1109/tkde.2010.2592-s2.0-79960917469PengY.WangS.LongX.LuB.-L.Discriminative graph regularized extreme learning machine and its application to face recognition20153403532-s2.0-8497153906310.1016/j.neucom.2013.12.065HadsellR.ChopraS.LeCunY.Dimensionality reduction by learning an invariant mappingProceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR '06)June 20061735174210.1109/cvpr.2006.1002-s2.0-33845594569SindhwaniV.NiyogiP.BelkinM.Beyond the point cloud: from transductive to semi-supervised learningProceedings of the 22nd International Conference on Machine Learning (ICML '05)August 20058258322-s2.0-31844440904CandèsE. J.LiX.MaY.WrightJ.Robust principal component analysis?201158310.1145/1970392.1970395MR2811000CaiJ.-F.CandèsE. J.ShenZ.A singular value thresholding algorithm for matrix completion20102041956198210.1137/080738970MR2600248LinZ.ChenM.MaY.The augmented lagrange multiplier method for exact recovery of corrupted low-rank matriceshttps://arxiv.org/abs/1009.5055PengY.WangS.LuB.-L.Structure preserving low-rank representation for semi-supervised face recognitionProceedings of the International Conference on Neural Information Processing2013148155SchölkopfB.SmolaA. J.2002The MIT PressBelkinM.SindhwaniV.NiyogiP.Manifold regularization: a geometric framework for learning from labeled and unlabeled examples2006723992434MR2274444Zbl1222.68144NiyogiP.SmaleS.WeinbergerS.Finding the homology of submanifolds with high confidence from random samples2008391-34194412-s2.0-4034910210510.1007/s00454-008-9053-2ChungF. R.199792AMS BookstoreLiuG.LinZ.YanS.SunJ.YuY.MaY.Robust recovery of subspace structures by low-rank representation20133511711842-s2.0-8487019751710.1109/TPAMI.2012.88ChengB.YangJ.YanS.FuY.HuangT. S.Learning with-graph for image analysis201019485886610.1109/TIP.2009.2038764MR2752089KohK.KimS.BoydS.Stanford UniversityWangF.ZhangC.Label propagation through linear neighborhoods2008201556710.1109/tkde.2007.1906722-s2.0-36648998944HeR.ZhengW.-S.HuB.-G.KongX.-W.Nonnegative sparse coding for discriminative semi-supervised learningProceedings of the 2011 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2011June 2011USA284928562-s2.0-8005289298910.1109/CVPR.2011.5995487JiS.YeJ.An accelerated gradient method for trace norm minimizationProceedings of the 26th International Conference On Machine Learning (ICML '09)June 2009ACM4574642-s2.0-71149103464CaiD.HeX.Manifold adaptive experimental design for text categorization20122447077192-s2.0-8486339317210.1109/TKDE.2011.104HullJ. J.A database for handwritten text recognition research199416555055410.1109/34.2914402-s2.0-0028428774LiP.BuJ.ZhangL.ChenC.Graph-based local concept coordinate factorization20154311031262-s2.0-8489038090610.1007/s10115-013-0715-x