Sparse models have a wide range of applications in machine learning and computer vision. Using a learned dictionary instead of an “off-the-shelf” one can dramatically improve performance on a particular dataset. However, learning a new one for each subdataset (subject) with fine granularity may be unwarranted or impractical, due to restricted availability subdataset samples and tremendous numbers of subjects. To remedy this, we consider the dictionary customization problem, that is, specializing an existing global dictionary corresponding to the total dataset, with the aid of auxiliary samples obtained from the target subdataset. Inspired by observation and then deduced from theoretical analysis, a regularizer is employed penalizing the difference between the global and the customized dictionary. By minimizing the sum of reconstruction errors of the above regularizer under sparsity constraints, we exploit the characteristics of the target subdataset contained in the auxiliary samples while maintaining the basic sketches stored in the global dictionary. An efficient algorithm is presented and validated with experiments on real-world data.
1. Introduction
Sparse models are of great interest in machine learning and computer vision, owing to their applications for image denoising [1], face recognition [2–4], traffic sign recognition [5], visual-tactile fusion [6, 7], and so forth. In sparse coding, samples or signals are represented as sparse linear combinations of the column vectors (called atoms) of a redundant dictionary. This dictionary can be a predefined one, such as the DCT bases and wavelets [8], or a learned one based on a specific task or dataset of interest.
With sufficient samples, learning a specialized dictionary instead of using the “off-the-shelf” one has been shown to dramatically improve the performance. Generally, the dictionary and the coefficients are estimated by minimizing the sum of least squared errors under the sparsity constraint. Batch algorithms such as MOD [9] and K-SVD [10] and nonparametric Bayesian methods [11] have shown state-of-the-art performance. Further, Mairal et al. [12] developed an online approach to handle large amounts of samples.
Recently, theoretical analysis of sparse dictionary learning has attracted much attention. Schnass [13] presented theoretical results of the dictionary identification problem. Sample complexity has been estimated in [14, 15]. Gribonval et al. [16] analyzed the local minima of dictionary learning. Moreover, to extend the capacity, dictionary learning with specific motivations [17–19] has also attracted lots of interests. For instance, robust face recognition [3] is dedicated to particular applications, and Hawe et al. [20] require the dictionary to have a separable structure. While the learned dictionary has significant effects on a given dataset, attaining further specialized dictionaries for subdatasets with fine granularity is an interesting and useful concept as well. For instance, with a dictionary corresponding to facial images of all humans, we want to gain a customized dictionary for each particular individual. However, in this case, standard dictionary learning approaches may be unwarranted or impractical: on one hand, samples for a particular individual (subject) are restricted and insufficient in most cases; on the other hand, even with enough data, learning so many dictionaries becomes inefficient for computation and storage. We demonstrate further examples, such as customizing handwritings to different styles, matching images of flower to various species, or matching paper corpora to specific proceedings.
In terms of classification tasks, approaches such as Yang et al. [18] and Ma et al. [2] learn a structured dictionary which consists of N subdictionaries on behalf of N different subjects. However they are often unfeasible: firstly, as a part of the global dictionary, the coding performance of the subdictionary is always worse than the global one. Secondly, the subdictionaries for N subjects must be learned together, which becomes inflexible and exacting for a huge N. Thirdly, once the global dictionary is obtained, specialization for a new (N+1)th subdataset would be impossible.
In this paper, we are looking for an effective, economic, and flexible dictionary customization approach, which are supposed to have the following characteristics:
We specialize an existing global dictionary by utilizing auxiliary samples obtained from the target subdataset, valid for finer granularity and a small quantity of examples (hence less computations).
Compared with the global one, the customized dictionary has the same size but smaller reconstruction errors and better representation of the target subdataset.
The customization for each subdataset is independent; thus we can customize an arbitrary number of subdatasets or attain a particular one alone.
As depicted in Figure 1, we first observed that the corresponding dictionary atoms of the global and the particular subjects often look “similar.” This is reasonable, as the dictionary atoms describe the sketches of the object and the basic shapes of all the subjects are consistent. For a more rigorous theoretical analysis, we further considered dictionary identifiability [13] for mixed bounded signal models, that is, signals that are generated from more than one source (reference dictionary). And we proved that if reference dictionaries were close in the sense of the Frobenius norm, the global dictionary learned from mixed signals would be close to each of them. In fact, the global dictionary grabs the common basic shapes of all the subdatasets, regarding characteristics of the subjects as noise and discarding them.
Three sorted dictionaries corresponding to two individual ones and the global one are demonstrated as images. Each one has a size of 64 × 256. The dictionaries for individuals 1 and 2 are trained from 40,000 patches picked from 24 of their corresponding facial images. The global one is trained from 80,000 patches sampled from 200 facial images belonging to 50 different individuals.
Thus, formulating the dictionary customization problem, we introduced a regularizer penalizing the difference between the global and the customized dictionary. By minimizing the sum of the reconstruction error and above regularizer under sparsity constraints, we exploit the characteristic of the target subdataset contained in the auxiliary samples while maintaining the basic shapes stored in the global dictionary. As a result, a better dictionary, closer to the global one, is obtained. The solution is an asymptotic unbiased estimation of the underlying dictionary and can be seen as a trade-off between learning a new one from data and using an existing one.
To minimize the object function, we considered a general strategy the same as dictionary learning, that is, coding the samples and updating the atoms alternately in each iteration. Further, we present an algorithm that shares the idea with K-SCVD [10], which we call C-Ksvd. The flow chart of our methods is demonstrated in Figure 2. Experiments on tasks such as denoising and superresolution illustrate that our approach can handle the customization problem effectively and efficiently, outperforming both the global one and the normal dictionary learning approach. In addition, our model is also promising for more tasks such as enhancing an insufficient learned dictionary.
Flow chart of our methods.
2. Notations
Throughout this paper, we write matrices as uppercase letters and vectors as lowercase letters. Given p>0, the lp-norm of the vector v∈Rn is defined as vp≜∑i=1nvip1/p. In particular, the l0-normv0 counts the nonzero entries of v. Let sign(v) denote the vector such that its jth entry signvj is equal to zero if vj=0 and to one (resp., minus one) if vj>0 (resp., vj<0).
The Frobenius norm of the matrix M is denoted as MF≜∑i=1m∑j=1nmij21/2 and matrixl1-norm as M1≜∑i=1m∑j=1n|mij|1. Define the operator norm (1)Mp,q≜supxp≤1Mxq,where mi denotes the ith column vector of M.
3. Dictionary Learning with Mixed Signals
Dictionary identifiability [13], that is, recovering a reference dictionary that is assumed to generate the observed signals, is important for the interpretation of the learned atoms. In particular, Gribonval et al. [16] proved that the loss function of dictionary learning admits a local minimum in the neighborhood of the dictionary generating the signals.
In this section, we consider that there are multiple reference dictionaries and that the signals generated from them are mixed. Further, we prove that if reference dictionaries are close to each other in the sense of the Frobenius norm, dictionary learning with mixed signals admits a local minimum near both reference dictionaries simultaneously.
Without loss of generality, we analyze the case of two signal sources S1, S2. In particular, for the signal source Si (i=1 or 2), assume its signals xi∈Rm are generated by model (2)Si: xi=Diαi+εi,where Di∈Rm×p is the reference dictionary of Si, αi∈Rp is the coefficient, and εi∈Rp is the noise.
Particularly, the coefficient αi is drawn on index set J⊂{1,2,…,p} such that αJci is a zero vector and αJi is a random vector. Assume αJi and εi satisfy the following assumptions similar to [15], where we denote ξi=sign(αi).
Assumption 1 (basic and bounded signal assumption).
There exist random variables α, ε, values α_, Mα, and Mε, such that (3)EαJiαJiT∣J=Eα2·I,EξJiξJiT∣J=I,EαJiξJiT∣J=Eα·I,EεiεiT∣J=Eε2·I,EεiαJiT∣J=EεiξJiT∣J=0.Pminj∈Jαji<α_∣J=0,Pαi2>Mα=0,Pεi2>Mε=0.i=1,2.
Remark 2.
Almost all sparse signal models such as k-sparse Gaussians and Laplacians satisfy the first five formulas, which can be seen as a kind of abstract and generalizing of the basic sparse signal model.
Further, the additional assumptions that the signal is upper-bounded and lower-bounded are standard and mainly used to make the analysis simple and clear [15]. In practice, as digital data is gathered with sensors with limited dynamics and stored in float format with limited precision, the boundedness assumption seems to be reasonably relevant.
The index set J is called the support of αi and the sparsity s is defined as the number of elements in J. Thus the signal model is parameterized by the sparsity s, the expected coefficient energy Eα2, the minimum coefficient magnitude α_, maximum norm Mα, and the flatness κα≜E|α|/Eα2.
Note these assumptions can be generalized to multiple sources case easily, and thus we have the following definition.
Definition 3 (mixed bounded signal source).
A mixed signal source SO is defined as the union set of several signal sources S1,S2,S3,…; that is, (4)SO=S1∪S2∪S3∪⋯,where each source generates the signals by the way described in (2). Further, if S1,S2,S3,… satisfy the basic and bounded signal assumptions (3) simultaneously, we say that SO is a mixed bounded signal source or satisfies a mixed bounded signal model.
Further, for the two signal sources’ case, assume D1 and D2 are close in the sense of Frobenius distance; that is, there is a small ζ∈R, s.t. D1-D2F≤ζ. (As discussed in [15], a dictionary is invariant by sign flips and permutations of the atoms, and we simply assume the atoms have been tuned to attain the minimum distance.) Denoting dj the jth column of D, the cumulative coherence of a dictionary D is defined as (5)μkD≜supJ≤ksupj∉JDJTdj1.
The term μk(D) gives a measure of the level of correlation between columns of D. Moreover, the lower restricted isometry constant of a dictionary D, δ_(D), is the smallest number, for any α∈Rm, satisfying (6)1-δ_Dα22≤Dα22.
Recall that, for a set X=x1,…,xn∈Rm×n, the loss function of dictionary learning is (7)FXD≜1n∑i=1ninfαi∈Rp12xi-Dαi22+gαi,where g(α) is a penalty function promoting sparsity. Now consider a set of mixed signals Y=X1,X2=x11,…,xn1,xn+12,…,x2n2, where X1⊂S1 and X2⊂S2; the dictionary learning can be formulated as (8)minD∈DFYD=12FX1D+12FX2D,where D denotes the set of dictionaries with unit l2 norm atoms. Further, we have the following asymptotic result.
Theorem 4.
SO is a mixed bounded signal source described above which consists of two signal sources S1 and S2. Without loss of generality, let D12,22≥D22,22. And assume the cumulative coherence μsDi and the sparsity level s satisfy (9)μsDi≤14,s≤p16Di2,2+12.i=1,2.
Further, we define (10)Cmini≜24κα2spDi2,2+1DiTDi-IF,Cmin≜maxCmin1,Cmin2,Cmaxi≜27·EαMα1-2μsDi,Cmax≜minCmax1,Cmax2.And assume Cmin<Cmax. Define t=Eε22/Eα022. Moreover, let g(α)=λα1 with a regularization parameter λ≤1/4α_ and denote λ-≜λ/Eα, π=πD1-D2, δ_=max(δ_D1,δ_D2). Then there exists a radius r which satisfies Cminλ-+ζ<r<Cmaxλ-, Mε/Mα<7/2(Cmaxλ--r), and (11)1+t<1-δ_8πsD22,22rζ-1r-ζ-Cminλ-,such that the expectation of the function FY(D) admits a local minimum D^ that D^-D1F<r, D^-D2F<r.
Let us consider in more detail the assumptions in Theorem 4.
μsDi≤1/4 and s≤p/16Di2,2+12 assume upper bounds of the correlation level between columns of Di and the sparsity s. This is common in the analysis of sparse learning [21].
The condition Cmin<Cmax would be satisfied with small s/p. The smaller s/p is, the larger Cmax-Cmin would be.
λ≤1/4α_ impose an upper limit on admissible regularization parameters. Note that limits on regularization parameters are also frequent [22].
Mε/Mα<7/2Cmaxλ--r requires the level of noises. In particular, noiseless situation, that is, Mε=0, is a special case. Besides, t would be particularly small; for example, if the noise level is 30 dB, then 1+t=1.001.
Consider Cminλ-+ζ<r<Cmaxλ- and (12)1+t<1-δ_8πsD22,22rζ-1r-ζ-Cminλ-,and we can rewrite them as (13)ζ≤Cmax-Cminλ-,ζ≲1-δ_8πsD22,22,so the conditions would be satisfied for small ζ, in line with that D1 and D2 are close.
To conclude, the assumptions will hold for small cumulative coherence μsDi, sparsity s, noise level Mε, dictionary distance ζ, and chosen regularization parameter λ.
Remark 5.
For the radius r, it is lower-bounded by Cmin, λ-, and ζ. While ζ is fixed, if the sparsity s is particularly small, Cminλ- will be very small as well and the lower bound of r will be close to ζ. While s is fixed, and ζ tends to zero, that is, the mixed signal model degenerated into one single source case, then (14)1+t<1-δ_8πsD22,22rζ-1r-ζ-Cminλ-will be held forever and Theorem 4 degenerated into the case in [15], implying that the discussion in [15] can be seen as a special case of ours.
Moreover, the upper bound of r is implied to be less than 0.15, which can be concluded by a discussion similar to [15].
Remark 6.
Theorem 4 can be generated to n(n>2) sources case easily by considering a loss (15)minD∈DnFYD=FX1D+FX2D+FX3D+⋯+FXnD,and the proof is similar.
Proof.
Define the closed ball of a dictionary D with radius r as (16)BD,r=D′∈D:D′-DF≤r.
Now consider D1 and D2, as D1-D2F≤ζ and ζ<ζ+Cminλ-<r, and the two balls B(D1,r) and B(D2,r) have intersection U=BD1,r∩BD2,r with D1,D2 contained. Denote T as the boundary of U.
Further, for a set of samples X and two dictionaries D,D′, define (17)fXD≜EFXD,ΔfXD,D′≜fXD-fXD′,ΔfXT,D′≜infD∈TΔfXD,D′.
Note that 2fY(D)=fX1(D)+fX2(D); then we have (18)2ΔfYT,D1=ΔfX1T,D1+ΔfX2T,D1=ΔfX1T,D1+ΔfX2T,D2-ΔfX2D1,D2.
When g(α)=λα1, the function FX(D) is Lipschitz continuous with respect to Frobenius metric on the compact constraint set D⊂Rm×p [16]. Thus by choosing a radius r such that ΔfY(T,D1)>0, the compactness of the closed set U will then imply the existence of a local minimum D^ of FY(D) such that D^-D1F<r, D^-D2F<r. Now let us bound each item of (18).
First note that assuming μsD1≤1/4, s≤p/16D12,2+12, λ≤1/4α_, and Mε/Mα<7/2(Cmaxλ--r), then, by the proof of theorem 1 in [15], for any radius d∈(Cminλ-,Cmaxλ-) and any dictionary D that D-D1F=d, we have (19)ΔfX1D,D1≥Eα28·sp·dd-Cmin1λ->0.
Further, Eα2/8·s/p·d(d-Cmin1λ-) is monotonically increasing for d>Cminλ- and for D∈T, we have D-D1F≥r-ζ>Cminλ-. Thus (20)ΔfX1T,D1≥Eα28·sp·r-ζr-ζ-Cminλ-.For the second item ΔfX2T,D2, we have the same lower bound similarly.
Moreover, for the dictionary D2 and any coefficient α with sparsity s, we have (21)1-δ_sα12≤1-δ_α22≤D2α22.
Then by the theorem 2 and lemma 6 in [16], when g(α) is l1 norm, we have (22)FX2D1-FX2D2≤LX2D1-D21,2,where LX2=2s/n1-δ_·X2F2. Thus (23)FX2D1-FX2D2≤2sn1-δ_·πp·X2F2D1-D2F.
Assume x is a sample in X2 and its sparse coefficient and noise coefficient are α0 and ε. As x=D2α0+ε, we have (24)x2=D2α02+ε2+2D2α0Tε;taking expectation on each side of it, by assumptions in (2), as Eα022=sEα2 and ED2α0Tε=0, then (25)Ex22≤D22,22Eα0221+t≤sD22,22Eα21+t;thus EX2F2≤nsD22,22Eα21+t. Taking expectation on each side of (23), we have (26)ΔfX2D1,D2<2ss1+t1-δ_·πp·Eα2D22,22·ζ.
By (18), (20), and (26), as long as (27)1+t<1-δ_8πsD22,22rζ-1r-ζ-Cminλ-,we have (28)2ΔfYT,D1≥2·sEα28p·r-ζr-ζ-Cminλ--2πssp1-δ_·Eα2D22,22·ζ>0,which means that EFY(D) admits a local minimum D^ in U; that is, D^-D1F<r, D^-D2F<r.
The result is reasonable, as when those reference dictionaries are similar, the dictionary learned from the mixed signals should be similar to each of them, in order to get less reconstruction errors for each subdataset and hence a lower total loss.
4. The Regularizer and Dictionary Customization Problem
Now we turn back to the dictionary customization problem. In particular, the dataset SO consists of several separable subdatasets SA,SB,SC,…; that is, SO=SA∪SB∪SC∪⋯. Further, D0∈Rm×p (p≫m) is an existing global dictionary corresponding to SO. This is common, as the dictionary for facial images would always be well trained, and the corresponding dataset can be divided by different individuals. Then we would like to customize D0 with some auxiliary samples X=xi∣xi∈Rmi=1n⊂SA, requiring that the customized dictionary D has the same size but behaves better on SA.
Obviously, D should have sparse representations and small reconstruction errors on X, which corresponds to minimizing ∑i=1nxi-Dwi under sparsity constraint wi0≤s. Further, noting that SA,SB,SC,… can be regarded as several signal sources and, hence, SO as a mixed bounded model. Moreover, accounting for the fine granularity, the differences between those subdatasets SA,SB,SC,… are small and the basic sketches of them are consistent, implying that the underlying dictionaries for all subdatasets are similar. Thus, D0 should, according to Theorem 4, be close to our customized dictionary D as well, which is also in accordance with the practical observation. Considering the distance induced by the Frobenius norm, this leads directly to a regularizer D-D0F.
Denote E=D-D0; then the customization model can be formulated as a sum of the reconstruction errors and the above regularizer; that is, (29)argminE,WX-D0+EWF2+γEF2,s.t.∀i,wi0≤s,where wi represents the ith column vector of W, s≪m is the sparsity number, and γ≥0 is the parameter balancing the prior knowledge of D0 and the information in X.
It is worth noting that problem (29) is connected with the matrix version of total least squares (TLS) problems [23], which generalized the least squares by assuming noises in both dependent and independent variables. This is interpretable: as mentioned above, the atoms of the global dictionary only grab the main sketches. They regard the characteristics belonging to different subdatasets as noise and discard them. As a result, when considering a particular subdataset, the characteristic information is absent and thus the corresponding atoms of D0 can be seen as noisy. Different from TLS, the tuning parameter γ is necessary, as noises in D0 and X are different and should be balanced. We further depict model (29) with the following properties.
Theorem 7.
Consider customization problem (29), where X=xii=1n is the auxiliary data, D0 is the global dictionary, D∗ is the true one corresponding to the target subdataset, and D- is the customized one attained from (29); then
denote W=w∈Rp∣w0≤s; for any γ≥0, (30)infwi∈W∑i=1nxi-D-wiF≤infwi∈W∑i=1nxi-D0wiF;
for a fixed γ, when n tends to infinity, D- will converge to D∗; in other words, the minimizer of (8) is an asymptotically unbiased estimator of D∗;
the tuning parameter γ reflects the confidence in D0; in particular, if γ→∞, D-=D0; if γ=0, (29) will degrade into a common dictionary learning problem.
Proof.
For 1, as D- is the optimal solution of problem (29), then, for any γ≥0 and D∈Rm×p, we have (31)infwi∈W∑i=1nxi-D-wiF+γD--D0F2≤infwi∈W∑i=1nxi-DwiF+γD--D0F2.Let D=D0; then we have (32)infwi∈W∑i=1nxi-D-wiF≤infwi∈W∑i=1nxi-D-wiF+γD--D0F2≤infwi∈W∑i=1nxi-D0wiF,and the equality holds only when D-=D0.
For 2, reshape the loss function as (33)argminE,wi1n∑i=1nxi-D0+Ewi22+γnEF2.When n tends to infinity, the penalty will tend to zero and thus the loss function will degenerate into the common dictionary learning form.
For 3, it is easy to see that γ reflects the weight of the penalty in the loss function and the conclusion is reasonable.
According to the third property of Theorem 8, customization can be seen as a trade-off between learning a dictionary and using an existing one, which fills the void between them and implies a more flexible dictionary selection strategy. In particular, for datasets with coarse granularity, select dictionary learning with large amounts of samples. For subjects with fine granularity, customize the existing one with some auxiliary samples, and use a predefined dictionary if no sample is available.
We also emphasize that our model (29) will be valid as long as the assumption is satisfied (i.e., D0-D∗F≤ζ). As demonstrated in the experiments, there are more applications, such as improving an insufficient learned dictionary or correcting a contaminated one. In addition, for the regularizer, more matrix norms can be selected as well. For example, consider the distance induced by matrixl1-norm; then E will be sparse and the consumptions in storage and transmission will be reduced greatly.
5. Optimization
In this section, we first introduce a general optimization strategy and then devise a more straightforward dictionary updating strategy similar to K-SVD [10].
5.1. A General Strategy
A general optimization strategy, not necessarily leading to a global optimum, can be found by splitting the problem into two parts which are alternately solved within an iterative loop. The two parts are as follows.
5.1.1. Sparse Coding
Keeping E fixed, find W by (34)minWX-D0+EWF2s.t.∀i,wi0≤s.This can be solved by pursuit algorithms such as OMP [24], FOCUSS [25], or relaxed to Lasso [26].
5.1.2. Dictionary Updating
Keeping W fixed, find E by (35)minEX-D0+EWF2+γEF2.This is a quadratic programming problem with a closed-form solution E=X-D0WWTγI+WWT-1.
5.2. C-Ksvd Algorithm
We now turn to a more involved dictionary updating strategy: rather than freezing the coefficient matrix W, we update D=D0+E together with the nonzero coefficients (i.e., only the support is fixed).
In particular, assume that both W and E are fixed except for one column ek in the correction matrix E and the coefficients that correspond to it, the kth row in W, denoted as wTk. Then the loss function can be rewritten as (36)X-D0+EWF2+γEF2=X-∑i≠kpdi0+eiwTi-dk0+ekwTkF2+γek22+C=Mk-dk0+ekwTkF2+γek22+C,where dk0 is the kth column of D0, C=γ∑i≠kpei22 is a constant, and Mk=X-∑i≠kpdi0+eiwTi represents the error when the kth dictionary atom is removed.
Now we shrink the loss function to the support of row vector wTk. Define δk as the group of indices pointing to samples {xi} that use the atom dk=dk0+ek; that is, δk=i∣1≤i≤p,wTk(i)≠0. Further, define Ωk∈Rm×|δk| as ones on the (δk(i),i)th entries and zeros elsewhere. Then problem (29) is transformed to (37)minek,wRkMkR-dk0+ekwRkF2+γek22,where MkR=MkΩk and wRk=wTkΩk. For this subproblem, we have the following results.
Theorem 8.
Suppose the largest singular value and the corresponding singular vectors of matrix γdk0,MkR∈Rm×(n+1) are σ1, u1, and v1.v11 is the first element of v1. Then, the unique solution for problem (37) is (38)ek=σ1v11γu1-dk0,wRk=dkTMkRdk22,where dk=dk0+ek=σ1v11/γu1.
Proof.
Denote dk=dk0+ek; then (39)MkR-dk0+ekwRkF2+γek22=MkR-dkwRkF2+γdk0-dk22=γdk0,MkR-γdk,dkwRkF2.
As γdk,dkwRk=dkγ,wRk is the product of two vectors, its rank is one. Then problem (37) can be rewritten as (40)mindk,wRkγdk0,MkR-dkγ,wRkF2.
And thus dkγ,wRk is the best rank-one approximation of γdk0,MkR. By Eckart-Young-Mirsky Theorem [27], we have dkγ,wRk=σ1u1v1; thus dk=σ1v11/γu1, and ek=dk-dk0. Taking ek back into the original problem (37), it becomes a least squares problem and we have (41)wRk=dkTMkRdk22.
Thus, problem (37) has a closed-form solution and the main computation is top SVD of γdk0,MkR. In the dictionary updating stage, we can suggest minimization with respect to each column dk (for simplicity, omitting ek, we directly use dk in updating) and corresponding wTk in sequence, forcing the support of the coefficients fixed. The complete algorithm, named “C-Ksvd”, is described as Algorithm 9. Noting that while K-SVD computes the top SVD of matrix MkR∈Rn×δk for the kth column, C-Ksvd computes that of γdk0,MkR∈Rm×δk+1.
Assuming that the sparse coding stage is performed perfectly, a local minimum is guaranteed, as the loss function is guaranteed to be nonincreasing at the update step for dk and a series of such steps ensures a monotonic reduction. Compared with the general strategy, the updating for dk is more straightforward as it allows tuning of the values of the corresponding coefficients. In addition, each atom can have a unique parameter γi as well, on behalf of the confidence level of the ith atom di.
Algorithm 9 (C-Ksvd algorithm).
Initialization: a global dictionary D0, samples xii=1n.
Repeat:
Sparse coding stage: use any sparse recovery algorithm to compute the coefficients wi for each sample xi by approximating the solution of (42)minwixi-D0+Ewi22s.t.wi0≤s.
Dictionary updating stage: for each column k=1,2,…,p in D, update it by
Compute Mk by Mk=X-∑i≠kdiwTi.
Define the group of samples that use this atom as δk. Restrict Mk and wTk by choosing the columns corresponding to δk, obtain MkR and wRk.
Apply top SVD decomposition to γdk0,MkR, obtain σ1,u1,v1. Update (43)dk=σ1v11γu1,wRk=dkTMkRdk22.
Until convergence (stopping rule).
Output: a better dictionary D.
6. Experiments
We first showed the effectiveness of our approach on the denoising task, with analysis of the customized dictionary and the tuning parameter γ. Further, a novel superresolution experiment was illustrated, sharing the idea of transferring knowledge from a related auxiliary data source. In addition, we conducted an experiment that enhances an insufficient learned dictionary by C-Ksvd, illustrating that our model was also valid for more tasks.
6.1. Denoising
We demonstrated the customization results by denoising tasks on facial images drown from PIE Database [28]. The denoising process was similar to [1], which included sparse coding of each patch of the noisy image. As the coding performance relied heavily on the dictionary, we could assess the dictionary by the denoising results, which were evaluated by PSNR (Peak Signal Noise Ratio).
In particular, the noisy images were produced by adding Gaussian noises with mean zero and different standard deviation σ. The patch size and the redundant factor were set as 16 × 16 and 4. (We chose them for the best visual effect while similar comparisons can be attained for different value.) OMP was used for coding and atoms were accumulated until the average error passed the threshold, chosen empirically to be ε=1.15·σ. Results corresponding to three dictionaries were compared, that is, the global dictionary D0, the one generated by K-SVD, and the one produced by our customization approach. In K-SVD and customization, D0 was used as initialization and the iteration number was set to 10. Moreover, three kinds of D0 were considered, denoted as “global I”, “global II,” and “DCT,” respectively: (1) a dictionary learned by K-SVD, with 40,000 noiseless patches picked from 100 individuals; (2) similar to (1), but learned with noisy patches (σ=20); (3) predefined DCT (discrete cosine transform).
Each experiment was repeated 5 times and results are depicted in Table 1 and Figure 3. It was seen that customization outperformed the global dictionary and K-SVD on both PSNR and visual effects, accounting for the fact that both the common sketches in D0 and characteristics in X had been utilized. Particularly, note that denoising by D0 tended to be too smooth, and results by K-SVD were likely to be too rough. Regarding DCT as a suboptimal global dictionary, the results also showed that our customization is valid for a wide range of D0. Conducted on a i7-3770 CPU and processed with the same dataset X, the average running time for K-SVD and customization were 173.34 s and 48.21 s, respectively, showing that our approach is competitive. In particular, for K-SVD, 119.31 s were used for removing identical atoms. We also display the three dictionaries as images in Figure 4, showing that the customized one was similar to the global one, while the one corresponding to K-SVD was not.
Denoising results (PSNR, dB) on facial images of different individuals with the noise levelσ=30. For each image, three kinds ofD0were considered.
Individual
Original
Type of D0
D0
K-SVD
Customized
1
18.77
Global I
26.62
26.63
27.34
Global II
26.37
26.62
27.12
DCT
25.42
26.34
26.61
2
18.52
Global I
26.73
26.24
27.39
Global II
26.39
26.06
27.10
DCT
25.72
25.85
25.82
3
18.63
Global I
26.83
27.16
27.78
Global II
26.41
26.82
27.31
DCT
25.84
26.49
26.57
Examples of denoising different facial images. For the parameters of three rows, D0 was chosen as global I, II, and DCT, and δ=30,20,25, respectively.
Three dictionaries were sorted and quarters at the top left corner were demonstrated while denoising a noisy image with σ=30. For the convenience of comparison, the global dictionary learned by noisy patches is placed in the center.
In addition, we plotted the relations of the tuning parameter γ, the average number (AN) of coefficients for patches, and the PSNR after denoising in Figure 5. It was shown that γ could be chosen as the one attaining the minimum average number of coefficients by a quick one-dimensional search. What is more, experimentally, for a fixed D0, the best γ for different individuals was the same, which implies we only need to tune it once while customizing.
γ with highest PSNR was attained by the minimum average number of coefficients.
6.2. Superresolution
Yang et al. [29] proposed a scale-up algorithm via sparse signal representation, which contains two steps: dictionary learning and patch-pairs construction. To reduce the dimension and speed-up processing, Elad [30] applied PCA on the samples and used K-SVD for training. However, this learned dictionary was still a global dictionary, which means that we can further improve the performance of superresolution by customization.
Consider a global dictionary D0 and patches X=xii=1n sampled from related high-resolution images. We can customize this dictionary to a finer granularity. In particular, by substituting K-SVD, the low-resolution dictionary Dl and the coefficient W was customized by C-Ksvd, and the corresponding high-resolution dictionary was attained by (44)Dh=Dh0+X-Dl0WWTγI+WWT-1,where Dh0 and Dl0 denoted the initial high-resolution and low-resolution dictionaries, respectively.
In this experiment, similar to the settings in [31], we evaluated the proposed approach on the Yale Face Database [32], which contains 11 different 100 × 100 facial images for each of the 15 individuals. A downscaled image was taken as the low-resolution object, and the down-scale factor was set to 3. Further, other images of the same individual were considered as high-resolution auxiliary data. The patch size was set to 3 × 3. The global dictionary D0 was trained by 34,650 patches sampled from the 80% downscaled total dataset. (This is to highlight the relevance of auxiliary data and simulate real conditions, as the total training set is relatively small and clean in our experiment.) Results produced by D0, K-SVD (i.e., the original version with D0 as initialization and X as training data), and customization were compared. For customization, 225 patches were taken from each auxiliary image. For K-SVD, the total number of sampled patches was fixed to 6,000, to gain the best results.
Varying the number of the auxiliary images and repeating the experiments on different individuals, the performance was evaluated by PSNR. Some of the results are summarized in Table 2. “DL” and “Cus” represent K-SVD and customization, respectively, and “3,” “6,” or “9” denote the number of auxiliary images. “Bicubic,” that is, simple Bicubic interpolation, is shown as a baseline method.
PSNR for superresolution on test images.
Task
Bicubic
D0
DL3
Cus3
DL6
Cus6
DL9
Cus9
1
32.5
34.4
33.1
35.0
34.1
35.4
35.6
35.7
2
33.8
36.2
36.1
36.9
35.0
36.8
37.0
37.2
3
32.9
35.8
32.4
36.6
31.8
37.0
36.7
37.3
4
32.0
34.0
34.0
34.7
32.1
34.9
33.8
35.1
5
36.2
38.1
37.6
38.7
36.7
38.9
38.7
38.9
It is seen that if the number of auxiliaries is small, results produced by K-SVD are worse than the common dictionary, implying that the learning is meaningless. However, even when the auxiliary data is small (675 patches from 3 images), superresolutions by customization have significant improvements. Further, customization still outperforms or is no worse than the learning approach when new data is added. Remember that the number of patches which K-SVD needs is much larger than that needed for customization, meaning more computations and time are required. Also note that once the dictionary has been customized, it is valid for all the images of the person.
6.3. Enhancing
As was mentioned above, model (29) can be applied to more tasks, as long as the assumption that D0 and the reference dictionary D∗ are close. In this subsection, we consider the case of enhancing an existing dictionary by C-Ksvd and evaluate the performance on classification.
In particular, LC-KSVD [33], one of the state-of-the-art methods for image classification, introduced a triple model (D,A,W), where D represents the dictionary, A stands for parameters of the label consistent term, and W denotes the linear classifier. Regarding DT,αAT,βWTT as a new dictionary, the following object function can be solved by K-SVD: (45)argminD,W,A,XY-DX22+αQ-AX22+βH-WX22,s.t.∀i,xi0≤T.
Sometimes the model M0=D0,A0,W0 learned is likely not good enough, due to the fact that the training data may be insufficient or too noisy. Moreover, over time the past training information often becomes unavailable. In this case, we can further enhance it by our customization model, simply replacing the K-SVD procedure with C-Ksvd. In accordance with [33], we used the “Extended Yale-B” dataset [34] to demonstrate that the performance and the data were divided into three parts: training data to obtain the initial model, auxiliary data for model enhancement, and test data to evaluate it. Parameters α and β were tuned while training the initial models and then kept fixed. The initial model M0, LC-KSVD, and C-Ksvd were compared, where LC-KSVD used M0 as initialization and X as training data. Results were analyzed in three ways.
(1) Initial models of different levels were obtained by tuning the number of training samples, and we tried to promote the model with 800 auxiliary images. After repeating the experiment 5 times for each level, the averaged recognition accuracies are summarized in Table 3.
Accuracy (%) for different level initial and fixed number auxiliary images.
Init
76.76
79.32
84.86
88.04
90.41
93.30
LC-KSVD
85.32
85.36
85.76
88.69
90.23
90.83
C-Ksvd
91.78
92.20
94.36
94.94
95.79
97.38
It is seen that C-Ksvd is valid in a wide range of initial models and always significantly outperforms LC-KSVD. Besides, influences of the initial models on LC-KSVD are relatively small, in accordance with our previous analysis.
(2) For fixed initial models, varying the number of auxiliary images from 100 to 1100, we plotted the corresponding recognition results in Figure 6 and found that the accuracy had a significant increase, even when the auxiliary number was relatively small. To gain a competitive result for LC-KSVD, large amounts of images were required, which was unaffordable.
Varying the number of auxiliary images with fixed initial models.
(3) In the previous discussion, the auxiliary images were uniformly sampled from all 38 individuals. Then we considered the nonuniform case where only images of several classes (named “enhanced classes”) were available. After setting the number of enhanced classes as 19, and getting 31 images from each class, the results are reported in Table 4.
Accuracy (%) on enhanced classes, remainder classes, and all classes.
Classes
Init
LC-KSVD
C-Ksvd
Enhanced
83.90
92.05
93.40
Remainder
84.86
0.0
82.83
All
84.38
46.02
88.11
While C-Ksvd improved the accuracy on enhanced classes, the accuracy on the remainder was slightly reduced, owing to the similarity of the original and new dictionaries. LC-KSVD presented a sharp contrast.
7. Conclusion
In this paper, we considered the dictionary customization problem, which can be seen as a trade-off between learning a new dictionary from data and using an existing one. We investigated our hypothesis with theoretical analysis and formulated a model by raising a specific regularizer. An efficient algorithm was proposed, and experiments on real-world data demonstrate that our approach is promising.
Competing Interests
The authors declare that they have no competing interests.
EladM.AharonM.Image denoising via sparse and redundant representations over learned dictionaries20061512373637451715394710.1109/TIP.2006.881969MR24980432-s2.0-3375137973617153947MaL.WangC.XiaoB.ZhouW.Sparse representation for face recognition based on discriminative low-rank dictionary learningProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR '12)June 2012Providence, RI, USA2586259310.1109/cvpr.2012.62479772-s2.0-84866713695WrightJ.YangA. Y.GaneshA.SastryS. S.MaY.Robust face recognition via sparse representation200931221022710.1109/tpami.2008.792-s2.0-61549128441ZhangQ.LiB.Discriminative K-SVD for dictionary learning in face recognitionProceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR '10)June 20102691269810.1109/cvpr.2010.55399892-s2.0-77955998411LiuH.LiuY.SunF.Robust exemplar extraction using structured sparse coding20152681816182110.1109/tnnls.2014.23570362-s2.0-84937439743LiuH.GuoD.SunF.Object recognition using tactile measurements: kernel sparse coding methods201665365666510.1109/tim.2016.2514779LiuH.YuY.SunF.GuJ.Visual-tactile fusion for object recognition201610.1109/tase.2016.2549552MallatS.1999Academic PressMR1614527EnganK.AaseS. O.HusoyJ. H.Method of Optimal Directions for frame design5Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '99)March 1999244324462-s2.0-0032627073AharonM.EladM.BrucksteinA.K-SVD: an algorithm for designing overcomplete dictionaries for sparse representation200654114311432210.1109/tsp.2006.8811992-s2.0-33750383209ZhouM.ChenH.PaisleyJ.RenL.LiL.XingZ.DunsonD.SapiroG.CarinL.Nonparametric Bayesian dictionary learning for analysis of noisy and incomplete images20122111301442169342110.1109/TIP.2011.2160072MR29189602-s2.0-8425520443221693421MairalJ.BachF.PonceJ.SapiroG.Online dictionary learning for sparse codingProceedings of the 26th International Conference on Machine Learning (ICML '09)June 2009Montreal, CanadaACM6896962-s2.0-71149119964SchnassK.Local identification of overcomplete dictionarieshttps://arxiv.org/abs/1401.6354SchnassK.On the identifiability of overcomplete dictionaries via the minimisation principle underlying K-SVD201437346449110.1016/j.acha.2014.01.005MR32567832-s2.0-84908556147GribonvalR.JenattonR.BachF.KleinsteuberM.SeibertM.Sample complexity of dictionary learning and other matrix factorizationshttps://arxiv.org/abs/1312.3790GribonvalR.JenattonR.BachF.Sparse and spurious: dictionary learning with noise and outliers201561116298631910.1109/tit.2015.2472522MR3418966LuC.ShiJ.JiaJ.Online robust dictionary learningProceedings of the 26th IEEE Conference on Computer Vision and Pattern Recognition (CVPR '13)June 2013Portland, Ore, USA41542210.1109/cvpr.2013.602-s2.0-84887379127YangM.ZhangL.FengX.ZhangD.Fisher discrimination dictionary learning for sparse representationProceedings of the IEEE International Conference on Computer Vision (ICCV '11)November 2011Barcelona, Spain54355010.1109/iccv.2011.61262862-s2.0-84863011935MairalJ.PonceJ.SapiroG.ZissermanA.BachF. R.Supervised dictionary learningProceedings of the Advances in Neural Information Processing Systems 22 (NIPS '09)200910331040HaweS.SeibertM.KleinsteuberM.Separable dictionary learningProceedings of the 26th IEEE Conference on Computer Vision and Pattern Recognition (CVPR '13)June 2013Portland, Ore, USAIEEE43844510.1109/cvpr.2013.632-s2.0-84887354448CandèsE. J.RombergJ. K.TaoT.Stable signal recovery from incomplete and inaccurate measurements20065981207122310.1002/cpa.20124MR22308462-s2.0-33745604236NegahbanS.RavikumarP.WainwrightM. J.YuB.A unified framework for high-dimensional analysis of M-estimators with decomposable regularizersProceedings of the 23rd Annual Conference on Neural Information Processing Systems (NIPS '09)December 2009Vancouver, Canada134813562-s2.0-84858717588MarkovskyI.HuffelS. V.Overview of total least-squares methods200787102283230210.1016/j.sigpro.2007.04.0042-s2.0-34249825770PatiY. C.RezaiifarR.KrishnaprasadP. S.Orthogonal matching pursuit: recursive function approximation with applications to wavelet decompositionProceedings of the 27th Asilomar Conference on Signals, Systems & ComputersNovember 1993Pacific Grove, Calif, USAIEEE40442-s2.0-0027814133GorodnitskyI. F.RaoB. D.Sparse signal reconstruction from limited data using FOCUSS: a re-weighted minimum norm algorithm199745360061610.1109/78.5584752-s2.0-0031102203TibshiraniR.Regression shrinkage and selection via the lasso1996581267288MR1379242EckartC.YoungG.The approximation of one matrix by another of lower rank19361321121810.1007/BF022883672-s2.0-0000802374SimT.BakerS.BsatM.The CMU pose, illumination, and expression (PIE) databaseProceedings of the 5th IEEE International Conference on Automatic Face and Gesture Recognition (FGR '02)2002IEEE4651YangJ.WrightJ.HuangT. S.MaY.Image super-resolution via sparse representation201019112861287310.1109/TIP.2010.2050625MR28141932-s2.0-78049312324EladM.2010New York, NY, USASpringer10.1007/978-1-4419-7011-4MR26775062-s2.0-84892329327GuoY.BurgesC. J. C.BottouL.GhahramaniZ.WeinbergerK. Q.Robust transfer principal component analysis with rank constraintsProceedings of the 26th International Conference on Neural Information Processing Systems (NIPS '13)201311511159GeorghiadesA.1997Center for Computational Vision and Control at Yale UniversityJiangZ.LinZ.DavisL. S.Learning a discriminative dictionary for sparse coding via label consistent K-SVDProceedings of the 2011 IEEE Conference on Computer Vision and Pattern Recognition (CVPR '11)June 2011Colorado Springs, Colo, USAIEEE1697170410.1109/cvpr.2011.59953542-s2.0-80052901219GeorghiadesA. S.BelhumeurP. N.KriegmanD. J.From few to many: illumination cone models for face recognition under variable lighting and pose200123664366010.1109/34.9274642-s2.0-0035363672