An Efficient Algorithm for Learning Dictionary under Coherence Constraint

Dictionary learning problem has become an active topic for decades. Most existing learning methods train the dictionary to adapt to a particular class of signals. But as the number of the dictionary atoms is increased to represent the signals much more sparsely, the coherence between the atoms becomes higher. According to the greedy and compressed sensing theories, this goes against the implementation of sparse coding. In this paper, a novel approach is proposed to learn the dictionary that minimizes the sparse representation error according to the training signals with the coherence taken into consideration. The coherence is constrained by making the Gram matrix of the desired dictionary approximate to an identity matrix of proper dimension. The method for handling the proposed model is mainly based on the alternating minimization procedure and, in each step, the closed-form solution is derived. A series of experiments on synthetic data and audio signals is executed to demonstrate the promising performance of the learnt incoherent dictionary and the superiority of the learning method to the existing ones.


Introduction
Sparse representation (SR) theory [1,2] indicates that a signal can be represented by certain linear combination of a few atoms of a prespecified dictionary.It is an evolving field, with state-of-the-art results in many signal processing tasks, such as coding, denoising, face recognition, deblurring, and compressed sensing [3][4][5][6][7].
A fundamental consideration in employing the above theory is the choice of the dictionary and this leads to the famous dictionary learning (DL) problem.DL has attracted a lot of attention since its introduction at the end of last century [8,9].Most of the research has been done to learn a data adaptive dictionary so that a particular class of signals can be sparsely represented in this dictionary with low approximation error.
Under the SR framework, a signal vector y ∈ R ×1 can be expressed in the form of where D ∈ R × is the dictionary with its columns {D(:, )} referred to as atoms (throughout this paper, MATLAB notations are used) and x ∈ R ×1 is the corresponding sparse coefficient vector.Let k ∈ R ×1 with k() being its th element.The   -norm of vector k is defined as ( Note that ‖k‖  is not a norm in a strict sense for 0 ≤  < 1.For convenience, ‖k‖ 0 is used to denote the number of nonzero elements in k.A vector y given by ( 1) is said to be -sparse in D if ‖x‖ 0 = .Let {y  }  =1 be a set of training samples from a class of signals to be considered.The basic problem of DL is to find a dictionary D such that, for each y  , there exists a vector x  that is sparse.Such a problem has been widely investigated during the last decade or so [10][11][12] and can be formulated as min where ‖ ⋅ ‖  denotes the Frobenius norm, {  } are proper constants, ‖x  ‖ 0 is the sparsity of sparse vector x  , and Such a problem is difficult to be solved as it is nonconvex in D and X, and ‖ ⋅ ‖ 0 is nonsmooth and highly unstable.
A popularly used approach is based on the alternating minimization strategy.A two-stage procedure is usually carried out for solving the above problem and also for avoiding the selections of {  } [10][11][12].The problem in the first stage is referred to as sparse coding, aiming at finding the (column) sparse matrix X with a given D; that is ( Note that the equivalent expression to (5) where the constraint is a fixed sparse representation error (SRE) level can also be formulated as with  being the error threshold.Such a problem can be solved using the orthogonal matching pursuit (OMP) based methods [13,14].Furthermore, it can be shown that the solution of the above problem is the same as the one of the  1 -based minimization below: while the latter can be addressed using algorithms such as basis pursuit (BP) [15] and the  1 -based optimization techniques [16].
Many algorithms for solving (3) are different from each other mainly in the 2nd stage, that is, dictionary updating.For the dictionary D, in order to code the signals of interest more sparsely, we usually set  >  which means that D is overcomplete.However, this redundancy increases the pairwise similarity of dictionary atoms.According to the work in [13], such a similarity has a direct influence on the dictionary's performance, especially for the accuracy in sparse coding stage.If any two atoms degenerate to the same vector, this will lead to overfitting to the training data.Thus, incoherent dictionary is expected to improve the performance of the SR model.
Yaghoobi et al. proposed a design method for parametric dictionary [17].The authors attempted to optimize the dictionary to make the corresponding Gram matrix approximate to the Gram of an equiangular tight frame (ETF), which possesses good coherence behavior.However, this method relies extremely on a priori knowledge of appropriate parameters choosing criterion that is related to a given class of signals.
A new algorithm was developed in [18] named INK-SVD.In each iteration of K-SVD algorithm [11], the dictionary updating stage is followed by an additional decorrelation step.Each pair of atoms which has coherence above the threshold should have its inner angle increased symmetrically so as to reduce the coherence.But this procedure will implicitly destroy the original SR result from the K-SVD algorithm.To compensate this problem, the authors of [19] improved the work of [18] by incorporating a new decorrelation step (also related to the ETF according to its low coherence) and a dictionary rotation operation to the update stage.In [20], a weighting model was formulated to balance the coherence of the dictionary and the sparse representation ability, and a gradient-based method was carried out for solving the corresponding problem.
The main objective of this paper is to propose a new incoherent dictionary learning (IDL) method that constrains the coherence of the dictionary and minimizes the SRE and the contributions are threefold: (i) A novel model is proposed for learning the incoherent dictionary.The main contribution is also located in the dictionary updating procedure.When minimizing the SRE, that is, min D ‖Y − DX‖ 2  , D is under the coherence constraint by making the corresponding Gram matrix approximate to an identity matrix of proper dimension.
(ii) An iterative algorithm that updates the sparse coefficients and the components of the dictionary alternately is put forward to solve the design problem.In every step of dictionary updating, the solution of each component of dictionary is derived analytically.
(iii) A series of experiments on synthetic data and audio signals is carried out to demonstrate the performance of each compared algorithm.
The remainder of this paper is arranged as follows.In Section 2, some preliminaries are provided and the main issue of learning incoherent dictionary is also formulated in this part.The algorithm proposed for addressing the corresponding design problem is investigated in Section 3. Simulations are carried out in Section 4 to examine the performance of the proposed algorithm and to compare with the existing ones.Some concluding remarks are given in Section 5.

Preliminaries and Problem Formulation
In this section, some preliminaries will be introduced and two main comparisons of this paper are also reviewed in detail.Based on these, we formulate the problem of incoherent dictionary learning with the purpose of increasing the approximation performance of the dictionary to a particular class of signals under the coherence constraint.
The most fundamental quality associated with a dictionary is the mutual coherence (MC) [21].MC indicates the degree of similarity between different dictionary columns.It equals the maximum absolute inner product between two distinct atoms: where T denotes the transpose operator.As shown in [21], a -sparse signal generated according to (1) can be exactly recovered with OMP as long as Roughly speaking, MC measures how two atoms can look alike.Equation ( 9) is just a worst-case bound and only reflects the most extreme correlations in the dictionary.Nevertheless, MC is easy to be manipulated and it captures well the behaviors of some dictionaries.Generally, a dictionary is called incoherent if the corresponding MC is small [18,19].Besides, as pointed out in [19], the coherence of a dictionary is related to the condition number of its subdictionaries.This implies that achieving a low MC value results in wellconditioned subdictionaries.Define the Gram matrix of D as It is common to study MC in (8) via the Gram matrix.Let D sc be the diagonal matrix whose th element is given by 1/√G(, ) for  = 1, . . ., .The Gram matrix of D ≜ DD sc , denoted as G, is then normalized, such that G(, ) = 1, ∀.
Obviously, (D) = max  ̸ = |G(, )|.For D ∈ R × , it has been shown in [22] that (D) is bounded with with  being the Welch bound.If each atomic inner product meets this bound, the dictionary is called an ETF.An ETF has a very nice MC behavior and has been considered to be utilized in optimal dictionary design [17,19].

Related Works.
It is worth noting that ETFs only exist for those matrices D ∈ R × with dimensionality constrained with  ≤ ( + 1)/2 if the atoms are real.So, one usually replaces the set of ETF Grams with a relaxed version [17,19] that is defined as where 0 <  < 1 is a constant to control the searching space.
Clearly, when  ≥ , S  contains all the ETF Grams.
Besides the space S  , the authors of [19] define a spectral constraint set as Here eig(⋅) returns the vector of eigenvalues and rank(⋅) is the rank operator.The algorithm for learning incoherent dictionary proposed in [19] can be outlined as follows: (i) Sparse coding with OMP.
(iii) Atoms decorrelation through an iterative projection procedure.
(iv) Dictionary rotation to minimize the approximate error while keeping the MC unchanged.
The main contributions of [19] lie in the last two steps.
The atoms decorrelation is executed by iteratively projecting the Gram of the output dictionary of K-SVD between the sets S  and F until a stopping criterion is met.With the singular value decomposition (SVD) of the resulting positive semidefinite Gram matrix Ĝ being expressed as where V G ∈ R × is orthonormal and Σ G ∈ R × is the diagonal singular value matrix with all its elements being nonnegative, the incoherent dictionary can be obtained as with U being an arbitrary orthonormal matrix.Finally, the authors consider this degree of freedom to further reduce the SRE by solving where () is the set of  ×  orthonormal matrices.This is the rotation procedure.
Remark 1.Compared with the decorrelation operation in [18], the above-mentioned atoms decorrelation can achieve a much smaller MC value.Besides, the additional rotation procedure can slightly redeem SR ability.However, the approximation performance of the dictionary is highly damaged by the iterative projections.Though the dictionary rotation procedure is carried out for compensation, the effect of the sole degree of freedom U on the SR ability is quite limited.
In [20], the authors consider another strategy for IDL, where the dictionary's coherence is minimized along with the SRE.The cost function can be expressed as min with I  denoting the identity matrix of dimension .It is clear that I  is the simplest ETF Gram (with  = ).The Lagrange multiplier  controls the trade-off between minimizing the SRE and minimizing the dictionary's coherence.
With the gradient of (D) being calculated as the update of the dictionary is then executed by the steepest descent algorithm [20].
Remark 2. (i) The choice of  remains open-ended and there is no selection criterion introduced in [20].From the simulation results of [20], larger  introduces better performance.
(ii) It is well known that the gradient-based algorithms may easily fall into a local minimum if the initialization is not properly set [23,24].As gradient-based method is carried out for solving (17), the efficiency and accuracy can be further improved.

Problem Formulation.
Let D ∈ R × be the dictionary, let Y = {y  }  =1 be the signal set with y  ∈ R ×1 , and let X = {x  }  =1 be the corresponding sparse coefficient matrix in D with x  ∈ R ×1 as defined previously.For the problem indicated in (3), we update X and D alternately.For a fixed D, X can be calculated by the greedy algorithms or the  1 -based convex optimization methods.In the following, we focus our discussion on the dictionary updating stage.For the traditional case, that is, min the authors of [10] simply update the dictionary as D = YX T (XX T ) −1 .But when X is not full rank, this method fails to work.The K-SVD algorithm [11] minimizes (19) for each atom separately.When updating the dictionary, the coefficients are also renewed simultaneously.In every iteration between coefficients and dictionary, it needs  times SVD operations.It is a time-consuming algorithm and not well suited to enforcing a coherence constraint which is important for the implementation of sparse coding.
As an ETF can achieve small MC value, this motivates us to design a dictionary that is as close as possible to an ETF [17][18][19][20].So the following constrained model is proposed: The closed-form solution set of {min D ‖D T D−I  ‖ 2  } has been derived in [7,24] as where U and V are both arbitrary orthonormal matrices of dimensions  and , respectively.So (20) can be rewritten as min Remark 3. (i) Here we choose the identity matrix I  as the target Gram for the following reasons: it is easy to handle (avoiding the iterative projection between S  and F as carried out in [19]) and expression ( 21) contains more degrees of freedom than (15) for further minimizing the SRE.
(ii) As pointed out in [20], a flatter singular value spectrum of the dictionary indicates a less coherent dictionary.Our design strategy is under constraint (21) which means that the nonzero singular values of our designed dictionary are all equal (the same as (17) with  → ∞).Hence, better coherence performance can be expected.

Coherence Constrained Dictionary Learning
In this section, an alternating minimization algorithm is developed to address the dictionary learning problem (22).Iter 1 : number of iterations between sparse coding and dictionary updating; Iter 2 : number of iterations between updating U and V.
(b) For fixed U (−1) () , update V (−1) () by solving min The solution will be derived in the next subsection.

Update the Components of Dictionary
. Now, let us focus on solving (24) and (25).For convenience, we omit the superscript () in the expressions.As the sparse coefficient matrix X is assumed to be fixed in Step 2, we can rewrite the cost function of (22) as min U∈(),V∈() where U and V can be updated alternately.Let X have the following SVD (for arbitrary matrix M, the general SVD form can be expressed as M = U  Σ  (V  ) T ): Define with Ṽ1 ∈ R × .We then have two alternative expressions for (U, V): Assume that U (−1) and V (−1) be given.In what follows, we derive a procedure for updating (U, V) such that  (U () , V () ) ≤  (U (−1) , V (−1) ) . (30)

Update U. First of all, consider
This model can be solved by the following theorem [19].can be characterized as where V  and U  are both orthonormal matrices given by the following SVD: The solution to (31) can be derived as  (U () , V) .

Note that 󰜚(U
It is clear that such a function has the following properties: where Ṽ is defined in (28).Denote where each Ṽ() = (U  ) T V() with V(0) ≜ V (−1) is constructed using It follows from (38) and Theorem 4 that the solution to (41), as understood, is given by where It turns out from ( 39) and ( 40) that while (41) indicates that This implies that constructing { Ṽ()} using (42) and hence {V()} = {U  Ṽ()} makes (U () , V()) a decreasing sequence and, therefore, the solution to (37) can be estimated with Remark 5. (i) It may be possible that ) is always true.Therefore, (U, V) can be updated with (U () , V() ).
(ii) For the whole CCDL, there actually exist three loops (indexed by , , and , resp.).For the loop indexed by , (44) and ( 45) indicate that the procedure of updating V makes (U () , V()) decrease as  increases.So the solution (or an approximate one) of (37) can be gotten.Besides, the solution of (31) is derived analytically as (36).All these result in the convergence of the second loop indexed by , that is, dictionary updating.Assuming that OMP performs perfectly in the sparse coding stage, the nonincreasing trend of Step 1 is ensured.To sum up, the cost function (22) decreases in every step and hence the convergence of CCDL is guaranteed.

Experiment Results
In this section, we evaluate the performance of the proposed model and algorithm with synthetic data and audio signals.4.1.Convergence Performance.Firstly, several simulations will be carried out to verify the convergence performance of the proposed CCDL.As the main contributions of the new method lie in the dictionary updating stage, we focus on the performance of designing the orthonormal matrices U and V, that is, solving (26).
Set  = 20,  = 80, and number of signals  = 1000.There exist two loops in dictionary updating as introduced in the second point of Remark 5 indexed by  and , respectively.The maximum iteration numbers for  and  are both fixed to 100.
4.1.1.For Synthetic Dictionary.X is taken as a × Gaussian random matrix.Two orthonormal matrices Ǔ and V are generated to form the authentic dictionary Ď: Then Y is produced as Y = ĎX.The performance is evaluated by with D being the learnt dictionary.Starting from an initial random dictionary, Figures 1 and  2 show the convergence performance of the loops indexed by  (with  = 1) and , respectively.Remark 6. (i) Seen from the minimum values of Figures 1 and  2 (very close to zero), they indicate that the design processes of U and V can result in a dictionary which is almost the same as the authentic one, Ď.
(ii) The loop indexed by  is embedded in that of .When  = 1, that is, the case of Figure 1,  already achieves the minimum value that is very close to zero.It manifests that the orthonormal matrix V plays a more important role in minimizing .The result of Figure 2 also verifies this conclusion as the value of  converges in one iteration of the loop indexed by .Review Remarks 1 and 3 that state that one of the main differences between the proposed algorithm and the method in [19] is the extra degree of freedom V.So better sparse representation ability of the proposed algorithm can be expected.
Figures 1 and 2 show the efficiency the proposed CCDL, where the authentic dictionary is generated with an ideal format as (47).In what follows, random generated dictionary will be considered.

For Random Dictionary.
In this case, the matrix X, the authentic dictionary Ď, and the initial dictionary are all chosen randomly of proper dimensions without any correlation.Y is produced as Y = ĎX.Figures 3 and 4   the convergence performance of the loops indexed by  (with  = 1) and , respectively.
The phenomena observed in this case are similar to the previous one, except for the fact that the value of  is larger compared to the case with synthetic Ď.It should be pointed out that, for random always holds [25].This explains why the minimum of  cannot approach zero in this case.

4.2.
Simulations with Synthetic Data.We now carry out experiments to illustrate the performance of dictionaries learnt with different approaches.As comparisons, algorithms in [11,19,20] will be performed.For convenience, the learning systems are denoted as Dict new , Dict ksvd , Dict BP , and Dict SDB for the proposed CCDL and the methods in the references just mentioned, respectively.We generate two  ×  dictionaries D (0) and Ď, both with normally distributed entries.D (0) is used as the initial condition for executing different learning algorithms, and Ď is the authentic dictionary.A set of  -sparse  × 1 vectors {x  }  =1 is produced, where each nonzero element of x  is randomly positioned with a Gaussian distribution of i.i.d that has zero mean and unit variance.With the authentic dictionary Ď, the set of signal vectors {y  }  =1 is generated by y  = Ďx  , ∀, for training the dictionaries.
Set  = 20,  = 80,  = 5, and  = 5000, and the number of iterations for dictionary learning is fixed Iter 1 = 100 for all the four methods.Besides, the number of iterative projections and rotations in [19] is set to 100, and the gradient descent is executed 100 times with step size equaling to 0.1 in [20].For CCDL, the maximum iteration numbers for the loops indexed by  and  are fixed as 100 and 10, respectively.
The mutual coherence performances of different dictionaries are compared and the results are shown in Figure 5.For   Dict BP , the horizontal axis refers to the constant  which controls the searching space S  in (12), while for Dict SDB the horizontal axis indicates the weighting factor  for  = 2  with  being an integer varying within [−10, 10].In order to have clear comparisons, some results beyond certain ranges have been omitted, mainly concerning Dict BP with too small .
With synthetic data, we test the sparse representation abilities of the learnt dictionaries.The representation accuracy is usually quantified with the mean square error (MSE) defined as [11] where ŷ = Dx  is the reconstructed signal with D being the output dictionary of Dict new , Dict ksvd , Dict BP , or Dict SDB and {x  }  =1 being the corresponding coefficients of {y  }  =1 in different dictionaries calculated by OMP. Figure 6 depicts the MSE results of different systems.
Remark 7. (i) As known, for Dict BP , when  approximates to 1, the system regresses to Dict ksvd .The results in Figures 5  and 6 have confirmed this conclusion with the cases where  = 1.Besides, if too small  is chosen, the MSE performance of Dict BP degenerates, though small mutual coherence values would be achieved.
(ii) The results of Dict SDB fluctuate a lot for both tests.Though some surprisingly good performance is achieved, this superiority is highly sensitive to the data.As will be seen in the next experiment, when musical audio signals are tested, the fluctuations of Dict SDB become gentle.
(iii) Seen from the recovery accuracy indicator that is most crucial for evaluating the systems' performance, that is, Figure 6, the results of Dict new are superior to those of the other three methods in most of the cases.

Experiments with Musical Audio Signals.
The effectiveness of all the algorithms will be evaluated for audio signal coding task which is popularly used for testing the performance of incoherent dictionaries [19,20].
The audio signals are selected from the "testMusic16kHz" set of SMALLbox [26].Just like the operations in [19], we divide the recording into 50% overlapping blocks of 256 samples with rectangular windows and arrange the resulting time-domain signals as columns of the training data matrix Y.For each musical excerpt, the resulting Y ∈ R 256×624 ; that is,  = 256 and  = 624.An overcomplete Gabor dictionary of size 256 × 512 is used as the initialization which means  = 512.The sparsity level is fixed as  = 12 for all the tests.The numbers of iterations are kept with the same settings as used for that of synthetic data.When learning the dictionaries, OMP algorithm is applied for the sparse coding stage.
The recovery accuracy is quantified with the signal-tonoise ratio (SNR) defined as [ by the  1 -Homotopy algorithm [16] with  = 0.001.For all Dict new , Dict ksvd , Dict BP , and Dict SDB , we test each of the ten musical excerpts in "testMusic16kHz" set and keep the average results for comparison.The mutual coherence behavior for each of the learnt dictionaries is depicted in Figure 7.
When recovering with  1 -Homotopy, the SNR performance is shown in Figure 8. Remark 8.In this case, where musical audio signals are tested and  1 -Homotopy algorithm is applied for signal reconstruction, the fluctuations of Dict SDB become gentle for both SNR and mutual coherence performance versus the weighting factor .For Dict BP , the results of mutual coherence and SNR are more uniform; that is, smaller mutual coherence leads to higher SNR value.The performance of Dict new is superior to the others as these results are obtained by averaging ten musical excerpts.

Conclusion
In this paper, we have investigated the problem of learning incoherent dictionary.The contributions are threefold.The first one is to have proposed a novel model for IDL which considers minimizing the sparse representation error according to the training signals under the coherence constraint by making the Gram matrix of the dictionary approximate to the identity matrix.An alternating minimization algorithm named CCDL has been presented for solving the learning problem and the solution of each component of the optimum dictionary is derived analytically as the second contribution.The last one is to carry out experiments on synthetic data and musical audio signals to demonstrate the superiority of the proposed model and algorithm.

3. 1 .
Algorithm for IDL.To solve the above multivariate problem and also avoid the selections of {  }, the alternating minimization strategy as introduced for addressing (3) seems a natural choice.The pseudocode of the proposed algorithm (named CCDL, standing for Coherence Constrained Dictionary Learning) is summarized as follows: Initialization D (0) : initial random  ×  dictionary; Y: training data;

Theorem 4 .
For both B and C belonging to R × , the solution of min

Figure 1 :
Figure 1: Convergence performance of loop indexed by  with synthetic Ď.

Figure 2 :Figure 3 :
Figure 2: Convergence performance of loop indexed by  with synthetic Ď. depict

Figure 4 :
Figure 4: Convergence performance of loop indexed by  with random Ď. 0

Figure 5 :
Figure 5: The mutual coherence performance.(a) Results of Dict BP versus .(b) Results of Dict SDB versus  = 2  .

Figure 7 :
Figure 7: The mutual coherence performance for each of the learnt dictionaries.(a) Results of Dict BP versus .(b) Results of Dict SDB versus  = 2  .

Figure 8 :
Figure 8: The recovery SNR performance with  1 -Homotopy used for reconstruction.(a) Results of Dict BP versus .(b) Results of Dict SDB versus  = 2  .
x ≜ arg min x      y  − Dx