A Constrained Algorithm Based NMF α for Image Representation

Nonnegative matrix factorization (NMF) is a useful tool in learning a basic representation of image data. However, its performance and applicability in real scenarios are limited because of the lack of image information. In this paper, we propose a constrained matrix decomposition algorithm for image representation which contains parameters associated with the characteristics of image data sets. Particularly, we impose label information as additional hard constraints to the α-divergence-NMF unsupervised learning algorithm.The resulted algorithm is derived by using Karush-Kuhn-Tucker (KKT) conditions as well as the projected gradient and its monotonic local convergence is proved by using auxiliary functions. In addition, we provide a method to select the parameters to our semisupervised matrix decomposition algorithm in the experiment. Compared with the state-of-the-art approaches, our method with the parameters has the best classification accuracy on three image data sets.


Introduction
Learning an efficient representation of image information is a key problem in machine learning and computer vision.Efficiency of the representation refers to the ability to capture significant information from a high dimensional image space.Such a high dimensional problem is difficult to manipulate and compute; therefore dimension reduction becomes the crucial method to cope with this problem.Fortunately, matrix factorization is a valid approach to solve the dimension reduction problem, and it has a long and successful history in dealing with image representation [1][2][3].Some methods of matrix factorization can be referred to as principal component analysis (PCA) [4], singular value decomposition (SVD) [5], vector quantization (VQ) [6], and nonnegative matrix factorization (NMF) [7].
Among all techniques for matrix factorization, NMF is distinguished from others by its use of nonnegative constraints in learning a basis representation of image data [8] and has been applied in face recognition [9][10][11], medical imaging [12,13], electroencephalogram (EEG) classification for brain computer interface [14], and many other areas.However, NMF is an unsupervised learning algorithm and inapplicable to learning a basic representation from limited image information.Thus, to make up for this deficiency, extra constraints are implicitly or explicitly incorporated into NMF to derive some semisupervised matrix decomposition algorithms.In [15], the authors impose label information as additional hard constraints to NMF based on the squared Euclidean distance and Kulback-Leibler divergence.Such a representation encodes the data points from the same class using the indicator matrix in a new representation space, where the obtained part-based representation is more discriminating.
However, none of the semisupervised NMF algorithms mentioned above contain parameters associated with the characteristics of image data sets.In this paper, we introduce -divergence-NMF algorithm [16], where  is a parameter.We impose the labeled constraints to the -divergence-NMF algorithm to derive a generic constraint matrix decomposition algorithm which includes some existing algorithms as their special cases: one of them is CNMF KL [15] with  = 1.Then, we obtain the proposed algorithm using Karush-Kuhn-Tucker (KKT) method as well as the projected gradient method and prove its monotonic local convergence using an auxiliary function.Comparing to the current semisupervised NMF algorithms, we analyze the classification accuracy for two fixed values of  ( = 0.5, 2) on three image data sets.
The CNMF  algorithm does not work well for a fixed value of .Since the parameter  is associated with the characteristics of a learning machine, the model distribution is more inclusive when  goes to +∞ and is more exclusive when  approaches to −∞.The selection of the optimal value of  plays a critical role in determining the discriminative basis vectors.In this paper, we provide a method to select parameters  for our semisupervised CNMF  algorithm.The variation of  is associated with the characteristics of image data sets.Compared with the algorithms in [15,16], our algorithm is more complete and systemic.
The rest of the paper is organized as follows.In Section 2, we make a brief overview on standard NMF algorithm and constraint NMF algorithm.The detailed algorithms with labeled constraints and theoretical proof of the convergence of the algorithms are provided in Sections 3 and 4 separately.Section 5 presents some experimental results to show the advantages of our algorithm.Finally, a conclusion is given in Section 6.

Related Work
NMF, proposed by Lee and Seung [7], is considered to provide a part-based representation and applied in diverse examples of nonnegative data [17][18][19][20][21] including text data mining, subsystem identification, spectral data analysis, audio and sound processing, and document clustering.
Suppose  = [ 1 , . . .,   ] ∈  × is a set of  training images, where {  }  =1 is a column vector and consists of  nonnegative pixel values of a training image.NMF is to find two nonnegative matrix factors  ∈  × and  ∈  × to approximate the original image matrix where the positive integer  is smaller than  or .NMF uses nonnegative constraints to make the representation purely adapted to an unsupervised way.It is inapplicable to learn a basis representation to the limited image information.To make up for this deficiency, extra constraints such as locality [22], sparseness [9], and orthogonality [23] were implicitly or explicitly incorporated into NMF to identify better local features or provide more sparse representation.
In [15], the authors impose label information as additional hard constraints to NMF unsupervised learning algorithm to derive a semisupervised matrix decomposition algorithm, which makes the obtained representation more discriminating.The label information is incorporated as follows.
Suppose  = {  }  =1 is a data set, which consists of n training images.Set that the first  images { 1 , . . .,   } ( ≤ ) are represented by the label information, and the remaining  −  images { +1 , . . .,   } ( ≤ ) are represented by the unlabeled.Assume there exist  classes and each image from { 1 , . . .,   } is designated one class.Then we have an  ×  indicator matrix , which can be represented as From the indicator matrix , a label constraint matrix  can be defined as where  − denotes an ( − ) × ( − ) identity matrix.
Imposing the label information as additional hard constraint by an auxiliary matrix , there is  = .It verifies that ℎ  = ℎ  if   and   have the same labels.With the label constraints, the standard NMF is transformed into factorizing a large-size matrix  into the product of three small-size matrices , , and : Such a representation encodes the data points from the same class using the indicator matrix in a new representation space.

A Constrained Algorithm Based NMF 𝛼
The exact form of the error measure of ( 1) is as crucial as the nonnegative constraints in the success of the NMF algorithm in learning a useful representation of image data.
In the researches on NMF, there are quite a large number of investigations for error measure, such as Csiszár's fdivergences [24], Amari's -divergence [25], and Bregman divergences [26].Here, we introduce a genetic multiplicative updating algorithm [16] which iteratively minimizes the divergence between  and   .We define the -divergence as where  is a positive parameter.We combine the labeled constraints with (4) to derive the following objective function, which is based on the -divergence between  and ()  , With the constraints   ≥ 0,   ≥ 0, and   ≥ 0, the minimization of   [‖()  ] can be formulated as a constrained minimization problem with inequality constraints.
In the following, we will present two methods to find a local minimum of (6).

KKT Method.
Let   ≥ 0 and   ≥ 0 be the Lagrangian multipliers associated with constraints   ≥ 0 and   ≥ 0, respectively.The Karush-Kuhn-Tucker conditions require that both the optimality conditions and the complementary slackness conditions are satisfied.If   = 0 and   = 0, then either     can have any values.At the same time, if   = 0 and   = 0, then   ≥ 0 and   ≥ 0. Hence we need both   ̸ = 0   ̸ = 0 and   ̸ = 0   ̸ = 0.It follows from (9) that We multiply both sides of ( 7) and ( 8) by   and   , respectively, and incorporate with (10), and then we obtain the following updating rules: 3.2.Projected Gradient Method.Considering the gradient descent algorithm [24,25], the updating rules for the objective function (6) can be also derived by using the projected gradient method [27] and have the form where (⋅) is a suitably chosen function and   and   are two parameters to control the step size of gradient descent.Then, we have Setting (Ω) = Ω  , to guarantee that the updating rules ( 11) and ( 12) hold, we need From ( 7) and ( 8), the updating rules become which are the same as the updating rules (11) and (12).
We have shown that the algorithm can be derived using Karush-Kuhn-Tucker conditions and presented alternative of the algorithm using the projected gradient.The use of the two methods guarantees the correctness of the algorithm theoretically.
In the following, we will give a theorem to guarantee the convergence of the iterations in updates (11) and (12).Theorem 1.For the objective function (6),   [‖()  ] is nonincreasing under the updating rules (11) and (12).The objective function   is invariant under these updates if and only if  and  are at a stationary point.
Multiplicative updates for our constrained algorithm based NMF  are given in (11) and (12).These updates find a local minimum of   [‖()  ], which is the final solution of ( 11) and (12).Note that, when  = 1, the updates (11) and ( 12) are the same with CNMF KL algorithm [15], which is a special case included in our generic constraint matrix decomposition algorithm.In the following, we will give the proof of Theorem 1.

Convergence Analysis
To prove Theorem 1, we will make use of an auxiliary function that was used in the expectation-maximization algorithm [28,29].Definition 2. A function (, x) is defined as an auxiliary function for () if the following two conditions are both satisfied: Lemma 3. Assume that the function (, x) is an auxiliary function for (); then () is nonincreasing under the update Proof .Consider ( +1 ) ≤ ( +1 ,   ) ≤ (  ,   ) = (  ).
It can be observed that the equality ( +1 ) = (  ) holds only if   is a local minimum of (, x).We iterate the update in (18) to obtain a sequence of estimates that converge to a local minimum  min = arg min  () of the objective function given by In the following, we will show that the objective function ( 6) is nonincreasing under the updating rules ( 11) and ( 12) by defining the appropriate auxiliary functions with respect to   and   . where Proof.Obviously, (, ) = ().According to the definition of auxiliary function, we only need to prove (, V) ≥ ().To do this, we use the convex function (⋅) for positive  to rewrite the -divergence function () as where Note that ∑    ( V) = 1 and   ( V) ≥ 0 from the definition of   ( V). Applying Jensen's inequality [30], it leads to From the above inequality, it follows that which satisfies the condition of auxiliary function.
Reversing the rules of   and   in Lemma 4, we define an auxiliary function for the update (12). where This can be easily proved in the same way as Lemma 4. From Lemmas 4 and 5, now we can demonstrate the convergence of Theorem 1.
Proof.To guarantee the stability of (), from Lemma 3, we just need to obtain the minimum of (, V) with respect to   .Set the gradient of (20) to zero; there is Then, it follows that which is similar to the form of the updating rule (11).
Similarly, to guarantee the updating rule (12) holds, the minimum of (, Ŵ), which can be determined by setting the gradient of ( 26) to zero, must exist.

Experiments
In this section, the CNMF  algorithm is systematically compared with the current constrained NMF algorithms on three image data sets, named ORL Database [31], Yale Database [32], and Caltech 101 Database [33].The details of the above three databases will be described individually later.We introduce the evaluated algorithms firstly.
(i) Constrained nonnegative matrix factorization algorithm in [15] aims to minimize the F-norm cost.
(ii) Constrained nonnegative matrix factorization algorithm with parameter  = 0.5 in this paper aims to minimize the Hellinger divergence cost.
(iii) Constrained nonnegative matrix factorization algorithm with parameter  = 1 in [15], aiming at minimizing the KL-divergence cost, is the best reported algorithm in image representation.
(iv) Constrained nonnegative matrix factorization algorithm with parameter  = 2 in this paper aims to minimize the  2 -divergence cost.
(v) Constrained nonnegative matrix factorization algorithm with parameter  =   in this paper aims to minimize the -divergence cost, where the parameters   are associated with the characteristics of the image database and designed by the presented method.CNMF KL algorithm is a special case of our CNMF   algorithm with  =   = 1.
We apply these algorithms to a problem of classification and evaluate their performance on three image data sets which contain a number of different categories of image.For each date set, the evaluations are conducted with different numbers of clusters; here the number of clusters  varies from 2 to 10.We randomly choose  categories from one image data set and mix the images of these  categories as the collection .Then, for the semisupervised algorithms, we randomly pick up 10 percent images from each category in  and record their category number as the available label information to obtain the label matrix .For some special data sets, the label process is different and we will describe the details later.
Suppose a data set has  categories  1 ,  2 . . .,   , and the cardinalities of these labeled images are  1 ,  2 . . .,   , respectively.Since the label constraint matrix  is composed of the indicator matrix  and the identity matrix , the indicator matrix  plays a critical role in classification performance for different categories in .To determine the effectiveness of , we define a measure to represent the relationship between the cardinalities of labeled samples and the total samples, where ) , where The value of   computed by (31) is associated with the characteristics of image data sets, since its variation is caused by both the cardinalities of labeled samples in each category and the total samples.We can obtain both the cardinalities of labeled samples and the total samples in our semisupervised algorithms.However, we can not get the cardinalities of labeled images exactly in many real-word applications.Moreover, the value of  varies depending on data sets.It is still an open problem how to select the optimal  [16].
To evaluate the classification performance, we define classification accuracy as the first measure.Our CNMF  Discrete Dynamics in Nature and Society algorithm described in (11) and ( 12) provides a classification label of each sample, marked    .Suppose  = {  }  =1 is a data set, which consists of n training images.For each sample, let   be the true class label provided by the image data set.More specifically, if the image   is designated the   th class, we evaluate it as a correct label and set    = 1.Otherwise, it is counted as a false label and noted    = 0. Eventually, we compute the percentage of correct labels obtained by defining the accuracy measure as To evaluate the classification performance, we carry out computation about the normalized mutual information, which is used to measure how similar two sets of clusters are, as the second measure.Given two data sets of clusters   and   , their normalized mutual information is defined as which takes values between 0 and 1.Where (  ), (  ) denote the probabilities that an image arbitrarily chosen from the data set belongs to the clusters   and   , respectively, and (  ,   ) denotes the joint probability that this arbitrarily selected image belongs to the cluster   as well as   at the same time.( ) and (  ) are the entropies of   and   , respectively.Experimental results on each data set will be presented as classification accuracy and the normalized mutual information is in Tables 1 and 2.

ORL Database.
The Cambridge ORL Face Database has 400 images for 40 different people, 10 images per person.The images of some people are taken at different times, varying lighting slightly, facial expressions (open/closed eyes, smiling/nonsmiling), and facial details (glasses/no glasses).All the images are taken against a dark homogeneous background with the subjects in an upright, frontal position and slight left-right out-of-plane rotation.To locate the faces, the input images are preprocessed.They are resized to 32 × 32 pixels with 256 gray levels per pixel and normalized in orientation so that two eyes in the facial areas are aligned at the same position.
There are 10 images for each category in ORL and 10 percent is just one image.For the fixed parameter  ( = 0.5, 1, 2), we randomly choose two images from each category to provide the label information.Note that the same label is meaningless for (30).To obtain (), we divide the 40 categories into 3 groups: 10 categories, 20 categories, and 10 categories.In the first 10 categories, pick up 1 image from each category to provide the label information; pick up 2 images from each category in the second 20 categories; and pick up 3 from each category in the remaining categories.The dividing process is repeated 10 times and the obtained average classification accuracy is recorded as the final result.
Figure 1 shows the graphical classification accuracy rates and normalized mutual information on the ORL Database.Note that if the samples in the collection  come from the same group, we set   = 0.67.Because of the same number of samples in each category, the variation of the  is small even though we label different cardinalities of samples.Compared to the constrained nonnegative matrix factorization algorithms with fixed parameters, our CNMF   algorithm gives the best performance since the selection of   is suitable to the collection .Table 1 summarizes the detailed classification accuracy and error bars of CNMF =0.5 , CNMF =2 , and CNMF   .It shows that our CNMF   algorithm achieves 1.92 percent improvement compared to the best reported CNMF KL algorithm ( = 1) [15] in average classification accuracy.For normalized mutual information, the details and the error bars of our constrained algorithms with  = 0.5,  = 2, and   are listed in Table 2. Comparing to the best algorithm CNMF, our CNMF   algorithm achieves 0.54 percent improvement.

Yale Database.
The Yale Database consists of 165 grayscale images for 15 individuals, 11 images per person.One per image is taken from different facial expression or configuration: center-light, w/glasses, happy, left-light, w/no glasses, normal, right-light, sad, sleepy, surprised, and wink.We preprocess all the images of Yale Database in the same way as the ORL Database.Each image is linearly stretched to a 1,024-dimensional vector in image space.
The Yale Database also has the same number of samples in each category.To obtain an appropriate , we do similar label processing to the ORL Database.Divide the 15 individuals into 3 groups averagely, choose 1 image from each category in the first group, choose 2 from each category in the second, and choose 3 from each category in the remaining group.We repeat the process 10 times and record the average classification accuracy as the final result.
Figure 2 shows the classification accuracy and normalized mutual information on the Yale Database.Set   = 0.55 when () − () = 0.It indicates that the samples in the collection  just come from two groups; that is, choose 15 images from 10 categories in the first and second group.CNMF   achieves an extraordinary performance for all the cases and CNMF =0.5 follows.This suggests that the constrained nonnegative matrix factorization algorithm has a higher classification accuracy when the value of  is close to   .Comparing to the best reported CNMF KL algorithm, CNMF   achieves 2.42 percent improvement in average classification accuracy.For normalized mutual information, CNMF   achieves 7.2 percent improvement compared to CNMF.The details of classification accuracy and normalized mutual information are provided in Tables 3 and 4, which contain the error bars of CNMF =0.5 , CNMF =2 , and CNMF   .categories.Each category contains about 31 to 800 images with a total of 9,144 samples of size roughly 300 × 300 pixels.This database is particularly challenging for learning a basis representation of image information, because the number of training samples per category is exceedingly small.In our experiment, we select the 10 largest categories (3,044 images in total), except the background category.To represent the input images, we do the preprocessing by using the codewords generated by SIFT features [34].Then we obtain 555,292 SIFT descriptors and generate 500 codewords.By assigning the descriptors to the closest codewords, each image in Caltech 101 database is represented by a 500dimensional frequency histogram.We randomly select  categories from Faces-Easy category in Caltech 101 database and convert them to gray-scale of 32 × 32.The label process is repeated 10 times and the obtained values of   computed by (31) are listed in Table 7.The variation of the  in the same  categories is great.That is, selecting an appropriate  plays a critical role for the mixture of different categories in one image data set, especially in the case that the number of samples in each category is different.The choice of   can fully reflect the effectiveness of the indicator matrix .
Figure 3 shows the effect that derived from the using of proposed algorithm with   .The upper part of figure is the original samples which contain 26 images, the middle part is their gray images, and the lower is the combination of the basis vectors learned by CNMF   .
The classification accuracy results and normalized mutual information for Faces-Easy category in Caltech 101 database are detailed in Tables 5 and 6, which contain the error bars of CNMF =0.5 , CNMF =2 , and CNMF   .The graphical results of classification performance are shown in Figure 4.The best performance in this experiment is achieved when the parameters   listed in Table 7 were selected.In general, our method demonstrates much better effectiveness in classification by choosing   .Comparing to the best reported algorithm other than our CNMF  algorithm, CNMF   achieves 2.4 percent improvement in average classification accuracy, and comparing to the CNMF algorithm [15] other than CNMF  algorithm, CNMF   achieves 8.77 percent improvement in average classification accuracy.For normalized mutual information, CNMF   achieves 2.29 percent improvement and consistently outperforms the other algorithms.

Conclusion
In this paper, we present a generic constraint nonnegative matrix factorization algorithm by imposing label information as additional hard constraint to the -divergence-NMF algorithm.The proposed algorithm can be derived using      Karush-Kuhn-Tucker conditions and presented alternative of the algorithm using the projected gradient.The use of the two methods guarantees the correctness of the algorithm theoretically.The image representation learned by our algorithm contains a parameter .Since -divergence is a parametric discrepancy measure and the parameter  is associated with the characteristics of a learning machine, the model distribution is more inclusive when  goes to +∞ and is more exclusive when  approaches −∞.The selection of the optimal value of  plays a critical role in determining the discriminative basis vectors.We provide a method to select the parameters  for our semisupervised CNMF  algorithm.The variation of the  is caused by both the cardinalities of labeled samples in each category and the total

5. 3 .
Caltech 101 Database.The Caltech 101 Database created by Caltech University has images of 101 different object

Figure 1 :
Figure 1: Classification performance on the ORL Database.

Figure 2 :
Figure 2: Classification performance on the Yale Database.

Figure 3 :
Figure 3: Sample images used in YaleB Database.As shown in the three 2 × 13 montages, (a) is the original images, (b) is the gray images, and (c) is the basis vectors learned by CNMF   .Positive values are illustrated with white pixels.

Figure 4 :
Figure 4: Classification performance on the Caltech 101 Database.

Table 1 :
The comparison of classification accuracy rates on the ORL Database.

Table 2 :
The comparison of normalized mutual information on the ORL Database.

Table 3 :
The comparison of classification accuracy rates on the Yale Database.

Table 4 :
The comparison of normalized mutual information on the Yale Database.

Table 5 :
The comparison of classification accuracy rates on the Caltech 101 Database.

Table 6 :
The comparison of normalized mutual information on the Caltech 101 Database.

Table 7 :
The value of  in  categories.In the experiments, we apply the fixed parameters  and   to analyze the classification accuracy on three image databases.The experimental results have demonstrated that the CNMF   algorithm has best classification accuracy.However, we can not get the cardinalities of labeled images exactly in many real-word applications.Moreover, the value of  varies depending on data sets.It is still an open problem how to select the optimal  for a specific image data set.