We study the problem of detecting and localizing objects in still, gray-scale images making use of the part-based representation provided by nonnegative matrix factorizations. Nonnegative matrix factorization represents an emerging example of subspace methods, which is able to extract interpretable parts from a set of template image objects and then to additively use them for describing individual objects. In this paper, we present a prototype system based on some nonnegative factorization algorithms, which differ in the additional properties added to the nonnegative representation of data, in order to investigate if any additional constraint produces better results in general object detection via nonnegative matrix factorizations.
1. Introduction
The notion of low dimensional approximation has played a fundamental role in effectively and efficiently processing and conceptualizing huge amount of data stored in large sparse matrices. Particularly, subspace techniques, such as Singular Value Decomposition [1], Principal Component Analysis (PCA) [2], and Independent Component Analysis [3], represent a class of linear algebra methods largely adopted to analyze high dimensional dataset in order to discover latent structures by projecting it onto a low dimensional space. Generally, a subspace method is characterized by learning a set of base vectors from a set of suitable data templates. This vector spans a subspace which is able to capture the essential structure of the input data. Once the subspace has been found (during the off-line learning phase), the detection of a new sample can be accomplished (in the so-called on-line detection phase) by projecting it on the subspace and finding the nearest neighbor of templates projected onto this subspace. These methods have found efficient applications in several areas of information retrieval, computer vision, and pattern recognition, especially in the fields of face identification [4, 5], recognition of digits and characters [6, 7], and molecular pattern discovery [8, 9].
However, pertinent information stored in many data matrices are often nonnegative (examples are pixels in images, the probability of a particular topic appearing in a linguistic document, the amount of pollutant emitted by a factory, and so on [10–15]). During the analysis process, taking into account this nonnegativity constraint could bring some benefits in terms of interpretability and visualization of large scale data, while maintaining the physical feasibility more closely. Nevertheless, classical subspace methods describe data as a combination of elementary features involving both additive and subtractive components; hence, they are not able to guarantee the conservation of nonnegativity.
The recent approach of low-rank nonnegative matrix factorization (NMF) becomes particularly attractive to obtain a reduced representation of data by using additive components only. This idea has been motivated in a couple of ways. Firstly in many applications (e.g., by the rules of physics) one knows that the quantities involved cannot be negative. Secondly, nonnegativity has been argued for based on the intuition that parts are generally combined additively (and not subtracted) to form a whole; moreover, psychological and physiological principles assume that humans learn objects part-based. Hence, the nonnegativity constraints might be useful for learning part-based representations [16].
In this paper, we investigate the problem of performing “generic” object detection in images using the framework of NMF. By performing “generic” detection, we mean to detect, inside a given image, classes of objects, such as any car, any face, rather than finding a specific object (class instance), such a particular car, or a particular face.
Generally, object detection task is accomplished by comparing object similarities to a small number of reference features which can be expressed in holistic (global) or sparse (local) terms and then adopting a learning mechanism to identify regions in the feature space that correspond to the object class of interest. Among subspace techniques, PCA constitutes an example of approach which adopts global descriptors related to the variance of the image space (the so-called eigenfaces) to visually represent a set of given face images [17]. Other holistic approaches are based on global descriptors expressed by color, texture histogram, and global image transformations [18]. On the other hand, local features have been proved to be invariant regarding noise, occlusion or pose view and they are also supported by the theory of “recognition-by-components” introduced in [19]. The most adopted features of local type are Gabor features [20], wavelet features [21], and rectangular features [22]. Some approaches using part-based representation were proposed in [23, 24], but they present the drawback of requiring manually defined object parts and vocabulary of parts to represent object in the target class. More recently, automatic extraction of parts possessing high information contents in terms of local signal change has been illustrated in [25] together with a classifier based on a sparse representation of patches extracted around interesting points in the image.
The nonnegativity constraints of NMF make this subspace method a promising technique to automatically extract parts describing the structure of object classes. In fact, these localized parts can be added in a purely additive way (with varying combination coefficients) to describe individual objects and could be used as learning mechanism to extract interpretable parts from a set of template images. Moreover, making use of the concept of distance from the subspace spanned by the extracted parts, NMF, could be also adopted as learning method to detect when an object is present or not inside a given image.
An interesting example of part-based representation of the original data can be found in the context of image articulation libraries. Here, NMFs are able to extract realistic parts (limbs) from image depicting stick figures with four limbs with different articulations. However, it should be pointed out that the existence of such a part-based representation heavily depends on the objects itself [26].
The firstly proposed NMF algorithms (the multiplicative and additive updated rules presented in [11]) have been applied in the fields of face identification to decompose a face image into parts reminiscent of features such as lips, eyes, and nose. More recently, comparisons between other nonnegative part-based algorithms (such as nonnegative sparse coding and local NMF) have been presented in the context of facial features, learning, demonstrating a good performance in term of detection rate by using only a small number of bases components [27]. A preliminary comparison on three NMF algorithms (classical multiplicative NMF [11], local NMF [28], and discriminant NMF [29]) has been illustrated in [30] on the recognition of different object color images. Moreover, results on the influence of additional constraints on NMF, such as the sparseness proposed in [31], have been presented in [32] for various dimensions of subspaces generated for object recognition tasks (particularly, face recognition and handwritten digits identification).
Here, we investigate the problem of performing detection of single objects in images using different NMF algorithms, in order to inquire if the representation provided by the NMF framework can effectively produce added value in detecting and locating objects inside images. The problem to be explored here can be formalized as follows. Given a collection of template images representing objects of the same class, that is a group of objects which may differ slightly from each other visually but correspond to the same semantic concept, for example, cars, digits, and faces, we would like to understand if NMF is able to provide some kind of local feature representations which can be used to individuate objects in test images.
The rest of the paper is organized as follows. The next section describes the mathematical problem of computing nonnegative matrix factorization and reviews some of the algorithms proposed in the literature and adopted to learn such a matrix decomposition model. These algorithms will constitute the core of an object detection prototype system based on the learning via NMF, proposed in Section 3 together with a brief description of its off-line and on-line learning phases. Section 4 presents experimental results illustrating the properties of the adopted NMF learning algorithms and their performance in detecting objects in real images. Finally, Section 5 concludes with a summary and possible directions for future work.
2. Mathematical Background and Algorithms
The problem of finding a nonnegative low dimensional approximation of a set of data templates stored in a large dimension data matrix V∈ℝ+n×m can be stated as follows.
Given an initial dataset expressed by a n×m matrix V, where each column is an n-dimensional nonnegative vector of the original database (m vectors), find an approximate decomposition of the data matrix into a basis matrix W∈ℝ+n×r and an encoding variable matrix H∈ℝ+r×m, both having nonnegative elements, such that V≈WH.
Generally the rank r of the matrices W and H is much lower than the rank of V (usually it is chosen so that (n+m)r<nm). Each column of the matrix W contains a base vector of the spanned (NMF) subspace, while each column of H represents the weights needed to approximate the corresponding column in V by means of the vectors in W.
The NMF is actually a conical coordinate transformation: Figure 1 provides a graphical interpretation in a two dimensional space. The two basis vectors w1 and w2 describe a cone which encloses the dataset V. Due to the nonnegative constraint, only points within this cone can be reconstructed through linear combination of these basis vectors:v′=(w1,w2)⋅(h1,h2)⊤.
Nonnegative matrix factorization as conical coordinate transformation: illustration in two dimensional space.
The factorization of V≈WH presents the disadvantages concerning the lack of uniqueness of its factors. For example, if an arbitrary invertible matrix A∈ℝr×r such that the two matrices W′=WA and H′=A-1H are positive semidefinite can be found, then another factorization V≈W′H′ exists. Such a transformation is always possible if A is an invertible nonnegative monomial matrix (a matrix is called monomial if there is exactly one element different from zero in each row and column). However, if A is a nonnegative monomial matrix, in this case, the result of this transformation is simply a scaling and permutation of the original matrices [33].
An NMF of a given data matrix V can be obtained by finding a solution of a nonlinear optimization problem over a specified error function. Two simple error functions are often used to measure the distance between the original data V and its low dimensional approximation WH: the sum of squared errors (also known as the squared Euclidean distance), which leads to the minimization of the functional:‖V-WH‖2
subject to the nonnegativity constraints over the elements Wij and Hij, and the generalized Kullback-Leibler divergence to the positive matrices:Div(V∥WH)=∑ij(Vijlog(Vij(WH)ij)-Vij+(WH)ij),
subject to the nonnegativity of matrices W and H.
2.1. Classical Algorithm
The most popular approach to numerically solve the NMF optimization problem is the multiplicative update algorithm proposed in [11]. Particularly, it can be shown that the square Euclidean distance measure (2) is nonincreasing under the iterative updated rules described in Algorithm 1.
Algorithm 1: The Lee and Seung multiplicative update rules (NMF).
Initialize nonnegative matricesW(0) and H(0)
While Stopping criteria are not satisfieddo
W←W⊙(VH⊤)⊘(WHH⊤)
H←H⊙(W⊤V)⊘(W⊤WH)
end while
{⊙ and ⊘ denotes the Hadamard product, that is the element-wise matrix
multiplication and the element-wise division, respectively}
Lee and Seung update rules can be interpreted as a diagonally rescaled gradient descent method (i.e., a gradient descent method using a rather large learning rate). It has been proved that the above algorithm converges into a local minimum. Other techniques, such as alternating nonnegative least squares method or bound-constrained optimization algorithms, such as projected gradient method, have also been used when additional constraints are added to the nonnegativity of the matrices W or H [34–36].
2.2. NMF Algorithms with Orthogonal Constraints
Differently to other subspace methods, the learned basis vectors in NMF are not orthonormal to each other. Different modifications of the standard cost functions (2) and (3) have been proposed to include further constraints on the factors W and/or H, such as orthogonality or sparsity.
As concerning the possibility of making the bases or the encoding matrices closer to the Stiefel manifold (the Stiefel manifold is the set of all real l×k matrices with orthogonal columns {Q∈ℝl×k∣Q⊤Q=Ik}, being Ik the k×k identity matrix) (which means that vectors in W or H should be orthonormal to each other), two different update rules have been proposed in [37] to add orthogonality on W or H, respectively. Particularly, when one desires that matrix W is as close as possible to the identity matrix of conformal dimension (i.e., W⊤W≈Ir), the multiplicative update rule (1) can be modified as described in Algorithm 2 (see [38] for details).
Algorithm 2: NMF with orthogonal constraint on W.
Initialize nonnegative matricesW(0) and H(0)
WhileStopping criteria are not satisfieddo
H←H⊙(VH⊤)⊘(WHH⊤)
W←W⊙((VH⊤)⊘(WW⊤VH⊤))·(1/2)
end while
{⊙ and ⊘ denotes the Hadamard product and the element-wise division,
respectively and (·)·(1/2) denotes the element-wise square root operation}
Different orthogonal NMF algorithms have been derived using directly the true gradient in Stiefel manifold [38, 39] and imposing the orthogonality between nonnegative basis vectors in learning the decomposition.
An interesting issue, strictly tied with the computation of the orthogonal NMF, when the adopted cost function is the generalized KL-divergence, is the connections with some probabilistic latent variable models. Particularly in [40], it has been pointed out that the objective function of a probabilistic latent semantic indexing model is the same of the objective function of NMF with an additional orthogonal constraint. Moreover, when the encoding matrix H is required to possess orthogonal columns, it can be proved that orthogonal NMFs are equivalent to the K-means clustering algorithm [40, 41].
2.3. NMF Algorithm with Localization Constraints
NMF algorithms optimizing a slight variation of the KL-divergence (3) can be adopted to yield a factorization which reveals local features in the data, as proposed in [28]. Particularly, local nonnegative matrix factorization uses the error function:∑ij(Vijlog(Vij(WH)ij)-Vij+(WH)ij+αUij)-β∑iQii,
where α,β>0 are constants, and U=W⊤W and Q=HH⊤. The function (4) is the KL-divergence (3) with three additional terms designed to enforce the locality of the basis features. Particularly, the modified objective function (4) attempts to minimize the number of basis components required to represent the dataset V and the redundancy between different bases, trying to make them as orthogonal as possible. Moreover, it maximizes the total activity on each component, that is, the total squared projection coefficients summed over all training data, so that only bases containing the most important information should be retained. The iterative update rules derived by the error function (4) are described in Algorithm 3.
Algorithm 3: Local nonnegative matrix factorization update rules.
Initialize nonnegative matrices W(0) and H(0)
WhileStopping criteria are not satisfieddo
H←(H⊙(W⊤(V⊘WH)))·(1/2)
W←W⊙((V⊘WH)H⊤)
W←Wdiag(∥W*1∥1,∥W*2∥1,…,∥W*r∥1)-1
end while
{⊙ and ⊘ denotes the Hadamard product and the element-wise division,
respectively, (·)·(1/2) denotes the element-wise square root operation,
diag(∥W*1∥1,∥W*2∥1,…,∥W*r∥1) indicates the r×r diagonal matrix
whose diagonal elements are the 1-norm of the column basis vectors in
W}
It has been proved that the update rules in Algorithm 3 decrease monotonically the objective function (4) to a local minimum.
2.4. NMF Algorithm with Sparseness Constraints
NMF algorithms can be extended to include the option to control sparseness explicitly in order to discover parts-based representations that are qualitatively better than those given by standard NMF, as proposed in [31]. Particularly, to quantify the sparseness of a generic given vector x∈ℝk, the following relationship between the 1-norm and the Euclidean norm (in the original Hoyer's paper the terminology L1-norm and L2-norm is adopted) has been adopted:sparseness(x)=k-(‖x‖1)/(‖x‖2)k-1.
Function (5) assumes values in the interval [0,1], where 0 indicates the minimum degree of sparsity obtained when all the elements xi possess the same absolute value, while 1 indicates the maximum degree of sparsity, which is reached when only one component of the vector x is different from zero. This measure can be adopted to impose a desired degree of sparseness on vectors in W and/or the encoding coefficient vectors in H, depending on the specific application the nonnegative decomposition is seeking for.
To compute NMF with sparseness constraints, a projected gradient descent algorithm has been developed. This algorithm essentially takes a step in the direction of the negative gradient of the cost function (2) and subsequently projects onto the constraint space, that is, the cone of nonnegative matrices with a prescribed degree of sparseness ensured imposing that sparseness(Wi)=sW and sparseness(Hi)=sH, where Wi and Hi are the ith column of W and H, respectively, and sW and sH are the desired sparseness. The update rules used to compute W and H are described in Algorithm 4.
Algorithm 4: NMF with space constraints.
Input:positiveconstants:μW>0,μH>0
ChooseanappropriateProjection(·)toensurethe degree of sparseness
Initialize nonnegative matrices W(0)H(0)
whileStoppingcriteriaarenotsatisfieddo
Wij←Wij-μW((WHH⊤)ij-(XH⊤)ij)
W←Projection(W)
H←Hij-μH((W⊤WH)ij-(W⊤X)ij)
H←Projection(H)
end while
{μW>0 and μH>0 are positive constants representing the step size
of the algorithm and Projection(·) indicates the appropriate projection
operator}
It should be observed that when the sparsity constraint is not required by W or H, the update rules are those provided by Algorithm 1 (the interested readers can be addressed to [31] for further details on this algorithm).
3. Object Detection System Based on NMF
In this section, we schematically present an object detection prototype system based on the learning via NMF. The working flow of the prototype system can be roughly divided in two main phases: the off-line learning phase and the on-line detection phase (mainly devoted to the object location activity).
The off-line learning phase consists in preparing the training image data and then learning a proper subspace representation of them. To be compliant to the format of the data matrix V (in order to obtain one of its nonnegative factorizations), each given p×q training image has to be converted into a pq-dimensional column vector (stacking the columns of the image matrix into a single vector) and then inserted as a column of the matrix V. It should be observed that this vector representation of an image data presents the drawback of losing the spatial relationship between neighborhood pixels inside the original image.
Once the image training matrix V∈ℝ+n×m is formed (now being n=pq), its NMF can be computed by applying one of the following algorithms:
the Lee and Seung multiplicative update rule (indicated by NMF and described in Algorithm 1) [11],
NMF with orthogonal additional constraint on the basis matrix W (indicated by DLPP and described in Algorithm 2) [37],
local NMF (indicated by LNMF and described in Algorithm 3) [28],
NMF with sparseness additional constraint (indicated by NMFsc and described in Algorithm 4) [31].
Once the bases and the encoding matrices have been obtained using one of the previous algorithms, the on-line detection phase can be started. In particular, for each test sample image q, the distance from the subspace spanned by the learned basis matrix W is computed by means of the following formula:dist(W,q)=‖q-WW⊤q‖2.
The value distance dist(W,q) is then compared with a fixed threshold ϑ, which is adopted to positively recognize the test image q as known object. Particularly, the decisional rule which can be easily derived is
“IFdist(W,q)≤ϑTHENqislabelledasknownobjectandtheobjectislocatedinside”.
Since the dimensions of the test image are bigger than those of the training images, we adopt a common approach to detect rigid object such as faces or cars [42]. Particularly, a frame of the same dimensions of training images (i.e., a window-frame of p×q pixels) is slid across the test image in order to locate the subregions of the test image which contain known objects. To reduce computational costs, started from the left-up corner of the test image, the sliding frame is moved in steps of size 5 percent of the test image, first in the horizontal and in the vertical direction (as shown in Figure 2).
Example of a sliding window moving across a test image.
The detection threshold is relevant to label each query image as object belonging or not to the subspace representation of the training space. Lowering the threshold increases the correct detections, but also increases the false positives; raising the threshold would have the opposite effect. To overcome this weakness, a preliminary detection phase can be performed in order to determine a range [d,D] used to fix a default threshold value as follows:ϑdefault=d+(D-d)*0.1.
The multiplicative factor 0.1 has been derived empirically. Although the simple mechanisms adopted to estimate the threshold value could cause the drawback, the proposed system identifies something also when it deals with images which do not contain any object of interest. Different estimation methods of the default threshold could be adopted to increase the detection rate; however, we delayed such aspect to a more detailed analysis to be tackled in a future work of ours. Figure 3 provides an example of the results obtained after the on-line detection phase: the picture on the left represents the test image, while the picture on the right represents a copy of the test image in which black pixels identify those pixels belonging to sliding windows which have not been identified as known objects.
Example of output provided by the prototype system during the on-line detection phase.
4. Experimental Results
This section presents some experimental evaluation of the object detection/localization approach developed in the previous section. The prototype system is evaluated on single-scale images (i.e., images containing objects of the same dimension of the training data). After a brief description of the data sets adopted in the off-line training phase, some comparisons of the above-mentioned NMF algorithms are reported. Our primary concern is on the qualitative evaluation of the different algorithms in order to assess when additional constraints on basis matrix (such as sparseness and orthogonality) and/or different number of bases images (explicitly represented by the rank r) can produce better results in object detection.
All the numerical results have been obtained by Matlab 7.7 (R2008b) codes implemented on an Intel Core 2 Quad Q9400 processor, 2.66 GHz with 4 GB RAM. The execution time of each algorithm has been computed by the build in Matlab functions tic and toc.
In order to test the object detection prototype system based on the illustrated NMF algorithms, three image datasets have been adopted: CarData, USPS, and ORL. The exploited datasets represent three different typologies of objects: cars, handwritten digits, and faces, respectively. Figure 4 illustrates some training images from the adopted datasets.
Examples of car images from (a) the CarData dataset, (b) USPS dataset, (c) ORL dataset.
The CarData training set contains 550 gray scale training images of cars of size 100×40 pixels, while the test set is composed by 170 single-scale test images, containing 200 cars at roughly the same scale as in the training images. The USPS dataset contains normalized gray scale images of handwritten digits of size 16×16 pixels, divided into a training set of 7291 images and a test set of 2007 images including all digits from 0 to 9. A preprocessing of USPS has been applied to rescale pixel values from the range [-1,1] to the range [0,1]. Figure 4 illustrates some training images from the adopted datasets. The ORL dataset contains gray scale images of faces of 40 distinct subjects. Each image is of size 92×112 pixels and has been taken against a dark homogeneous background with the subjects in an upright, frontal position, with slight leftright out-of-plane rotation. We use the first 8 images of each subject for the training set and the remaining 2 images for the test set.
4.1. Experimental Setup
The off-line learning phase has been run for different values of the rank r (representing the number of bases images) and with selected degree of sparsity imposed to NMFsc algorithm (particularly, the sparsity parameters in NMFsc have been fixed as sW=0.5 and sH=[]). As previously observed, we are interested in assessing the existence of any qualitative difference between the NMF learning algorithms in the context of generic object detection. In fact, the rank value r represents the dimensionality of the subspace spanned by the matrix W: an increase in its value can be interpreted as an information gain with respect to the original dataset. On the other hand, large values of r could introduce some redundancy in the basis representation of the dataset, nullifying the benefits provided by the part-based representation of the NMF. The algorithms have been trained on each dataset for various values of rank (CarData: r=20,110, USPS: r=80,220, ORL: r=20,80). We report the results related to the lowest and the highest rank values for each dataset. For the benefit of comparison, the same stopping criteria has been adopted for all NMF learning algorithms (i.e., the algorithms stop when the maximum number of iterations, set to 2500, is reached). Moreover, the results reported in the following sections represent the average values obtained over ten different random initializations of the nonnegative initial matrices W(0) and H(0). Note that, for each trial, the same initial matrices randomly generated (with proper dimensions with respect to the adopted dataset) have been used for all the algorithms.
The algorithms have been compared in terms of final approximation error, computed by MSE(W,H)=∥V-WH∥2, execution time (indicating the number of seconds required by each algorithm to complete the learning phase) and degree of orthogonality of W, measured by orth(W)=∥W⊤W-I∥F. This latter measure has been added for highlighting when additional constraints (in the specific case the orthogonality of the basis factor) provide better results in the detection phase.
4.2. Results of the Off-Line Learning Phase
This section reports the results obtained at the end of the off-line training phase for all the three chosen image datasets. Table 1 reports the MSE, the execution time, and the degree of orthogonality of W, when the algorithms are trained on the chosen datasets. For each dataset, the results obtained for the initial value and the final value of the rank are reported. These results are related to the lower and the higher subspace approximation of each dataset.
Algorithm performances when applied to CarData, USPS, and ORL dataset, respectively. Reported values refer to the lowest and highest values of the factor rank r as previously described.
CarData
Rank
20
110
Method
MSE
Time
orth(W)
MSE
Time
orth(W)
NMF
2.441e9
275
8.7411e4
1.457e9
453
4.4435e5
LNMF
2.404e10
292
4.9734
2.373e10
472
10.2373
NMFsc
2.559e9
695
6.7818e9
1.422e9
1265
1.5825e9
DLPP
2.664e9
2271
1.5627
1.657e9
2591
3.3221
USPS
Rank
80
220
Method
MSE
Time
orth(W)
MSE
Time
orth(W)
NMF
1.297e4
397
2.8166e4
3.031e3
847
1.2142e5
LNMF
1.331e5
374
6.6387
1.609e4
1427
6.4695
NMFsc
1.318e4
777
5.2854e4
5.568e3
1409
2.7761e4
DLPP
1.507e4
637
3.4077
1.249e3
1144
3.2623
ORL
Rank
20
80
Method
MSE
Time
orth(W)
MSE
Time
orth(W)
NMF
1.027e9
496
1.5701e5
5.413e8
705
6.0577e5
LNMF
3.104e10
556
4.4656
3.080e10
781
8.8920
NMFsc
1.425e9
1362
1.0762e10
6.183e8
2164
2.2674e9
DLPP
1.323e9
14824
1.7690
8.145e8
15278
3.4647
Figure 5 illustrates the part-based representation of CarData dataset learned by the adopted algorithms. For the benefit of appreciating some visual difference between the obtained bases, we plot the bases only for the smaller value r=20. Analogously, Figures 6 and 7 report the bases representation of USPS (with rank value r=80) and ORL dataset (with rank value r=20), respectively. Algorithm NMF learns global representation of either set of face car and face image, while it provides local representation of handwritten digits. LNMF, DLPP, and NMFsc algorithms, instead, learn localized image parts some of which appear to roughly correspond to parts of faces, parts of cars, part of digit marks. Essentially, the NMF algorithms select a subset of the pixels which are simultaneously active across multiple images to be represented by a single bases vector.
Illustration of the learnt bases (with r=20) of the CarData dataset obtained via (a) NMF, (b) LNMF, (c)NMFsc, and (d) DLPP.
Illustration of the learnt bases (with r=80) of the USPS dataset obtained via (a) NMF, (b) LNMF, (c) NMFsc,and (d) DLPP.
Illustration of the learnt bases (with r=20) of the OPS dataset obtained via (a) NMF, (b) LNMF, (c)NMFsc, and (d) DLPP.
As an example, Figure 8 illustrates the behavior of the MSE during the learning phase on the CarData dataset, with rank values r=20 and r=115, respectively. It should be observed that after some iterations all algorithms converge to similar values of the MSE. The LNMF algorithm presents a larger value of the MSE just because this algorithm is based on the KL-divergence cost function so it provides a rougher approximation of the dataset in term of MSE. To better appreciate the rate of convergence of all algorithms, Figure 9 reports the behavior of the MSE during the initial 600 iterates in the learning phase associated with the USPS dataset, with rank value r=80. A behavior similar to that depicted in Figures 8 and 9 is shown for all the other datasets and for different values of the rank r.
Behavior of the MSE during the learning iterations for the CarData dataset ((a) rank value r=20, (b) rank value r=115).
Behavior of the MSE during the initial 600 iterates in the learning phase on the USPS dataset.
As concerning the degree of orthogonality of the matrix W learned by each algorithm, Figure 10 reports the semilog plot of the orthogonality error for W during the learning iterations on the CarData dataset (with the rank values r=20 and r=115, resp.). It should be observed that both LNMF and DLPP produce a matrix W possessing a discrete degree of orthogonality. On the other hand, since NMF and NMFsc do not incorporate any additional constraint, they preserve or sometimes deteriorate the degree of orthogonality of the initial matrix W0. Similar plots for the orthogonal error can be depicted for the matrices obtained using the USPS and ORL dataset, respectively.
Behaviour of the orthogonality error for matrix W during the learning iterations on the CarData dataset: (a) rank value r=20, (b) rank value r=115.
4.3. Results of the On-Line Detection Phase
Once the bases and the encoding matrices have been obtained at the end of the learning phase, we are ready to enter the on-line detection and localization phase in order to assess a qualitative analysis of the considered algorithms (by means of the prototype system). To measure the performance of the NMF-based object detection/localization system, we are interested in knowing how many of the objects it detects and how often the detection it makes is false. Particularly, the two quantities of interest are the number of correct detections and the number of false detections: the former should be maximized while the latter quantity has to be minimized. As we have already observed in Section 3, the decisional rule (7), which allows to identify a test image as known object, is dependent on the detection threshold ϑ. Opportunely varying the threshold ϑ, a different tradeoff between correct and false detections can be reached. This tradeoff can be estimated considering the recall and the precision. The recall is the proportion of objects that are detected, the precision is the fraction of corrected detected objects among the total number of detection made by the system. Denoting by TP the number of true positive, TF the number of false positive, nP and nF the total number of positives and negatives in the dataset, respectively, the performance measures are Recall=TP/nP and Precision=TP/(TP+FP), and the number of false detections can be computed as 1-Precision. It should be pointed out that precision-recall is a more appropriate measure than the common ROC curve, since this metric is designed for binary classification tasks, not for detection tasks [25].
The evaluation results have been obtained by manually determining the location of the windows containing interesting objects. Tables 2, 3 and 4 report the performance results for Cardata, USPS, and ORL, respectively, when different values of the dimensionality r of the subspace dataset approximation are adopted. NMF algorithms evidence some differences in terms of recall and precision, particularly NMF anf NMFsc provide better results than LNMF and DLPP. The performance of the latter algorithms is also quite bad on the ORL face dataset, which represents one of the easiest database in terms of recognition.
Algorithm performances in terms of recall and precision when applied to CarData with factor ranks r=20 and r=110. Bold fonts indicate the highest values of precision and recall.
r=20
Method
TP
FP
Recall
Precision
F-measure
NMF
103
67
0.52
0.61
0.56
LNMF
92
78
0.46
0.54
0.5
NMFsc
106
64
0.53
0.62
0.57
DLPP
37
133
0.19
0.22
0.2
r=110
Method
TP
FP
Recall
Precision
F-measure
NMF
112
58
0.56
0.66
0.61
LNMF
86
85
0.43
0.5
0.46
NMFsc
110
60
0.55
0.65
0.59
DLPP
21
93
0.11
0.18
0.13
Algorithm performances in terms of recall and precision when applied to Usps with factor ranks r=80 and r=220. Bold fonts indicate the highest values of precision and recall.
r=80
Method
TP
FP
Recall
Precision
F-measure
NMF
2602
98
0.96
0.96
0.96
LNMF
2457
243
0.91
0.91
0.91
NMFsc
2615
85
0.97
0.97
0.97
DLPP
658
2042
0.24
0.24
0.24
r=220
Method
TP
FP
Recall
Precision
F-measure
NMF
2602
98
0.96
0.96
0.96
LNMF
1708
1042
0.63
0.63
0.63
NMFsc
2603
97
0.96
0.96
0.96
DLPP
2195
505
0.81
0.81
0.81
Algorithm performances in terms of recall and precision when applied to ORL with factor ranks r = 20 and r= 80.
r=20
Method
TP
FP
Recall
Precision
F-measure
NMF
80
0
1
1
1
LNMF
80
0
1
1
1
NMFsc
80
0
1
1
1
DLPP
40
40
0.5
0.5
0.5
r=80
Method
TP
FP
Recall
Precision
F-measure
NMF
80
0
1
1
1
LNMF
80
0
1
1
1
NMFsc
80
0
1
1
1
DLPP
41
39
0.51
0.51
0.51
Figure 11 reports the results obtained after the on-line phase on a car test example. The picture on the top illustrates the query image; the remaining pictures provide the positive pixels provided by (a) NMF, (b) LNMF, (c) NMFsc, and (d) DLPP, respectively (trained with r=110).
Output of the on-line detection phase after learning the CarData dataset: query image on the top, (a) NMF, (b) LNMF, (c) NMFsc, and (d) DLPP. The off-line phase has been performed with r=110.
Figure 12 illustrates the results obtained after the on-line phase on a handwritten digit test example. The picture on the top illustrates the query image, the remaining pictures provide the positive pixels provided by (a) NMF, (b) LNMF, (c) NMFsc, and (d) DLPP, respectively (trained with r=80). As it can be noted the DLPP algorithm provides the worst result, since it locates all the background pixels around the digit images.
Output of the on-line detection phase after learning the USPS dataset: query image on the top, (a) NMF, (b) LNMF, (c) NMFsc, and (d) DLPP. The off-line phase has been performed with r=80.
Figure 13 illustrates the results obtained after the on-line phase on a composited image with different ORL test images. Again, the picture on the top illustrates the query image, the remaining pictures provide the positive pixels provided by (a) NMF, (b) LNMF, (c) NMFsc, and (d) DLPP, respectively (trained with r=80). Also in this case, the worst results are given by DLPP algorithm, which is not able to correctly locate all the ORL test images.
Output of the on-line detection phase after learning the ORL dataset: query image on the top, (a) NMF, (b) LNMF, (c) NMFsc, (d) and DLPP. The off-line phase has been performed with r=20.
4.4. Qualitative Analysis in Natural Images
The following images illustrate the results obtained during the on-line detection phase for each considered algorithm with different query images. Particularly, Figure 14 provides an example of detection of a car inside some test images taken from the CarData test set.
Output of the on-line detection phase after learning the CarData dataset: (a) NMF, (b) LNMF, (c) NMFsc, and (d) DLPP with r=110 and ϑ=2.6e3.
Figure 15 illustrates the detection and location of some digit images inserted in a large scale image with white background while Figure 16 reports the detection/location results of some digit image written on a large white page. Figure 17 shows the detection of some handwritten digits presenting on an image of a real letter envelope. In the latter case, it could be observed that there are some false positive detections such as the two stamps and the letters in the address. This can be explained in the case of the stamps by considering their bigger dimension with respect to the sliding window and also the bases (see Figure 6) learnt by the algorithm, in the case of the letters by considering the inherent resemblance between some handwritten numbers and letters (such as “0” and “O,” “B” and “8,” “6” and “b”).
Output of the on-line detection phase after learning the USPS dataset: (a) NMF, (b) NMFsc with r=80 and ϑ=1.0e8.
Output of the on-line detection phase on a white paper image presenting some handwritten digits. Test is made with NMFsc after learning the USPS dataset, with r=80 and ϑ=1.0e8.
Output of the on-line detection phase on a letter envelope image presenting some handwritten digits. Test is made with LNMF after learning the USPS dataset, with r=80 and ϑ=2.3e3.
Figure 18 gives evidence of the capability of NMF algorithms to recognize human face inside two real world pictures which portrait human figures with different backgrounds; as it can be observed the adopted algorithm is able to recognize the presence of a face different from the training faces learnt in the off-line training phase. This represents a confirmation that the part-based representation provided by NMF can effectively produce added value in detecting and locating objects inside images.
Output of the on-line detection phase after learning the ORL dataset. Test is made with NMFsc with r=20 and ϑ=2.4e3.
5. Conclusions and Future Work
To summarize, we have presented a prototype framework for learning how to detect and locate “generic” objects in images using the part-based representation provided by nonnegative matrix factorization of a set of template images. Comparisons between different NMF algorithms have been presented, evidencing that different additional constraints (such as sparseness) could be more suitable to identify localized parts describing some structures in object classes. Our experiments on the well-known databases demonstrated that the proposed NMF-based prototype system is able to extract such interpretable parts from a set of training images in order to use them in localizing similar object in real world image.
Future work could be undertaken to allow the elaboration of object images with different scales, to improve final localization (using, for instance, a repeated part elimination algorithm), and to apply different criteria and/or measures to identify when a test image does or not belong to the subspace of known objects.
Acknowledgment
The authors would like to thank the anonymous referees for their suggestions and comments, which proved to be very useful for improving the paper.
GolubG. H.Van LoanC. F.20013rdThe Johns Hopkins University PressJolliffeI. T.1986SpringerHyvarinenA.Independent component analysis20012GuillametD.VitriáJ.Non-negative matrix factorization for face recognition20022504336344SunX.ZhangQ.WangZ.Face recognition based on NMF and SVMProceedings of the 2nd International Symposium on Electronic Commerce and Security2009616619GuillametD.VitriàJ.Evaluation of distance metrics for recognition based on non-negative matrix factorization2003249-10159916052-s2.0-003741068410.1016/S0167-8655(02)00399-9LiuW.ZhengN.Non-negative matrix factorization based methods for object recognition20042588938972-s2.0-244262437810.1016/j.patrec.2004.02.002GaoY.ChurchG.Improving molecular cancer class discovery through sparse non-negative matrix factorization20052121397039752-s2.0-2774460182210.1093/bioinformatics/bti653KimH.ParkH.Sparse non-negative matrix factorizations via alternating non-negativity-constrained least squares for microarray data analysis20072312149515022-s2.0-3454784407710.1093/bioinformatics/btm134PaateroP.TapperU.Positive matrix factorization: a non-negative factor model with optimal utilization of error estimates of data values1994521111262-s2.0-0028561099LeeD. D.SeungS. H.Algorithms for non-negative matrix factorization13Proceedings of the Advances in Neural Information Processing Systems Conference2000MIT Press556562NovakM.MammoneR.Use of nonnegative matrix factorization for language model adaptation in a lecture transcription task1Proceedings of the IEEE International Conference Acoustic, Speech and Signal Processing2001IEEE Computer Society541544ChuM.PlemmonsR. J.Nonnegative matrix factorization and applications20053427PaucaV. P.PiperJ.PlemmonsR. J.Nonnegative matrix factorization for spectral data analysis2006416129472-s2.0-3364668264610.1016/j.laa.2005.06.025D.ChenR.PlemmonsNonnegativity constraints in numerical analysisProceedings of the Symposium on the Birth of Numerical Analysis2008World Scientific Press541544LeeD. D.SeungH. S.Learning the parts of objects by non-negative matrix factorization199940167557887912-s2.0-003359260610.1038/44565TurkM. A.PentlandA. P.Face recognition using eigenfacesProceedings of the IEEE Conference on Computer Vision and Pattern Recognition1991586591SchieleB.CrowleyJ. L.Recognition without correspondence using multidimensional receptive field histograms200036131502-s2.0-003387231310.1023/A:1008120406972BiedermanI.Recognition-by-components: a theory of human image understanding19879421151472-s2.0-0023322501LeeT. S.Image representation using 2d gabor wavelets199618109599712-s2.0-0030259483StricklandR. N.HahnH. I.Wavelet transform methods for object detection and recovery1997657247352-s2.0-0031140344ViolaP.JonesM.Rapid object detection using a boosted cascade of simple featuresProceedings of the IEEE Computer Society Conference on Computer Vision and Pattern RecognitionDecember 2001I511I5182-s2.0-0035680116MohanA.PapageorgiouC.PoggioT.Example-based object detection in images by components20012343493612-s2.0-003530565310.1109/34.917571UllmanS.Vidal-NaquetM.SaliE.Visual features of intermediate complexity and their use in classification2002576826872-s2.0-003630739210.1038/nn870AgarwalS.AwanA.RothD.Learning to detect objects in images via a sparse, part-based representation20042611147514902-s2.0-1284424958910.1109/TPAMI.2004.108DonohoD.StoddenV.When does non-negative matrix factorization give a correct decomposition into parts?16Proceedings of the Neural Information Processing Systems200311411149ShastriB. J.LevineM. D.Face recognition using localized features based on non-negative sparse coding20071821071222-s2.0-3394751993910.1007/s00138-006-0052-0LiS. Z.HouX.ZhangH.ChengQ. S.Learning spatially localized, parts-based representation1Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition2001IEEE Computer Society207212BuciuI.PitasI.A new sparse image representation algorithm applied to facial expression recognitionProceedings of the 14th IEEE Signal Processing Society Workshop of the Machine Learning for Signal ProcessingOctober 20045395482-s2.0-17644371717BuciuI.Learning sparse non-negative features for object recognitionProceedings of the IEEE 3rd International Conference on Intelligent Computer Communication and Processing (ICCP '07)September 200773792-s2.0-4764910211210.1109/ICCP.2007.4352144HoyerP. O.Non-negative matrix factorization with sparseness constraints2004514571469SoukupD.BajlaI.Robust object recognition under partial occlusions using NMF200820081410.1155/2008/857453857453BermanA.PlemmonsR.1979Academic PressChuM. T.DieleF.PlemmonsR.RagniS.Optimality, computation and interpretation of nonnegative matrix factorizations2005NCSUChuM. T.LinM. M.Low-dimensional polytope approximation and its applications to nonnegative matrix factorization2007303113111552-s2.0-5554909174410.1137/070680436LinC.-J.Projected gradient methods for nonnegative matrix factorization20071910275627792-s2.0-3554896947110.1162/neco.2007.19.10.2756DingC.LiT.PengW.ParkH.Orthogonal nonnegative matrix tri-factorizations for clusteringProceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD '06)August 20061261352-s2.0-33749575326ChoiS.Algorithms for orthogonal nonnegative matrix factorizationProceedings of the International Joint Conference on Neural Networks (IJCNN '08)June 2008182818322-s2.0-5634909831010.1109/IJCNN.2008.4634046Del BuonoN.A penalty function for computing orthogonal non-negative matrix factorizationsProceedings of the 9th International Conference on Intelligent Systems Design and Applications (ISDA '09)December 2009100110052-s2.0-7794950959910.1109/ISDA.2009.59DingC.LiT.PengW.Nonnegative matrix factorization and probabilistic latent semantic indexing: equivalence, chi-square statistic, and a hybrid methodProceedings of the 21st National Conference on Artificial Intelligence and the 18th Innovative Applications of Artificial Intelligence Conference (AAAI '06)July 20063423472-s2.0-33750737012DingC.HeX.SimonH. D.On the equivalence of nonnegative matrix factorization and spectral clusteringProceedings of the SIAM Data Mining Conference2005606610MurphyK.TorralbaA.EatonD.FreemanW.Object detection and localization using local and global features20064170394412