Bidirectional Nonnegative Deep Model and Its Optimization in Learning

Nonnegative matrix factorization (NMF) has been successfully applied in signal processing as a simple two-layer nonnegative neural network. Projective NMF (PNMF) with fewer parameters was proposed, which projects a high-dimensional nonnegative data onto a lower-dimensional nonnegative subspace. Although PNMFovercomes the problemof out-of-sample ofNMF, it does not consider the nonlinear characteristic of data and is only a kind of narrow signal decomposition method. In this paper, we combine the PNMF with deep learning and nonlinear fitting to propose a bidirectional nonnegative deep learning (BNDL) model and its optimization learning algorithm, which can obtain nonlinear multilayer deep nonnegative feature representation. Experiments show that the proposed model can not only solve the problem of out-of-sample of NMF but also learn hierarchical nonnegative feature representations with better clustering performance than classical NMF, PNMF, and Deep Semi-NMF algorithms.


Introduction
In the study of machine learning, pattern recognition, computer vision, and image processing, it is an important problem to find the effective representations of the input data matrix with nonnegative elements and very high dimensions.In 1999, Lee and Seung had proposed a classical feature representation method, named nonnegative matrix factorization (NMF) [1], which effectively solved the above problems.The basic idea and analysis of the NMF algorithm may be simply described as follows.
Given a nonnegative data matrix  ∈  × † , which is a collection of  samples as columns, and each sample is nonnegative so that they allow only additive not subtractive and linear combinations.To a degree, it can capture the essence of intelligent data description.And the objective function can be defined as min ≥0,≥0       −        . (1) Although the NMF is optimal for learning the parts of objects, it suffers from the out-of-sample problem [2,3]; namely, it is indirect or repeats the factorization to obtain the coefficients of any new coming examples.To overcome the disadvantages of the NMF, after that, the researchers put forward some improved methods based on the NMF algorithm.For example, Yuan et al. proposed a Projective NMF (PNMF) [4] in 2009.The PNMF is a modified form of the traditional NMF, with strong sparseness and orthogonality [4,5] under the projection assumption.It only needs to calculate a nonnegative matrix , thereby reducing the amount of computation at each iteration; that is, the PNMF learns a nonnegative matrix to directly project  onto the lower-dimensional nonnegative subspace.If  denotes the basis matrix, the PNMF treats  =    as the coefficient and utilizes    to reconstruct .So its objective function is min ≥0       −         . ( The PNMF has fewer parameters than the NMF, and it is widely used in linear dimension reduction and can solve the problem about out-of-sample deficiency.Being the same with the NMF, the PNMF is a linear dimensionality reduction method, but many data present the nonlinear characteristics [6].At the same time, the NMF and the PNMF only factorize the original data one time [7].In many situations, the nonnegative data sampled from real applications are usually very complex and need to be factorized many times for obtaining the high-level deep features with distinction and strong representation ability.Some studies have shown that, in order to learn the high-level representations of complex data and have better performance in image understanding and speech perception, the deep learning is needed [6].And the deep learning has a profound impact both in academia and in industry fields since Hinton and Salakhutdinov published a known article [8] in Science in 2006.This article shows the following: (1) artificial neural network with a lot of hidden layers has excellent ability for learning characteristics, which is more essential to describe data and facilitates the visualization, clustering, and classification; (2) the difficulty on training the deep neural network can be overcome by the "layer by layer initialization" (layer-wise pretraining).With the success of training deep architectures, several variants of deep learning have been introduced [6,9].These multilayer algorithms take hierarchical approaches in feature extraction and provide efficient solution to complex problems, and they use an error backpropagation algorithm and unsupervised learning to obtain an effective representation model.However, they have not considered the following concerns: (1) the weights should be nonnegative when a lot of physical signal is nonnegative data; (2) the pure additive description uses little component to make the components of the nonnegative data clear.
For obtaining the deep nonnegative feature representation, Trigeorgis et al. applied the concept of Semi-NMF [10] to propose a Deep Semi-NMF [9] that is able to learn hidden deep representations of the original data.In the Semi-NMF, the goal is to construct a low-dimensional nonnegative representation  + of our original data  ± , with the bases matrix  ± serving as the mapping between our original data and its lower-dimensional representation [10].The Deep Semi-NMF model finds a representation of the data that has a similar interpretation at the top layer.The input data matrix is now further analyzed as a product of multiple factors  ± =  ± 1  ± 2  ± 3 ⋅ ⋅ ⋅ , which are thought to be deep seminonnegative matrix factorization.That means it is able to decompose the data in  different ways according to multiple different attributes: Although the Deep Semi-NMF uses a multilayer model to obtain more features, it can only deal with seminonnegative data, which is a linear transformation with weak representation capacity.Moreover, the Deep Semi-NMF model still has the out-of-sample problem.
Based on the above analysis, the PNMF only computes one projection matrix and it cannot learn more rich features, especially when the data are a nonlinear or near a nonlinear manifold, or the data are hierarchically generated.Motivated by the ideas of the PNMF, the Deep Semi-NMF, and deep learning (especially, AutoEncoder [8,11]), in this paper, we propose a novel model which we call bidirectional nonnegative deep learning (BNDL), for learning more helpful and meaningful deep nonnegative representations of the original data with nonlinear characteristic and overcoming the outof-sample problem.In Section 2, we introduce our BNDL method and the analysis of the optimal objective functions.And we give the corresponding algorithms in Section 3. Experiments are demonstrated in Section 4. In Section 5, we briefly give some conclusion remarks about this paper.

Bidirectional Nonnegative Deep Learning Model
2.1.Motivation.The particular attraction of the NMF alspongorithm is the nonnegative constraints, and it is useful for data representation in clustering.But the NMF is a simple linear coding algorithm using a single layer network with nonnegative constraints, and it suffers from the out-ofsample deficiency which cannot directly obtain the codes of any new coming examples [12,13].
To the PNMF algorithm, it uses the transpose matrix of the learned basis matrix as the projection matrix, which obtains nonnegative coefficients for any new coming examples [4,14].Although it overcomes the problem of outof-sample of the NMF, the PNMF is also a linear coding algorithm and simple single layer decomposition.
On the other hand, the current existing deep network models rarely consider the nonnegative constraints, even if the newest related model Deep Semi-NMF [9] only broadens an incomplete nonnegative constraint and is still a linear model.
In this paper, we propose a nonnegative hierarchical data representation model, named bidirectional nonnegative deep learning (BNDL) model, which applies the concept of PNMF to train an initial multilayer nonlinear structure that is able to learn hidden complete deep representations of the original data.
Different from the other deep architectures, the BNDL firstly constructs a pretraining deep network through stacking every nonnegative two layers network independently to get the whole network, and the learning process of each layer is to combine the PNMF and a designed nonlinear mapping.That is to say that each time we do one-step decomposition, then the basis matrix of two-layer BNDL can be regarded as the weight matrix of the deep network, and the output of this step can be used as the input of the next layer by a Sigmoid function.Upwards, iterating this process, we can get a deep network.Downwards, we can reconstruct the original sample data.Because BNDL only learn one layer in each step, we can fast build a deep network.The hierarchical feature extraction strategy learns more meaningful, helpful features and higher-order nonnegative nonlinear characteristics than one-step learning.Finally, a fine-tune training is applied to improve the reconstruction performance and deep features of our deep network under the nonnegative weight value constraints.

• • •
The i + 1th layer The ith layer The two-layer binetwork structure for BNDL.

Bidirectional Nonnegative Deep Learning Model. Let 𝑋 =
[ 1 ,  2 , . . .,   ] denote the data sample set, among which   ∈   denotes the feature descriptor of the th sample and  is the number of total samples.Here we assume that the input data matrix  is nonnegative.Let  denote the dimension of the desired dimension-reduced feature space.The task of data factorization is to get a nonnegative basis matrix  = [ 1 ,  2 , . . .,   ] ∈  × and its corresponding coefficient matrix  =   .Here  ∈  × devotes the projection matrix that transforms an -dimensional feature vector into a -dimensional feature space.The matrices  () ,  = 1, . . ., , are the output matrices of the first  − 1 layers, and  (1) is equal to the original matrix .The objection of the th factorization is as close as possible  () ; that is,  () ≈  ()  () .So the objection function for projective nonnegative multilayer factorization can be defined as min where  is a positive factor to avoid the too large input amount for Sigmoid function and  () ≥ 0 (or  () ≥ 0 or  () ≥ 0) denotes that each element of it is nonnegative.
Due to preserving the same S-shape nonlinear mapping function of the top-to-bottom operator, the top-to-bottom reconstruction basis  () should also be constrained to reconstruct the input  (−1) of the ( − 1)th layer.So the new objective cost is further improved into min where  is the balance factor.In our experiments, if  = 1,  = 0; else  = 1.So the two-layer structure for constructing BNDL can be illustrated in Figure 1.
According to minimizing (7), simultaneously solving  () and  () is a NP-difficult problem.So the scalable solution of ( 6) is alternatively optimizing with respect to  () and  () by fixing one of them.
When fixing the  () , the objective function solved by Lagrangian multiplier method [9] is transformed into where  is the Lagrangian multiplier, in which the constraints  and  are nonnegative.
The minimization of ( 8) is equal to letting the equation Under the nonnegative condition constraints Based on the KKT conditions [15], the minimum solution of ( 8) is satisfied as where   is the item at the th row and the th column of .We can get the following equality via combining (10) and (11) with respect to the parameters .
To (12), moving the negative items to the right-hand side in (12), we have So we can get a Multiplicative Update Rule (MUR) from (4) for any two-layer learning of BNDL: where ⊙ denotes the dot product, that is, the product of the corresponding elements from two matrices, and the matrix division is the dot division of the corresponding elements.
Similarly, the MUR [12] for  is  () ←  ()  ()  ()  ()  ()  ()  ()  () ⊙  () . ( Moreover, by referring the proof of Theorem 1 in the literature [12], we can prove the following conclusion, that is, Theorem 1.Note that the proof of Theorem 1 is similar to the literature [12]; due to the limited space, we omit the detailed proof and derivations. Theorem 1.The Euclidean metric ‖ () −  ()  ()  () ‖ 2 is nonincreasing under the following two Multiplicative Update Rules: And there must exist the optimal  () and  () to make the value of ( 7) achieve a stationary stable point.

The Modification of the Top-to-Bottom Reconstruction
Weight  () .After decomposing the input  () to get nonnegative generative weights  () , fixing the first term and minimizing the second term of the objective function (5), we need to modify reconstruction weights  () for obtaining the optimal top-to-bottom nonlinear mapping to the ( − 1)th layer input; that is,  (−1) ≈ ( ()  () ),  ≥ 2. So we can easily get the following theorem.
Theorem 2. For minimizing the square objective function ‖ (−1) − ( ()  () )‖ 2 , because the Sigmoid functions (⋅) for neuron outputs are a monotonic function, the optimal solution of  () is where (⋅) + is the pseudo-inverse computation and (⋅) + is to preserve nonnegative elements and substitute the negative elements into zero.

Unfolding Each Two-Layer Network to Construct a Multilayer Nonnegative Network.
Though the weight matrices of all layers are learned, they are only efficient in each two-layer network, and they are not optimal for the whole network.
The weights in higher layers may be not optimal for the lower layers [16,17].So after greedily learning good initial values for the weights in every two layers, we unroll each two-layer nonnegative network by using the  () and  () to construct a multilayer nonnegative nonlinear network which can be seen in Figure 2. From top to down, we hope that the reconstruction error is as small as possible.So the BP algorithm is applied to fine-tune the unrolled deep networks under the nonnegative-weight constraints.

The BNDL Algorithm Description
Summarizing the above analysis, unrolling each two-layer nonnegative network, the bottom-to-top weights  () connect each nonlinear neuron layer to get the first multilayer structure; then the top-to-bottom weights  () connect each nonlinear neuron layer from the top layer to the bottom layer (i.e., the input layer) and get a whole multilayer nonlinear nonnegative deep network structure as illustrated in Figure 2.
For further optimizing the multilayer deep networks, finetuning all weights by the improved BP algorithm reduces the reconstruction error and ensures the weights to be nonnegative by constraining the gradient descent iteration () + Δ ≥ 0. In our fine-tune stage, the weights are updated by using the conjugate gradient with three-time linear searching from the source codes in literature [8], and nonnegative weights are ensured by the computation ( + 1) = max{0, () + Δ}.
The main steps of the BNDL algorithm are described in Algorithm 1.
Algorithm 1 (training algorithm for BNDL).Learning algorithm for BNDL includes the following steps.
(2) Repeatedly update the reconstructed weights  ()  and generative weights  () by, respectively, using the iterative formulas ( 14) and ( 15) until the maximum number of iterations.
Step 2. Unroll each two-layer nonnegative network by using the  () and  () to construct a multilayer nonnegative nonlinear network as shown in Figure 2.
Step 4. Run the -means algorithm, the feature representation by the given nonnegative output.

Experiments
In this section, we carry out some experiments to verify the validity of BNDL on three datasets including COIL20, COIL100, and CMU PIE, as shown in Table 1.In order to compare the clustering experimental results, we use the Accuracy (AC) and Normalized Mutual Information (NMI) [13] as the evaluation measures.[17] in this experiment contains 20 objects.The images of each object are taken 5 degrees apart as the object is rotated on a turntable and each object has 72 images.The size of each image is preprocessed into 32 × 32 pixels, with 256 grey levels per pixel.Thus, each image is represented by a 1024-dimensional vector.

Datasets Introduction. COIL20 dataset
COIL100 dataset [17] which contains 100 objects is used to the experiment.The images of each object are taken 5 degrees apart as the object is rotated on a turntable and each object has 72 images.The size of each image is preprocessed into 32 × 32 pixels, with 256 grey levels per pixel.Thus, each image is represented by a 1024-dimensional vector.
CMU PIE [18] includes the original database of 41,368 images of 68 people, each person under different poses and different illumination conditions and with different expressions.Due to the limitation of experiment platform, in this experiment, the fore 42 face images corresponding to each person are extracted from the preprocessed CMU PIE dataset [18] with 11,544 face pose images of 68 persons at size 32 × 32 pixels, so the experimental data includes 2856 face images of 68 persons where each person corresponds to 42 face images with different poses.

Clustering Experiment on COIL20, COIL100, and CMU PIE Datasets.
The related literatures have demonstrated that NMF method has good performance for clustering, especially in the image clustering task.So we mainly do experiments about clustering by comparing the most related methods including NMF [1], PNMF [2], Deep Semi-NMF [9], and our deep learning model, that is, BNDL.In this experiment, the number of layers about BNDL is set to five, the number of nodes in its first layer is the dimension of the input data, the second layer has 500 nodes, the third layer has 500 nodes, the fourth layer has 2000 nodes, and the last has the same number with classes .So the whole network node is 1024-500-500-2000-.Note that fine-tuning training is carried out on nine-layer networks.The network structure can be seen in Figure 2.
In this subsection, "2nd layer" expresses the second layer of the deep network matrix decomposition; "rec.1st layer" expresses the first reconstruction layer of the deep network matrix reconstruction, and so on.
In the clustering experiment, we use the feature from the all learned layers and reconstruction layers of BNDL to compare with other methods.All algorithms use the same iterations (5000 times), the same initialization method (random initialization), and the same termination conditions (the error is less than 10 −6 ).The clustering performance based on AC and NMI is shown in Table 2.
From the results of Table 2, we can see that BNDL has the following advantages: (1) Compared with the single layer matrix decomposition network, BNDL learns more rich features while the clustering effect is not reduced.
(2) Compared with the deep network matrix decomposition network, each layer of BNDL has better clustering effect.
(3) Each layer of BNDL has a stable clustering performance, which is better for the data representation and downward transmission.(4) From the results on COIL20 and COIL100, we can see that BNDL has a better clustering performance for large scale samples, which is more conducive to the characteristic expression of complex data.
(5) Experimental results on CMU PIE face images are disappointed in a degree.Compared with our model, NMF and Deep Semi-NMF suffer from the out-ofsample problem.But both our BNDL model and the classical PNMF can solve the problem about out-ofsample deficiency.Moreover, our BNDL obtains the better performance than the PNMF on CMU PIE face dataset.In addition, face clustering and classification usually get the better results by using cosine distance metric, so the graph regularization with the sine similarity is introduced to improve the BNDL in the future.
The reconstruction performance of BNDL is also excellent.In order to compare the reconstruction performance of each algorithm, here we compare the clustering performance of reconstruction data.  of BNDL has better clustering effect, and the deep network structure is more conductive to the multilayer feature expression.
The reason why BNDL has better reconstruction results is that it connects each layer into a deep learning network after decomposing all layers and fine-tuning the whole network by improved BP algorithm to reduce reconstruction error to minimum.To verify the effect of fine-tuning to the whole network, we compare the difference of before fine-tuning and after fine-tuning, as shown in Figures 3 and 4. Two figures  imply that the clustering ability of BNDL has a considerable improvement after fine-tuning, which means fine-tuning has a strong effect on network; that is to say, the multilayer feature representation ability of network is strongly boosted after fine-tuning.

Conclusion
This paper proposes a bidirectional nonnegative deep learning model to obtain effective feature representation, which can automatically learn a deep hierarchy with nonlinear and nonnegative feature representations via inputting a given nonnegative dataset, and such representations are demonstrated to be suited for clustering.About the PNMF, it is a linear dimensionality reduction and feature representation method, and it only factorizes the original data in one step.And the Deep Semi-NMF can construct a multitime linear factorization to learn more features, but it is still a linear dimensionality reduction method and not an absolutely nonnegative treatment.Our BNDL model combines the advantages of PNMF and deep belief networks under the inspiration of the Deep Semi-NMF and overcomes the abovementioned shortcomings.At the same time, we designed an effective learning algorithm for optimizing the corresponding parameters of our BNDL model.Lastly, we show its better clustering performance compared with the single-layered NMF, PNMF, and Deep Semi-NMF to a degree.In addition, our method avoids the out-of-sample problem and negative feature representation in the different motivation with Deep Semi-NMF.

Figure 4 :
Figure 4: NMI of before and after fine-tuning of BNDL on COIL100.

Table 1 :
The characteristics of the datasets.

Table 2 :
Performance comparison on different datasets.

Table 3 :
Performance comparison of reconstruction data on different datasets.
Table 3is the results on the three datasets.It can be found that the final reconstruction result of BNDL does not decrease, compared with the reconstruction result of the single layer matrix decomposition.And BNDL is better than other deep network matrix decomposition.The experiment results show that the reconstruction data