Nonnegative Signal Decomposition with Supervision

This paper presents a novel algorithm to numerically decomposemixed signals in a collaborative way, given supervision of the labels that each signal contains. The decomposition is formulated as an optimization problem incorporating nonnegative constraint. A nonnegative data factorization solution is presented to yield the decomposed results. It is shown that the optimization is efficient and decreases the objective function monotonically. Such a decomposition algorithm can be applied on multilabel training samples for pattern classification.The real-data experimental results show that the proposed algorithm can significantly facilitate the multilabel image classification performance with weak supervision.


Introduction
Signal decomposition, which separates a mixed source signal into its constitutive pure components, is an important step in many practical engineering problems.For example, in image classification or speech recognition, a single training or test sample we collected usually contains multiple additive signals from several classes.The decomposition of the mixed signals can yield pure feature representations for better classes model training and recognition.Figure 1 illustrates the decomposition for an image containing three object classes.From the given mixed signal on the top, we aim to derive a specific representation for each of its associated labels, which reflects a corresponding pure region in the image.The decomposed representation can be more accurate and informative for describing object categories.
Unsupervised signal decomposition, for example, blind source separation in signal processing, is a fundamental problem which has been well explored in the past decades [1,2].Various methods have been proposed for different tasks, including Independent Component Analysis (ICA) for speech signal [3] and Null Space Pursuit (NSP) for electricity consumption and global surface temperature [4].In image processing, a straightforward way to decompose the image representation into label-pure representations is manual labeling.It is tedious and impractical in most cases.Most conventional approaches to automatic image decomposition focused on clustering based on low-level visual cues, such as the bottom-up image segmentation in which pixels are locally grouped on the basis of their appearances [5] and the top-down image parsing in which primitives (e.g., rectangles, sketches, and edges) are correlated based on a few grammar rules [6].However, all these methods have not taken the supervision information into consideration.In a practical case, online images from internet are usually associated with the textual labels on the webpage, which can be utilized to guide the decomposition.
In this paper, we focus on the signal decomposition problem with label supervision; that is, given a set of mixed signals each with class labels it contains, we decompose each signal to the specific labels.To this aim; the fundamental principle in pattern classification is utilized that signals of the same label should be close to each other while those of different labels should be relatively separated in feature space.Based on this assumption, a novel algorithm to numerically decompose mixed signals with labels in a collaborative way is proposed.The decomposition is formulated as an optimization problem incorporating nonnegative constraint for practical consideration, due to the nonnegative property of most real-life signals.Furthermore, we present a nonnegative data factorization solution, which is shown to be efficient and can decrease the objective function monotonically.Such a decomposition can be applied on multi-label training samples for pattern classification.The real-data experimental results show that the proposed algorithm can significantly facilitate the multilabel image classification with weak supervision.

Notation
In the following, we will use these rules to facilitate presentation: for any matrix Here a label represents a class, and a tag is a specific label instance.For a specific sample  ∈ {1, . . ., }, its signal representation can be calculated as a -dimensional normalized vector v  with ‖v  ‖ 1 = 1, where ‖ ⋅ ‖ 1 represents the ℓ 1 -norm operator.The labels of this sample are denoted in a vector l  ∈ R  with only 0 or 1 elements: where  denotes the number of labels.Let a -dimensional vector r   denote the representation of the corresponding part for the th label in the th sample.For a set of training samples, we collect the representations for all nonzero label instances, that is, "tag"s, and reindex as vectors {r  },  ∈ {1, . . ., Ñ}, where Ñ is the number of valid tags in the set obtained by Ñ = ∑  =1 ∑  =1    .Note that {r  } are unknown as the objective to be computed.

Problem Formulation
In this section, the above decomposition problem is solved by nonnegative data factorization.The presented solution appears to be efficient.Furthermore, the monotonic decrease of the objective function can be proved regarding the convergence properties.
With  training samples, to formulate the decomposition problem, we use such representations as follows: a matrix to enclose the decomposed representations for all the tags, which are also arranged in columns, where Ñ is the total number of tags in the training set.The decomposition problem formulation consists of the following two parts.
(1) The first part is minimizing the error of reconstruction from the decomposed representations to the additive signal representations; that is, for all  ∈ {1, . . ., }, if the th tag is associated with the th sample. ( Considering that, in the above formulation, {r  } are not normalized, we introduce the coefficients and yield: if the th tag is associated with the th sample, where   represents the coefficient of r  in the reconstruction of v  .Using matrix form and an Ñ ×  matrix H with elements H  =   to enclose the information of coefficients, the reconstruction error minimization can be formulated as arg min R,H ‖V − RH‖ 2 .Note that a mixed holistic representation should be reconstructed only from representations of the associated labels.Therefore, some elements of H matrix must be 0; that is, This information can be derived from the training labels.
Here we release the 0-elements constraint for H matrix and introduce a membership matrix Y: Then the minimization problem becomes arg min Note that Y matrix will be abandoned later in the proposed solution when the 0-elements constraint for H matrix is already considered.
(2) The second part is minimizing intralabel distance between the decomposed signal representations of the same label.Here we use S matrix to denote the intralabel information with , r  and r  correspond to the same label; 0 otherwise.(7) To force the decomposed representations belonging to the same label to be similar to each other, we will minimize ∑  ‖r  − r  ‖ 2 S  (here for convenience {r  } are assumed as normalized, which will be described later), and we can derive in the matrix form where Representing the context among the decomposed local representations in a graph as shown in Figure 2, L matrix is therefore the Laplacian matrix for the graph, based on which graph preserving energy [7] (similar to (8)) can be derived.Note that R may not be normalized.When measuring distance between different {r  } in the graph based representation, we introduce Q for normalizing R. Since V is known to be normalized in column, by minimizing the reconstruction error with (6), R(H•Y) can be considered to be approximately normalized.Thus Q is designed as a diagonal matrix: where e = [1, . . ., 1, . . ., 1] is a Ñ dimensional vector; that is, Multiplying {r  } with {Q  }, respectively, we can see that RQ can be approximately columnwise normalized.If we take into account the normalization factor Q, the second factor of the objective function minimizing intralabel distance is then yielded as min  (RQLQ  R  ) .
Based on the above two parts, the presented decomposition problem can be formulated as a regularized nonnegative data factorization problem as where  is a weight to be empirically set in the experiments.The constraints show that R and H should be nonnegative to satisfy the reconstruction assumption, since in many real applications only when the signal representations and the reconstruction coefficients are nonnegative, can the proposed decomposition be physically meaningful.The decomposed label representations R can be obtained from the fission of V by solving (13).The solution will be given in the following.In the formulation, the given V is normalized, but there is no constraint that R or H matrix should be normalized.In the following optimization process we will normalize H.

Optimizing Solution
The problem of (13) falls into the framework of nonnegative matrix factorization (NMF) [8] for its nonnegative constraints.NMF has shown its effectiveness in practical signal processing for image classification [9].Here similar to the optimization strategy adopted in the previous work by Lee and Seung [8], we optimize the objective function of ( 13) in an iterative way with a multiplicative updating rule, which guarantees the non-negativity.For initialization, to get an original guess of the target decomposed representations, we initialize ).Note that "•" stands for dot product in this paper, and k  is the average representation of th class from training samples associated with label .
Most iterative procedures for solving high-order optimization problems transform the original intractable problem into a set of tractable subproblems and finally obtain the convergence to a local optimum.The proposed iterative procedure also follows this philosophy and optimizes H and R alternately.The proposed iteration rules can be proved to monotonically decrease the objective function value, and the theoretical proof is given in Section 5.In fact, although stationary is the necessary condition for achieving local minima, it is difficult to justify that any limit points are the stationary point.Lin [10] has slightly modified the original iterative algorithm of NMF to achieve the convergence safely.Their modification, however, will increase the computational complexity while achieving his similar performance, as reported in [10].Therefore, in this work, we do not adopt that the modification and experiment results show that the proposed solution could usually converge to local minima.We give the iterative procedure as follows.
(1) Optimize H for Given R. For a fixed R, the objective function in (13) with respect to H can be written as Let Φ  be the Lagrange multiplier for constraint H  ≥ 0; the Lagrange factor is ∑  Φ  H  = (ΦH) and the Lagrange function L is The partial derivation of L with respect to H is Mathematical Problems in Engineering 5 where we have used the following deductions: Using the KarushKuhnTucker (KKT) condition [11] Φ  H  = 0, from ( 16), the following equations can be obtained for H  : which leads to the update rule Remember that H ()  = 0 if Y  = 0, where  means the th iteration.Note that it could be satisfied by initializing H (0)   = 0 if Y  = 0; therefore Y actually can be neglected without any influence.It is obvious that the updated H (+1) is still nonnegative if the matrices R and H () are nonnegative.After getting H (+1) , we normalize the column vectors of H (+1) and consequently convey the norms to the basis matrix; namely, The above updating of R and H does not change the value of the objective function in (13).
(2) Optimize R for Given Normalized H. Then based on the normalized H in ( 22), the objective function in (13) with respect to R for given H can be written as Let Ψ  be the Lagrange multiplier for constraint R  ≥ 0, the Lagrange function L is The partial derivation of L with respect to R is Using the KKT condition Ψ  R  = 0, from (25), we obtain the following equations for R  : which leads to the following update rule:

Convergence Proof of the Proposed Solution
Definition 1. Function (A, A  ) is an auxiliary function for function (A) if the conditions are satisfied.

Theorem 2. If 𝐺 is an auxiliary function, then 𝐹 is nonincreasing under the update
where  means the th iteration.

Convergence of the
The auxiliary function of   is designed as with (32) to find that (H  , where Thus, (33) holds and (H  , H ()  ) ≥   (H  ).
Proof.By setting (H  , H ()  )/H  = 0, we have Then the update rule for H in (20) can be obtained.

Convergence of the
The auxiliary function of   is designed as Lemma 5. Equation (37) is an auxiliary function for   , the part of (R) only relevant to R  .

Experiment
A typical application of the proposed decomposition in real life is as follows: for a set of images related to a scene such as campus and each associated with multiple labels, we can decompose the image feature representations to pure label representations, to provide better label samples for modeling object categories, which therefore can facilitate object recognition, detection, and so on.It leads to a useful tool for automatic labeling of online images or personal photo album.There are two main reasons that the proposed decomposition is necessary here.First, with limited training samples, it is difficult to learn good label models using the original image presentations directly without decomposition.Since each sample is a mixture of several labels and the background for each label is various, it is better to decompose the image representation to labels optimally.Second, instances of the same label related to a scene usually have short intrasubspace distance, which supports the assumption of the proposed decomposition and guarantees the effectiveness.
To demonstrate a practical application of the proposed decomposition in automatic image annotation, we manually crawled images of "KAIST campus" from the Flickr photo sharing website [12], excluded the irrelevant images, and obtained 183 images (we share the data at http://rcv.kaist .ac.kr/∼tengli/Teng resource.html).They are labeled with nine concepts: "sky, " "plant (grass/tree/flower), " "water, " "ground, " "man-made (building/sign/road), " "car/bus, " "bicycle, " "person, " and "animal." 150 images are used for training samples, which are used for class models training and can be applied to the proposed decomposition, and the rest are used for testing.Labels are highly mixed in the images.
In the implementation, the images and regions are represented using the Bag-of-Words (BoW) feature [13], where the number of visual words is set as 500 in the experiments for the tradeoff of classification accuracy and computational cost, and the feature dimension is therefore 500.With the decomposed label representations from the  training images, that is, normalized R matrix, a typical classifier, support vector machine (SVM) [14] is applied to learn label models and classify the testing samples.In the experiment of multilabel image classification, we adopted the area under ROC curve (AUC) measurement [15].Table 1 gives the performance comparison between the image classification adopting the proposed nonnegative signal decomposition and the classic BoW method [13].To further examine the performance, except for the well-known SVM, we also apply a recent popular Sparse Representation Classification (SRC) [16] as the classifier.To train using the original BoW image representations without applying decomposition yield AUC as low as around 0.5, whereas, to train label models using the decomposed results, the AUCs on the test samples are 0.77 and 0.76, in combination with SVM and SRC, respectively.Figure 3 gives the multilabel classification results for some sample images of KAIST campus Flickr dataset.The results appear to be promising.

Conclusion
A new technique for the nonnegative signals decomposition considering the supervised additive label information has been developed in this paper, with the assumption that intralabel signals should be relatively closer to each other than interlabel signals.The nonnegative constraint makes it applicable for many practical cases.We formulated it as an optimization problem and proposed an efficient iterative solution based on nonnegative data factorization.Convergence analysis showed that the solution guarantees monotonic decrease of the objective function.

Figure 1 :
Figure 1: Illustration of the proposed decomposition with an example of an image: given the top image with multiple labels, from its holistic feature representation shown right to it, for each of its associated labels, we aim to derive a label-specific representation, which reflects its actual region in the image.It is more informative to model individual labels compared to the holistic representation.

6 Figure 2 :
Figure 2: An illustration showing the graph based representation which encodes the context among decomposed signals of multiple labels: each node represents the decomposed representation to be obtained for a specific tag; tags belonging to the same label are linked with edge weight 1, while edge weights between those of different labels are set to 0.

Figure 3 :
Figure 3: Exemplar results of multilabel image classification on the KAIST campus images using the proposed decomposed training samples: corresponding regions of the correctly detected labels are marked in the images.

r 𝑐 𝑖 : decomposed signal representation of the 𝑐th label in 𝑖th training sample, r 𝑖 : decomposed representation of the 𝑖th valid tag in
A, A  means the th row vector of A, its corresponding lowercase version a  means the th column vector of A, and A  denotes the element of A at the 1 , v 2 , . . ., v  ]), Update Rule for H. Firstly let us go back to (14) and consider one element H  of matrix H. Using   to denote the part of (H) relevant to H  , we can get

Lemma 3 .
Equation (32) is an auxiliary function for   , the part of (H) only relevant to H  .Proof.Since (H  , H  ) =   (H  ) is obvious, we need only show that (H  , H ()  ) ≥   (H  ).To this aim, we compare the Taylor series expansion of Update Rule for R. Let us go back to (23) and consider any element R  of matrix R. Now   is used to denote the part of (R) relevant to R  .It is easy to check that    = (−2VH  + 2RHH  + 2RL)  ,    = 2(HH  )  + 2L  .

Table 1 :
Comparison of classification performance (AUC) on the KAIST campus images.