Independent component analysis (ICA) is a widely applicable and effective approach in blind source separation (BSS), with limitations that sources are statistically independent. However, more common situation is blind source separation for nonnegative linear model (NNLM) where the observations are nonnegative linear combinations of nonnegative sources, and the sources may be statistically dependent. We propose a pattern expression nonnegative matrix factorization (PE-NMF) approach from the view point of using basis vectors most effectively to express patterns. Two regularization or penalty terms are introduced to be added to the original loss function of a standard nonnegative matrix factorization (NMF) for effective expression of patterns with basis vectors in the PE-NMF. Learning algorithm is presented, and the convergence of the algorithm is proved theoretically. Three illustrative examples on blind source separation including heterogeneity correction for gene microarray data indicate that the sources can be successfully recovered with the proposed PE-NMF when the two parameters can be suitably chosen from prior knowledge of the problem.
Blind source separation (BSS) is a very active topic recently in signal processing and neural network fields [
Independent component analysis (ICA) has been found very
effective in BSS for the cases where the sources are statistically independent.
In fact, it factorizes the observation matrix
It is easy to recognize that BSS for
NNLM is a nonnegative matrix factorization, that is, to factorize
In this paper, we extend NMF to
pattern expression NMF (PE-NMF) from the view point that the basis vector is
desired to be the one which can express the data most efficiently. Its
successful application to blind source separation of extended bar problem, nonnegative
signal recovery problem, and heterogeneity correction problem for real gene microarray data
indicates that it is of great potential in blind separation of dependent
sources for NNLM model. The loss function for the PE-NMF proposed here is a
special case of that proposed in [
NMF problem is given a nonnegative
Let
The basis
the angle between the vectors in the basis should be as
large as possible, such that each data in the angles between the vectors in the basis should be as
small as possible to make the vectors clamp the data as tightly as possible,
such that no space is left for expression of what is not included in each vector in the basis should be of the most efficient
in expression of the data in
The vectors defined with the above three requirements are what we call the
pattern basis of the data, and the number of vectors in the basis,
Notice that the second requirement in
the definition of the pattern basis readily holds from the constraint of NMF
that the elements in
Given an
This problem is
a special case of the constrained optimization problem proposed in [
For the derivation of learning
algorithm for
For
any r by r symmetric nonnegative matrix
By noticing that
For any
Now in Theorem
The convergence proof will be
performed by introducing an appropriate auxiliary function
Now
we construct the auxiliary function to be
Obviously,
The
Taylor
expansion of the loss function
We
employ the steepest descent search strategy for optimal
This theorem can be proved similarly
as the proof of Theorem
By representing (
It is evident that the portion
relating to the row w in the objective function
The update of the
To our
knowledge, it seems that there are two main reasons for NMF to converge to
undesired solutions. One is that the basis of a space may not be unique
theoretically, and therefore separate runs of NMF may lead to different
results. Another reason may come from the algorithm itself, that the loss
function sometimes gets stock into local minimum during its iteration. By
revisiting the loss function of the proposed PE-NMF, it is seen that similar to
NMF, the above PE-NMF still sometimes gets stock into local minimum during its
iteration, and/or the number of iterations required for obtaining desired
solutions is very large. For the sake of these, an ICA-based technique was
proposed for initializing source matrix instead of setting it to be a nonnegative
matrix at random: we performed ICA on the
observation signals, and set the absolute of the independent components obtained
from ICA
to be
the initialization of the source matrix. In fact, there are reasons that the
resultant independent components obtained from ICA
are generally not the original sources.
One reason is the nonnegativity of the original sources but centering
preprocess of the ICA
makes each independent component be both positive and negative in its elements:
the means of each independent component is zero. Another reason is possibly
dependent or partially independent original sources which does not follow the
independence requirement of sources in the ICA
study. Hence, the resultant independent components
from ICA
could
not be considered as the recovery of the original sources. Even so, they still
provide clues of the original sources: they can be considered as
The proposed
PE-NMF algorithms have been extensively tested for many difficult benchmarks
for signals and images with various statistical distributions. Three examples will
be given in the following context for demonstrating the effectiveness of the
proposed method compared with standard NMF method and/or ICA
method. In ICA approach here, we decenteralize the recovered signals/images/microarrays for
its nonnegativity property for compensating the centering preprocessing of the ICA
approach. The NMF
algorithm is simply the one proposed in [
The linear bar problem
[
Bar problem solution obtained from NMF: (a) source images, (b) mixed images, (c) recovered images from ICA, and (d) recovered images from NMF.
Extended bar problem solution obtained from PE-NMF: (a) source images, (b) mixed images, (c) recovered images from PE-NMF.
Recovered images from (a) ICA, and (b) NMF for the extended bar problem.
We
performed experiments on recovering 5 nonnegative signals from 9 mixtures of 5 nonnegative
dependent source signals, which is the one in [
Blind signal separation example: (a) 5 original signals, (b) 9 observations, (c) recovered signals from NMF, and (d) recovered signals from PE-NMF.
Gene expression microarrays promise powerful new tools for the large-scale analysis of gene expression. Using this technology, the relative mRNA expression levels derived from tissue samples can be assayed for thousands of genes simultaneously. Such global views are likely to reveal previously unrecognized patterns of gene regulation and generate new hypotheses warranting further study (e.g., new diagnostic or therapeutic biomarkers). However, as a common feature in microarray profiling, gene expression profiles represent a composite of more than one distinct but partially dependent sources (i.e., the observed signal intensity will consist of the weighted sum of activities of the various sources). More specifically, in the case of solid tumors, the related issue is called partial volume effect (PVE), that is, the heterogeneity within the tumor samples caused by stromal contamination. Blind application of microarray profiling could result in extracting signatures reflecting the proportion of stromal contamination in the sample, rather than underlying tumor biology. Such “artifacts” would be real, reproducible, and potentially misleading, but would not be of biological or clinical interest, while can severely decrease the sensitivity and specificity for the measurement of molecular signatures associated with different disease processes. Despite their critical importance to almost all the followup analysis steps, this issue, called partial volume correction (PVC), is often less emphasized or at least has not been rigorously addressed as compared to the overwhelming interest and effort in pheno/gene-clustering and class prediction.
The
effectiveness of the proposed PE-NMF method was tested with real-world data
set, microarray gene expression data set, for PVC. The data set consists of
2308 effective gene expressions from two samples of neuroblastoma and non-Hodgkin
lymphoma cell tumors [
Heterogeneity correction result: (a) observations, (b) recovered sources from PE-NMF, and (c) real sources.
The scatter plots of the real sources (blue stars) and the recovered sources (red dots) from (a) PE-NMF, and (b) NMF.
This paper proposes a pattern
expression nonnegative matrix factorization (PE-NMF) approach for efficient
pattern expression and applies it to blind source separation for nonnegative
linear model (NNLM). Its successful application to blind source separation of
extended bar problem, nonnegative signal recovery problem, and heterogeneity
correction problem for real microarray gene data indicates that it is of great
potential in blind source separation problem for NNLM model. The loss function
for the PE-NMF proposed here is in fact an extension of the multiplicative
update algorithm proposed in [
Same as what has been mentioned in
[
Algorithm parameters: Input: an n by Output: an nonnegative
matrix where is an elements being
zeros. and
This work was supported by the National Science Fund of China under Grant nos. 60574039, 60371044, and Sino-Italian joint cooperation fund, and the US National Institutes of Health under Grants EB000830 and CA109872.