Independent Component Analysis Based on Information Bottleneck

and Applied Analysis 3 That is also regarded as max W H(Y) . (10) The equation can be differentiatedwith respect to a parameter W, involved in the mapping fromX to Y: ∂ ∂W I (X; Y) = ∂ ∂W H(Y) . (11) Therefore,MI between themixturesX and recoveriesY could be maximized by maximizing the entropy of the recoveries alone. And then the gradient method is used to obtain the learning method [3]. Considering the slow convergence and nonprecise and low accuracy, HyV ̈ arinen gave a FastICA based on the maximum of the negentropy, which is regarded as the measure for the independency of the signal. 3.2. FastICA Method. According to the informax method and IBN, the BBS is equal to the optimal problem as follows: max Y H(Y) . (12) Then, the optimal problem can be adapted as min Y I (Y) = n


Introduction
Information theory is found by Claude Elwood Shannon (1948) in one of his famous academic papers, "A Mathematical Theory of Communication, " where he gave the definition of information and information entropy based on the probability theory which build a bridge between the information theory and the numerical mathematics.Some basic conceptions (entropy, negentropy, mutual information, and so on) in the information theory have been successfully used to elaborate the independent components (ICs) and to deal with the problems on the application of the blind source separation (BSS).In the past decades, the information theory has been applied successfully into many fields such as clustering [1], medical examination [2], independent component analysis [3], feature learning [4], and telecommunication [5][6][7][8].The purpose of this paper is to use the information bottleneck to derive the maximum of the mutual information (MI) between the mixing data and the recovery data which is no more than the MI of the recovery data and the original sources.
The rest of the paper is organized as follows.In Section 2, we first explain the information theory and introduce some important formulas.In Section 3, based on the entropy, the mutual information (MI), and negentropy, information bottleneck is used to illustrate the equivalence of the two classical algorithms, informax [3] and FastICA [9].At last, by a series of experiments of synthetic data, sonic data, and image in Section 4, it is easy to compare the accuracy and complexity of the two algorithms.However, the ambiguity of the direction and scale of the recovery matrix lead the results of the image to the opposite.

Information Theory
According to the explanation of communication theory by Warren Weaver, "information" is not related to what you 2 Abstract and Applied Analysis do say but to what you could say.That is, information is a measure of one's freedom of choice when one selects a message [6,7,10].
At first, people focused attention on the "meaningful" or "relevant" information, which is crucial in solving the problem of transmitting information.Then, some scholars argue that lossy source compression provides a natural quantitative approach to "relevant information" [11,12].So, information bottleneck, which is going to seek for a tradeoff between the compression and the representation and preserving meaningful information, could be decomposed into the following aspects: (1) how to define the "meaningful" or "relevant" information; (2) how to extract the efficient representation of relevant information in order to transmit it speedily; (3) how to recover the information as exactly and comprehensively as possible only based on the efficient representation of relevant information.
People regard the possible results of the uncertainty or fuzzy as the surprise or information [13], and the smaller probability of the results occurring, the bigger surprise or the more information people obtained.So, entropy (), a measure of the chaotic degree, is defined to measure the uncertainty of information.Assume that  is a discrete random variable, and probability density function (pdf) () = Pr( = ),  ∈ ; then, entropy is defined as Moreover, it is easy to generalize it to more than two random variables, the joint entropy.On the other hand, mutual information (MI), a measure of dependency of two different random variable sets, is regarded as the reduction of uncertainty of the random variable, given the other random variable.Consider two random variables  and , with the joint pdf (, ) and marginal pdfs () and (), respectively.MI can be written as follows: ) . ( It is easy to prove the following equations about MI based on information entropy: According to the last two terms, we can find the relationship between MI and entropy.if and only if  and  are irrelevant.

The Equivalence of the Two ICA Algorithms
Based on the IBN Information bottleneck (IBN) [14] is used to make sure to recover the compressed information , which is presumed to be good representation or compression of the original information , to the recipient in terms of  in the following type: Now, in the terminology of information theory and optimizing theory, there are two inconsistent optimal problems that, on the one hand, we would make sure to minimize MI between the original information  and the compressed information  and, on the other hand, we want to capture the maximum of mutual information between  and .
Obviously, the amount of information about  in X is given by while the mutual information between the independent sources and the mixing signals is determinate but unknown with the precondition of ICA.
ICA is studied to find the independent sources as   ,  = ( 1 ,  2 , . . .,   ), which is equal to the original independent sources,  = (  1 ,   2 , . . .,    ) ignoring the ambiguity of the direction and scale.Furthermore, the independent sources are the most concise, while any linear transformation of the independent sources obtains the redundance information If and only if  are the independent sources the equation is true.That is to say, we need to find the recovery matrix ,  = , in order to obtain the independent sources.Because of the precondition of the unknown independent sources and mixing matrix, the optimal problem of ICA is written by IBN [14] as follows: where () is an theoretic maximum and () is an approximate maximum.

Infomax Method.
Infomax method [3] is used to tackle the problem of separating the mixture signals , attempting to look for the weight matrix  without both the mixture matrix  and the original signal .We attempt to illustrate BBS in the following: According to the optimization problem of ICA (16), we could rewrite it as follows: That is also regarded as The equation can be differentiated with respect to a parameter , involved in the mapping from  to : Therefore, MI between the mixtures  and recoveries  could be maximized by maximizing the entropy of the recoveries alone.And then the gradient method is used to obtain the learning method [3].
Considering the slow convergence and nonprecise and low accuracy, HyV ä rinen gave a FastICA based on the maximum of the negentropy, which is regarded as the measure for the independency of the signal.

FastICA Method.
According to the informax method and IBN, the BBS is equal to the optimal problem as follows: Then, the optimal problem can be adapted as where  = ( 1 ,  2 , . . .,   ) and   ( = 1, 2, . . ., ) are the ICs.How can we identify and measure the independence of the recovery data?The equivalence of the non-Gaussian random variables and negentropy is illustrated based on the Central-Limit Theorem [9].
Theorem 1 (Central-Limit Theorem [15]).Given certain conditions, the arithmetic mean of a sufficiently large number of iterates of independent random variables, each with a welldefined expected value and well-defined variance, will be approximately normally distributed.
According to the Central-Limit Theorem, if and only if the recovery data  is a permutation of the original independent sources , the non-Gaussian random variables reaches the maximum.Consider  where  is the   identical matrix.So, (10)  which is equivalent to (7) and (10).So, we can obtain the equivalence of the two classical algorithms in the point of IBN.
The approximation of negentropy and fixed-point algorithm are applied to derive the learning rule [9].

Experiments
Based on the infomax learning rule, the experiments presented here were obtained using the synthetic data as the original data plotted in Figure 1 In order to illustrate the efficiency of FastICA algorithm and the limitation on no more than one Gaussian variable, we list some numerical results on the blind mixing signals shown in Figures 2, 3, and 4, using the nonquadratic function  1 to approximate the negentropy.Consider  = ( −0.707 −0.008 0.001 0.009 −0.500 0.000 0.001 −0.018 0.833 ) .
Figure 2 is an obvious proof to declare the efficiency of the algorithm separating the randomly mixing data of the sinusoid, the rectangular curve, and the sawtooth curve successfully.And (18) revealed that the matrix  is an elementary transformation of the approximative inverse of the mixing matrix   = ( −0.00 0.03 −0.83 −0.14 0.71 0.01 0.00 0.20 −0.06 0.01 −0.05 1.50 −0.01 0.50 0.01 −0.08 ) .Original data set The iterative steps Figure 2 Less than 15 steps Figure 3 Less than 20 steps Figure 4 Not stationary Then, it is necessary and meaningful to add the Gaussian variable into the original data to prove the efficiency of the algorithm so that the result is shown in Figure 3 and the product matrix  is in (19).At last, based on the two Gaussian signals in the mixing data, the algorithm is not efficient to separate the two Gaussian signals apart shown in Figure 4.
Furthermore, the average iterative steps on the first three numerical experiments based on FastICA algorithm are shown in Table 1.
After the experiments on the synthetic data, the algorithm is also efficient on the real sonic data in Figure 5 and image data in Figure 6.In the process of separating the image data, the picture in Figure 7 can be obtained, because the matrix , which alters the picture in the opposite color, is not the exact inverse of   = ( −0.376 9.318 14.193 0.175 ) , = ( 0.069 0.798 0.898 0.015 ) . (21)

Conclusion
The algorithm of independent component analysis is enlightened from BSS, which is a very successful application of the information theory in speech recognition, image separation without knowing the linear transformation.But, there are also some disadvantages.For example, there exist the strong preconditions that the original data should be independent and the transformation should be linear.

Figure 1 :
Figure 1: ICA.The synthetic independent data are plotted in (a), and the recovery data are shown in (b) corresponding to the matrix  in (17).In terms of every column of the matrix , the substantial entry,   , is almost a reflection of the transformation between the original data () and the recovery data () by multiplying the substantial entry   , accompanied by the nonzero entries,   ,  ̸ = .For example, in the first result of the ICA experiments,  21 = −0.6804 is just a proof that the original data (1) is recovered into the recovery data (2) with a multiplicator  21 and some noises based on the minor numbers of  22 and  23 .

Figure 2 :
Figure2: Based on the FastICA algorithm, we separate the randomly mixing data of the sinusoid, the rectangular curve, and the sawtooth curve successfully.At the same time, the product matrix  of the separation matrix  and the random mixing matrix  is presented in (18).

Figure 3 :
Figure 3: In this numerical experiment, we add the Gaussian into the above experiment and succeed in blindly separating the mixing data.The product matrix  is given in (19).

Figure 4 :
Figure 4: The algorithm is not very efficient in separating the sinusoid from the two Gaussian signals in the mixing data.The more Gaussian variables there are, the more difficult it is to recover the original data.

Figure 5 :
Figure5: Real sonic data (FastICA).Using the real sonic data from the website, we also can get the recovery data and the product matrix (20).

Figure 6 :
Figure 6: Using the picture from the website and Gaussian noise mixing with the matrix  = [2, 3; 2, 1], the Gaussian noise and the original picture are shown in the first line, the two mixing pictures in the second line, and the recovery pictures in the third line.

Figure 7 :
Figure 7: The recovery matrix  is not the exact inverse of the mixing matrix , while the recovery data  has the different orders with  and is very accurately estimated, up to multiplicative signs (FastICA).
(a).The result by infomax is listed in Figure 1(b) corresponding to the recovery matrices (17).Obviously,  =  *  is the product of the recovery matrix and the mixture matrix so that it would be the permutation of the approximate diagonal matrix.Then, we can easily find that only one substantial entry (boxed) exists in each row and column  = ( −0.191 −0.015 −0.800 0.680 −0.020 −0.223 0.023 0.500 −0.063 ) .

Table 1 :
The comparison of the different data on the iterative steps.