The paper is mainly used to provide the equivalence of two algorithms of independent component analysis (ICA) based on the information bottleneck (IB). In the viewpoint of information theory, we attempt to explain the two classical algorithms of ICA by information bottleneck. Furthermore, via the numerical experiments with the synthetic data, sonic data, and image, ICA is proved to be an edificatory way to solve BSS successfully relying on the information theory. Finally, two realistic numerical experiments are conducted via FastICA in order to illustrate the efficiency and practicality of the algorithm as well as the drawbacks in the process of the recovery images the mixing images.
1. Introduction
Information theory is found by Claude Elwood Shannon (1948) in one of his famous academic papers, “A Mathematical Theory of Communication,” where he gave the definition of information and information entropy based on the probability theory which build a bridge between the information theory and the numerical mathematics. Some basic conceptions (entropy, negentropy, mutual information, and so on) in the information theory have been successfully used to elaborate the independent components (ICs) and to deal with the problems on the application of the blind source separation (BSS). In the past decades, the information theory has been applied successfully into many fields such as clustering [1], medical examination [2], independent component analysis [3], feature learning [4], and telecommunication [5–8]. The purpose of this paper is to use the information bottleneck to derive the maximum of the mutual information (MI) between the mixing data and the recovery data which is no more than the MI of the recovery data and the original sources.
The rest of the paper is organized as follows. In Section 2, we first explain the information theory and introduce some important formulas. In Section 3, based on the entropy, the mutual information (MI), and negentropy, information bottleneck is used to illustrate the equivalence of the two classical algorithms, informax [3] and FastICA [9]. At last, by a series of experiments of synthetic data, sonic data, and image in Section 4, it is easy to compare the accuracy and complexity of the two algorithms. However, the ambiguity of the direction and scale of the recovery matrix lead the results of the image to the opposite.
2. Information Theory
According to the explanation of communication theory by Warren Weaver, “information” is not related to what you do say but to what you could say. That is, information is a measure of one’s freedom of choice when one selects a message [6, 7, 10].
At first, people focused attention on the “meaningful” or “relevant” information, which is crucial in solving the problem of transmitting information. Then, some scholars argue that lossy source compression provides a natural quantitative approach to “relevant information” [11, 12].
So, information bottleneck, which is going to seek for a tradeoff between the compression and the representation and preserving meaningful information, could be decomposed into the following aspects:
how to define the “meaningful” or “relevant” information;
how to extract the efficient representation of relevant information in order to transmit it speedily;
how to recover the information as exactly and comprehensively as possible only based on the efficient representation of relevant information.
People regard the possible results of the uncertainty or fuzzy as the surprise or information [13], and the smaller probability of the results occurring, the bigger surprise or the more information people obtained. So, entropy H(X), a measure of the chaotic degree, is defined to measure the uncertainty of information. Assume that X is a discrete random variable, and probability density function (pdf) p(x)=Pr(X=x),x∈χ; then, entropy is defined as(1)HX=-∑x∈χpxlogpx.Moreover, it is easy to generalize it to more than two random variables, the joint entropy. On the other hand, mutual information (MI), a measure of dependency of two different random variable sets, is regarded as the reduction of uncertainty of the random variable, given the other random variable. Consider two random variables X and Y, with the joint pdf p(x,y) and marginal pdfs p(x) and p(y), respectively. MI can be written as follows:(2)IX;Y=∑x∈χ∑y∈Yp(x,y)logp(x,y)p(x)p(y)=Ep(x,y)logp(x,y)p(x)p(y).
It is easy to prove the following equations about MI based on information entropy:(3)IX;Y=H(X)+H(Y)-H(X,Y)=H(Y)-H(Y∣X)=HX-H(Y∣X).
According to the last two terms, we can find the relationship between MI and entropy. if and only if X and Y are irrelevant.
3. The Equivalence of the Two ICA Algorithms Based on the IBN
Information bottleneck (IBN) [14] is used to make sure to recover the compressed information X, which is presumed to be good representation or compression of the original information S, to the recipient in terms of Y in the following type: (4)S⟹X⟹Y.
Now, in the terminology of information theory and optimizing theory, there are two inconsistent optimal problems that, on the one hand, we would make sure to minimize MI between the original information S and the compressed information X and, on the other hand, we want to capture the maximum of mutual information between Y and X. Obviously, the amount of information about Y in X~ is given by (5)IX;Y=ΣyΣx~py,xlogpy,xpypx≦IS;Y,while the mutual information between the independent sources and the mixing signals is determinate but unknown with the precondition of ICA.
ICA is studied to find the independent sources as yi,Y=(y1,y2,…,yn), which is equal to the original independent sources, S=(si1,si2,…,sin) ignoring the ambiguity of the direction and scale. Furthermore, the independent sources are the most concise, while any linear transformation of the independent sources obtains the redundance information(6)IS;Y=H(S)-H(S∣Y)≦HS.If and only if Y are the independent sources the equation is true. That is to say, we need to find the recovery matrix W, Y=WX, in order to obtain the independent sources. Because of the precondition of the unknown independent sources and mixing matrix, the optimal problem of ICA is written by IBN [14] as follows:(7)maxYI(X;Y),where H(S) is an theoretic maximum and H(Y) is an approximate maximum.
3.1. Infomax Method
Infomax method [3] is used to tackle the problem of separating the mixture signals X, attempting to look for the weight matrix W without both the mixture matrix A and the original signal S. We attempt to illustrate BBS in the following: (8)S→X=ASmixtureX→Y=WXrecoveryY.According to the optimization problem of ICA (16), we could rewrite it as follows: (9)maxWIX;Y=HY-HYX.That is also regarded as(10)maxWHY.The equation can be differentiated with respect to a parameter W, involved in the mapping from X to Y: (11)∂∂WI(X;Y)=∂∂WH(Y).Therefore, MI between the mixtures X and recoveries Y could be maximized by maximizing the entropy of the recoveries alone. And then the gradient method is used to obtain the learning method [3].
Considering the slow convergence and nonprecise and low accuracy, Hyva¨rinen gave a FastICA based on the maximum of the negentropy, which is regarded as the measure for the independency of the signal.
3.2. FastICA Method
According to the informax method and IBN, the BBS is equal to the optimal problem as follows: (12)maxYH(Y).Then, the optimal problem can be adapted as(13)minYI(Y)=∑i=1nH(yi)-H(Y),where Y=(y1,y2,…,yn) and yi(i=1,2,…,n) are the ICs. How can we identify and measure the independence of the recovery data? The equivalence of the non-Gaussian random variables and negentropy is illustrated based on the Central-Limit Theorem [9].
Theorem 1 (Central-Limit Theorem [15]).
Given certain conditions, the arithmetic mean of a sufficiently large number of iterates of independent random variables, each with a well-defined expected value and well-defined variance, will be approximately normally distributed.
According to the Central-Limit Theorem, if and only if the recovery data Y is a permutation of the original independent sources X, the non-Gaussian random variables reaches the maximum. Consider(14)IY=∑i=1nH(yi)-H(X)-logdet(W)=(-H(X)-logdet(W))+∑i=1n(H(yigauss)-J(yi))via the normalization of mixing data: (15)EyyT=WExxTWT=I,where I is the n_order identical matrix. So, (10) is rewritten as(16)maxYJ(Y)s.t.jjW2=1which is equivalent to (7) and (10). So, we can obtain the equivalence of the two classical algorithms in the point of IBN.
The approximation of negentropy and fixed-point algorithm are applied to derive the learning rule [9].
4. Experiments
Based on the infomax learning rule, the experiments presented here were obtained using the synthetic data as the original data plotted in Figure 1(a). The result by infomax is listed in Figure 1(b) corresponding to the recovery matrices (17). Obviously, wa=W∗A is the product of the recovery matrix and the mixture matrix so that it would be the permutation of the approximate diagonal matrix. Then, we can easily find that only one substantial entry (boxed) exists in each row and column(17)wa=-0.191-0.015-0.8000.680-0.020-0.2230.0230.500-0.063.
ICA. The synthetic independent data are plotted in (a), and the recovery data are shown in (b) corresponding to the matrix wa in (17). In terms of every column of the matrix wa, the substantial entry, waij, is almost a reflection of the transformation between the original data (i) and the recovery data (j) by multiplying the substantial entry waij, accompanied by the nonzero entries, wit,t≠j. For example, in the first result of the ICA experiments, wa21=-0.6804 is just a proof that the original data (1) is recovered into the recovery data (2) with a multiplicator wa21 and some noises based on the minor numbers of wa22 and wa23.
Synthetic independent sources
Recovery data
In order to illustrate the efficiency of FastICA algorithm and the limitation on no more than one Gaussian variable, we list some numerical results on the blind mixing signals shown in Figures 2, 3, and 4, using the nonquadratic function G1 to approximate the negentropy. Consider(18)wa=-0.707-0.0080.0010.009-0.5000.0000.001-0.0180.833.
Based on the FastICA algorithm, we separate the randomly mixing data of the sinusoid, the rectangular curve, and the sawtooth curve successfully. At the same time, the product matrix wa of the separation matrix W and the random mixing matrix A is presented in (18).
Independent sources
Recovery data
Mixing data
In this numerical experiment, we add the Gaussian into the above experiment and succeed in blindly separating the mixing data. The product matrix wa is given in (19).
Synthetic independent source
Recovery data
Mixing data
The algorithm is not very efficient in separating the sinusoid from the two Gaussian signals in the mixing data. The more Gaussian variables there are, the more difficult it is to recover the original data.
Independent sources
Recovery data
Mixing data
Figure 2 is an obvious proof to declare the efficiency of the algorithm separating the randomly mixing data of the sinusoid, the rectangular curve, and the sawtooth curve successfully. And (18) revealed that the matrix W is an elementary transformation of the approximative inverse of the mixing matrix A(19)wa=-0.000.03-0.83-0.140.710.010.000.20-0.060.01-0.051.50-0.010.500.01-0.08.
Then, it is necessary and meaningful to add the Gaussian variable into the original data to prove the efficiency of the algorithm so that the result is shown in Figure 3 and the product matrix wa is in (19). At last, based on the two Gaussian signals in the mixing data, the algorithm is not efficient to separate the two Gaussian signals apart shown in Figure 4.
Furthermore, the average iterative steps on the first three numerical experiments based on FastICA algorithm are shown in Table 1.
The comparison of the different data on the iterative steps.
Original data set
The iterative steps
Figure 2
Less than 15 steps
Figure 3
Less than 20 steps
Figure 4
Not stationary
After the experiments on the synthetic data, the algorithm is also efficient on the real sonic data in Figure 5 and image data in Figure 6. In the process of separating the image data, the picture in Figure 7 can be obtained, because the matrix W, which alters the picture in the opposite color, is not the exact inverse of A(20)wa=-0.3769.31814.1930.175,(21)wa=0.0690.7980.8980.015.
Real sonic data (FastICA). Using the real sonic data from the website, we also can get the recovery data and the product matrix (20).
Mixing data
Recovery data
Independent sources
Using the picture from the website and Gaussian noise mixing with the matrix A=[2,3;2,1], the Gaussian noise and the original picture are shown in the first line, the two mixing pictures in the second line, and the recovery pictures in the third line.
Independent images
Mixing images
Recovery images
The recovery matrix W is not the exact inverse of the mixing matrix A, while the recovery data y has the different orders with s and is very accurately estimated, up to multiplicative signs (FastICA).
5. Conclusion
The algorithm of independent component analysis is enlightened from BSS, which is a very successful application of the information theory in speech recognition, image separation without knowing the linear transformation. But, there are also some disadvantages. For example, there exist the strong preconditions that the original data should be independent and the transformation should be linear.
Conflict of Interests
The authors declare that there is no conflict of interests regarding the publication of this paper.
Acknowledgments
This investigation was supported by National Basic Research Program of China (973 Program) under Grant no. 2013CB329404, the Major Research Project of the National Natural Science Foundation of China under Grant no. 912300101, the National Natural Science Foundation of China under Grant no. 61075006, the Key Project of the National Natural Science Foundation of China under Grant no. 111311006, the Scientific Research Program Funded by Shaanxi Provincial Education Department (Program no. 2013JK1139), the China Postdoctoral Science Foundation (no. 2013M542370), and the Specialized Research Fund for the Doctoral Program of Higher Education of the People’s Republic of China (Grant no. 20136118120010).
GokcayE.PrincipeJ. C.Information theoretic clustering200224215817110.1109/34.9828972-s2.0-0036472386KraskovA.StögbauerH.AndrzejakR. G.Hierarchical clustering using mutual information200570227828410.1209/epl/i2004-10483-yMR21478182-s2.0-17444370127BellA. J.SejnowskiT. J.An information-maximization approach to blind separation and blind deconvolution1995761129115910.1162/neco.1995.7.6.11292-s2.0-0029411030HyvärinenA.HurriJ.HoyerP. O.2009SpringerCoverT. M.ThomasJ. A.2012John Wiley & SonsZhangH.LiuX.WangJ.Robust H∞ sliding mode control with pole placement for a fluid power electrohydraulic actuator (EHA) system201473 5-810951104ZhangH.ShiY.WangJ.On energy-to-peak filtering for nonuniformly sampled nonlinear systems: a Markovian jump system approach2014221212222ZhangH.WangJ.Combined feedback-feedforward tracking control for networked control systems with probabilistic delays201435163477348910.1016/j.jfranklin.2014.02.012MR3201043HyvärinenA.OjaE.Independent component analysis: algorithms and applications2000134-541143010.1016/S0893-6080(00)00026-52-s2.0-0042826822ZhangH.ZhangX.WangJ.Robust gain-scheduling energy-to-peak control of vehicle lateral dynamics stabilisation2014523309340WeiW.QiY.Information potential fields navigation in wireless Ad-Hoc sensor networks20111154794480710.3390/s1105047942-s2.0-79957785921WeiW.ShenP.ZhangY.ZhangL.Information fields navigation with piece-wise polynomial approximation for high-performance OFDM in WSNs20132013990150910.1155/2013/9015092-s2.0-84876551421ShuaiZ.ZhangH.WangJ.Lateral motion control for four-wheel-independent-drive electric vehicles using optimal torque allocation and dynamic message priority scheduling2014245566 TishbyN.PereiraF. C.BialekW.The information Bottleneck methodProceedings of the 37th annual Allerton Conference on Communication, Control, and ComputingSeptember 1999368377RiceJ.2006Cengage Learning