Convergence Analysis of Contrastive Divergence Algorithm Based on Gradient Method with Errors

Contrastive Divergence has become a common way to train Restricted Boltzmann Machines; however, its convergence has not been made clear yet. This paper studies the convergence of Contrastive Divergence algorithm. We relate Contrastive Divergence algorithm to gradient method with errors and derive convergence conditions of Contrastive Divergence algorithm using the convergence theorem of gradient method with errors. We give specific convergence conditions of Contrastive Divergence learning algorithm for Restricted BoltzmannMachines in which both visible units and hidden units can only take a finite number of values. Two new convergence conditions are obtained by specifying the learning rate. Finally, we give specific conditions that the step number of Gibbs sampling must be satisfied in order to guarantee the Contrastive Divergence algorithm convergence.


Introduction
Deep belief networks have recently been successfully applied to resolve many problems [1][2][3][4][5].Restricted Boltzmann Machines (RBMs), one of important blocks of deep belief networks, have also been widely applied in many fields [2,[6][7][8][9][10][11].The learning of RBMs and deep belief network has been an important and hot topic in machine learning research.The learning process is a parameters estimating problem.The general parameters estimating method is challenging, Hinton proposed Contrastive Divergence (CD) learning algorithm [12].Although it has been widely used for training deep belief networks, its convergence is still not clear.Recently, more and more researchers have studied theoretical characters of CD.Bengio and Delalleau [13] proved the use of a short Gibbs chain of length  to obtain a biased estimator of the log-likelihood gradient.Akoho and Takabatake [14] gave an information geometrical interpretation of the CD learning algorithm.Sutskever and Tieleman [15] gave proofs showing CD is not the gradient of any function.It is possible to construct regularization functions that cause it to fail to converge.Yuille [16] related CD to the stochastic approximation literature and derived elementary conditions which ensure convergence (with probability 1).However, convergence conditions are relatively strict; particularly, convergence conditions are related to the model parameter which minimize the Kullback-Leibler divergence ( 0 () ‖ ( | )) between the empirical distribution function of the observed data  0 () and the model ( | ).
In this paper, we study the convergence of the CD learning algorithm.By exploring the relation between the CD algorithm and the gradient method with errors, we obtain convergence conditions of CD using the convergence theorem of gradient method with errors.Our convergence conditions are more practical than those given by Yuille [16].We also give an analysis of convergence of the CD algorithm for RBMs, especially the convergence conditions of the CD algorithm for RBMs in which both visible units and hidden units only take a finite number of values.We give two new convergence conditions by specifying the learning rate.Finally, we give the theoretical analysis of convergence conditions of the CD algorithm for RBMs and the relationship which the learning rate and the step number of Gibbs sampling must satisfy in order to guarantee the CD algorithm convergence.
The rest of the paper is organized as follows.In Section 2, we give a brief overview of the CD algorithm.In Section 3, we firstly propose the gradient method with errors and convergence theorem of the gradient method with errors and then relate the CD algorithm to the gradient method with errors.Convergence conditions of the CD algorithm are derived.In Section 4, we give an analysis of convergence conditions of the CD algorithm for RBMs.We draw some conclusions in Section 5.

Contrastive Divergence Learning Algorithm
Given a probability distribution over a vector , where () = ∑ ,ℎ  −(,ℎ;) is a normalization constant or partition function, ℎ is hidden variable, and (, ℎ; ) is an energy function.This class of random-field distribution has been used in many fields.The marginal likelihood is The gradient of the marginal log-likelihood with respect to model parameter  is ( The log-likelihood gradient algorithm can be expressed as where   denotes the learning rate at th update.The first term in the bracket of the right hand of (4) can be computed exactly; however, the second term (also called the expectation under the model distribution) is intractable because the calculation of () is extremely difficult.In order to apply the log-likelihood gradient algorithm, we have to do alternating blocked-Gibbs sampling from the conditionals (; | ℎ) and (ℎ; | ).This requires an infinite number of Gibbs transitions per update to fully characterize the expectation.Hinton [12] proposed a modification of the log-likelihood gradient algorithm known as Contrastive Divergence.
The idea of -step Contrastive Divergence learning (CD-) is simple: instead of approximating the second term in the log-likelihood gradient by a sample for RBM-distribution (which would require running a Markov chain until the stationary distribution is reached), a Gibbs chain is run for only  steps.The Gibbs is initialized with a training example  (0) of the training set and yields the sample  () after  steps.Each step  consists of sampling ℎ () from (ℎ;  |  () ) and subsequently sampling  (+1) from (;  | ℎ () ).The gradient (3) with  of log-likelihood for one training example  (0) is approximated by )  ( () , ℎ; )  . ( The expectation of the CD algorithm can be ascribed by where   ( x, h; ) is the empirical distribution function on the samples obtained by the data  (0) and running the Markov chain forward for  steps,   ( x, h; ) = ( () = x, ℎ () = h).
The asymptotic unbiased estimator of the parameters can be obtained by using the log-likelihood gradient algorithm; asymptotic property of the estimator of the parameters in CD- learning is discussed in the next section.

Convergence of Contrastive Divergence Algorithm
In this section, we will study the convergence of the CD learning algorithm.The CD algorithm has a similar form with the gradient method with errors.We try to relate the CD algorithm to the gradient method with errors and derive the convergence conditions of CD.For achieving it, we firstly propose the gradient method with errors and give the convergence theorem of the gradient method with errors.

Gradient Methods with Errors and Convergence Theorem.
Given the optimization problem, where R  denotes the -dimensional Euclidean space and () : R  → R is a continuously differentiable function, such that for a positive constant  we have where ‖‖ = (∑  =1  2  ) 1/2 , ‖ ⋅ ‖ stands for the Euclidean norm in R  .
The gradient method with errors is of the following form: where   is a positive step-size sequence,   is a descent direction, and V  is an error.The error V  could be deterministic or stochastic.In both cases, the gradient method has been studied in literature [17][18][19][20].We consider that V  is stochastic because of the CD algorithm in this paper.The gradient method with stochastic errors can be considered as stochastic approximation algorithm or stochastic approximation procedure [21,22].Younes [22] has analyzed the convergence of stochastic approximation procedure (SAP) and has given the almost sure convergence conditions of SAP using ODE (Ordinary Differential Equations) approach.He generated a persistent Markov chain and studied the recursion algorithm in which several iterations of the simulation procedures are performed before updating the current parameter, with the updating being done using the average of the obtained values.Bertsekas and Tsitsiklis [18] have studied the convergence of gradient method, in which the expectation of the stochastic error V  is zero with probability 1.We present a gradient method with different stochastic errors; convergence of the gradient method is guaranteed by the following theorem.We will need a known lemma, which has been proved by Grimmett and Stirzaker [23].
Theorem 2. Let   be a sequence generated by the method where   is a positive step size,   is a descent direction, and V  is a stochastic error.Let F  be an increasing sequence of -fields (F  should be interpreted as the history of the algorithm up to time ).We assume the following: (1) () ≥ 0,  ∈ R  , (  ), and   are F  -measurable.
Using the assumption that (  ) is F  -measurable, we have By using Lemma 1 and the assumption ∑ ∞ =0  2  < ∞, we see that (  ) converges.
The theorem is proved.
Strictly speaking, the conclusion of the theorem only holds with probability 1.For simplicity, an explicit statement of this qualification often will be omitted.We will use the theorem to derive convergence conditions of CD based on the similarity between the CD algorithm and the gradient method with errors.

Convergence of CD.
In order to derive convergence conditions of the CD learning algorithm using the convergence theorem of the gradient method with errors, we have to explore the relation between the CD algorithm and the gradient method with errors.We can reconstruct the CD algorithm in the form of gradient optimization problem.
The theorem of the gradient method with errors involves four basic concepts.The first is an optimization function (), which must be a continuously differentiable function, such that, for some constant , we have ‖∇() − ∇()‖ ≤ ‖ − ‖.The second is the descent direction   .The third is the error vector V  .The last concept is the step size   ;   can be considered as the learning rate in the CD learning algorithm.
The gradient method with errors will converge provided the conditions of Theorem 2 are satisfied.
We can derive convergence theorem for the CD learning algorithm through selecting appropriate   and V  .Next, we give the convergence theorem of the CD learning algorithm using the convergence theorem of the gradient method with errors.
Proof.The CD algorithm can be described as the form of gradient optimization problem.The CD algorithm is In (21), let The CD algorithm in the form of gradient optimization problem can be described as follows: In (23), let Then, (23)  (31) We let  = 4 2 ; then, we have By using Theorem 2, we have (  ) converging; then, the CD learning algorithm will converge.The theorem is proved.
We derive convergence conditions of the CD algorithm in the above theorem.It is easy to find that convergence conditions mainly include three aspects of contents.The first is the function log (; ) of the parameter .The second is the learning rate of the CD learning algorithm.The third includes two terms.The first term is about the error between the empirical distribution function   ( x, h;   ) and the distribution function ( x, h;   ); this term can be controlled by the number of Gibbs sampling.The second term is the value related to the energy function ( x, h;   ).
These convergence conditions derived here are different from conditions that were obtained by Yuille [16]; convergence conditions which were obtained by Yuille are related to the model parameter; our convergence conditions are not related to the model parameter.Because the task of learning is to estimate the model parameter, the model parameter is generally unknown; convergence conditions in this paper have more practical significance than convergence conditions that were obtained by Yuille.

The Learning Rate and Convergence Conditions.
In convergence conditions of the CD algorithm, the condition which   must satisfy is a necessary condition; it is Basing the fact that ∑ ∞ =1 (1/) = ∞ and ∑ ∞ =1 (1/ 2 ) < ∞, we assume   = 1/ and  0 = 0; then, we have the following new convergence conditions derived from Theorem 3.

Consistency of CD.
It is clear that CD is equivalent to the Monte Carlo version of the log-likelihood gradient descent as the number the MCMC step  goes to infinity, because the empirical distribution   (, ℎ; ) converges to the distribution (, ℎ; ).It is known that CD gives a good solution; even  is relatively small.Akoho and Takabatake [14] give an intuitive interpretation about the reason why CD can approximate well by means of information geometry.
In the above sections, we study the convergence of the CD algorithm; now, we consider the consistency of the CD algorithm.If  * is a limit point of   , then (  ) converges to the finite value ( * ) by Theorem 2; then,  * is a stationary point of  ( * = arg min()); furthermore, every limit point of   is a stationary point of .It is known that CD is an approximation of the log-likelihood gradient, the convergence conditions of Theorem 3 assure the error of the approximation is small enough to make CD converging.
If the convergence conditions of Theorem 3 are satisfied, CD will converge.We know the conclusions of Theorems 2 and 3 hold with probability 1.We can obtain the following conclusion: if the CD algorithm converges with probability 1, the convergence point is consistent with the stationary point of the optimal function log (; ), which is a local optimum in general.

Convergence of CD Algorithm for RBMs
In this section, we consider the convergence of the CD algorithm for RBMs.In the following, we consider the case where both visible units  and hidden units ℎ only take a finite number of values.
In Section 3, we have already considered convergence of the CD algorithm and derived convergence theorem for the CD learning algorithm based on the convergence theorem of gradient method with errors.Now, we give the convergence theorem of the CD learning algorithm for RBMs.
Since  and ℎ only take a finite number of values, then ‖(, ℎ)‖ has the upper bound; we assume the upper bound is .
By using the Theorem 3, we see that the CD learning algorithm converges.
The theorem is proved.We obtain convergence conditions of the CD learning algorithm for RBMs.Next, we study the relationship between the learning rate and convergence conditions.Basing the fact that ∑ ∞ =1 (1/) = ∞ and ∑ ∞ =1 (1/ 2 ) < ∞, we also assume   = 1/ and  0 = 0; then, we have the following new convergence conditions derived from Theorem 5.
The result of Corollary 6 shows that the convergence of the CD algorithm is related to the errors between the empirical distribution function   ( x, h) and the distribution function ( x, h) providing the learning rate is deterministic; the error can be controlled by the number of Gibbs sampling.

Theoretical Analysis of Convergence Conditions.
In Section 4.1, we have given the convergence conditions of the CD algorithm for RBMs; the most important term is the error between the empirical distribution function   ( x, h;   ) and the distribution function ( x, h;   ); the empirical distribution function   ( x, h;   ) is the empirical distribution function on the samples obtained by the data  and running the Markov chain forward for  steps, the distribution function ( x, h;   ) is the limit distribution of the empirical distribution.Fischer and Igel [24]