A Simplified Natural Gradient Learning Algorithm

Adaptive natural gradient learning avoids singularities in the parameter space of multilayer perceptrons. However, it requires a larger number of additional parameters than ordinary backpropagation in the form of the Fisher information matrix. This article describes a new approach to natural gradient learning that uses a smaller Fisher information matrix. It also uses a prior distribution on the neural network parameters and an annealed learning rate. While this new approach is computationally simpler, its performance is comparable to that of Adaptive natural gradient learning.


Introduction
Amari et al. developed the adaptive natural gradient learning (ANGL) algorithm for multilayer perceptrons [1][2][3].It is similar to Newton's method in which the gradient of the error function is multiplied by the inverse of the Hessian matrix of the error function.However, Amari's algorithm replaces the inverse of the Hessian matrix with the inverse of the Fisher information matrix.The simplified natural gradient learning (SNGL) algorithm introduced in this paper uses a new formulation of the Fisher information matrix.SNGL is based on the backpropagation algorithm [4].In addition, the SNGL algorithm also uses regularization [5] to penalize solutions with large connection weights.This regularization also prevents the Fisher information matrix from approaching a singular matrix.The learning rate is also annealed [6] to improve the general performance of SNGL.
The remainder of this section will review the natural gradient and discuss the details of Amari's ANGL algorithm in [2,7,8].The next section will introduce the SNGL algorithm and the theory behind it.The third section will describe two improvements used in the SNGL algorithm.These improvements are regularization and annealed learning rates.The following section will show the experimental results and compare the performance of these two algorithms with that of ordinary backpropagation.The final section will conclude the ideas discussed in this paper.

Mathematical Preliminaries and Notation.
The natural gradient learning algorithm is "natural" in the sense that it follows the manifold of the underlying parameter space of multilayer perceptrons.This parameter space has Riemannian geometry.Calculating the direction of steepest descent in a Riemannian manifold is more complicated than it is in a Euclidean space.
The parameter space of multilayer perceptrons is an affine space in which probability distributions are the points and random variables are vectors [9,10].Given a probability density function p of a random variable x, the translation operation is p + x = p exp(x).
1. 1.1. Exponential Families.An exponential family is a set of distributions with probability density functions of the form (1) Thus, the distributions in the exponential family are parameterized by {θ 1 , . . ., θ r }, and these parameters are the canonical coordinates of the distribution.The number A(θ 1 , . . ., θ r ) is chosen such that p(x | θ 1 , . . ., θ r )dx = 1.Given an open set U in the event space of the random variable X, Advances in Artificial Neural Systems the probability that the result of the random variable X lies within U is ( The sufficient statistics for the exponential family are the r random variables η 1 , . . ., η r in (1).These statistics can be used to estimate the coordinates θ 1 , . . ., θ r from the values of samples of x.
The log-likelihood function of a probability distribution is the logarithm of its probability density function log p.The score functions are the derivatives of the log-likelihood function with respect to the coordinates θ 1 , . . ., θ r .The score for parameter i is The Fisher information of two score functions is the expected value of their product.For the score functions s i and s j , the Fisher information is where E p is the expectation operator for the distribution with probability density function p.The Fisher information matrix is a real symmetric r × r matrix G p = (g i j p ) whose elements are the Fisher information for each pair of score functions.

Direction of Steepest Descent.
When finding the minimum of a function f in a Euclidean space, a descent direction [11] is a unit norm tangent vector v p such that The minimum value of ( 5) is −α ∇ f (p) 2 where α is an arbitrary scalar and ∇ f (p Thus, in a Euclidean space, the direction of steepest descent is the unit tangent vector Finding the direction of steepest descent turns out to be harder than it first appears to be.In a Euclidean geometry, it is relatively straightforward.In a Riemannian geometry, the coordinate system is curved, and a gradient descent algorithm should follow the curvature of the manifold.The direction of steepest descent should be compensated for this curvature.
In a Riemannian space, a metric g p exists at every point p.If the tangent vectors v 1 p , . . ., v n p form a basis for the tangent space T p at p, then g i j p = v i p , v j p is the Riemannian metric where •, • is an inner product defined on the tangent space T p .
Given a tangent vector w p = n i=1 α i v i p and the gradient ∇ f (p) = n i=1 β i v i p , their inner product will be If w p is the steepest descent direction, then, according to the analysis above, w p , ∇ f (p) = −α ∇ f (p) 2 .This occurs when w p = −G −1 p ∇ f (p) where G p = (g i j p ) is the matrix whose elements are equal to the inner products of the basis vectors.In other words, the coordinate system matters when doing gradient descent, and it may change at each point on the manifold.The natural gradient learning algorithm uses this "natural" descent direction at each step.

A Probability Model for Multilayer Perceptrons.
In order for this analysis to be applied to neural networks, there must be a probability density function associated with a neural network.This section describes an exponential family in which each member of the family has a probability density function of network output errors.This exponential family will then be used to determine the direction of steepest descent and the natural gradient to be used in a learning algorithm.
Let the vectors {x 1 , . . ., x n } and {y 1 , . . ., y n }, respectively, be the inputs and outputs of a training set.A convenient model for training a multilayer perceptron is where φ(x i ) represents the output of the multilayer perceptron for training input i, and each i ∼ N (0, Σ) represents the error on each training pair.A probability distribution can then be defined with the density function where m is the dimension of the vector space to which y 1 , . . ., y n belong.In this model we assume that each error vector is independent and identically distributed so that the joint density of all the training patterns is With the training set and the network architecture (e.g., number of hidden nodes, number of hidden layers, transfer functions) specified, the probability distributions in (8) can be points in a manifold [9].The network parameters are the coordinates of the points.This is called a neuromanifold [10].Random variables are the tangent vectors at the point p.If v is a random variable, then the point p + v is (by definition) the distribution with the density function p exp(v).The score functions in (3) of the network parameters form a basis for the tangent space at a point [12].If the inner product of this tangent space is defined to be v, w = E p {vw}, where E p is the expectation operator with density function p, then the Riemannian metric is the Fisher information matrix because the Fisher information between two score functions is g i j = E p {s i s j } [9,10,12].These definitions are illustrated in the following example.
1.1.4.Natural Gradient with a Single Neuron.This is an example of a neuromanifold.Consider a neural network that consists of a single neuron that is described by a weight vector w of dimension d and a bias b.This example will show how the Fisher information matrix is determined for a neural network.It also shows some of the pitfalls that are encountered when computing its inverse.
The log-likelihood of a distribution p in a neuromanifold is If the network is a single neuron where x is a d-dimensional input vector, and f is a sigmoid function such as tanh or the logistic function, then the score functions are The gradient of p is The Fisher information matrix is G p = E p {∇ p (∇ p ) T }.In this case, the straightforward calculation yields assuming that One property of sigmoid functions is that their derivatives are small outside of the region near the origin.Therefore, as the norm of w or the bias b increases, the derivative of the output tends toward zero.Since the derivative is squared in (14), the Fisher information matrix G p could become nearly singular.
The matrix G p will also be singular unless there are d linearly independent input vectors.For example, if the input vectors of the training set were three-dimensional vectors, then there must be three linearly independent vectors in order to successfully invert G p .Care must be taken when computing the inverse of G p .Calculating G p becomes much more complicated with the complex network architectures of multilayer perceptrons.

Adaptive Natural Gradient. Multilayer perceptrons can have many parameters. If the dimension of the tangent space
T p is m, then G p will be an m × m matrix.Inverting the matrix has computational complexity O(m 3 ).In the calculation of the natural gradient, the inverse of the Fisher information matrix needs to be known at each step of the algorithm.However, the Fisher information matrix itself does not need to be calculated directly for each step.
Amari et al. developed an algorithm called adaptive natural gradient learning (ANGL) [2].It uses the matrix inversion lemma [12] to perform a rank 1 update on the inverse of the Fisher information matrix at each step.The parameters are arranged and indexed as a column vector.The score function s k = ∇ p then has the same form and is used in the updated formula The variable k is an integer representing the current iteration number of the learning algorithm.At each value of k, new values of the network parameters are used to compute the training error vectors.These errors are used to calculate the changes to the network parameters.The inverse of the Fisher information matrix is first initialized as G −1 0 = δ −1 I.As mentioned in the previous paragraph, the ANGL algorithm stacks all the network parameters into one tall column vector.This is typically done when using a nonlinear least squares algorithm like Levenberg-Marquardt [11].For networks with multiple layers, this leads to tall vectors and a large Fisher information matrix.Amari avoids inverting the matrix by using the matrix inversion lemma.However, the matrix is still usually large enough to be an issue with memory.
The ANGL algorithm performs well, but it needs sufficient memory to store a large matrix and is sensitive to initial conditions and learning rates.

A Simplified Natural Gradient Learning Algorithm
This section describes the learning rule used in the SNGL algorithm.First, the directional derivative for a single-layer perceptron is determined.Then the Fisher information matrix for this network is described in detail.These two results are then generalized to find the directional derivative and the Fisher information matrix for a multilayer perceptron.Finally, the section will conclude with a discussion of the computational complexity of SNGL compared to that of ANGL.
The derivation and analysis of the simplified natural gradient learning algorithm begin with the observation that the neural network is parameterized by (A, b) in the affine transformation Ax + b.There is one affine transformation for each layer of the network.The formula in (11) was for a single neuron that has only one output.In general, the affine transformations in each layer of a multilayer perceptron can have more than one output.The objective of the algorithm is to minimize the sum of squares of the error vector norms.These norms are calculated with the standard L 2 norm of a Cartesian coordinate system.However, that does not mean that the matrix parameters need to be column-ordered into tall vectors.While the homomorphism as done in [2] preserves the vector space operations such as addition and multiplication by a scalar, it does not preserve multiplying the matrix by a vector.Therefore, it makes sense that the score function of a matrix parameter should itself be a matrix.The Frobenius norm of a m×n Matrix is tr( The inner product associated with this norm is A, B = tr(A T B).The following analysis shows how the derivative of the norm of an error vector can be equivalent to an inner product between matrices.
2.1.The Directional Derivative.Assume that the network is a single-layer perceptron with the vector-valued sigmoidal transfer function F(•).Thus, the log-likelihood function with n training examples that depends on the parameters A and b is Furthermore, assume that if the affine transformation ψ(x) then the minimum point, according to (16), is reached.Thus, C and d represent a descent direction.Let γ(t, x) = φ(x) + tψ(x) = (A + tC)x + b + td be a curve through the manifold for 0 ≤ t ≤ 1.The affine transformation γ(0, x) is the initial point, and γ(1, x) is the final point.The derivative of this curve with respect to t is γ(t, x) = Cx + d and is constant with respect to t.The loglikelihood along this curve is The directional derivative of the log-likelihood with respect to t is where F (•) is the Jacobian matrix of F(•).Here, we have isolated γ(t, x) by using the adjoint of F (•) in the second argument of the inner product.Given that γ(0, x) = Ax + b and γ(0, x) = Cx + d, the directional derivative of the loglikelihood function at t = 0 is Let Then, by using the linearity property of the inner product, (19) can be rewritten as The matrix C can be isolated because the matrix inner product A, B = tr(A T B) is being used on the right-hand side.Thus, z i , and the components of the score function are The components in (21) would point in the direction of steepest descent if the space of affine transformations was a Euclidean space.To find the direction of steepest descent, then (21) must be corrected for the curvature of the parameter space using the Riemannian metric G.

The Fisher Information Matrix.
The Fisher information matrix is the Riemannian metric in this manifold [1].For the affine transformation φ(x) = Ax + b, the Fisher information matrix is The score function with the components in (21) can be arranged in an augmented matrix form as Thus, in terms of these components, the Fisher information matrix is The partial derivative ∂ φ /∂A in ( 21) is a sum of n rank 1 matrices.Therefore, the outer product of ∂ φ /∂A with itself is the sum where However, since Σ is not known, it must be estimated.The unbiased estimator is The Fisher information of the weight matrix A is estimated as and the Fisher information of the bias vector b is estimated as The Fisher information matrix can be estimated as the sum of these two matrices.Combining like terms gives The direction of steepest descent in ( 19) is now represented as These equations are for a perceptron with only one layer.Please note that by using the matrix forms, the Fisher information matrix is only a d × d matrix.If we had stacked the weight matrix and bias vector into a single column vector, then the Fisher information matrix would be d(d +1)×d(d + 1) matrix.

Extension to Multilayer Perceptrons.
A multilayer perceptron has r layers each with its own nonlinear sigmoidal function F k (•) and affine transformation φ k (x) = A k x + b k .Each layer k feeds into the one above it with index k+1.Thus, the output of layer k is The input layer is defined to be the identity transformation φ 0 (x) = x and F 0 (x) = x.For convenience, let the vector z k i be the output of layer k when given input x i .The output z i can be written recursively, with z 0 i = x i .Two vector spaces V and W can be joined together by means of a direct sum V ⊕ W in which stacking a vector v ∈ V on top of a vector w ∈ W results in the vector The analog to the direct sum with matrices is to create a block matrix.Given two matrices A and B, their direct sum is The direct sum of the score functions of the layers in a multilayer perceptron is used as the analog for stacking the layers.
The Fisher information matrix is then a block-diagonal matrix where each square matrix along the main diagonal is the Fisher information matrix for a specific layer.The score functions for layer k are where B k i is the backpropagation matrix that is determined with the recurrence relation where r is the number of layers in the network.Thus, for the output layer, the layer index is r and the backpropagation matrix is the Jacobian matrix of the output layer's transfer function.
Let i = y i − z r i be the error for pattern i.The estimate of the Fisher information matrix for layer k is Thus, the updates for the parameters in layer k are if m, n, and p are positive integers.
It is also less computationally intensive to invert the block matrix used in the SNGL algorithm since the matrix inversion algorithms would have complexity O(p 3 + n 3 ).The complexity to update Amari's recursion in (15) is O((m + 1) 2 p 2 +2(m+1)p(p+1)n+(p + 1) 2 n 2 ).This takes into account that ANGL exploits the matrix inversion lemma.Table 1 shows the computational complexity in both space and time of the ANGL algorithm and the SNGL algorithm for three problems described by Park et al. [3].The last two columns labelled "Ratio" in Table 1 are the ratios of the space and time complexities of the SNGL algorithm compared to that of the ANGL written as percentages.

SNGL Improvements.
There are two more elements of the simplified natural gradient learning algorithm.The first is the regularization of the gradient descent algorithm by adding a prior distribution to the probability density function of the network errors [5].The second is annealing the learning rate of the algorithm [6].Neither has any significant impact on the computational complexity.The complexity of the regularization is O(mn + np) for a two-layer network.The learning rate is recalculated once per learning epoch.

Regularization of Network Parameters.
The goal of regularization is to penalize larger connection weights.This helps to discourage overtraining.The penalty of the weights is governed by the hyperparameter α.For a neuron with parameters w and b, the probability density function of these parameters in the prior distribution is where q is the dimension of the vector space to which w belongs.Thus, the weight vector is distributed normally as w ∼ N (0, (1/α)I) and the bias as b ∼ N (0, 1/α).The loglikelihood function is The score functions of w and b with respect to α are, respectively The probability density function of this prior distribution can be added to the probability density function of the network error distribution.Then the updates for the parameters in a multilayer perceptron are When the algorithm is overtraining, the parameters are still increasing.Subtracting α times the current parameter from the update effectively counteracts the update when the norm of that parameter is large enough.The algorithm will eventually reach a steady state where the parameters are longer increasing.
Since E pα {w} = 0, E pα {b} = 0, E pα {ww T } = (1/α)I, and E pα {b 2 } = 1/α, the Fisher information matrix for the regularized algorithm is Adding αI to the Fisher information matrix helps the algorithm avoid inverting an almost singular matrix.Thus, if all the errors are zero, then the Fisher information matrix will be αI.The effect will be multiplying all the weight update elements by 1/α.This is the largest value the inverse will reach during the SNGL algorithm's execution.
In [13] Park adds regularization to the ANGL algorithm with an Akaike Information Criterion.

Annealed Learning Rates.
Annealing the learning rate [6] is a technique in which a learning algorithm will slowly decay its learning rate after a specific epoch during the execution of the algorithm.Let the learning rate of the SNGL algorithm be η t , where t is the current learning epoch.It is defined as where η 0 is the initial learning rate and τ is a factor that determines which learning epoch begins the annealing process.The epoch in which the decay begins is t 0 = τ/η 0 .

Experimental Results
The exclusive OR function was the first Boolean function to be learned by a multilayer perceptron that was not linearly separable [4].Encoding data with a low density parity check (LDPC) code [14] could be considered the exclusive OR problem's big brother.An LDPC is computed by multiplying (0,1)-vectors with a (0,1)-matrix over GF (2) to get a longer (0,1)-vector.Remember that in GF(2) addition is the XOR operation and multiplication is the AND operation.Formally, a low density parity check code is a code in which extra bits that are parity checks are added the message.
A matrix is used to represent the transformation of the bits over GF (2).The encoding function is the matrix This code has a rate of 1/2 because the code words are twice as long as the message words.
The training set consists of 32 training patterns.The inputs are the 32 possible messages, and the outputs are the 32 10-bit code words corresponding to each message.Thus, the network has 5 input units and 10 output units.The multilayer perceptron will also have 10 hidden units.During the training the bits are changed according to a bit error rate of 0.01.This will help the network deal with bit errors.
First, the performance of the three algorithms will be compared.This will be followed by a comparison of their effective learning rates.Finally, an experiment in which the effective learning rate of the SNGL and ANGL algorithms when these learning rates are used in the ordinary gradient learning (OGL) algorithm as described in [4] will be discussed.

Performance Comparison.
Figure 1 shows the sumsquared error of the three algorithms when learning this LDPC encoding function.As can be seen, the performance of the SNGL algorithm is comparable to that of the OGL algorithm, but it reaches convergence in less than 1% of the time.
Table 2 lists the parameters used in all three algorithms and some of the performance metrics.The ANGL algorithm achieves a much lower MSSE.Using the prior probability by adding the current parameters multiplied by the hyperparameter α caused the components of the Fisher information matrix to approach infinity.Therefore, it was omitted, and the weights of the network were allowed to grow without constraint.Thus, the ANGL overtrained the network significantly.
Table 2 may at first seem to be an unfair comparison.However, each of the algorithms were run with and without the regularization and annealed learning rates.The algorithm that performed the best is listed above.This was done to conserve space and simplify the reports.Thus, OGL performs better without an annealed learning rate on this problem.There were several difficulties implementing ANGL with the regularization.It did not perform well with regularization.

The Effective Learning Rate.
The effective learning rate is defined to be the product of the baseline learning rate and the smallest eigenvalue of the matrix chosen to represent the inverse of the Fisher information matrix.For OGL the effective learning rate is just the learning rate chosen for the algorithm.For ANGL it is the learning rate multiplied by the smallest eigenvalue of the estimated Fisher information matrix.For SNGL it is the inverse of the largest eigenvalue of the Fisher information matrix multiplied by the learning rate.Figure 2 shows the effective learning rates for these three algorithms.It is a simplified number to give the reader a sense of the step size used by the algorithm during each learning epoch.
The effective learning rate for the SNGL algorithm is small for the first 100 learning epochs.Between the hundredth and the first thousandth epoch, the learning rate becomes quite large, for a step size, becoming larger than one for a few epochs.When the algorithm reaches convergence, then the learning rate begins to decrease due to annealing.The LDPC function is more complicated than XOR, and this is reflected in the number of learning epochs required to reach the three phases of learning for SNGL.

Substituting Learning Rates.
One might argue that SNGL and ANGL perform better than OGL because they both use higher learning rates.However, as Figure 2 shows, both of these algorithms have smaller learning rates at first, and the learning rates do not increase until progress is made towards good parameter values.
Figure 3 shows the performance of OGL with the effective learning rates of ANGL and SNGL in addition to OGL with a flat learning rate along with ANGL and SNGL.The plot shows that using the effective learning rate of ANGL improves the performance of OGL.However, the performance of OGL does not improve with the SNGL algorithm's effective learning rate.For this problem, the Fisher information matrix used in SNGL is a block-diagonal matrix with two 10-by-10 blocks on the main diagonal.In the ANGL algorithm, it is a 61-by-61 matrix.These matrices are positive definite.The eigenvalues represent how much information is associated with each direction.The eigenvectors associated with these eigenvalues are these directions.Using the maximum eigenvalue to determine an effective learning rate is good for demonstration purposes.However, it will not improve performance of the algorithm because the directional information is missing.

Conclusion
This paper developed a simplified natural gradient learning algorithm.Using an algebraic justification to form the gradient of a multilayer perceptron as a block matrix, a smaller block-diagonal Fisher information matrix was discovered.
For the three problems studied in the experiments, the Fisher information matrix was much smaller than that in ANGL.The minimum sum-squared error performance of this algorithm is comparable to Amari et al. [2] but uses less memory and computational resources.

Figure 1 :
Figure 1: Sum-squared error of the OGL, SNGL, and ANGL algorithms when learning the low density parity check (LDPC).

Figure 2 :
Figure 2: Effective learning rate of the OGL, SNGL, and ANGL algorithms when learning the LDPC encoding function.

Figure 3 :
Figure 3: Sum-squared error of the OGL, SNGL, and ANGL algorithms compared to the OGL algorithm with the effective LR of the ANGL and SNGL algorithms.

Table 1 :
Complexity comparison between the SNGL and ANGL algorithms on other problems.