Building Recurrent Neural Networks to Implement Multiple Attractor Dynamics Using the Gradient Descent Method

The present paper proposes a recurrent neural network model and learning algorithm that can acquire the ability to generate desired multiple sequences. The network model is a dynamical system in which the transition function is a contraction mapping, and the learning algorithm is based on the gradient descent method. We show a numerical simulation in which a recurrent neural network obtains a multiple periodic attractor consisting of ﬁve Lissajous curves, or a Van der Pol oscillator with twelve di ﬀ erent parameters. The present analysis clariﬁes that the model contains many stable regions as attractors, and multiple time series can be embedded into these regions by using the present learning method.


Introduction
Recurrent neural networks (RNNs) have been successfully applied to the modeling of various types of dynamical systems.Since the universal approximation ability of multilayer neural networks has been proved, RNNs can model arbitrary dynamical systems and turing machines [1][2][3].However, applying RNNs to a desired model may be very difficult even if such RNNs exist [4].For example, building RNNs to implement required multiple attractor dynamics is a difficult problem for standard training, such as the gradient descent method.Doya and Yoshizawa [5] demonstrated that RNNs can acquire two limit cycles in the gradient descent method using initialization with small connection weights, whereas learning for more than three limit cycles is difficult [6].This is due to the fact that the learning of several time series causes a conflict with respect to the changing of the connection weights.How to form RNN models that can learn several temporal sequence patterns has proved to be a challenging problem.
There have been some approaches to this problem.In order to avoid conflicts in the change of parameters, the mixture-of-experts-type architecture has been investigated [7,8].The mixture-of-experts model consists of RNNs as experts and a hierarchical gating mechanism.
At the end of successful learning, each expert implements attractor dynamics as locally represented knowledge, and a gating mechanism chooses only one expert at any time.The system can acquire many attractor patterns although there is a disadvantage in that the system does not have the generalization ability on the attractor patterns.As the other approach to implement multiple patterns, the parametric bias (PB) method has been developed to improve the learning capability of RNNs [9,10].In an RNN that employs the PB method (RNNPB), PB values provide the information needed in order to individualize each sequence.It has been reported that the number of time series that RNNPBs can learn is greater than that which RNNs without PB can learn.However, the PB method cannot avoid the conflict caused by each attractor learning.Therefore, learning multiple time series by an RNNPB tends to fail when the number of time series increases.
In the present study, we will focus on the training method for RNNs to learn multiple attractor dynamics.Furthermore, we will show that the present research is related to research into RNNs with contraction transition functions.In recent years, RNNs with contraction transition mapping have been investigated with respect to the performance of time series learning [11][12][13], generalization ability [14], and memory capacity [15].Jaeger [11,12] demonstrated that an "echo state network," which is an RNN with contraction mapping, successfully learns the Mackey-Glass chaotic time series, a well-known benchmark system for time series prediction.In order to formally express the generalization ability, Hammer and Ti ňo proved that RNNs with contraction are distribution-independent learnable in the probably approximately correct (PAC) sense [14].From the above results, RNNs with contraction might be regarded as powerful tools for modeling dynamical systems.However, RNNs with contraction have difficulty in representing multiple attractor dynamics because dynamic states governed by the contraction transition function are globally attracted to one point.In this paper, the representation capability of RNNs with contraction mapping will be improved such that the RNNs can obtain multiple attractor dynamics.
We start by defining the concepts of the RNN and the training method for multiple attractor dynamics.The RNN has the Elman net-type architecture, and the training method for RNNs is basically based on the backpropagation through-time (BPTT) algorithm [16].We then show in numerical simulation that the RNNs can acquire multiple periodic attractors constituted by five Lissajous curves, or a Van der Pol oscillator with twelve different parameters.Moreover, we consider why the RNNs successfully learn multiple attractors and how the performance of learnability depends on parameters of the RNNs.Finally, we link the results obtained herein to other learning strategies, and consider other advanced research topics.

Recurrent Neural Network.
We first consider a neural network model with recurrent connection, such as the Elman net [17] (see Figure 1).The RNN contains I/O units, orthogonal units, and internal units.We denote the dynamic states of I/O units, orthogonal units, and internal units at time step n by x n ∈ R N1 , r n ∈ R N2 , and u n ∈ R N3 , respectively.The RNN is defined by functions f θ and g θ with a parameter θ ≡ (W 1 , W 2 , W 3 , V, b, d), where f θ : where and F denotes a componentwise application such as F i = tanh.Dynamic states of the RNN at time step n are updated according to ( From these equations, the RNN can be represented by an N 3dimensional dynamical system.We now define bistability for the RNN.
The function f θ is bistable with respect to the third variable u if a real value ω > 1 and an integer N s exist such that for every element w i j of the matrix W 3 .
The bistability of a function f θ is a key concept of our learning method.We will show in Section 4.1 that the bistable function f θ plays an important role in the learning of multiple attractor dynamics.

Learning Method.
We present a formulation of the training procedure for the RNN with a multiple teacher I/O time series.For every 1 ≤ k ≤ m and L k ∈ N, we assume that ( Initialization of Parameters.We initialize every element of matrices W 1 (0), W 2 (0), and V(0) and vectors b(0) and d(0) randomly from the uniform distribution in the interval (−1/N 3 , 1/N 3 ).A matrix W 3 (0) is randomly assigned such that f θ is bistable.For all Assume that r (k) n (0) is an m-tuple of vectors (s (k,1)  n (0), . . ., s (k,m) n (0)) for 1 ≤ k ≤ m and 1 ≤ n ≤ L k , and that the dimension of s (k,l)  n (0) is equivalent to that of s (k ,l )   n (0) if l = l .We initialize s (k,l)  n (0) such that Advances in Artificial Neural Systems

Run Network with Teacher I/O and Compute Error Function.
For every 1 ≤ k ≤ m, the sequence ( x (k) 1 (t), . . ., x (k) Lk (t)) of I/O units of the RNN at learning step t is defined by The error function E (k) (t) of the RNN at learning step t with the kth teacher I/O time series is defined by where e denotes the mean square error function e(x, Finally, the error function E(t) at learning step t is defined by where Δρ(0) = 0; α and β are the constants of the learning rate and momentum, respectively.On the other hand, a connection matrix W 3 (t) is not changed as W 3 (t+1) = W 3 (t) in order to hold the bistability condition.We compute the initial state u (k) 1 (t + 1) of the internal units at learning step t + 1 such that where Δu (k) 1 (0) = 0, and α is the constant of the learning rate of the initial state.Assume that s (k,l)  n (t) is a vector as a component of the orthogonal units r (k)  n (t), such as where Δs (k,l) n (0) = 0, and α is the constant of the learning rate of the orthogonal units.
Note that the maximum value of the error function E(t) depends on the number of units and the length of the teacher I/O time series.Thus, we should scale the learning rates α, α , and α with the number of units and length of sequences.In the present paper, we consider parameters γ, γ , and

Numerical Experiments
In this section, we conduct two types of experiments as examples of using the training method for RNNs proposed in Section 2. The first experiment shows the learning of five Lissajous curves.The second experiment shows the training of multiple attractors of a Van der Pol oscillator with 12 different parameters.

Teacher I/O Time Series. Our first task is to learn the five Lissajous curves defined by
x (1) and we consider constants M = 32 and L k = 200 for all 1 ≤ k ≤ 5 (see Figure 2).

Learning and Testing.
We now describe the specific conditions applied to RNN training.The time constant is set to 0.1.The number N 2 of orthogonal units is 10, and the dimension of a vector s (k,l) n is 2 for all 1 ≤ l ≤ 5. Suppose that f θ is bistable with N 3 = 30, N s = 15, and ω = 2.5.The learning rates and momentum are given by γ = 0.1, γ = γ = 0.01, and β = 0.9, respectively.
Figure 3 shows the error function E (k) (t) for 20 000 learning steps.We also show the Kullback-Leibler divergence between the teacher I/O time series and a sequence of I/O units in the RNN computed by (3) which do not use external perturbation by the teaching sequences.We use the Kullback-Leibler divergence as a measure of the discrepancy between two sequences.Formally, the Kullback-Leibler divergence between two probability distributions p and q is defined as By definition, in order to compute the Kullback-Leibler divergence, it is necessary to obtain probability distributions of the teacher I/O time series and a sequence of I/O units.However, obtaining the probability distribution of a sequence of I/O units is very difficult.Therefore, we quantize a time series of real-valued vectors into a symbolic sequence such that if the real value is less than 0, then the symbol 0 is appropriated, and otherwise the symbol 1 is appropriated.In addition, we use the probability distribution whereby sub-blocks with a block length of l appear in the symbolic sequence given by the above quantization.
Figure 4 describes attractors of the trained RNN computed by (3) of which the initial state of internal units is u (k)  1 (t) for each 1 ≤ k ≤ 5.By comparing the attractors with the teacher I/O time series displayed in Figure 2, we can see that the RNN can generate sequences similar to training data.
In Figure 5, examples of attractors for the RNN with random initial states are displayed.This shows that, in addition to the attractors corresponding to teacher I/O time series, there exist many attractors of the RNN.

Experiment 2: Van der Pol Attractors
3.2.1.Teacher I/O Time Series.Our second task is to learn multiple attractors given by the Van der Pol oscillator with different parameters.The Van der Pol oscillator defined by is a model of an electronic circuit that appeared in very early radios.It is well known that there exists a limit cycle for the Van der Pol oscillator.In this experiment, we consider twelve teacher I/O time series, where the kth teacher I/O time series x (k)  n is given by for μ = 0.25 and a = 0.15, where b k and c k are constant parameters representing the center position of the limit cycle, and τ k is a time constant of the oscillator.We assume that the parameters b k , c k , and τ k are given by combining the values of b k = ±0.4,c k = ±0.4,and τ k = 2, 4, 6. Figure 6 shows the teacher I/O time series given by (16).The length of training data is

Learning and Testing.
The parameters for learning are set as follows.Let f θ be bistable with N 3 = 40 and N s = 20.The dimension of the vector s (k,l) n is 1 for every 1 ≤ l ≤ 12 so that N 2 = 12.Other parameters are the same as in experiment 1.
The error function and the Kullback-Leibler divergence for 200 000 learning steps are displayed in Figure 7. Figure 8 shows attractors of the trained RNN, and the initial state of the internal units of which is set to u (k)  1 (t) for every 1 ≤ k ≤ 12.
This result allows us to consider that the RNN acquires multiple periodic attractors constituted by the teacher I/O time series.

Numerical Analysis
4.1.Contraction and Bistability.Assume that X and U are sets and that U is equipped with a metric structure.A function f : X × U → U is a contraction with respect to U if a real value C ∈ [0, 1) exists such that the inequality holds for all x ∈ X and u 1 , u 2 ∈ U.   Lemma 2. Let one consider a dynamical system on R N3 defined by the transition function u n+1 = f θ (g θ (u n ), u n ), where f θ and g θ are defined in (1) and (2), respectively.Assume that each element w i j of the matrix W 3 satisfies (4), and ν ∈ R is the maximum absolute value of elements in W 1 , W 2 , and b.If there exist three solutions of (1) there are 2 Ns invariant sets of a dynamical system Proof.We suppose that (18) has three solutions, such as x 1 > x 2 > x 3 (see Figure 9).In general, x 1 , x 2 > 0 and x 3 < 0.
(1) Assume 1 ≤ i ≤ N s and x 1 ≥ u (i) n ≥ x 2 .Then, the expression Therefore, the region [x 1 , ∞) is a stable set of the ith element of vector u n satisfying the fact that if Similarly, we can easily show that if u (i) n ≤ −x 1 , then u (i)  n+1 ≤ −x 1 .Thus, there are two stable regions of the ith element of vector u n for each 1 ≤ i ≤ N s .Then, there are 2 Ns invariant sets.
(2) Let U ⊂ R N3 be the invariant set presented above.Assume that u n , u n ∈ U.
On the other hand, for every N s ≤ i ≤ N 3 , Then, For any N 1 , N 2 ∈ N and ν ≥ 0, there is a real number q such that if ω ≥ q, then (18) has three solutions.Thus, if ω is large enough and matrices W 1 and W 2 represent small connection weights, then f θ contains 2 Ns invariant sets, and each restriction of f θ to an invariant set is a contraction with respect to a third input.Moreover, the integer N 3 − N s is the effective degree of freedom for each contraction mapping restricted to an invariant set.If N 3 − N s is a large value, then RNN can acquire a more complex time sequence.In Figures 10 and 11, we plot the Kullback-Leibler divergence of the trained RNN for parameters ω and N s , in which the training data are the same as those for experiment 1.These results imply that it is necessary that ω, 2 Ns , and N 3 − N s be large values in order to learn multiple attractor dynamics.

Orthogonality.
In the last paragraph of the previous section, we have shown that RNNs have many stable regions, and the existence of the stable regions plays an important role in the learning of multiple sequences.However, the existence of multiple stable regions is not sufficient for success in the multiple attractor learning because if the change of parameters corresponding to each time series influences other changes, each time series cannot necessarily be embedded into each region.Similarly, this problem appears in the method of RNNPB.
In the training algorithm defined in Section 2, each state of orthogonal units r = (s (1) , . . ., s (m) ) is trained by (5) and (11).Thus, firing of s (k) only occurs in the generation Advances in Artificial Neural Systems of the kth teaching sequence.This implies that orthogonal units allow the conflict of parameter changes caused by multiple time series learning to be avoided because orbits Figure 12: Average dV (k) (t) of the kth learning ratio for the connections between internal units and orthogonal units s (k) for 200 000 learning steps in experiment 1.
corresponding to each teaching I/O time series run around the orthogonal state space of the trained RNN.In order to show the effect of the orthogonal units on the conflict among teaching sequences, we consider the kth learning ratio dv (k)  i j (t) defined by where v i j is an element of the matrix V .If dv (k) i j (t) is nearly equal to 1, then the change in v i j is approximately independent of teaching sequences rather than the kth sequence.In Figure 12, we plot the value dV (k) (t) determined by where R (k) is a set of indices corresponding to the elements of the vector s (k) .The value dV (k) (t) represents the average of the kth learning ratio for connections between internal units and orthogonal units s (k) .In this numerical experiment, for each learning step, dV (k) (t) is clearly larger than 1/m = 0.2, where m is the number of teaching sequences.Then, the sum of the kth learning ratios of connection weights between internal units and orthogonal units s (k) is dominant.Therefore, in changing matrix V , there is no conflict generated by multiple teaching sequences.However, we could not find a strong bias of the learning ratio for the matrices W 1 and W 2 and every element v i j of V with i < N s .Thus, we consider that connection weights between internal units and orthogonal units encode information on an individual time series, and other connection weights encode whole information.

Discussion
In this report, we have investigated a method of embedding multiple time series into a single RNN.In order to clarify the characteristics of the proposed approach, we compare the proposed approach with other approaches with respect to information representation of multiple sequences in the models.The mixture-of-RNN-experts-type model composes local representation in an RNN for each sequence.The local representation provides robustness against changing the parameters in learning, but it lacks the ability to extract common patterns included in the sequences because of the independency of the local representation.In the proposed model, the local representation is constructed into orthogonal units, while the global representation is also constructed into internal units using the connection weights between I/O units and internal units.Since each sequence generated by the proposed model shares the state space and connection weights, the model can extract common patterns of the sequences as well as conventional neural networks.Another characteristic, which clarifies the difference between our model and other models, is whether the classification of each time series is self-organized into the state space.For example, in the mixture-of-RNN-expertstype model, the allocation of time series to each RNN is determined automatically.As another example, in the RNNPB model, PB values are self-organized such that the PB can individualize each time series.On the other hand, the proposed model needs the information of orthogonalization for each time series.Since the sparse firing patterns which appear in orthogonal units, corresponding to time series, are given as teaching information externally, the classification of sequences is not self-organized.The characteristic whereby the time series cannot be automatically classified is a disadvantage of the proposed model.However, the time series can be classified using other clustering techniques before applying the proposed method.Thus, by combining the proposed method and other clustering techniques, an algorithm that automatically classifies and generates multiple time series can be constructed.

Conclusion
In this paper, we have presented an RNN model and a learning algorithm that can acquire the ability to generate multiple sequences.The RNN model consists of two distinct properties called bistability and orthogonality.Bistability guarantees the existence of multiple attractor structures in RNNs, and provides the RNNs with contraction transition mapping.Orthogonality, which is given as a function of the orthogonal vectors of RNNs, helps prevent conflicts with respect to parameter changes caused by multiple training sequences.In the numerical experiments, RNNs which have bistability and orthogonality can learn multiple periodic attractors constituted by five Lissajous curves or 12 Van der Pol oscillators.Based on these results, the proposed model can be applied to the modeling of various types of dynamical systems that include multiple attractors.

Figure 1 :
Figure 1: Architecture of the recurrent neural network.Solid arrows, dotted arrows, and boxes represent fixed connections, adjustable connections, and network states, respectively.

Figure 2 :
Figure 2: Trajectories of the teacher I/O time series in experiment 1.

Figure 3 :
Figure 3: Error and Kullback-Leibler divergence between the teaching sequences and output generated by the RNN for 20 000 learning steps in experiment 1.

Figure 4 :
Figure 4: Time series x n generated by the trained RNN in experiment 1.For each time series, only the initial state u 0 is different.

Figure 5 :Figure 6 :
Figure 5: Time series x n generated by the trained RNN with random initial state u 0 in experiment 1.

Figure 7 :Figure 8 :
Figure 7: Error and Kullback-Leibler divergence between the teaching sequences and output generated by the RNN for 200 000 learning steps in experiment 1.

Figure 11 :
Figure 11: Kullback-Leibler divergence between the teaching sequences and output generated by the trained RNN with =