Encoding Static and Temporal Patterns with a Bidirectional Heteroassociative Memory

Brain-inspired, artificial neural network approach offers the ability to develop attractors for each pattern if feedback connections are allowed. It also exhibits great stability and adaptability with regards to noise and pattern degradation and can perform generalization tasks. In particular, the Bidirectional Associative Memory BAM model has shown great promise for pattern recognition for its capacity to be trained using a supervised or unsupervised scheme. This paper describes such a BAM, one that can encode patterns of real and binary values, perform multistep pattern recognition of variable-size time series and accomplish many-to-one associations. Moreover, it will be shown that the BAM can be generalized to multiple associative memories, and that it can be used to store associations from multiple sources as well. The various behaviors are the result of only topological rearrangements, and the same learning and transmission functions are kept constant throughout the models. Therefore, a consistent architecture is used for different tasks, thereby increasing its practical appeal and modeling importance. Simulations show the BAM’s various capacities, by using several types of encoding and recall situations.


Introduction
Being able to recognize and recall patterns of various natures and in different contexts is something that human beings accomplish routinely and with little effort.But these tasks are difficult to reproduce by artificial intelligent systems.A successful approach has consisted of distributing information over parallel networks of processing units, as done in biological neural networks.This brain-inspired, artificial neural network approach offers the ability to develop attractors for each pattern if feedback connections are allowed e.g., 1, 2 .It also exhibits great stability and adaptability with regard to noise and pattern degradation, and can perform generalization tasks.In particular, the Bidirectional Associative Memory BAM model has shown great promise for pattern recognition for its capacity to be trained using a supervised or unsupervised scheme 3 .Given n bipolar colon-matrix pairs, X 1 , Y 1 , X 2 , Y 2 , . . ., X i , Y i , . . ., X n , Y n , BAM learning is accomplished with a simple Hebbian rule, according to the equation: In this expression, X and Y are matrices that represent the sets of bipolar pairs to be associated, and W is the weight matrix.To assure perfect storage and retrieval, the input patterns needed to be orthogonal X T X K, with typically K I the identity matrix; 4 .In that case, all the positive eigenvalues of the weight matrix will be equal and, when an input is presented for recall, the correct output can be retrieved.For example, if the input x represents the encoded pattern X i , then the output will be given by

1.2
The y output will thus correspond to the Y i encoded stimulus.Equation 1.1 uses a one-shot learning process since Hebbian association is strictly additive.A more natural learning procedure would make the learning incremental but, then, the weight matrix would grow unbounded with the repetition of the input stimuli during learning.In addition, the eigenvalues will reflect the frequency of presentation of each stimulus.This property may be acceptable for orthogonal patterns, but it leads to disastrous results when the patterns are correlated.In that case, the weight matrix will be dominated by its first eigenvalue, and this will result in recalling the same pattern whatever the input.For correlated patterns, a compromise is to use a one-shot learning rule to limit the domination of the first eigenvalue, and to use a recurrent nonlinear transmission function to allow the network to filter out the different patterns during recall.Kosko's BAM effectively used a signum transmission function to recall noisy patterns, despite the fact that the weight matrix developed by using 1.1 is not optimal.The nonlinear transmission function usually used by the BAM network is expressed by the following equations: where f is the signum function and t is a given discrete time step.Initially, BAMs had poor storage capacity, could only store binary patterns, and were limited to static inputs.Nowadays, storage and recall performance have much improved; BAMs can learn and recall binary patterns with good performance e.g., 5-20 , and that capability has been extended to real-valued patterns e.g., 7, 8, 21-24 .Attention has also been given to learning and retrieving temporal sequences see 25 for a review .These multistep associations have been

Analysis of the Transmission Function
This section describes how the transmission function is derived from both analytical and numerical results of the one-dimensional cubic function.

One-Dimensional Symmetric Setting
The transmission function used in the model is based on the classic Verhulst equation 34 .
Since, this quadratic function has only one stable fixed-point, it is extended to a cubic form described by the dynamic equation where x represents the input value, and r is a free parameter that affects the equilibrium states of 2.1 .Figure 1 illustrates its shape for r 1.The fixed-points are the roots of f x 0. For example, if r 1 the corresponding fixed-points are x −1, 0, and 1.The stability properties of the fixed-points are determined by computing the derivative of 2.1 : If the derivative at a given fixed-point is greater than zero, then a small perturbation results in growth; else, if the derivative is negative, then a small perturbation results in decay.The first situation represents an unstable fixed-point whereas the second represents a stable one.
In the previous example, both x 1 and x −1 are stable fixed-points while x 0 is an unstable fixed-point.This is illustrated in Figure 1 by filled and empty circles, respectively.
Another way to visualize the dynamics of a recurrent neural network is based on the physical idea of energy  where the negative sign indicates that the state vector moves downhill in the energy landscape.Using the chain rule, it follows from 2.3 that Thus, E t decreases along trajectories or, in other words, the state vector globally converges towards lower energy states.Equilibrium occurs at locations of the vector field where local minima correspond to stable fixed-points, and local maxima correspond to unstable ones.To find them, we need to find E x such that The general solution is where C is an arbitrary constant we use C 0 for convenience .Figure 2 illustrates the energy function when r 1.The system exhibits a double-well potential with two stable equilibrium points x −1 and x 1 .
The r parameter plays an important role in determining the number and positions of the fixed-points.Figure 3 illustrates the situation.For r less than zero, the system has only one stable fixed-point, x 0; for r greater than zero, there are three fixed-points: x 0 and x ± √ r, of which the first one is unstable and the other two are stable.Finally, for r equal to zero, we have a pitchfork bifurcation point.We deduce from the preceding that we must have r > 0 for the system to store binary stimuli.

M-Dimensional Symmetric Setting
In a m-dimensional space, 2.1 takes a vector form and becomes where, x x 1 , x 2 , . . ., x m T , and r r 1 , r 2 , . . ., r m T and m is the network's dimension.When the weight matrix is diagonal, the system is uncoupled and the analysis is straightforward.As in the one-dimensional system, the fixed-points are defined by the roots of 2.7 .Their stability properties of each one are determined by finding the eigenvalues of its Jacobian matrix.Depending on the eigenvalues, different types of fixed-point behaviors are obtained.For example, if r 1 1 there will be nine fixed-points as illustrated in Figure 4. Four of them are stable: −1, −1 , −1, 1 , 1, −1 , and 1, 1 ; they are indicated by filled circles.The other five are unstable and indicated by empty circles.Here also, the stability of the fixed-point can be determined from the energy of the system.In matrix notation, the energy function is In the case of bipolar pattern and if r is set to 1, 1 T , then the energy of the system will be equal to −m/4.The function is plotted in Figure 5 for a two-dimensional system, using the previous values of r.The stable fixed-points correspond to the local minima in the plot and they partition the recall space is into four equal wells.For example, if the desired correct output fixed-point is x 1, 1 T , the probability that this fixed-point attracts a uniformly distributed pattern x whose elements x i ∈ −1, 1 is 25%.

Numerical Approximation Using Euler's Method
From the previous analysis, two stable fixed-points exist when r i is positive.For simplicity, if all the element of r are set to one, then 2.7 becomes dx dt x − x 3 .2.9 This last differential equation can then be approximated in discrete time by using the Euler method: where δ is small positive constant and t is a given discrete time step.Rearranging the terms yields However, as it is, 2.11 does not reflect the impact of weight connections.To take into account that the connection strength between units is modified by a learning rule, we pose x t Wx t a t , where W represents the weight connections.Equation 2.11 then becomes x t 1 δ 1 a t − δa 3 t .

2.12
This last function is illustrated in Figure 6.As the figure shows, a i t 1, then x i t 1 will have the same value of 1.
The slope of the derivative of the transmission function will determine its stability.In our case, it is desired that the fixed-points have a stable monotonic approach.Therefore, the  slope must be positive and less than one 35 : This condition is satisfied when 0 < δ < 0.5 for bipolar stimuli.In that case, a W x t ±1.Another way to analyze the behavior of the network for the whole range of δ is by performing a Lyapunov Analysis.This analysis will allow us to discriminate between aperiodic chaos attractor from periodic one.The Lyapunov exponent for the case of a one-dimensional network is approximated by where T is the number of network iterations, set to 10,000 to establish the approximation.Again, for simplicity, we perform the analysis in the case of independent units.In other words, the derivative term is obtained from 2.13 when W I for simplicity , so that λ is given by

2.15
In order to estimate the range of values for a given period, a bifurcation diagram can be performed.Figure 7 illustrates both Lyapunov and bifurcation analysis and shows that for values of δ less than 1.0, the network converges to a fixed-point attractor.However, at higher values e.g., 1.6 , the network may converge to an aperiodic attractor.

Numerical Approximation Using Forth-Order Runge Kutta's Method
Another numerical approximation of the network's dynamics is by using the Forth-Order Runge Kutta FORK method.Contrarily to Euler's method, FORK uses an average estimation to approximate the next time step where Δ is a small approximation parameter.Again, to take into account weight connections, we pose that x t Wx t a t .Equation 2.16 thus becomes

2.17
This last function is illustrated in Figure 8.As the figure shows, if a i t 1, then x i t 1 will have the same value of 1. Again, like any nonlinear dynamic system, to guarantee that a given output converges to a fixed-point x t , the slope of the derivative of the transmission function must be positive and less than one 35 : This condition is satisfied when 0 < Δ < 0.135 for bipolar stimuli.In that case, a Wx t ±1.Here also, bifurcation and Lyapunov exponent diagrams were performed.Figure 8 shows that if the value of Δ is lower than 1.7, the network will converge to a fixed-point.However, if the value is too high then a given input will converge to an aperiodic attractor.

Performance Comparison between Euler and FORK Approximations
The purpose of this simulation was to compare the performance between Euler and FORK approximations.Although FORK gives a more precise approximation of ordinary differential equation 2.7 , we need to evaluate if a better approximation translates into a better radius of attraction.

Methodology
A low and a high memory load was used as a basis of comparison.For the low memory load situation, the first 10 correlated patterns were associated together a − j .The patterns are 7 × 7 pixels images of alphabetic characters where a white pixel is given the value −1 and a black pixel the value 1.This led to a memory load equal to 20% 10/49 of the 49dimensional space capacity.Normally, such a high load value cannot be handled by Kosko's BAM which is about 14%.For the high memory load situation, the associations were extended to all the 26 correlated patterns a − z .This last situation led to a memory load equal to 53% 26/49 .Usually, around 50% is about the maximum load an optimized Hebbian-type associative memory can stored without major performance decline 36 .Figure 9 illustrates the stimuli used for the simulation.The images were converted to vectors of 49 dimensions before being input to the network.
Both methods were tested with their output parameter set to 0.05 and 0.025.These values are much lower than the theoretical limit that was found analytically.The simulation results not reported show that a higher value makes the model unable to associate the patterns.The associations between patterns were accomplish using the learning rule 3.4 and 3.5 that will be described in the next section.
After associating the desired pattern pairs, the radiuses of attraction obtained by Euler and FORK were evaluated for noisy recall tasks.The first task consisted of recalling noisy patterns obtained by generating random normally distributed vectors that were added to the patterns.Each noisy pattern vector was distributed with a mean of 0 and a standard deviation of π where π represents the desired proportion of noise .For the simulation, π varied from 0.1 to 1.0.The second task was to recall the correct associated stimulus from a noisy input obtained by randomly flipping pixels in the input pattern.The number of pixel flips varied from 0 to 10, thus corresponding to a noise proportion of 0 to 20%.

Results
Figure 10 illustrates the various performances.If the memory load is low 10 patterns over 49 pixels as shown in the left column, then Euler's approximation has a slight advantage over FORK for random pixel flipping.This advantage is increased if the memory load is high 26 patterns over 49 pixels as shown in the right column.Moreover, the higher the value of the output parameter, the higher the performance will be.However, if the output parameter is set to high, then the network might not converged to a fixed-point.Therefore, the choice for an output transmission will be based on Euler methods.

Architecture
The proposed BAM architecture is similar to the one proposed by Hassoun 37 as illustrated in Figure 11, where x 0 and y 0 represent the initial input-states stimuli ; t is the number of iterations over the network; and W and V are weight matrices.The network is composed of two Hopfield-like neural networks interconnected in head-to-toe fashion.These interconnected layers allow a recurrent flow of information that is processed in a bidirectional way.The y-layer returns information to the x-layer and vice versa.If similar patterns are presented at both layers, then the network act like an autoassociative memory and if the patterns at the y-layer are associated with different ones in the x-layer, then the network acts like a heteroassociative memory 3 .As a result, it encompasses both unsupervised self-supervised and supervised learning.In this particular model, the two layers can be of different dimensions and contrary to usual BAM designs, the weight matrix from one side is not necessarily the transpose of that from the other side.

Transmission Function
Based, on the numerical results of the Section 2.5.2., the transmission function is expressed by the following equations.The input activations to each of the y-and x-layers are computed as follows:

3.1
The outputs are then obtained using Euler's approximation 2.12 defined by the following equations: ∀i, . . ., N, where N and M are the number of units in each layer, i is the index of the respective vector element, y t 1 and x t 1 represent the layers' contents at time t 1, and δ is a general output parameter.The shape of this function is illustrated in Figure 6.This function has the y t y 1 0 y 2 0 y n 0 advantage of exhibiting continuous-valued or gray-level attractor behavior for a detailed example, see 7 when used in a recurrent network.Such properties contrast with standard nonlinear transmission functions, which only exhibit bipolar attractor behavior e.g., 3 .

Learning
The network tries to solve the following nonlinear constraints: where f is the transmission function defined previously 3.2 .The form of the constraints and the recurrent nature of the underlying network call for a learning process that is executed online: the weights are modified in function of the input and the obtained output.In addition, since incremental learning is favored, the network must then be able to self-converge.As a result, the learning is based on the time difference Hebbian association 7, 38-40 .It is formally expressed by the following equations: where η represents the learning parameter.In 3.4 and 3.5 , the weight updates follow this general procedure: first, initial inputs x 0 and y 0 are fed to the network, then, those inputs are iterated t times through the network Figure 1 .This results in outputs x t and y t that are used for the weight updates.Therefore, the weights will self-stabilize when the feedback is the same as the initial inputs y t y 0 and x t x 0 ; in other words, when the network has developed fixed-points.This contrasts with most BAMs, where the learning is performed solely on the activation offline .Learning convergence is a function of the value of the learning parameter η.For simplicity, we assumed that both input and output are the same y 0 x 0 .Therefore, to find the maximum value that the learning parameter can be set to we need to find the derivation of learning equation when the slope is positive and then solve it for η 35 If t 1 and in the case of a one-dimensional pattern in a one-dimensional network, the situation simplifies to In this case, the network will converges when w k 1 w k 1 and the solution will be Just as any network, as the network increases in dimensionality, η must be set at lower values.Therefore, in the case of BAM of M and N dimensions, the learning parameter must be set according to 3.9

Simulations
The following simulations will first provide a numerical example to illustrate how the learning and recall are performed using a small network.The next simulation then reproduces a classic BAM one-to-one association's task.Finally, the third simulation extends the association property of the model to many-to-one association's task.

Numerical Example
The first simulation uses a toy example to show in details how the learning and recall processes are performed.

Learning
For this simulation the transmission function parameter δ was set to 0.1 and the learning parameter η was set to 0.01.Also, to limit the simulation time, the number of output iterations before each weight matrix update was set to t 1 for all simulations.Learning The stimuli pairs for the simulation are given by

First Learning Trial
Assume that the second pair is randomly selected.Then,

Second Learning Trial
Assume that the first pair is randomly selected.Therefore,

Recall
To test if the network can effectively recall a given pattern, a pattern is selected and it is iterated through the network until stabilization.For example, if the following pattern is presented to the network noisy version of pattern 2 the different state-vectors will be: The output of the second layer will give y 1 0.363 −0.363 1 T .

4.10
Then, this output is sent back to the first layer: x 1 0.344 0.344 1 −0.344T .

4.12
Therefore, the new input is classified as part of the second pattern pair.The next section presents a classic example of one-to-one association.

One-to-One Association
The task of the network consists of associating a given picture with a name.Therefore, the network should output from a picture the corresponding name, and from the name the corresponding picture.

Methodology
The first stimulus set represents 8-bit grey-level pictures.Each image has a dimension of 38 × 24 and therefore forms a 912-dimensional real-valued vector.They are reduced size versions of the California Facial Expressions set CAFE, 41 .Each pixel was normalized to values between −1 and 1.The second set consists of 4-letter words on a 7 × 31 grid that identify each picture.For each name a white pixel is assigned a value of −1 and a black pixel a value of 1.Each name forms a 217-dimensional bipolar vector.Therefore, the W weight matrix has 217 × 912 connections and the V weight matrix 912 × 217 connections.The network task was to associate each image with its corresponding name as depicted in Figure 12.The learning parameter η was set to 0.0025 and the output parameter δ to 0.1.Both values met the requirement for weight convergence and fixed-point development.Since the model's learning is online and iterative, the stimuli were not presented all at once.In order to save computational time, the number of iterations before each weight update was set to t 1.The learning followed the same general procedure described in the previous simulation:

Results
It took about 12 000 learning trial before the learning converged.The network was able to perfectly associate a name with the corresponding picture, and vice versa.Table 1 to Table 4 illustrates some examples of the noise tolerance and pattern completion properties of the network.More precisely, Table 1 shows how the network was able to recall the proper name under a noisy input.The input consisted of the first picture contaminated with a noise level of 22%.In other words, 200 pixels out of 912 were randomly selected and their values were multiplied by −1.
In addition, by the bidirectional nature of the network, it is also possible to not only recall the appropriate name but also to clean the noisy input.Table 2 shows an example of pattern completion.In this case the eye band consisted of pixels of zero value.The network was able to correctly output the name and restore the missing eyes.The network can also output a picture given an appropriate noisy name.For example, Table 3 shows that the network was able to recall the appropriate name even though the input name was incorrect Kate instead of Katy .Finally, Table 4 shows that since "Pete" is the only name that begins with a "P ", then only the first letter is necessary for the network to output the correct face.The general performance of the network has been evaluated by Chartier and Boukadoum 7 .The next section extends the simulations by introducing many-to-one associations.

Transmission Function with Hard Limits
For some applications e.g., 7 , It is desired that the transmission function lies within a fixed range of values.Saturating limits at −1 and 1 can be added to 3.2 and the transmission function is then expressed by the following equations: ∀j, . . ., N,

4.13
In contrast to a sigmoid transmission function, this function is not asymptotic at the ±1 values, and it still has the advantage of exhibiting continuous-valued or gray-level attractor behavior when used in a recurrent network.Figure 13 illustrates the shape of the transmission function when δ 0.5.We compared the same network with and without hard limits on the same task Figure 9 previously described in Section 2.5.1.The performance of the network was evaluated with transmission function δ values between 0.05 to 0.3; values outside this range gave worst results.In addition, the network was compared with the best results δ 0.05 obtained from the simulation illustrated in Figure 10 when no hard limits was used.After the association of the desired pattern pairs, radius of attraction of the network was evaluated under noisy input obtained by randomly flipping pixels in the input pattern.The number of pixel flips varied from 0 to 10.
Figure 14 illustrated the various performances.The results show that under low to medium noise, the network will have the best performance about 10% increases if no hard limits are used.However, under medium-to-high level of noise the situation is reverse; the network will have the best performance about 5% increases if hard limits are used δ 0.1 .

Many-to-One Association
This simulation illustrates many-to-one association.The idea is to associate different emotions depicted by different peoples to the proper name tag.Therefore, upon the presentation of a given images, the network should output the corresponding emotion.The simulation is based on Tabari et al. 42 .

Methodology
As in the previous section, the first stimulus set represents 8-bits grey-level pictures.Each image has a dimension of 38 × 24 and therefore forms a 912-dimensional real value vector.They are reduced size version of the California Facial Expressions sample CAFE, 41 .Each image pixels was rescaled to values between −1 and 1.For each of the 9 individuals 7 images reflect a given emotion anger, disgust, fear, happy, maudlin, neutral, and surprised .The second set consists of letters place a 7 × 7 grid that identify each emotions A, D, F, H, M, N, and S .For each letter a white pixel is assigned a value of −1 and a black pixel a value of 1.Each name forms a 49-dimensional bipolar vector.Therefore, the W weight matrix has 49×912 connections and the V weight matrix 912 × 49 connections.The network task was to associate each emotion, expressed by different person, with the corresponding letter Figure 15 .η was set to 0.0025 and δ to 0.1.Both values met the requirement for weight convergence and fixedpoint behavior.The learning procedure followed the general procedure expressed previously.

Results
Once the learning trials were finished about 1000 epochs , the network was able to correctly recall each emotion from the different people expressing it.As in one-to-one association, the network was able to removed the noise and correctly recall the appropriate emotion tag.Table 5 shows two examples of noisy images that were contaminated by 200-pixel flips 22% .Table 6 shows that the network recalled the appropriate emotion tag even in the absence of picture parts that are usually deemed important for identifying emotions, that is, the eyes and the mouth.
However, the most important property in this context is the fact that the two weight matrices are dynamically linked together.As was shown before, the network will output the corresponding letter given a face, but which face should the network output for a given letter?In this case, going from the letter layer to the picture layer implies one-to-many association.Without an architecture modification that can allow contextual encoding 7 , the network will not be able to output the correct pictures.Rather, the network will average all the people emotional expression.Therefore, as shown in Figure 16, the network is able to extract what features in the images make each emotion different.Step 0 Input Step 0 Noisy input Step 1 Output

Model Extension to Temporal Associative Memory
Until now, the association between the different stimuli is static.In some situations, the network can also perform multistep pattern recognition 43 .Such is the case when the output is a time-series.To allow such encoding, the network's architecture is be modified.Step 0 Input Step 0 Noisy input Step 1 Output

Architecture Modification
Figure 17 shows that, instead of having two heteroassociative networks connected together as in the Okajima et al. model 44 , the network is composed of a heteroassociative and an autoassociative layer.The heteroassociative layer is used to map a given pattern to the next layer, whereas the autoassociative layer acts like a time delay circuit that feeds back the initial input to the heteroassociative layer.With this delay, it becomes possible for the overall network to learn temporal pattern sequences.Figure 17 illustrates the difference between a temporal associative memory and a bidirectional associative memory.The initial value of the input pattern to the heteroassociative part is y 0 , and its output, y t , feeds the autoassociative part as x 0 .The latter yields an output, x t , that is fed back into the heteroassociative part as well as into the autoassociative part.Thus, the autoassociative part serves as a context unit to the heteroassociative part.These two networks work in synergy to allow the necessary flow of information for, as our simulations will show, multistep pattern recall while still being able to filter noise out of the inputs.

Simplification of the Learning Function
In the case of autoassociative learning, the function expressed by 3.5 can be simplified by using the fact that the weight matrix is then square and symmetric.The symmetry property has the effect of canceling the cross terms in 3.5 , and the autoassociative learning function becomes where V represents the connections weight matrix.The learning rule is then a sum of a positive and a negative Hebbian term 7, 38 .

Simulation 1: Limit Cycle Behavior
This simulation illustrates how the network can perform temporal association.The task consists of learning different planar rotation of the same object.Step 0 Input Step 1 Output Step 2 Output Step 3 Output Step 4 Output Step 5 Output Step 6 Output Step 7 Output Step 8 Output Step 9 Output Step 10 Output Step 11 Output Step 12 Output Step 13 Output Step 14 Output

Methodology
Five different pattern sequences must be learned.Each sequence is composed of the same image with a 45 degrees planar rotation.Table 7 illustrates the five patterns sequences.The network has to associative each pattern of a sequence with the following one.In other words the network will associate the 45 • with 90 • , the 90 • with 135 • , the 135 • with 180 • , the 180 • with the 225 • , the 225 • with 270 • , the 270 • with the 315 • , the 315 • with the 0 • , and the 0 • with the 45 • image.Therefore, the steady-state of the network should be a limit cycle of period 8.Each binary stimulus was placed on a 50 × 50 grid, where white and black pixels were assigned −1 and 1 values, respectively.The free parameters were set to η 0.0002 and δ 0.1, in accordance to the specification given in Chartier et al. 8 .

Results
The network took about 2000 learning trials to converge and was able to correctly learn the 5 pattern sequences.It was also able to operate with noisy inputs and perform some generalization.For instance, Table 8 illustrates how the network was able to remove noise Step 0 Input Step 1 Output Step 2 Output Step 3 Output Step 4 Output Step 5 Output Step 6 Output Step 9 Output Step 8 Output Step 9 Output Step 10 Output Step 11 Output Step 12 Output Step 13 Output Step 14 Output Table 10: Network recall using a different planar rotation: 275 • .
Step 0 Input Step 1 Output Step 2 Step 3 Output Step 4 Output Step 5 Output Step 6 Output Step 7 Output Step 8 Output Step 9 Output Step 10 Output Step 11 Output Step 12 Output Step 13 Output Step 14 Output from the "fish" image as the temporal sequence was recalled.Table 9 shows that the network can generalize its learning to use similar objects.In this case a new "butterfly" picture was used as the initial input.As the stimulus iterates through the network, the network recalled the original "butterfly" depicted in Table 7 the butterfly output at step 8 is different from the original one at step 0 .Finally, Table 10 shows that the network can correctly output the image sequence under small planar rotation variations of the initial input image.In this particular example the new "dinosaur" picture represents a 275 • rotation 270 • 5 • .Although that particular rotation was not part of initial sequence set, the network was able to correctly recall the appropriate picture sequence.

Simulation 2: Fixed-Point Behavior
Again, this simulation illustrates how the network can perform temporal association.The task is to learn different planar rotations of the same object.Contrarily to the previous simulation, once the network has been through every planar rotation it converges to a fixed-point.Step 0 Input Step 1 Output Step 2 Output Step 3 Output Step 4 Output Step 5 Output Step 6 Output Step 7 Output

Methodology
Five different pattern sequences must be learned.Each sequence is composed of the same image with a 45 • planar rotation.Table 11 illustrates the five patterns sequences.The network must associative each pattern of a sequence with the following one.In other words the network will associate the 45 • with 90 • , the 90 • with 135 • , the 135 • with 180 • , the 180 • with the 225 • , the 225 • with 270 • , the 270 • with the 315 • , the 315 • with the 0 • and the 0 • with the 0 • Image.This last association will creates a fixed-point attractor that can then be used for other types of association in more complex architectures as will be shown in the next section.Each binary stimulus was placed on a 50 × 50 grid, where white and black pixels were assigned a −1 and 1 value, respectively.The free parameters were set to η 0.0002 and δ 0.1, in accordance to the specification given in 8 .

Results
The network took about 2000 learning trials before convergence occurred, and it was able to correctly learn the 5 pattern sequences.The same examples for noise tolerance and generalization were used to compare the network performance between the limit cycle and the fixed-point condition.Table 12 shows that the network can eliminate noise as the sequence is recalled.Table 13 shows learning generalization to other similar objects.After 9 time steps, the network recalled the "butterfly" that is depicted in Table 11.Finally, Table 14 shows that the network can also output the correct image sequence under small initial variations of Step 0 Input Step 1 Output Step 2 Output Step 3 Output Step 4 Output Step 5 Output Step 6 Output Step 9 Output Table 14: Network recall using a different planar rotation: 275 • .
Step 0 Input Step 1 Output Step 2 Output Step 3 Output Step 4 Output Step 5 Output Step 6 Output Step 7 Output planar rotation.In this particular example the "dinosaur" picture represent a 275 • rotation as in the previous section.

Multidirectional Associative Memories
Sometimes association must be generalized to more than two directions.To deal with multiple associations, Hagiwara 45 proposed a Multidirectional Associative Memory MAM .The architecture was later extended to deal with temporal sequences 46 .This section shows some properties of a MAM that is composed of one temporal heteroassociative memory and one bidirectional associative memory.As in the previous simulations, only the network topology is modified as Figure 18 shows; the learning and transmission function remain the same.In this simulation the information received by the y-layer at time t consist of x t and z t .The feedback sent by z t is useful only once the network reaches steadystate, the last pattern of the stimuli set.Therefore, to minimize its effect during the temporal recall, the feedback value of z t is lower than x t .This is formally expressed by y t f W1x t αW2z t , 6.1 where 0 < α 1, W1 and W2 represent the weight connections linking x t and z t , respectively.Taken together, W1 and W2 form the W weight matrix.

Methodology
The simulation performs the multistep pattern recognition described in the previous section in combination with a standard one-to-one association.The one-to-one association is performed only when the network is at a given fixed-point.In other words the association will occur only when the output is at a 0 • planar rotation for the recalled sequence Table 11 .The association is made between the given picture and the first letter of its name as illustrated in Figure 19.The learning and transmission function parameters were set to η 0.0001 and δ 0.1.The z-layer feedback parameter was set to α 0.4.

Results
Of course, if the temporal or bidirectional associative sections are independently activated, the resulting network performance will be the same as the previous sections.What matter in this case is the output state of the network y t .The best case is when a pattern is presented to the network that is closed to its fixed-point behavior 315 • .The effect of the feedback coming from z-layer will be limited.On the opposite, the worst case is when a pattern is presented with a 45 • planar rotation.In this case, it will need 8 time steps before the output vector is at its fixed-point behavior.From this point on, the picture identification process can occur.
If the feedback coming from the z-layer is too strong, it could deviate the output vector's  trajectory.This could make the network converge to the wrong fixed-point and lead to an incorrect name tag.Because of this, the feedback value of the z-layer must be low.Table 15 shows an example of the "runner" given an initial 45 • planar rotation.After one time step the network correctly output the 90 • picture Out 1 .Of course, the name tag is incorrect Out 2 .Now the new input will correspond to the output of the x layer Out 10 and the feedback given by the z-layer Out 2 not shown .The overall effect will be Out 1 α Out 2 and will be used as the new input.The process is repeated until convergence.After 9 time steps, Table 15 shows that the correct 0 • "runner" picture is output as well as the correct name tag letter "R".

Conclusion
Several interesting properties of a BAM architecture have been shown.First, it was shown that a simple time-delay Hebbian learning can perform one-to-one association and manyto-one association with both binary and real-valued pattern.In addition, the BAM can be modified in such a way that both heteroassociation and autoassociation can be accomplished within the same architecture.In this case, the autoassociative part act like a time delay and the overall network can be used for temporal association.A simple case of one-step delay was shown.However, by adding l extra layers of autoassociation, the network can be easily modified to handle l-step delays.Finally, if a combination of temporal and bidirectional associations is grouped into a multidirectional associative memory more complex behaviors can be obtained.
In all cases, the same learning and transmission function are used.The only modification concerned the network topology.This property gives the network a high internal consistency.More complex architectures are possible, but a really powerful implementation improvement would be to develop an algorithm that guides the architecture growth based on the problem to be solved.Therefore, the architecture could be modified in function of the task to be performed and the desired behaviors, using BAMs as building blocks.

Figure 1 :
Figure 1: Transmission function when r 1 showing two stable and one unstable fixed-points.

2 Figure 4 :
Figure 4: Phase portrait of a two-dimensional system with r 1, 1 T .The stable fixed-points are represented by filled circles and the unstable ones by empty circles.

EnergyFigure 5 :
Figure 5: Energy landscape of the same two-dimensional system as in Figure 4.

Figure 6 :
Figure 6: Transmission function for a value of δ 0.4.

Figure 7 :
Figure 7: Bifurcation and Lyaponov exponent diagrams as a function of δ.

Figure 8 :
Figure 8: Bifurcation and Lyaponov exponent diagrams as a function of Δ.

Figure 9 :
Figure 9: Pattern set used for numerical evaluations.

Figure 10 :
Figure 10: Performance comparison between Euler and FORK methods for different output parameters 0.05 and 0.025 .Upper left panel low memory load under normally distributed noise Upper right high memory load under normally distributed noise.Lower left low memory load under pixels flipped noise.Lower right high memory load under pixels flipped noise.

Figure 11 :
Figure 11: Architecture illustration of the bidirectional associative memory.
was carried out according to the following procedure: Weight initialization at zero; 1 random selection of a pattern pair, 2 computation of x t and y t according to the transmission function 3.2 , 3 computation of the weight matrix W and V update according to 3.4 and 3.5 , 4 repetition of steps 1 to 3 until the weight matrices converge.

Using 3 .2 we compute y 1 and x 1 to get x 1 0 6 Using 3 .
.011 0.011 0.011 −0.011 T , y 1 0.022 −0.022 0.022 T .4.4 and 3.5 , the weight matrices are updated and their values are now weight matrices are not the transposed of each other like Kosko's 3 .kLearning TrialIf the process is iterated over and over, e.g., 100 learning trials the weight matrices converge to

Figure 12 :
Figure 12: Patterns pairs used to trained the network.

1 initialization of weights to zero, 2
random selection of a pair following a uniform distribution, 3 stimuli iteration through the network according to Equations 3.1 and 3.2 one cycle , 4 weight update according to Equations 3.4 and 3.5 , 5 repetition of 2 to 4 until the desired number of learning trials is reached k 15 000 .

Figure 14 :
Figure 14: Comparison between the transmission function with disks or without hard limits square in function of various values of delta.

Figure 15 :
Figure 15: Patterns pairs used to trained the network.

Figure 16 :
Figure 16: Pictures reconstructed using the emotion tags.

Figure 17 :
Figure 17: Comparison between the temporal associative memory and the standard BAM.

Figure 18 :
Figure 18: Architecture of the multidirectional associative memory.The network is composed of a temporal heteroassociative memory and a bidirectional associative memory.

Figure 19 :
Figure 19: Patterns pairs used to trained the network.
1 .The energy function, noted E x , can be defined by

Table 1 :
Network recall for a noisy input.

Table 2 :
Network recall for an incomplete pattern.

Table 3 :
Network recall for an incorrect name.

Table 6 :
Network recall when important feature is removed eyes and mouth .

Table 7 :
Five binary sequences composed of 8 time steps.

Table 9 :
Network recall using a different picture: another example of butterfly.

Table 11 :
Five binary sequences composed of 8 time steps.

Table 13 :
Network recall using a different picture: another example of butterfly.