A Self-Organizing Incremental Spatiotemporal Associative Memory Networks Model for Problems with Hidden State

Identifying the hidden state is important for solving problems with hidden state. We prove any deterministic partially observable Markov decision processes (POMDP) can be represented by a minimal, looping hidden state transition model and propose a heuristic state transition model constructing algorithm. A new spatiotemporal associative memory network (STAMN) is proposed to realize the minimal, looping hidden state transition model. STAMN utilizes the neuroactivity decay to realize the short-term memory, connection weights between different nodes to represent long-term memory, presynaptic potentials, and synchronized activation mechanism to complete identifying and recalling simultaneously. Finally, we give the empirical illustrations of the STAMN and compare the performance of the STAMN model with that of other methods.


Introduction
The real environment in which agents are is generally an unknown environment where there are partially observable hidden states, as the large partially observable Markov decision processes (POMDP) and hidden Markov model (HMM) literatures attest. The first problem for solving a POMDP is hidden states identifying. In many papers, the method of using the -step short-term memory to identify hidden states has been proposed. The -step memory is generally implemented through tree-based models, finite state automata, and recurrent neural networks.
The most classic algorithm of tree-based model is U-tree model [1]. This model is a variable length suffix tree model; however, this method can only obtain the task-related experiences rather than general knowledge of the environment. A feature reinforcement learning (FRL) framework [2,3] is proposed, which considers maps from the past observationreward-action history to an MDP state. Nguyen et al. [4] introduced a practical search context trees algorithm for realizing the MDP. Veness et al. [5] introduced a new Monte-Carlo tree search algorithm integrated with the context tree weighting algorithm to realize the general reinforcement learning. Because the depth of the suffix tree is restricted, these tree-based methods cannot efficiently handle long-term dependent tasks. Holmes and Isbell Jr. [6] first proposed the looping prediction suffix trees (LPST) in the deterministic POMDP environment, which can map the long-term dependent histories onto a finite LPST. Daswani et al. [7] extended the feature reinforcement learning framework to the space of looping suffix trees, which is efficient in representing longterm dependencies and perform well on stochastic environments. Daswani et al. [8] introduced a squared -learning algorithm for history-based reinforcement learning; this algorithm used a value-based cost function. Another similar work is by Timmer and Riedmiller [9], who presented the identify and exploit algorithm to realize the reinforcement learning with history lists, which is a model-free reinforcement learning algorithm for solving POMDP. Talvitie [10] proposed the temporally abstract decision trees to learning partially observable models. These -step memory representations based on multidimensional tree required additional computation models, resulting in poor time performances and more storage space. And these models have poor tolerance to fault and noise because of the accurate matching of each item.
More related to our work, finite state automata (FSA) has been proved to approximate the optimal policy on belief 2 Computational Intelligence and Neuroscience states arbitrarily well. McCallum [1] and Mahmud [11] both introduced the incremental search algorithm for learn probabilistic deterministic finite automata, but these methods learn extremely slowly and with some other restrictions. Other scholars use recurrent neural networks (RNN) to acquire memory capability. A well known architecture for RNN is Long Short-term Memory (LSTM) proposed by Hochreiter and Schmidhuber [12]. Deep reinforcement learning (DRL) [13] first was proposed by Mnih et al., which used deep neural networks to capture and infer hidden states, but this method still apply to MDP. Recently, deep recurrent -learning was proposed [14], where a recurrent LSTM model is used to capture the long-term dependencies in the history. Similar methods were proposed to learn hidden states for solving POMDP [15,16]. A hybrid recurrent reinforcement learning approach that combined the supervised learning with RL was introduced to solve customer relationship management [17]. These methods can capture and identify hidden states in an automatic way. Because these networks use common weights and fixed structure, it is difficult to achieve incremental learning. These networks were suited to resolve the spatiotemporal pattern recognition (STPR), which is extraction of spatiotemporal invariances from the input stream. For the temporal sequence learning and recalling in more accurate fashion, such as the trajectory planning, decision making, robot navigation, and singing, special neural network models for temporal sequence learning may be more suitable.
Biologically inspired associative memory networks (AMN) have shown some success for this temporal sequence learning and recalling. These networks are not limited to specific structure, realizing incremental sequence learning in an unsupervised fashion. Wang and Yuwono [18] established a model to recognize and learn complex sequence which is also capable of incremental learning but need to provide different identifiers for each sequence artificially. Sudo et al. [19] proposed the self-organizing incremental associative memory (SOIAM) to realize the incremental learning. Keysermann and Vargas [20] proposed a novel incremental associative learning architecture for multidimensional real-valued data. However, these methods cannot address temporal sequences. By using time-delayed Hebb learning mechanism, a self-organizing neural network for learning and recall of complex temporal sequences with repeated items and shared items is presented in [21,22], which was successfully applied to robot trajectory planning. Tangruamsub et al. [23] presented a new self-organizing incremental associative memory for robot navigation, but this method only dealt with simple temporal sequences. Nguyen et al. [24] proposed a long-term memory architecture which is characterized by three features: hierarchical structure, anticipation, and one-shot learning. Shen et al. [25,26] provided a general self-organizing incremental associative memory network. This model not only leaned binary and nonbinary information but realized one-to-one and many-to-many associations. Khouzam [27] presented a taxonomy about temporal sequences processing methods. Although these models realized heteroassociative memory for complex temporal sequences, the memory length still is decided by designer which cannot vary in self-adaption fashion.
Moreover, These models are unable to handle complex sequence with looped hidden state.
The rest of this paper is organized as follows: In Section 2, we introduce the problem setup, present the theoretical analysis for a minimal, looping hidden state transition model, and derive a heuristic constructing algorithm for this model. In Section 3, STAMN model is analyzed in detail, including its short-term memory (STM), long-term memory (LTM), and the heuristic constructing process. In Section 4, we present detailed simulations and analysis of the STAMN model and compare the performance of the STAMN model with that of other methods. Finally, a brief discussion and conclusion are given in Sections 5 and 6 separately.

Problem Setup
A deterministic POMDP environment can be represented by a tuple = ⟨ , , , , ⟩, where is the finite set of hidden world states, is the set of actions that can be taken by the agent, is the set of possible observations, is a deterministic transition function ( , ) → , and is a deterministic observation function ( , ) → . In this thesis, we only consider the special observation function ( ) → that solely depends on the state . A history sequence ℎ is defined as a sequence of past observations and actions { 1 , 1 , 2 , 2 , . . . , −1 , }, which can be generated by the deterministic transition function and the deterministic observation function . The length |ℎ| of a history sequence is defined as the number of observations in this history sequence.
The environment that we discuss is deterministic and the state space is finite. We also assume the environment is strongly connected. However the environment is deterministic, which can be highly complicated, and nondeterministic at the level of observation. The hidden state can be fully identified by a finite history sequence in a deterministic POMDP, which is proved in [2]. Several notations are defined as follows: trans(ℎ, ) = { | is a possible observation following ℎ by taking action }.
Our goal is to construct a minimal, looping hidden state transition model by use of the sufficient history sequences. First we present the theoretical analysis showing any deterministic POMDP environment can be represented by a minimal, looping hidden state transition model. Then we present a heuristic constructing algorithm for this model. Corresponding definitions and lemmas are proposed as follows.
Definition 1 (identifying history sequence). A identifying history sequence ℎ is considered to uniquely identify the hidden state . In the rest of this paper, the hidden state is equally regarded as its identifying history sequence ℎ , so and ℎ can replace each other.
A simple example of deterministic POMDP is illustrated in Figure 1. We conclude easily that an identifying history Computational Intelligence and Neuroscience sequence ℎ 2 for 2 is expressed by { 2 } and an identifying history sequence ℎ 1 for 1 is expressed by Note that there may exist infinitely many identifying history sequences ℎ for because the environment is strongly connected and may exist as unbounded long identifying history sequences ℎ for because of uninformative looping. So this leads us to determine the minimal identifying history sequences length .
Definition 2 (minimal history sequences length ). The minimal identifying history sequences length for the hidden state is defined as follows: The minimal identifying history sequence ℎ for the hidden state is the identifying history sequence ℎ such that no suffix ℎ of ℎ is also identified for . However, the minimal identifying history sequence may have unbounded length, because the identifying history sequence can include looping arbitrarily many times, which merely lengthens the history sequence. For example, two identifying history sequences ℎ 1 and ℎ 1 for 1 are separately expressed to { 2 , 1 , 1 , 1 , 1 } and { 2 , 1 , 1 , 1 , 1 , 1 , 1 , 1 , 1 }. In this situation, { 1 , 1 , 1 , 1 , 1 } is treated as the looped item. So the minimal identifying history sequences length for the hidden state is the minimal length value of all identifying history sequences by excising the looping portion.
Definition 3 (a criterion for identifying history sequence). Given a sufficient history sequence, for the set of history sequences for the hidden state , if each history sequence ℎ for the hidden state exists, which satisfies that trans(ℎ, ) is the same observation for each action, thus we consider the history sequence ℎ as identifying history sequence. Definition 3 is correct iff a full sufficient history sequence is given.
Lemma 4 (backward identifying history sequence). We assume an identifying history sequence ℎ is for a single hidden state , if ℎ is a backward extension of ℎ by adding an action ∈ and an observation ∈ ; then ℎ is a new identifying history sequence for the single state = ( , ). is called the prediction state of .
Lemma 5 (forward identifying history sequence). We assume an identifying history sequence ℎ is for a single hidden state , if ℎ is a prefix of ℎ by removing the last action ∈ and the last observation ∈ ; then ℎ is the identifying history sequence for the single state s such that = ( , ). s is called the previous state of .
Proof. We assume ℎ is a history sequence { 1 , 1 , . . . , −1 , } generated by the transition function and observation function . And ℎ is a history sequence { 1 , 1 , . . . , −2 , −1 } as a prefix of h by removing the action −1 and the observation . Since ℎ is an identifying history sequence for the state , it must be that = . Because the environment is deterministic POMDP, thus, the current state is uniquely determined by the previous state such that ( −1 , −1 ) = . Then ℎ is the identifying history sequence for the state such that ( −1 , −1 ) = ( , −1 ) = .
Theorem 6. Given the sufficient history sequences, a finite, strongly connected, deterministic POMDP environment can be represented soundly by a minimal, looping hidden state transition model.
Proof. We assume = ⟨ , , , , ⟩, where is the finite set of hidden world states and | | is the number of the hidden states. First, for all hidden states, at least a finite length identifying history sequence ℎ for one state exists; because the environment is strongly connected, there must exist a transition history ℎ from to , according to Lemma 4, by backward extending of ℎ by adding ℎ , which is the identifying history sequence for . Since the hidden state transition model can identify the looping hidden state, there exists the maximal transition history sequence length from to which is | |. Thus, this model has minimal hidden state space with | |.
Since the hidden state transition model can correctly realize the hidden state transition = ( , ), this model can correctly express all identifying history sequences for all hidden state and possesses the perfect transition prediction capacity, and this model has minimal hidden state space with | |. This model is a -step variable history length model, and the memory depth is a variable value for the different hidden state. This model is realized by the spatiotemporal associative memory networks in Section 3.
If we want to construct correct hidden state transition model, first we necessary to collect sufficient history sequences to perform statistic test to determine the minimal identifying history sequence length . However, in practice, it can be difficult to obtain a sufficient history. So we propose a heuristic state transition model constructing algorithm. Without the sufficient history sequence, the algorithm may produce a premature state transition model, but this model at least is correct for the past experience. We realize the heuristic constructing algorithm by way of two definitions and two lemmas as follows.
Definition 7 (current history sequence). A current history sequence is represented by the -step history sequence { − +1 , − +1 , − +2 , . . . , −1 , } generated by the transition function and the observation function . At = 0, the current history sequence ℎ 0 is empty. is the observation vectors at current time , and − +1 is the observation vectors that precede -step at time .
Definition 8 (transition instance). The history sequence associated with time is captured as a transition instance. The transition instance is represented by the tuple ℎ +1 = (ℎ , , +1 ), where ℎ and ℎ +1 are current history sequences occurring at times and + 1 on episode. A set of transition instances will be denoted by the symbol F, which possibly contains transitions from different episodes. Proof by Contradiction. Since an identifying history sequence ℎ is considered to uniquely identify the hidden state , the identifying history sequences ℎ uniquely identify the hidden state and the identifying history sequences ℎ uniquely identify the hidden state . We assume ℎ = ℎ ; thus there must exit = , which is contradicting with ̸ = . Thus, original proposition is true.
However, the necessary condition for Lemma 10 is not always true, because there maybe exist several different identifying history sequences for the same hidden state.
Algorithm 11 (a heuristic constructing algorithm for the state transition model). The initial state transition model is constructed making use of the minimal identifying history sequence length = 1. And the model is empty initially.
The transition instance ℎ is empty, and = 0: (1) We assume the given model can perfectly identify the hidden state. The agent makes a step in the environment (according to history sequence definition, the first item in ℎ is the observation vector). It records the current transition instance on the end of the chain of transition instance. And at each time , for the current history sequence ℎ , execute Algorithm 12, if the current state is looped to the hidden state ; then go to step (2); otherwise a new node is created in the model; go to step (4).
(3) According to Lemma 10, for any two identifying history sequences ℎ and ℎ separately for and , there exists ℎ ̸ = ℎ iff ̸ = . So the identifying history sequence length for ℎ and ℎ separately for and must increase to +1 until ℎ and ℎ are discriminated. We must reconstruct the model based on the new minimal identifying history sequence length , and go to step (1).
(4) If the action node and corresponding function for the current identifying state node exist in the model, the agent chooses its next action node based on the exhaustive exploration function. If the action node and corresponding function do not exist, the agent chooses a random action instead, and a new action node is created in the model. After the action control signals to delivery, the agent obtains the new observation vectors by trans( , ); go to step (1).
(5) Steps (1)-(4) continue until all identifying history sequences ℎ for the same hidden state can correctly predict the trans(ℎ, ) for each action.
Note, we apply the -step variable history length from = 1 to , and the history length is a variable value for the different hidden state. We adopt the minimalist hypothesis model in constructing the state transition model process. This constructing algorithm is a heuristic. If we adopt the exhaustiveness assumption model, the probability of missing looped hidden state increases exponentially, and many valid loops can be rejected, yielding larger redundant state and poor generalization.
Algorithm 12 (the current history sequence identifying algorithm). In order to construct the minimal looping hidden state transition model, we need to identify the looped state by identifying hidden state. Current history sequence with -step is needed. According to Definition 7, current -step history sequence at time is expressed by { − +1 , − +1 , − +2 , . . . , −1 , }. There exist three identifying processes.
Computational Intelligence and Neuroscience
(2) Following Identifying. If the current state is identified as the hidden state , the next transition history ℎ +1 is represented by ℎ +1 = (ℎ , , +1 ) through trans( , ). According to Lemma 4, the next transition history ℎ +1 is identified as the transition prediction state = ( , ).
In the STAMN, the following activation value ( ) is computed.
(3) Previous Identifying. If there exists a transition instance ℎ +1 = (ℎ , , +1 ), ℎ +1 is an identifying history sequence for state . According to Lemma 5, the previous state is uniquely identified by ℎ such that = ( , ). So if there exists a transition prediction state , and = ( , ), then the previous state is uniquely identified.
In the STAMN, the previous activation value ( ) is computed.
Algorithm 13 (transition prediction criterion). If the model correctly represents the environment model as Morkov chains, then, for all state , for all identifying history sequence ℎ for the same state, it is satisfied that trans(ℎ, ) is the same observation for the same action.
In the STAMN, the transition prediction criterion is realized by the transition prediction value P s j (t).

Spatiotemporal Associative Memory Networks
In this section, we want to relate the spatiotemporal sequence problem to the idea of identifying the hidden state by a sequence of past observations and actions. An associative memory network (AMN) is expected to have several characteristics: (1) An AMN must memorize incrementally, which has the ability to learn new knowledge without forgetting the learned knowledge.
(2) An AMN must be able to not only record the temporal orders but also record sequence items duration time with continuous time.
(3) An AMN must be able to realize the heteroassociation recall. (4) An AMN must be able to process the real-valued feature vectors in bottom-up method, not the symbolic items. (5) An AMN must be robust and must be able to recall the sequence correctly with incomplete or noisy input. (6) An AMN can realize learning and recalling simultaneously. (7) An AMN can realize interaction between STM and LTM; dual-trace theory suggested that the persistent neural activities of STM can lead to LTM. The first thing to be defined is the temporal dimension (discrete or continuous). The previous researches are mostly based on a regular intervals of Δ , and few AMN have been proposed to deal with not only the sequential characteristic but also sequence items duration time with continuous time. However, this characteristic is important for many problems. In speech production, writing, music generation and motorplanning, and so on, the sequence item duration time and sequence item repeated emergence have essential different meaning. For example, " 4 " and " ---" are exactly different, the former represents that item sustains for 4 timesteps, and the latter represents that item repeatedly emerges for 4 times. STAMN can explicitly distinguish these two temporal characteristics. So a history of past observations and actions can be expressed by a special spatiotemporal sequence ℎ. ℎ = { − +1 , − +1 , − +2 , . . . , −1 , }, where is the length of sequence of ℎ; the items of ℎ include the observation items and the action items, where denotes the real-valued observation vectors generated by taking action −1 .
denotes the action taken by the agent at time , where does not represent temporal discrete dimension by sampling at regular intervals Δ but represents the th step in the continuous-time dimension, which is the duration time between the current item and the next one.
A spatiotemporal sequence can classified as simple sequence and complex sequence. Simple sequence is a sequence without repeated items; for example, the sequence " ---" is a simple sequence, whereas those containing repeated items are defined as the complex sequence. In complex sequence, the repeated item can be classified as looped items and discriminative items by identifying the hidden state, for example, in the history sequence " ----", " ," and " " maybe the looped items or discriminative items. Identifying the hidden state needs to introduce the contextual information resolving by the -step memory. The memory depth is not fixed and is a variable value in different parts of state space.

Spatiotemporal Associative Memory Networks
Architecture. We build a new spatiotemporal associative memory network (STAMN). This model makes use of neuron activity decay of nodes to achieve short-term memory, connection weights between different nodes to represent long-term memory, presynaptic potentials and neuron synchronized activation mechanism to realize identifying and recalling, and a time-delayed Hebb learning mechanism to fulfil the one-shot learning. STAMN is an incremental, possibly looping, nonfully connected, asymmetric associative memory network. Nonhomogeneous nodes correspond to hidden state nodes and action nodes.
For the state node , the input values are defined as follows: (1) the current observation activation value ( ), responsible for matching degree of state node , and current observed value, which is obtained from the preprocessing neural networks (if the matching degree is greater than a threshold value, then ( ) = 1); (2) the observation activation value ( ) of the presynaptic nodes set of current state node ; (3) the identifying activation value ( ) of the previous state node of state node ; (4) the activation value ( ) of the previous action node of state node . The output values are defined as follows: (1) The identifying activation value ( ) represents whether the current state is identified to the hidden state or not. (2) The transition prediction value ( ) represents whether the state node is the current state transition prediction node or not.
For the action node , the input values are defined as the current action activation value ( ), responsible for matching degree of action node and current motor vectors. The output value is activation value of the action nodes ( ) and indicates that the action node has been selected by agent to control the robot's current action.
For the STAMN, all nodes and connections weight do not necessarily exist initially. The weights ( ) and ( ) connected to state nodes can be learned by a time-delayed Hebb learning rules incrementally representing the LTM. The weight ( ) connected to action nodes can be learned by reinforcement learning. All nodes have activity selfdecay mechanism to record the duration time of this node representing the STM. The output of the STAMN is the winner state node or winner action node by winner takes all. STAMN architecture is shown in Figure 2, where black nodes represent action nodes and concentric circles nodes represent state nodes.

Short-Term Memory.
We using self-decay of neuron activity to accomplish short-term memory, no matter whether observation activation ( ) or identifying activation ( ). The activity of each state node has self-decay mechanism to record the temporal order and the duration time of this node. Supposing the factor of self-decay is ∈ (0, 1) and the activation value ( ), ( ) ∈ [0, 1] the self-decay process is shown as The activity of action node also has self-decay characteristic. Supposing the factor of self-decay is ∈ (0, 1) and the activation value ( ), ( ) ∈ [0, 1] the self-decay process is shown as wherein and are active thresholds and and are self-decay factors. Both determine the depth of short-term memory , where is a discrete time point by sampling at regular intervals Δ , and Δ is a very small regular interval.

Long-Term Memory.
Long-term memory can be classified into semantic memory and episodic memory. Learning of history sequence is considered as the episodic memory, generally adopting the one-shot learning to realize. We use the time-delayed Hebb learning rules to fulfil the one-shot learning.
(1) The -Step Long-Term Memory Weight ( ). The weight ( ) connected to state nodes represents -step long-term memory. This is a past-oriented behaviour in the STAMN. The weight ( ) is adjusted according to where is the current identifying activation state node, ( ) = 1. Because the identifying activation process is a neuron synchronized activation process, when ( ) = 1, all nodes whose ( ) and ( ) are not zero are the contextual information related to state node . These nodes whose ( ) and ( ) are not zero are considered as the presynaptic node Computational Intelligence and Neuroscience 7 set of current state node . The presynaptic node set of current state node is expressed by , where ( ), ( ) represent the activation value at time , and ( ), ( ) record not only the temporal order but also the duration time because of the self-decay of neuron activity. If ( ), ( ) are smaller, the node is more early to current state node . ( ) is activation weight between presynaptic node and state node . ( ) records context information related to state node to be used in identifying and recalling, where ( ) ∈ [0, 1], (0) = 0, and is learning rate.
The weight ( ) is time-related contextual information; the update process is shown as in Figure 3.

(2) One-Step Transition Prediction Weight ( ). The weight
( ) connected to state nodes represents one-step transition prediction in LTM. This is a future-oriented behaviour in the STAMN. Using the time-delayed Hebb learning rules, the weight ( ) is adjusted according to The transition activation of current state node is only associated with the previous winning state node and action node, where is the current identifying activation state node, ( ) = 1. The previous winning state node and action node are presented by = arg max ( ), ∈ { , }, where ( ) ∈ [0, 1] and (0) = 0, is learning rate.
The weight ( ) is one-step transition prediction information; the update process is shown as in Figure 4.
(3) The Weight ( ) Connected to Action Nodes. The activation of action node is only associated with the corresponding state node which selects this action node directly. We assume the state node with maximal identifying activation value is the state node at the time , so the connection weight ( ) connected to current selected action node is adjusted by where ( ) is represented by function ( , ). This paper only discusses how to build generalized environment model, not learning the optimal policy, so this value is set to be the exhaustive exploration function based on the curiosity reward. The curiosity reward is described by (7). When action is an invalid action, ( ) is defined to be a large negative constant, which is avoided going to the dead ends where is a constant value, represents the count of exploration to the current action by the agent, and ave is the average count of exploration to all actions by the agent; / ave represents the degree of familiarity with the current selected action. The curiosity reward is updated when each action is finished. ( , ) update equation is showed by (8), where is learning rate The update process is shown as in Figure 5. The action node is selected by ( , ) according to where vl is the valid action set of state node . ( ) = 1 represent that the sate node is identified, and the action node was selected by agent to control the robot's current action; set ( ) = 1.

The Constructing Process in STAMN.
In order to construct the minimal looping hidden state transition model, we need to identify the looped state by identifying hidden state. There exist identifying phase and recalling phase (transition prediction phase) simultaneously in the constructing process in STAMN. There is a chicken and egg problem during the constructing process: the building of the STAMN depends on state identifying; conversely, state identifying depends on the current structure of the STAMN. Thus, exhaustive exploration and -step variable memory length (depends on the prediction criterion) are used to try to avoid local minima that this interdependent causes.
According to Algorithm 12, The identifying activation value of state node depends on three identifying processes: -step history sequence identifying, following identifying, and previous identifying. We provide calculation equations for each identifying process. history sequence ℎ for is computed to identify the looping state node . First, we compute the presynaptic potential ( ) for the state node according to where ( ) is the current activation value in the presynaptic node set . is the confidence parameter which means the node 's importance degree in the presynaptic node set . The value of can be set in advance, and ∑ ∈ = 1. The function ⌀ represents the similar degree between activation value of the presynaptic node and the contextual information in LTM. ⌀( , ) ∈ [0, 1]; the similar degree is high between and ; thus ⌀( , ) is close to 1. According to "winner takes all," among all nodes whose presynaptic potentials exceed the threshold , the node with the largest presynaptic potential will be selected.
The presynaptic potential ( ) represents the synchronous activation process of presynaptic node set of , which represents the previous − 1 step contextual information matching of state node . To realize all -step history sequence matching, the -step history sequence identifying activation value of the state node is given below: where max :1→ ( ( )) is the maximum potential value of node . ( ( ), ) ⋅ ( ( ), max :1→ ( ( ))) means the node with the largest potential value is selected among all nodes whose synaptic potentials exceed the threshold . ( ) represents the matching degree of state node and current observed value. If ( ) = 1, the current state is identified to looped state node by the step memory.
(2) Following Identifying. If the current state is identified as the state , then the next transition prediction state is identified as the state = ( , ). First, we compute the transition prediction value for the state node according to where ( ) is the identifying activation value of the state node and action node at time . = arg max ( ) indicates state node and action node are the previous winner nodes. is the confidence parameter which means the node 's importance degree. The value of can be set in advance, and ∑ =arg max ( ) = 1. records one-step transition prediction information related to state node to be used in identifying and recalling phase. If the current state is identified as the hidden state , ( ) represent the probability of the next transition prediction state is .
If the next prediction state node is the same as the current observation value, the current history sequence is identified as the state node . the following identifying value of the state node is given below:  If ( ) = 1, the current history sequence ℎ is identified to looped state node by the following identifying. If ( ) ≥ max :1→ ( ( )), ( ) ≥ , and ( ) ̸ = 1, then there exists ( ) ̸ = 1, representing mismatching of state node and current observed value. There exist trans(ℎ , ) ̸ = trans(ℎ , ). According to transition prediction criterion (Algorithm 13), the current history sequence ℎ is not identified by the hidden Computational Intelligence and Neuroscience 9 sate , so the identifying history sequence length for ℎ and ℎ separately must increase to + 1.

(3) Previous
Identifying. If the current history sequence is identified as the state , then the previous state is identified such that = ( , ). First, we compute the identifying activation value for all states . if there exists ( ) = 1, then the current state is identified as state , and the previous state −1 is identified as the previous state of state . The previous identifying value of the state node is defined as ( ). The previous state is = arg max( ( )) satisfying condition 1 and condition 2. Then we set ( ): Condition 1: Condition 2: According to above three identifying processes, the identifying activation value of the state node is defined as follows: According to Algorithm 11, we give Pseudocode 1 for Algorithm 11.
The pseudocode for Algorithm 12 is as follows: the current history sequence identifying algorithm (identifying phase). Compute the identifying activation value according to (16).
The pseudocode for Algorithm 13 is in Pseudocode 2.
According the Lemma 10, for any two identifying history sequences ℎ and ℎ separately for 1 and 1 , there exists ℎ ̸ = ℎ iff ̸ = . So the identifying history sequence length for ℎ and ℎ separately for and must increase to 2. For 1 , the identifying history sequence is { 1 , 1 , 1 }, and for 1 , the identifying history sequence is { 1 , 1 , 1 }. These two identifying history sequences are identical, so longer transition instance is needed.
According to -step history identifying, following identifying and previous identifying, ℎ can be represented as According to Pseudocode 1, the STAMN model is constructed incrementally using ℎ as in Figure 7(a), and the LPST is constructed as Figure 7(b). The STAMN has the fewest nodes because the state nodes in STAMN represent the hidden state, and the state nodes in LPST represent the observation state.
To illustrate the difference between the STAMN and LPST, we present the comparison results in Figure 8.
After 10 timesteps, the algorithms use the current learned model to realize the transition prediction, and the transition prediction criterion is expressed by the prediction error. The data point represents the average prediction error over 10 runs. We ran three algorithms: the STAMN with exhaustive exploration, the STAMN with random exploration, and the LPST.
Set initial memory depth = 1; = 0; All nodes and connections weight do not exist initially in STAMN; The transition instance ℎ 0 = ; A STAMN is constructed incrementally through interaction with the environment by the agent, expanding the SATAMN until the transition prediction is contradicted with the current minimal hypothesis model, reconstruct the hypothesis model by increase the memory depth . in this paper, the constructing process includes the identifying and recalling simultaneously. While one pattern or can be activated from the pattern cognition layer do ℎ = (ℎ −1 , , ) for all existed state nodes compute the identifying activation value by executing Algorithm 12 If exist the ( ) = 1 then the looped state is identified, and the weight , is adjusted according to (4), (5) else a new state node will be created, set ( ) = 1. and the weight , is adjusted according to (4), (5) end if end for for all existed state nodes compute the transition prediction criterion by executing Algorithm 13. If IsNotTrans( ) then for the previous winner state node of the state node set memory depth = + 1 until each ℎ are discriminated end for reconstruct the hypothesis model by the new memory depth according to Algorithm 11 end if end for for all existed action nodes compute the activation value according to (9). if exist the ( ) = 1 then the weight ( ) is adjusted according to (6). else a new action node will be created, set ( ) = 1. and the weight ( ) is adjusted according to (6).

IsNotTrans( )
if the ( ) > according to (12) and the ( ) = 1 according to (13) Figure 8 shows that three algorithms all can produce the correct model with zero prediction error at last because of no noise. However, the STAMN with exhaustive exploration has the better performance and the faster convergence speed because of the exhaustive exploration -function.

Experimental Comparison in 4 * 3 Grid
Problem. First, we use the small POMDP problem to test the feasibility of the STAMN. 4 * 3 gird problem is selected, which is shown in Figure 9. The agent wanders inside the grid and has only left sensor and right sensor to report the existence of a wall in the current position. The agent has four actions: forward, backward, turn left, and turn right. The reference direction of the agent is northward.
In the 4 * 3 gird problem, the hidden state space size | | is 11, the action space size | | is 4 for each state, and  Figure 10. Figure 10 shows that the STAMN with exhaustive exploration still has the better performance and the faster convergence speed in the 4 * 3 gird problem. The number of state nodes and action nodes in STAMN and LPST is described in Table 2. In STAMN, the state node   represents the hidden state, but in LPST, the sate node represents observation value, and the hidden state is expressed by -step past observations and actions; thus, more observation nodes and action nodes are repeated created, and most of the observation nodes and action nodes are the same. LPST has the perfect observation prediction capability and is the same as the STAMN but is not the state transition model.
In STAMN, The setting of parameters , , , is very important which determines the depth of short-term memory. We assume = = 0.9 and = = 0.9 initial representing = 1. When we need to increase to + 1, we only need to decrease , according to = −0.2, = −0.1. The other parameters are determined relative easily. We set learning rates = = = 0.9 and set the confidence parameters = 1/| |, = 1/2 representing the same importance degree. Set constant values = 5, = 0.9.

Experimental Results in Complex Symmetrical
Environments. The symmetrical environment in this paper is very complex, which is shown in Figure 11(a). The robot wanders in the environment. By following wall behaviour the agent can recognize the left wall, right wall, and corridor landmarks. These observation landmarks are different from the observation vector in the previous grid problem; every observation landmark has different duration time. For analysis of the fault tolerance and robustness of STAMN, we assume the disturbed environment is shown in Figure 11(b).
The figures in Figure 11(a) represent the hidden states. The robot has four actions: forward, backward, turn left, and turn right. The initial orientation of the robot is shown in Figure 11(a). Since the robot follows the wall in the environment, the robot has only one optional action in each state. The reference direction of action is the robot current orientation. Thus, the path 2-3-4-5-6 and path 12-13-14-15-16 are identical, and the memory depth = 6 is necessary to identify hidden states reliably. We present a comparison between the STAMN and the LPST in noise-free environment and disturbed environment. The results are shown as in Figures 12(a) and 12(b). Figure 12(b) shows that STAMN has better noise tolerance and robustness than LPST because of the neuron synchronized activation mechanism, which bears the fault and noise in the sequence item duration time, and can realize the reliable recalling. However, the LPST cannot realize the correct convergence in reasonable time because of accurate matching.

Discussion
In this section, we compare the related work with the STAMN model. The related work mainly includes the LPST and the AMN.

Looping Prediction Suffix Tree (LPST).
LPST is constructed incrementally by expanding branches until they are identified or become looped using the observable termination criterion. Given a sufficient history, this model can correctly capture all predictions of identifying histories and can map the all infinite identifying histories onto a finite LPST.
The STAMN proposed in this paper is similar to the LPST. However, the STAMN is looping hidden state transition model, so in comparison with LPST, STAMN have less state nodes and action nodes because these nodes in LPST are based on observation, not hidden state. Furthermore, STAMN has better noise tolerance and robustness than LPST. The LPST realizes recalling by successive accurate matching, which is sensitive to noise and fault. The STAMN offers the neuron synchronized activation mechanism to realize recalling. Even if in noisy and disturbed environment, STAMN can still realize the reliable recalling. Finally, the algorithm for learning a LPST is an additional computation model, which is not a distributed computational model. The STAMN is a distributed network and uses the synchronized activation mechanism, where performance cannot become poor with history sequences increasing and scale increasing.

Associative Memory Networks (AMN)
. STAMN is proposed based on the development of the associative memory networks. In existing AMN, firstly, almost all models are unable to handle complex sequence with looped hidden state. The STAMN realizes the identifying the looped hidden state indeed, which can be applied to HMM and POMDP problems. Furthermore, most AMN models can obtain memory depth determined by experiments. The STAMN offers a self-organizing incremental memory depth learning method, and the memory depth is variable in different parts of state space. Finally, the existing AMN models general only record the temporal orders with discrete intervals Δ rather than sequence items duration with continuous-time. STAMN explicitly deals with the duration time of each item.

Conclusion and Future Research
POMDP is the long-standing difficult problem in the machine learning. In this paper, SATMN is proposed to identify the looped hidden state only by the transition instances in deterministic POMDP environment. The learned STAMN is seen as a variable depth -Morkov model. We proposed the heuristic constructing algorithm for the STAMN, which is proved to be sound and complete given sufficient history sequences. The STAMN is real self-organizing incremental unsupervised learning model. These transition instances can Computational Intelligence and Neuroscience be obtained by the interaction with the environment by the agent and can be obtained by a number of training data that does not depend on the real agent. The STAMN is very fast, robust, and accurate. We have also shown that STAMN outperforms some existing hidden state methods in deterministic POMDP environment. The STAMN can generally be applied to almost all temporal sequences problem, such as simultaneous localization and mapping problem (SLAM), robot trajectory planning, sequential decision making, and music generation. We believe that the STAMN can serve as a starting point for integrating the associative memory networks with the POMDP problem. Further research will be carried out on the following aspects: how to scale our approach to the stochastic case by heuristic statistical test; how to incorporate with the reinforcement learning to produce a new distributed reinforcement learning model; and how it should be applied to robot SLAM to resolve the practical navigation problem.