Parameterless-Growing-SOM and Its Application to a Voice Instruction Learning System

An improved self-organizing map (SOM), parameterless-growing-SOM (PL-G-SOM), is proposed in this paper. To overcome problems existed in traditional SOM (Kohonen, 1982), kinds of structure-growing-SOMs or parameter-adjusting-SOMs have been invented and usually separately. Here, we combine the idea of growing SOMs (Bauer and Villmann, 1997; Dittenbach et al. 2000) and a parameterless SOM (Berglund and Sitte, 2006) together to be a novel SOM named PL-G-SOM to realize additional learning, optimal neighborhood preservation, and automatic tuning of parameters. The improved SOM is applied to construct a voice instruction learning system for partner robots adopting a simple reinforcement learning algorithm. User’s instructions of voices are classified by the PL-G-SOM at first, then robots choose an expected action according to a stochastic policy. The policy is adjusted by the reward/punishment given by the user of the robot. A feeling map is also designed to express learning degrees of voice instructions. Learning and additional learning experiments used instructions in multiple languages including Japanese, English, Chinese, and Malaysian confirmed the effectiveness of our proposed system.

Generally, SOM algorithm maps an n-dimension feature data in an input space x(x 1 , x 2 . . ., x n ) to a unit i in a low-dimensional output space with connections m i (m 1 , m 2 . . ., m n ) by a simple rule using Euclidean distance, winner-takes-all, that is, a high-dimensional input is corresponded to a most suitable unit i with position c, best-match-unit (BMU) on the output map.For all inputs and initial connections with random values, a competitive learning rule enhances that the input data with similar features keep closely on the visualized topological output map where α is learning rate and h ci is a neighborhood function Here, c i , c denote the positions of an arbitrary unit on the output map and BMU, respectively, i = 1, 2, . . ., k ≤ N × M, σ is a constant.Obviously, h ci (x) ≥ 0, h ci (0) = 1, and h ci (∞) = 0.In fact, the size of the output space N×M in the original SOM is fixed in advance, and parameters such as learning rate α and the scale of neighborhood σ are often determined empirically.These constraints result in 2 kinds of problems in technical applications [6][7][8][9][10][11][12][13][14]: (1) the fixed size of output map prevents additional learning when new feature data are presented and BMUs are difficult to be found on the trained output map; (2) annealing schemes for tuning the learning rate and the neighborhood size are necessary to improve the operation rate of output map; however, it usually increases computational load to realize the annealing.
Variations of SOM with growing structures are proposed to solve the first problem [7][8][9][10].The basic idea of these kinds of SOM is to set the output feature map with a small size initially, for example, 2 units, then insert rows/columns into the map in training, where/when a most visited BMU exists [7,10] or the deviation of the distance between the units on input layer and output map [8,9].We proposed another kind of method to solve the lack of units by using a memory layer to store matured units on the feature map during training process and release the matured units to be initialized, that is, the units come to available to be reused [12,13].When a feature data set is input to the learning system, the process searches corresponding BMU on memory layer at first, feature map which is produced by SOM just become to an intermediate map, so we called it transient SOM (T-SOM).
To solve the second problem, there have been also various approaches such as reducing learning rate (α in (2)) and neighborhood size (σ in (3)) linearly, that is, multiplying attenuation coefficients, calculating the neighborhood size in the input space, or using Kalman filters to find BMU on the output space [6].Berglund and Sitte proposed a lowcost parameterless SOM algorithm (PLSOM) recently which uses the fitting error between the input and the map only to decide the annealing schemes [11].
In this paper, we combine the idea of growing SOM algorithm and the method of PLSOM to construct a novel SOM names parameterless-growing-SOM (PL-G-SOM) to tackle both problems of SOM described above.This new PL-G-SOM increases its structure adapting to the input data, and anneals parameters to realize sensitive clustering on the output space automatically.We also adopt PL-G-SOM into a voice instruction learning system where it serves as an automatic classifier of input features as well as T-SOM has been applied to a hand image instruction learning system [12,13] and a voice instruction learning system [14].
The rest of this paper is organized as follows.Section 2 presents the details of PL-G-SOM.Section 3 shows a voice instruction learning system using PL-G-SOM.In Section 4, instruction learning experiments with 4 languages were reported to confirm the ability of learning and additional learning of the proposed system.Section 5 is the conclusion.The structure of a voice instruction learning system using PL-G-SOM.It is similar to the system using T-SOM in [12][13][14], however, instead of memory layer of BMU in T-SOM, each map grows with training.Annealing schemes of their neighborhood size and learning rates are given by PL-G-SOM.

A New SOM: PL-G-SOM
criteria have been proposed.Fritzke chose to insert a new row/column adjacent to a most often visited BMU in his Growing Grid [7].The reason for this criterion of map enlargement is that the earlier map may be considered as a coarse one and likelihood BMUs need to raise their resolution to deal with the change of input.Meanwhile, Bauer and Villmann suggested adding units in the direction or even new dimension of the largest error between input data and the output map in their GSOM [8,9].However, the process of enlarging the output map either in Growing Grid or GSOM is similar and it is shown in Figure 1.In fact, when a new row/column needs to be inserted to the neighbor of a BMU c, for example, in the middle of c and f, the weights of connections between input and new nodes take average values of c and f : and so do them of r's neighbors where l = 1, 2, . . ., N or M. Unit f is chosen which has a largest Euclidean distance from the BMU c among the neighbors of c, and after this process, the map size changes to N × (M + 1), or (N + 1) × M. We use the same growing process here however, a new criterion to choose the BMU is proposed by concerning with a reinforcement learning algorithm when SOM is adopted into a human-machine interaction learning system.The detail will be given in Section 3.

Annealing of Parameters.
To decide the learning rate and the size of neighborhood function, we adopt the method of PLSOM proposed by Berglund and Sitte [11].Either the learning rate α = ε(t) or the neighborhood size σ(t) is calculated by the distance between input and the BMU: where σ max , σ min are positive parameters, for example, the value may be the size of the map and 1.0, respectively.
The competitive learning rule of the connections between input and output units that is, (2), can be changed to an online learning algorithm

A Voice Instruction Learning System Using PL-G-SOM
A voice instruction learning system is supposed as an internal model of an autonomous robot which performs kinds of available actions when external signal in voices is presented at first and learns to output requested actions using the reward or punishment from the instructor.So the system supports the robot to keep learning and additional learning abilities.For example, a robot with the voice instruction learning system is able to "understand" human's instructions in different languages, or a pet robot like "AIBO" [16] comes easily to used to change a new owner.

The Structure.
To realize the human-machine interaction, an internal model of autonomous robot is constructed as shown in Figure 2. The structure is similar to a learning system using Transient-SOM (T-SOM) which is proposed in our previous work [12][13][14].In [12,13], a hand image instruction learning system which has 5 layers including Input Layer, Feature Map, Action Map, Feeling Map, and Memory Layer is composed with SOM algorithm and reinforcement learning rules.Instructions to the robot are presented by kinds of shapes of human's hand, and robots categorize them, that is, image signals in an 80-dimensional space with SOM, and the instructions are labeled with a series of autonomous actions according to a stochastic policy.
Instructor observes the action of the robot and provides reward/punishment of the action to robot, so the action policy of the robot is able to be modified to cooperate with the instructions of hand images.For online learning and additional learning, T-SOM adopted a memory layer which  stores "matured" and input features are matched with units on the memory layer before executing SOM on feature map.We also annealing plan to decide the size of neighborhood and learning rate into T-SOM, a voice instruction system using the improved T-SOM named PL-T-SOM was [14].However, a problem that exists in is that its memory layer stores only value of matured without the topology of the feature map.Even if the memory layer could remember topology of the feature map trained online, the new topology would not be established on it.For this reason, we propose a new voice instruction learning system using PL-G-SOM given in Section 2 instead of T-SOM.
In Figure 2, Feature Map is the basic growing SOM and the size of Action Map and Feeling Map growing with the Feature Map too.In fact, instructions given by voice data are transformed into feature vectors of input space (layer) at first, the PL-G-SOM algorithm is then executed on the Feature Map, and the rules of growing given by ( 4) and ( 5) (Figure 1) are also applied to increase Action Map and Feeling Map.Action Map is composed by those units which correspond to the units on Feature Map; that is, each unit on Action Map represents each kind of features of input data.The units on Action Map are labeled by a reinforcement learning algorithm given by Section 3.2 to limit each feature to adaptive actions of the robot.Feeling Map has the same distribution of units as which of Map.The action number that comes from Action Map is furnished a feeling value which expresses the of the action mastered by the robot.The of Feeling Map are described in Section 3.3.

Reinforcement Learning Algorithm. The value of units on
Action is given by a value function of state and action, is, (8), where Q(s , a (i)) has the value of action a t (i) when the robot is in the state s t and Q(s 0 , a 0 (i)) = random initially: where ±r is the empirical value of reward (+)/punishment (−) given by the instructor, for example, a positive constant when the robot correctly according to its function and a negative constant oppositely.Now suppose that there are N × M units that exist on Action Map; that is, N × M states exist in the environment of decision process (MDP), each unit has K actions to be selected available, then a learning (RL) algorithm [17] can be used to label classes of the states which are units Action Map yielded by the Feature Map.According to (8), a Q-value table can be established as shown in Table 1.
For each state s t that is, presented voice instruction, robot intends to select a valuable action a t (i) according 1: Q-value table.Each unit of Action has a Q t (s t , a t (i)) value corresponding to an action.
to a stochastic action π given by Gibbs distribution (Boltzmann distribution) as shown π t (a t (i) | s t ) = e Qt(st,at(i))/T j∈A e Qt(st,at( j))/T .
Here, T is named temperature [17], higher T causes an active exploration of actions (each action is selected under a similar probability), and lower T gives a greedy selection of the action with higher Q value oppositely.
We propose to use Q(s t , a t (i)) as a criterion of growing the size of Feature Map, Action Map, and Feeling Map.In fact, when the robot chooses an action with high Q(s t , a t (i)) but instructor judges that it is wrong, then a new row/column is inserted nearby the s t , that is, the BMU c.The growing process is described in Section 2.1.

Feeling Map.
To express the degree of how a voice instruction is learned by robot, a Feeling Map which has the same number of units with Action Map is designed (Figure 2).The distance from input pattern to BMU of Feature Map and the reward from instructor are used to calculate feeling values which is normalized in [−1.0, 1.0] where high positive value means happiness and 0.0 is the  Comparing the learning result showed by (c) and (d), it is easy to find that except of the action "1", PL-G-SOM showed to be more effective in gathering the similar input on its Action Map as neighbored output comparing with T-SOM.initial value of each unit; negative values express sadness.The learning algorithm which was also used in [12][13][14] is given by where F(i) notes the feeling value of unit i on the Feeling Map (zero initially), notes the continue times of reward punishment, D i is the Euclidean distance (squared error) between the unit on Map corresponding to i, and the input data, a, b are constants and 0 < a < 1, 0 < b 1.

4.1.
Learning and additional learning experiments were performed using the system with PL-G-SOM proposed in Section 3 and the system with T-SOM in [12][13][14].Four kinds of voice instructions were used in experiments: sit down, lie down, stand up, and walk.Instructions in Japanese were used to training the system.Additional learning using voice instructions given by other languages was executed after training using the Japanese.Three kinds of languages: English, Chinese, and Malaysian were used to confirm additional learning ability of the system.The voices were recorded in a normal room by 3 males who pronounced each instruction 3 times.So, there were 3 samples of one instruction used for each kind of languages while 4 actions with 48 samples.
Sound waves were preprocessed by normalization and noise elimination, and windowed by 20 intervals to yield 20 feature vectors of input space.Figure 4 shows an example of instruction "sit down" pronounced in Japanese ("Osuwari"), English ("Sit"), Chinese ("Zuoxia"), and Malaysian ("Duduk").Parameters used in the experiments are shown in Table 2.

Results and Analyses.
Either T-SOM or PL-G-SOM realized 100% recognition rates for 4 actions in different languages after learning and additional learning.However, PL-G-SOM showed faster and better convergence than T-SOM when the Euclidean distance (SE: squared error) between input and BMUs (Figure 5).This means that the classification to the input pattern was executed more efficiently by PL-G-SOM.Furthermore, the feeling values which express instruction recognition rate showed more obviously that correct actions of robot corresponding to instructions in voices were acquired more quickly and stably (Figure 6). Figure 7 shows the internal states of Feature Map   Malaysian 300 times, respectively, the size of Feature Map and Action map of PL-G-SOM grew from 25 (5 × 5) units to 165 (11 × 15) (Figure 8).The scaling variable ε used in PL-G-SOM (( 6)-( 7)) changed with the training, and by Figure 9, one can confirm that ε decreased eventually during learning at first; however, when a new kind of language was input, the scaling variable ε suddenly changed to be larger and repeated its annealing scheme.Figure 10 shows the increase of the number of units on Memory Layer of T-SOM and the increase of the number of units on PL-G-SOM.Both units grew with additional learning and the number of units on Memory Layer of T-SOM stopped at 33, meanwhile 140 units were inserted into PL-G-SOM each layer.To confirm the robustness of the two learning system, we also tested noisy samples.
Table 3 shows the results of recognition rates of different actions with 10%, 20%, and 30% noises added to the 48 voice samples (i.e., N% of data in 20 dimensions were replaced by random numbers between [0.0, 1.0] ).The average rate of success actions using T-SOM and PL-G-SOM was 48.0% and 86.7, respectively, given by 10 times of executions.Table 4 shows the results of recognition rates of different languages with the respective noisy samples.
Figure 11 shows the comparison of recognition rates of T-SOM and PL-G-SOM when 10% noises existed in all 48 instruction samples.
The results using PL-G-SOM proposed here show advantages than those with conventional learning system in all cases.In fact, we also investigated the use of frequency features for recognition of different instructions, however, similar results were observed in the experiments.

Conclusion
PL-G-SOM, a novel self-organizing map, was proposed using a reinforcement learning algorithm and annealing schemes of parameters.Online learning and additional learning are available with PL-G-SOM, and it was adopted into a voice instruction learning system of autonomous robot instead of conventional T-SOM.Experiments results showed that the advantage of the new learning system is speed and noise robustness.

Figure 1 :Figure 2 :
Figure 1: Insert a row/column into the feature map.Unit c is a BMU, and f is the farthest unit among the neighbors of c, r the inserted row/column.

Figure 3 :
Figure 3: Flow chart of the proposed voice instruction learning system processing.

Figure 4 :Figure 5 :
Figure 4: An instruction (sit down) features input to robot in different languages.Left: sound waves; right: normalized features in 20dimentional space.

Figure 6 :
Figure 6: Feeling values rose to the maximum happiness 1.0 according to training iterations.PL-G-SOM proposed here (solid lines) showed faster and longer convergence than T-SOM in [12-14] (dash lines).

Figure 7 :
Figure 7: Change of each map during learning process.Left maps show Feature Maps and Right maps show Action Maps in (a)-(d).Comparing the learning result showed by (c) and (d), it is easy to find that except of the action "1", PL-G-SOM showed to be more effective in gathering the similar input on its Action Map as neighbored output comparing with T-SOM.

Figure 8 :
Figure 8: Results of feature classification and instruction learning/additional learning using PL-G-SOM.The sizes of grew to 11 × 15 (165 units) when they began with 5 × 5 (25 units) in the experiment.

Figure 9 :Figure 10 :
Figure 9: The change of scaling variable ε of PL-G-SOM in the learning/additional learning process.

Figure 11 :
Figure 11: Comparison of recognition rates between T-SOM and PL-T-SOM using voice instructions with 10% noises.

Table 3 :
Recognition rates of different actions with noises.

Table 4 :
Recognition rates of different languages with noises.