A Sequential Algorithm for Training the SOM Prototypes Based on Higher-Order Recursive Equations

A novel training algorithm is proposed for the formation of Self-Organizing Maps (SOM). In the proposed model, the weights are updated incrementally by using a higher-order di ﬀ erence equation, which implements a low-pass digital ﬁlter. It is possible to improve selected features of the self-organization process with respect to the basic SOM by suitably designing the ﬁlter. Moreover, from this model, new visualization tools can be derived for cluster visualization and for monitoring the quality of the map.


Introduction
The self-organizing map, with its variants, is one of the most popular neural network algorithms in the unsupervised learning category [1][2][3]. The topologically preserving mapping from a high dimensional space to a low dimensional grid formed by the SOM finds a very wide range of applications [4][5][6]. The SOM consists in an ordered grid of a finite number of cells i = 1 · · · D located in a fixed, low-dimension output array, where a distance metric d(c, i) is defined. Each unit is associated to a model vector (or weight) m i ∈ R n that lives in the same high dimensional space of the input patterns r ∈ Δ ⊂ R n , where Δ is the dataset to be analyzed. The distribution of the model vectors, after the training, resembles a vector quantization of the input data with the additional important feature that model vectors tend to assume, in the input space, the same topological arrangement of the correspondent cells in the predefined output grid so that the topology of the data is reflected into the lattice. In general, the learning is based on a competitive paradigm, where for each input presented to the map a best matching unit is selected by maximizing the similarity between the input and model vectors. Successively, the winning unit and its topological neighbours are adaptively updated in order to increase the matching with the input. The incremental learning algorithm of the SOM has been established heuristically following this paradigm, where the matching of the models with the input is increased by moving all the vectors in the direction of the present input sample The update step for the cell i is proportional to a timedependent smoothing kernel function h ci (t), and c is the winner node The model vector represents the center of the receptive field of the cell. A widely applied kernel is written in terms of the Gaussian function centered over the winner node The scalar function h ci (t) has a central role in the learning process defined by (1), and a large part of the literature on SOM is focused on the theoretical and practical aspects of the different possible implementations of this kernel function.

Advances in Artificial Neural Systems
The SOM learning rule (1) can be interpreted as an approximate gradient descent of a cost function related to the quantization error [7], but it does not possess an exact cost function for continuous input distributions. In [3] it is shown that in order to better minimize a distortion measure the one step rule (1) may be modified formulating an improved but still approximate minimizing step. Equation (1) can be related to the asymptotic point density of the SOM or the so called magnification factor, which relates the density of the SOM weights to the density of the input data [8]. In [9], equation (1) is slightly modified in a convex or concave nonlinear expression achieving a certain control on the magnification factor. In [10] there is a similar nonlinear modification of the learning rule, but the focus is on time series reconstruction. In [11] a rival model penalization is introduced in the SOM, where the rivals are identified as those cells that are not direct neighbours of the winner, except their model vectors are closer to the input than those of the topological neighbours of the winner. For these rival cells, the kernel function assumes negative values so that through (1) their models are moved away from the input. In [12], a time invariant learning rate is proposed, and in [13] a long-term depression of the learning rate is proposed, but with a repulsive kernel function that leads to a novelty detector. Other works are related to an automatic or adaptive generation of the parameters μ(t) and σ(t) [14][15][16].
In the simple approach of [14] μ(t) and σ(t) are obtained as functions of the worst distance, registered during the training, between the winner and the input data.
The common feature of all this SOM variants is that the structure of (1) is maintained almost unchanged while different types of h ci (t) functions are adopted. Then, they have different rules to determine the length of the update step, while the direction of the update step is always imposed by the input. Consequently, at each step the entire map moves toward the presented sample (in [11] some rival models are moved in the opposite direction), without any memory of the previous update directions. This behaviour has theoretical support only in the final convergence phase of the learning [5,8] when it determines a good local statistical accuracy, and it controls the magnification factor. However, during the first phase (when the width of the kernel is shrinking and the global topology of the input data is learned), a smoother change of the update step direction than the one forced by (1) would be a desired feature.
To obtain this feature, in this paper a novel SOM model is proposed where the learning is accomplished by means of a higher-order linear difference equation that implements a predefined tracking filter. In the proposed model, each cell contains a set of memory vectors in order to take trace of the past positions of the model vectors and of the inputs. The update direction is defined by the dynamic of a properly defined filter that guides the movement of the model vectors as a combination of the past vectors. In that way, the proposed model gives a general framework to define different training strategies, where the basic SOM is a particular case. Moreover the proposed method has some new useful additional visualization and data analysis capabilities. In particular each cell can give more information such as local density and learning trajectories when higher order filters are used.

Model Description
2.1. Learning Filter. From (1), the new weight depends explicitly on the present input and on the present weight position (it is in fact a convex combination of these two vectors), and there is not a direct dependence on past input and weight values.
Our idea is to add more degrees of freedom to the incremental learning equation of the SOM, in order to improve the relaxation process that takes place during the self-organization of the model vectors.
This is obtained by adding to the neuron model a "memory" of the past values of the inputs and model vectors of the cell. The new weight vector is calculated as a linear combination of the vectors in the memory. To properly define this linear combination, it is necessary to employ a stable, discrete-time filter, which in the following will be referred as learning filter (LF). Using the Z-transform formalism, the learning filter is defined by means of its transfer function G(z) The filter G(z) is implemented by means of a linear difference equation described by the LF coefficients a = [a 1 , . . . , where r(t − k) represents the input sequence and m(t − k) represents the output sequence. A rigorous requirement of the LF is that it has to be a low-pass filter at least of type 1, that is, with unitary static gain: G(z = 1) = 1. This also means that for a constant input sequence r(t) = r the error sequence r − m(t) will tend to zero as time goes to infinity. Hence, the limit lim t → ∞ m(t) = r holds and the LF can track constant inputs with zero error. Guidelines for the design of the LF will be described in subsequent sections.

Neuron Model.
In order to introduce the dynamic of the LF in the SOM, we associate two matrices (composed by N column) to each element of index i of the map vectors, The columns of M i represent the sequence of the last N values of the weight vector of the cell i, while the columns of R i take trace of the last N values of the input r(t) ∈ R n to the cell i. Note that r(t) in (1) represents the sequence of samples randomly selected from the input distribution during the learning, which are presented to the whole map. Then, in the basic SOM all the cells receive the same input at each time step. Conversely, in the proposed model each cell contains a personalized input sequence R i , related to the winning frequency of the cell.
The n × N memory matrices R i and M i represent the memory added to the neuron model in order to take trace of the past events. Each cell of the proposed SOM calculates the value of the model vector at time t as a linear combination of the memory vectors at time t, through the coefficients of the LF where the represents the whole memory of cell i, while the 2N column vector g = b a contains the LF coefficients. Hence, the model vector m i (t) is directly calculated-from a given memory matrix Q i (t) and the LF coefficients g.

Network Training Procedure for the Proposed Model.
Given the vectorial input data to analyze r ∈ Δ ⊂ R n , we first create the output grid array of the SOM, and then choose the neighbourhood arrangement, defining the distance function d(c, i) between two cells of indexes c and i. Then, we design the linear difference equation of order N that drives the learning process, and suitably define the coefficients T of the numerator and denominator of the low-pass LF G(z) in (4). Initial values for the memory vectors Q i (0) can be selected at random, and the so-called linear initialization, that uses the principal axes of the input distribution, can give some benefit as in a classical SOM.
When a new input sample r(t) is presented to the map during the training the winner unit c is selected as the node that has the nearest model vector m i (t) to the input sample r(t) The receptive field of the cell i is given by the Voronoi region of the model vector Then, the memory of the cells needs to be updated in order to take trace of the new sample. If each neuron is considered as a standalone filter, without neighborhood collaboration, the update of the memory matrices is a straightforward one-step time shift of the memory vectors To include the neighborhood collaboration in our model, we consider the following update expressions: and the global memory matrix where the Gaussian neighborhood function is calculated as In that way, only the memory of the winner and its topological neighbors have to be updated, whereas the memory of the units that are far from the winner remain almost unchanged.

Update Direction and Convergence.
The neighbour collaboration is accomplished by means of an adaptive update of the "memory" of the neurons. Neurons that do not recognise the input pattern will not update their memory, while the memory of the neuron that recognise the pattern is updated together with the memory of its topological neighbours. It is known that in order to have the selforganization of the model vectors the update has to increase the similarity with the input pattern [3]. In our scheme, the model vectors are a linear combination of the memory vectors, as the update direction of the model vectors is determined by (6) and (11). Substituting (11) in (6), we have the following implicit expression of the updated prototypes: Then, the model vector is updated in the direction of m i + (t), which is the next value of the LF output in the case of the absence of neighbourhood cooperation It is central to point out the similarity of the implicit update expression (13) of our algorithm with the classic SOM update rule (1). In the classic SOM update rule (1), the target vector is represented by the randomly selected input pattern r(t), which is the same for all the cells, whereas in the update expression (13) the same role is played by the vector m i + (t), which represents a target pattern different for each cell, defined as a linear combination of the memory vectors and the actual input sample and model vector. In particular this combination is given by an LF that is designed in order reduce the error r(t) − m i + (t) to zero in the case of a continuously repeated input sample r(t) = r. In this sense, the similarity of the model vector with the input sample is increased when the model vector moves in the direction defined by (13). A necessary requirement of every kind of LF to have this property is that they need to be at least of type 1 [17]. This requirement can be translated in a constraint on the coefficients of the LF which is equivalent to require a unitary static gain G(1) = 1. This means that the linear combination in (6) is an affine combination of the vectors r i k , m i k , k = 1 · · · N and, in the case that a i , b i ≥ 0, it becomes a convex combination. This yields some possible geometrical interpretations of (6). In the classic SOM, the space for the weight update is the segment between the weight and the present input sample. In the proposed scheme, the new weight can move in the higher dimensional space given by the affine combination, defined by the LF coefficients of the 2N vectors [r i Another important remark is related to the fact that the target vector sequence m i + (t) can be seen as the output of a filter G(z), where the input is related to the stochastic sequence r(t). Therefore, using an LF with static gain G(1) = 1 corresponds to require a filter which maintains the firstorder moment of the stochastic process r(t), which is also the mean value of the input distribution. In fact, it is known that the mean value of the output sequence m i + (t) of a linear filter G(z) is related to the mean value of the input stochastic sequence r(t) by the following relationship: where the symbol E{·} represents the expectation. As a result the static properties of the input distribution are well represented by the sequence of target vectors m i + (t), and moving the prototypes as in (13) reduces the distance to the input distribution. Consequently, the local stability and convergence of the proposed model are assured by the stability and static gain features of the learning filter, while the global behaviour of the map follows the same heuristic principle of the basic SOM. In particular, to have the global convergence of the model vectors it is necessary that h ci (t) → 0, when t → ∞, and it is necessary to respect the well-known convergence conditions of the Robbins-Monro stochastic approximation method [2].
In the proposed model, the function h ci (t) can be defined similarly to other self organizing paradigms as vector quantization, neural gas [18], or SOM, obtaining analogous self organization behaviours.
As a final result of the proposed model a new self organizing algorithm is defined by (6), (7), and (11). Very stable and robust self organization activities of the model vectors are observed when executing (6), (7), and (11) if the LF G(z) is carefully designed.

Guidelines to Design the Learning Filter (LF).
In this section we describe general principles for a useful design of the LF. We first describe the LF that gives the basic SOM. To reproduce the classical SOM in our model, we have to consider, for example, the kernel function where we consider the following annealing schemes [3]: and T 1 , T 2 are, respectively, the decreasing times of the kernel width and of the learning rate.
The exact SOM equations are obtained from (6), (7), and (11) considering the following first-order learning filter LF1: where α ∈ (0, 1) is a global "learning constant." By substituting the coefficients of the LF1 in (19) into (6) we have that (6), (7), and (11) become formally equivalent to the basic SOM. Hence, the basic SOM algorithm is included in our framework and it can be obtained by using the firstorder LF1. The learning constant α, which determines the bandwidth of the learning filter, in this contest acts as a global attenuation constant of the learning factor μ(t).
A second-order learning filter LF2 has the form Filters coefficients can be chosen so that the learning filter (20) is a low-pass filter with the same normalized cut-off frequency ω 0 ∈ (0, 1) (−3 db band) of filter (19). In this case, the same global attenuation of the learning factor can be expected and the advantage of filter (20) on the classical SOM (19) is the enhanced filtering action obtainable by the higher-order update equation. Due to their well-known "optimal" characteristics of gain and bandwidth reported in circuit analysis, we use discrete low-pass Butterworth filters [17] as LF in many numerical experiments obtaining good results. With this choice, the only free parameters of the LF are the order N of the filter and the cut-off frequency ω 0 [17]. Figure 1 shows two different 20 × 20 maps trained with two different values of ω 0 , by using a second-order low-pass Butterworth filter. The input distribution is clustered in four regions with uniform distribution in R 2 as in Figure 1. Trained model vectors are the nodes of the depicted grid. With a high value of ω 0 = 0.8, the final distribution of model vectors in Figure 1(a) shows a better fitting (or regression) of the input distribution. For a lower value of ω 0 = 0.1, the map has a more interpolative Advances in Artificial Neural Systems 5 behaviour shown in Figure 1(b). This is due to the fact that with smaller values of ω 0 the learning rate factor decreases more rapidly, while the decrease of the neighbourhood width is unchanged. Then, it is possible to select suitable values of ω 0 depending on the particular data to be analyzed with the SOM. For example, usually a more interpolative map is desired for cluster analysis. In a basic SOM, this can be achieved by generating different maps changing the initial values of the learning factor μ(t) or by taking a greater final value of the neighbourhood width σ(t). In the proposed framework, we utilize the learning velocity constant ω 0 , which defines the band of the learning filter, to obtain maps with different interpolative behaviours, instead of operating on the annealing scheme of the learning rate factor μ(t).

Visualization Tools
The proposed method add a new feature to the SOM algorithm, by taking into account the old values of the model vector and of the input during training and using this information when updating the model vectors. By the numerous performed simulations, it comes out that, from these vectors m i 1 , . . . , m i N , useful information can be extracted for visualization purposes. If the LF is of order N ≥ 2, the following quantity can be computed for each cell after the training of the network: The norm of this vector gives the average lengths of the distances between subsequent vectors m i k . It comes out from simulations that e i is related to the point density of the input in the Voronoi region of the cell i. In particular, this norm is smaller for higher-density regions, and vice versa. When e i is visualized in the output grid, defining the colour of the unit, a visualization map similar to the U-matrix [19] is obtained and we call the proposed visualization the E-matrix. The U-matrix is obtained by visualizing the distances between the model vectors of adjacent units while the E-matrix is obtained by considering information given by each cell itself, without considering the topological neighbourhood. Examples of the usefulness of this visualization tool are given in the next section.
A second visual aid on the map is obtained by considering the direction of the normalized vectors e i / e i , that indicate the final direction of the trajectory of the model vector during training. This direction is defined in the input space and can be translated in the output map space by finding, within the neighbours of the cell i, the one whose model vector e i is "pointing to". This is achieved by first finding, for the cell i, all vectors d i q = m q − m i , q ∈ Λ i , where Λ i is the set of the direct neighbours of cell i, and then computing where ·, · denote the dot product. Then, in the output map grid, an arrow can be traced that connects the cell i to the cell u i , and we call the proposed visualization the A-matrix. This reveals to be a useful tool for visualizing the zones of attraction, or contrarily the empty zones of the map.

Numerical Results
We consider now various examples of input distributions in order to evaluate the performance of the proposed algorithm and the effect of changing the filter order N and cut-off frequency ω 0 .
To show the characteristics of the proposed model, we will consider some features of the convergence phase and compare them with data obtained with the basic SOM. To evidence the effects of the newly introduced filter, we will adopt the same kernel function h ci (t) by choosing the same annealing schemes for σ(t) and μ(t).
The total number of presentations used along our experiments is 100000. The presented results show typical trends of several tests performed by the authors. To assess the quality of the map, two indexes are used: quantization error and distortion measure.
Quantization error is calculated as where Δ denote the cardinality of the input dataset. Quantization error indicates how close the model vectors are fitting the data, but does not take into account the ordering state of the map. The following empirical distortion measure DM(σ) is used: where w ci (σ) = exp(−d(c, i) 2 /2σ 2 ). Unlike the quantization error, the distortion measure DM(Δ, σ) considers the SOM topology and we have DE(Δ, σ) → QE(Δ) when σ → 0. The minimization of QE is the goal of vector quantization, while the SOM can be regarded as a computational intelligence algorithm that aims to minimize the distortion measure DM(Δ, σ) for some σ > 0. It is known that the SOM only minimizes DM(Δ, σ) approximately, whereas the effective minimization of DM(Δ, σ) is extremely heavy numerically [3,20,21].

Analysis of a Gaussian Distribution.
In the first experiment, we consider that the input belongs to a Gaussian cluster Δ d g ∈ R d centred in the origin of the axes and with unitary variance. In this experiment, we intend to show the effects of the application of our method on the quality of the obtained maps, by using a Butterworth (BW) LF and changing the bandwidth of the filter and the filter order.
A classic 20 × 20 SOM with hexagonal neighbourhood has been trained where model vectors have been initialized using random initialization.
We also trained a set of maps with our method by adopting different learning filters. Training uses the same number of steps and the same learning parameters employed for the basic SOM, but the learning filter is a low-pass Butterworth filter, and a different map is trained for various values of the cut-off frequency ω 0 ∈ (0.1, 0.8) and filter order N = 2, 3, 4, 5. Each training procedure is repeated several times with different random initializations in order to avoid possible topological defects, and to obtain statistically significant results. Figures 2 and 3 show the averaged quantization error QE(Δ 10 g ) and distortion measure DM(Δ 10 g , 1) at the end    filter order is increased. Hence, the better QE are obtained for high-filter bandwidth and low-filter order. For higher values of ω 0 the BW filters of order 2 and 3 can reach better QE than the classic SOM. The opposite situation can be noticed observing the DM curves. Increasing the filter bandwidth, for ω 0 > 0.3, the DM decreases, while it improves by increasing the filter order. The use of the BW learning filter of order N = 5 leads to trained maps with better DM than classic SOM for all the used values of the filter bandwidth. The trade-off between QE and DM is a common problem that arises in the formation of SOMs. The proposed algorithm furnishes new instruments to directly control the quality of the map. Augmenting the filter order for fixed filter bandwidth produces maps with improved distortion measure but affects the quantization error, while increasing the filter bandwidth produces opposite trends. Depending on the particular analysis to be carried out, different weights can be given by the user to these features of the map. If a map that strictly fits the data is crucial than the minimization of quantization, error is the priority and filters with high bandwidth and low order can be used. If a smoother mapping is needed than higher-order filters can be used. Furthermore, it is noticeable that every BW filter has a range of bandwidth where both QE and DM perform better than the classic SOM. In this sense, we can state that the proposed model gives better self-organization performances than the basic SOM.
If the main scope of the SOM analysis is cluster visualization, then, the use of higher-order filters brings enhanced visualization results when the E-matrix or the Amatrix visualization tools are used as shown in the following examples.

Analysis of a Clustered Distribution.
In the second experiment, we consider that the input distribution is formed by two clusters in R 2 , where the cluster 1 has a uniform probability density function p(x ∈ 1) = 1/3 and the cluster 2 has a uniform probability density function p(x ∈ 2) = 2/3.
A classic 20 × 20 SOM with hexagonal neighbourhood has been trained where model vectors have been initialized using random initialization. The final state of the model vectors after 100000 presentations is shown in Figure 5. In the trained map, an undesired curvature is present in the grid distribution of the model vectors at cluster 2. This is due to the fact that the two principal axes of the input distribution have different lengths, while the map has a 20 × 20 square dimension. In this case, a rectangular map may behave better, but we may be interested in searching a 20 × 20 map that does not exhibit this kind of distortion. In classic SOM, this can be done by adjusting the learning parameters that define h ci (t), then it is not a trivial task to choose these parameters in order to improve the map quality. On the other hand, in the proposed method we can obtain the desired feature by suitably selecting the filter order and cut-off frequency as shown in the previous example. By using a Butterworth filter of order N = 4 and cut-off frequency ω 0 = 0.4, without changing the annealing schemes for the kernel function, the map in Figure 6 is obtained.   The model vectors shown in Figure 6 are the centers of the receptive fields m i calculated as in (6). At the end of the training, we also have, for each cell i, the vectors r i k (t), m i k (t), k = 1 · · · N. In this 2D example they can be visualized directly in the input space. In particular, for each cell i, the trajectory vector e i is calculated as in (21) and it is shown as an arrow that starts from m i . Figure 7 shows the vectors e i for the map in Figure 6. The trajectory vectors clearly indicate the zones of attraction of the input space. In the classic SOM, each cell has one model vector and this kind of visualization is not possible. Now let us consider the visualization of the map on the output grid. A popular tool for visualizing the clusters on the map is the U-matrix. We calculate the U-matrix and the E-matrix for the map of Figure 6 obtaining Figures 8  and 9, respectively. Also the A-matrix, the projections of the trajectory vectors in the output space, is displayed in Figure 9. The E-matrix in Figure 9 shows the norms of the vectors e i of Figure 7, and the A-matrix shows their directions projected on the topographic surface. The U-matrix gives information of the distances between neighbouring model vectors, whereas the E-matrix gives information of the final trajectory of the model vector of each cell independently. The figures show that similar information can be extracted by the two matrices, attesting that the trajectory vectors e i of the interpolative units (model vectors that remain between clusters) have higher final values than units that fall inside more dense areas. This confirm that the E-matrix can be used as a cluster visualization tool. Furthermore, it has additional features with respect to the U-matrix since some information on the map quality can be derived from E-matrix observing how the model vectors fit the underlying distribution.

Analysis of the 4D Anderson's Iris Dataset.
We show now the potentiality of the proposed method for visualizing 4D data, analysing the well-known Anderson's Iris dataset [22]. This dataset is composed of 150 patterns of four real variables, and each pattern belongs to one of three different classes. A first map obtained from a classic SOM of grid dimensions 40 × 10 and a second map obtained by using a second order Butterworth LF are considered. Note that the number of neurons is higher than the number of samples in the data, then the SOM is used as a nonlinear regressioninterpolation method. We consider here results that show similar quantization error and the distortion measure of the two maps. Therefore, equivalent information is achievable from the final distribution of the model vectors as shown in Figure 10. However, it is important to evidence the situation of a very short training length duration, where the number of presentations is limited to one or two epochs. In this case, the U-matrix does not give significant information, as can be seen from Figure 11(a), but from the E-matrix and trajectory vectors a rough, but correct, indication of the presence of two zones of attraction is easily obtained, as shown in Figure 11(b). Then, from this "rough" information it is possible to change some map parameters (number of neurons, filter order, etc.) and immediately optimize the learning process.

Computational Aspects
Considering the numerical complexity of the proposed model, it is important to note that the more demanding operations are the comparisons needed to find the winner unit. In the proposed model, the search of the winner is accomplished as in classical SOM, so that if no shortcut winner search is used, the numerical complexity is o(D 2 ) where D is the number of map units. The proposed model increases the number of updating operations per unit: when the classic SOM needs 1 scalar-vector multiplications at each update step, the proposed method needs 4N + 1 scalar-vector multiplications, so that the increase of the computational load is linear with the filter order N. With regard to the memory requirements, the proposed model needs 2N times the memory needed by the classic SOM. The increased requirements of computational resources are acceptable when the order of the filter is kept below N = 10, whereas the benefits of the proposed model are noticeable even for N = 2 or N = 3.

Conclusions
The presented work shows the feasibility of implementing the SOM learning with a higher-order difference equation that takes into account the data of the previous steps of the update process. The incremental update rule proposed is based on a discrete difference implementation of a suitably designed low-pass digital filter. The proposed model is a framework that allows the flexibility to design different learning strategies for the incremental training of various algorithms related to vector quantization, including the topology preserving methods such as SOMs. It is shown that this approach allows a simple and flexible control of the SOM formation. This gives better performances indexes with respect to a classical SOM at a cost of a slightly higher computational cost. Moreover, the memory of previous data used in the filter implementation can be used at the end of the training for visualizing the training trajectories. This novel visualization tools are very useful to understand the dynamic of the map formation and can be used for visual cluster analysis.