A Contribution to the Study of Ensemble of Self-Organizing Maps

This study presents a factorial experiment to investigate the ensemble of Kohonen Self-Organizing Maps. Clusters Validity Indexes and the Mean Square Quantization Error were used as a criterion for fusing Kohonen Maps, through three different equations and four approaches. Computational simulations were performed with traditional dataset, including those with high dimensionality, not linearly separable classes, Gaussianmixtures, almost touching clusters, and unbalanced classes, from theUCIMachine Learning Repository and from Fundamental Clustering Problems Suite, with variations in map size, number of ensemble components, and the percentage of dataset bagging.The proposed method achieves a better classification than a single KohonenMap and we applied the Wilcoxon Signed Rank Test to evidence its effectiveness.


Introduction
An ensemble is a collection of individual classifiers with different parameters which may lead to a higher generalization than when it is working separately, as well as a decrease in variance model and higher noise tolerance when compared to a single component [1].Each classifier operates independently and generates a solution that is combined by the ensemble producing a single output, as illustrated in Figure 1.
For a successful outcome of the ensemble, it is necessary that the components generalize differently and the errors introduced by each component must be uncorrelated because there is no point combining models that adopt the same procedures and assumptions [2].Furthermore, with a greater diversity among the components, the ensemble's performance is increased [3].
For the Kohonen Self-Organizing Maps (SOM) [4], uncorrelated errors can be obtained, for example, using different sets of training or using variations of training parameters for each neural network.The combination of the outputs of the SOM maps into a single output can be made by the fusion of the neurons of the maps that compose the machine committee.These neurons must represent the same input space of the region; namely, the weight vectors of neurons to be fused should be very close.
The interest on ensemble methods has grown rapidly and we can find its applications in a variety of knowledge areas.In addition to ensemble or machine committee, there are several denominations in the literature, such as classifiers committees, combined classifiers, classifier ensemble, and multiple classifier systems.
There are many ensemble applications to solve problems in different areas.In [5] they compared merged maps with the traditional SOM for document organization and retrieval.As a criterion for combining maps, the Euclidean distance between neurons was used in order to select the neurons that were aligned, working with two maps each time, until all maps were fused into one.In [6] the proposed method outperforms the performance of the SOM in MSQE (Mean Squared Quantization Error) and topology preservation, by effectively locating the prototypes and relating the neighbor nodes.In [7] authors aimed to preserve map's topology in order to obtain the most truthful visualization of datasets.In [8] they investigate the use of relative validity indexes in cluster ensemble selection.These indexes select the components with the highest quality to participate in the ensemble.
Different applications can be mentioned such as the mechanical fault diagnosis for high voltage circuit breakers [9], audio classification and segmentation [10], biomedical data [11], image segmentation [12], robotic [13], identification and characterization of computer attacks [14], unsupervised analysis of outliers on astronomical data [15], financial distress model [16], credit risk assessment [17], computer vision quality inspection of components [18], performance optimization of a classifier ensemble employed for target tracking in video sequences [19], disease prediction [20], and prediction of the duration of bus trips several days ahead [21].Interesting reviews about this subject can be found in [22][23][24][25].
This work presents a contribution to the study of ensemble of Self-Organizing Maps with equal sizes.The goal is to improve the classification accuracy when compared to a single map.The paper is organized as follows: Section 2 presents some important concepts for this work, Section 3 shows the methodology proposed for the maps fusion, Section 4 presents the results obtained from the simulations, and Section 5 presents the conclusions and proposals for future works.

Background
2.1.Self-Organizing Maps.The Self-Organizing Maps (SOM) is a neural network model, known as a method for dimensionality reduction, data visualization, and data classification.It has competitive and unsupervised learning, performing a nonlinear projection of the input space    , with  ≫ 2, in a grid of neurons arranged in two-dimensional array.This neural network has only two layers: an input and an output layer [4].
The network inputs  correspond to the -dimensional vector space.The neuron  of the output layer is connected to all the inputs of the network, being represented by a vector of synaptic weights, also in -dimensional space.Neurons in output layer are connected to the adjacent neurons through a neighborhood relationship that describes the topological structure of the map.
During the training phase, following a random sequence, input patterns  are compared to the neurons of the output layer.Through the Euclidean distance criterion a winning neuron, called BMU (Best Match Unit), is chosen and will represent the weight vector with the smallest distance to the input pattern; that is, the BMU will be the most similar to the input .Assigning the winner neuron index by , the BMU can be formally defined as a neuron according to Equation ( 2) adjust the BMU weights and the neighboring neurons.Here  indicates the iteration of the training process, () is the input pattern, and ℎ  () is the nucleus of neighborhood around the winner neuron : There are two ways of training a Kohonen map: sequential and batch training.The sequential training process of Self-Organizing Maps consists of three stages: (1) Competition: the input pattern is presented to the network one by one and the output layer neurons will compete against each other.The smallest Euclidean distance between the synaptic weights of neurons and the input pattern determines the only winner (BMU).
(2) Cooperation: chosen the winner neuron, this becomes the center of a topological neighborhood of activated neurons.
(3) Adaptation: neurons of certain topological neighborhood adjusted their synaptic weights so that nearby neurons to BMU suffer greater adjustments than the more distant neurons.
Figure 2 shows the process of weights updating of winner neuron and its neighborhood.The solid lines correspond to the situation before the updating and the dashed lines show the change due to the approach of BMU and their input pattern neighbors .The bright and dark circles, respectively, represent the neurons before and after adjustment of the weights.
The Gaussian function is used to describe the influence of the BMU on its neighbors (update the synaptic weights).The activated neuron tends to excite more intensively neurons in its immediate vicinity and decays smoothly with the lateral distance.Equation (3) shows how the neuron  excitation is calculated in relation to the winning neuron  where   is the distance between the neurons  and .At the end of the learning process, the output layer neurons represent a nonlinear approximation of the input patterns: In batch training process all input patterns are presented to the network at once.Thus, the learning rate parameter () is not necessary in (2) and randomness is not required in the presentation of the input vector to the network.

Cluster Validity Index.
Cluster validity indexes (CVI) are used to evaluate the clustering algorithm results.Therefore, it is possible to check if an algorithm found the groups that best fit the data.The literature presents several CVI and most of them have high computational complexity, which can be a complicating factor in applications involving large data volumes.To overcome this problem, a modification in the CVI calculations was proposed in [16] using a vector quantization produced by Kohonen Map.The synaptic weight vectors (prototypes) are used instead of the original data and this leads to the decrease of the data amount, so the computational complexity for calculating the CVI decreases too.In the author's proposal, the hits should be part of the equation and not just prototypes, to avoid possible differences between the values calculated with all data and only the prototypes.
The CVI calculation always involves a measure of distance.The modification described in [26] changes the way these processes are performed.We explain this by comparing the two equations, the first for calculating the distance between two clusters   and   in the traditional way (4) and the other equation with the modifications for use with the SOM (5).One has In (4), (, ) is a distance measure, and |  | and |  | refer to the clusters' amount of points   and   , respectively.When the amount of those points is high, the computational complexity is also high.
Equation (5) shows the modification considering the hits [26]: where   and   are the SOM prototype sets that represent the clusters   and   , respectively, (  ,   ) is the same distance measure type of (4), ℎ(  ) is the prototype's hits   belonging to   , and ℎ(  ) is the prototype's hits   belonging to   .Equation ( 5) presents a lower computational cost since the quantities involved,   and   , are lower than   and   .The inclusion of the prototypes' hits leads to error minimization caused by the vector quantization that Kohonen Map produces, since it introduces in the calculation the approach for the points density in the input space, here represented by prototypes.

Bagging.
There are three steps to obtain the components of an ensemble: generation, selection, and combination of individual components.In the component generation step it is necessary diversity among the components to ensure that each network generalizes differently.It can be achieved with random initialization of the weights for each component, the variation in the architecture of the neural network, the variation of the training algorithm or data resampling, among others.
In data resampling case, the Bagging [27] is a widely used technique.It consists in generating different training subsets from a single set via resampling with replacement.The samples selection from the original set has uniform probability distribution.Therefore, there will be some samples selected more than once, because of the replacement.There is the possibility that components do not generalize satisfactorily, but the ensemble of these models can result in a better generalization than each one of the individual components.

Materials and Methods
The method consists in the fusion of Kohonen maps formed from fractions of dataset created by bagging algorithm.The process overview is shown in Figure 3.The fusion occurs between two neurons that have the minimum Euclidean distance between them, indicating that they represent the same region of the input space.
The bagging method was used for generating subsets (resampling) from the training set, with uniform probability and replacement.The SOM was applied to each subset produced by bagging and all maps were segmented by means algorithm.Since not all the components should be used, because it can affect the final ensemble performance [28], the way to select these candidates will be through cluster validity index, modified for use with SOM, proposed by [26].In this work the cluster validity indexes CDbw [29], Calinski-Harabasz [30], generalized Dunn [31], PBM [32], and Davies-Bouldin [33] were used as parameters in fusion decision process (consensus function).
The ranking-based selection method can be described in this way: the candidates to compose the ensemble are ranked based on clusters validity indexes values, which will measure the individual performance of each component.

Fusion Equations.
We tested three different equations for neurons fusion.The first, equation (6), is an arithmetic average between the weight vectors of the neurons to be fused: The purpose of second equation was to avoid the influence of neurons without hits, as shown in Through (8), the third equation tested, we investigated whether the weighted fusion by CVI could result in an improvement in classification accuracy: where   is the fusion result of neurons   and   , considering ℎ  and ℎ  the prototype's hits and   and   the CVI from maps to be fused.

Factorial Experiment.
We defined a factorial experiment to validate our approach.The maps size was varied in 10 × 10, 15 × 15, and 20 × 20.Through the bagging 10 sets were generated containing the following quantities of subsets: 10, 20, and 30 up to 100 subsets.These subsets were created with different data rates: 50%, 70%, and 90% of the training set; that is, for each map size, it has ten different numbers of subsets and three different values of bagging percentage.This combination results in 900 maps.Thus, the first experiment map has the size equal to 10 × 10, with 10 subsets generated by bagging with a percentage of 50%, as show in Figure 4.

Approaches.
Four distinct approaches were defined to maps fusion, as a combination between ranked maps and maps fusion criteria.The ranking was based in five CVI and MSQE (Mean Square Quantization Error), defined in [6], and the maps fusion was based in five CVI and MSQE improvement criterion:  The MSQE indicates how well the units of the map approximate the data on the dataset [7]; that is, it can be employed to evaluate the adaptation quality to the data [6].
In the first approach, the maps were ranked by five CVI, but the fusion process was controlled by MSQE improvement of the fused map.In the second approach, the maps were ranked by MSQE and the fusion process was validated by each one CVI improvement.In third approach, maps were ranked by each CVI and the fusion was controlled by CVI increase; that is, if the maps were ranked according to the Davies-Bouldin cluster validity index, the fusion will be controlled by the improvement of this index.At last, in the fourth approach the maps were ranked by MSQE and fused by MSQE improvement.Figure 5 shows the fusion process.

Proposed Fusion Algorithm.
The proposed algorithm was applied in each one of the combinations shown in Figure 4 and according to approaches in Figure 5.It can be described as follows.
(1) Generate, through the bagging, subsets from the training set.
(3) Segment the maps, with -means algorithm.(4) Calculate the CVI and MSQE for each   ( = 1, 2, . ..) map and rank them according to this value, from the best index to the worst.
(5) The base map (one with the best CVI or MSQE value) is fused with the next best map, according to the sorted CVI or MSQE and ( 6), (7), and ( 8).
(6) Calculate the CVI or MSQE of fused map.(7) If there is an improvement in the CVI or MSQE value, the map resulting from the fusion becomes the base map, then return to step (5) until all maps are fused in this way.Otherwise, discard the fusion between these two maps and return to step (5) until all maps are fused.
The main parameters of SOM were hexagonal lattice, Gaussian neighborhood, initial training radius equal to 4, and final training radius equal to 1 and 2000 training epochs.
Due to random initialization, -means algorithm is repeated 10 times and the result with the smallest sum of squared errors is selected.

Results and Discussion
4.1.Datasets.The proposed method was tested through datasets with different characteristics from the UCI Repository [34] and from Fundamental Clustering Problems Suite (FCPS) [35], as shown in Table 1.Lines with missing values were removed from the datasets.

Results
. The aim of the simulation was to verify in which map size the number of subsets and bagging percentage configuration best accuracy value could be obtained.Each experiment was run 10 times and mean values for a 95% confidential interval were obtained.The results were compared with a single SOM map, to verify the improvement in classification accuracy.2 shows the best experimental results for each dataset; that is, which approach and which CVI resulted in the best accuracy value for each dataset.The column "Equation" shows the best fusion equation for each dataset.Equation ( 6) is the simplest of the three tested equations, being just a simple average between two neurons weight vectors.Equation ( 7) is a weighted average of the hits neurons.Equation ( 8) makes a weighted average with hits and by CVI, the most complex equation.The results showed an equilibrium between ( 6) and ( 7) and ( 8) achieved the best accuracy for only two bases.
The column "Approach" refers to the way the maps were ranked and fused, as specified in Section 3.3.The best accuracy values were achieved with approaches B (maps ranked by MSQE and fused by CVI improvement criterion) and C (maps ranked by CVI and fused by CVI improvement criterion).
The column "CVI" shows which CVI was used to achieve the best fusion accuracy in relation to the single map, for each dataset (DB means Davies-Bouldin index and CH means Calinski and Harabasz index).The Davies-Bouldin index was the one that achieved the best fusion results over other CVI for these datasets and PBM index did not lead to good results.
The column "Map size" refers to the map size that has achieved the best accuracy value.The 15 × 15 map did not get the best results for these datasets.
The column "Bagging percentage" shows that, for the majority of dataset, using 90% of the samples for bagging produces better results.A high percentage of the bagging value has a wider range of data for training, leading to better fusion accuracy result.These results show that it is possible to get good accuracies without using the entire training set.
The column "Subsets number" shows the number of subsets that were evaluated to find the best accuracy.For most datasets used in this experiment the best accuracies were obtained with large amounts of subsets.This is an important conclusion: the number of maps available for fusing influences the accuracy that can be achieved.(% ) S u b s e t q u a n t i t y Figure 6 shows the classification accuracy obtained for each dataset.Black vertical lines show the single SOM accuracy value.As we can see, in all situations the accuracies achieved by the proposed method were higher than the classification accuracies for a single map, except for Tetra    a combination of higher amounts of subsets and a higher percentage of bagging.

Conclusions
This work investigated the possibility of fusing Kohonen maps ranked by cluster validity indexes and MSQE.Computer simulations evaluated different setting for map size, subsets number, and different amounts (percentage) of the training set.Datasets with distinct characteristics, including unbalanced classes, were selected to test the proposed method.
The Kohonen Map fusion method achieves values similar or superior to the ones observed in regular SOM (single map).Using a percentage of the training data (not all data), the proposed model achieved satisfactory results compared to SOM, which used all the training set.
The best results were obtained with high amounts of subsets and with high bagging percentage.Among the five CVI used in this study, only the PBM index did not lead to good results.Good results were achieved by ranking the maps by MSQE and fusing by CVI criterion and by ranking the maps by CVI and fusing by CVI criterion.
An important contribution of this paper is to show that we can use CVI and MSQE as a function consensus in ensemble of Self-Organizing Maps.
Future work may explore the influence of certain parameters such as the number of maps, the adjustments in SOM training, and the segmentation method and evaluate strategies to enhance visualization of the fused maps.

Figure 3 :
Figure 3: Equal sizes maps fusion process overview.

Figure 4 :
Figure 4: Structure of the factorial experiment.

Figure 5 :
Figure 5: The four approaches to maps fusion.

Figure 6 :
Figure 6: Classification accuracy for each dataset.