Clustering Ensemble Model Based on Self-Organizing Map Network

This paper proposes a clustering ensemble method that introduces cascade structure into the self-organizing map (SOM) to solve the problem of the poor performance of a single clusterer. Cascaded SOM is an extension of classical SOM combined with the cascaded structure. The method combines the outputs of multiple SOM networks in a cascaded manner using them as an input to another SOM network. It also utilizes the characteristic of high-dimensional data insensitivity to changes in the values of a small number of dimensions to achieve the effect of ignoring part of the SOM network error output. Since the initial parameters of the SOM network and the sample training order are randomly generated, the model does not need to provide different training samples for each SOM network to generate a differentiated SOM clusterer. After testing on several classical datasets, the experimental results show that the model can effectively improve the accuracy of pattern recognition by 4%∼10%.


Introduction
Advances in network techniques provide more access to data, especially unlabeled data, which is a significant factor for the fast development of the clustering algorithm. Nowadays, clustering has an important role in areas like pattern recognition, image processing, recommendation system, and data mining. Also, the rising of ensemble learning naturally imports ensemble learning into the clustering, which is called clustering ensemble [1]. It is able, for the clustering ensemble model, to combine results of multiple clusterers while enhancing the performance compared with that of a single clusterer. Meanwhile, the clustering ensemble is robust facing noise or isolated points in input space. A clustering ensemble model called cascaded self-organizing map (cascaded SOM) is proposed in this paper. Cascaded SOM realizes the ensemble function by learning responses of different clusterers to make a final decision. When learning the responses, the model is learning highdimensional data, which is insensitive to the change of values in a few dimensions. e characteristic of high-dimensional data makes the model able to ignore the error of several clusterers and output the correct result.
Self-organizing map (SOM) was proposed by Kohonen in 1990 [2]. SOM is a competitive network that consists of an input layer and an output layer (competitive layer). SOM represents the distribution characteristics of input samples on the competitive layer using topographically ordered nodes and, in doing so, achieves clustering through dimensionality reduction. Due to its intrinsic nonlinear mapping capabilities on its low-dimensional neural surface, SOM is distinguished from other widely used clustering algorithms. Such advantage makes SOM a tool of visualizing nonlinear relations of data, topology-based cluster analysis, vector quantization, and projection of multidimensional data. Due to its versatility, SOM has been applied in areas like pattern recognition, image-text processing, data mining, genomics, and medical diagnostics.
Since SOM was proposed, plenty of researchers have put effort to improve it and proposed several variant algorithms. Growing-SOM [3] improves the performance of SOM in incremental learning by replacing the original competitive layer with a variable-scale one; TS-SOM [4] changes the competitive layer from single layer into multiple layers with tree structure, whose recursive characteristic can help to reduce training time; G-SOM [5] introduces genetic algorithm into the parameter initialization step of SOM to generate better parameter, compared with the original randomized parameter initialization method; similar to G-SOM, particle swarm optimization (PSO) can also be applied in the parameter initialization step [6]. Moreover, there is research that compares SOM with an uncommon grid structure and a common grid structure, proving that SOM with uncommon grid structure performs better with data having special distribution [7]. During the past decade, researchers came out with different methods to improve SOM. For example, DASOM introduces denoising autoencoder to reduce the noise in input space [8]; constrained SOM preserves the topology structure by blocking the input space [9]; robust SSGSOM introduces HQ method into semisupervised growing-SOM to improve network robustness [10]. PLSOM uses adaptive learning rate instead of learning rate reduction through training, which can focus on new patterns [11]. PLSOM2 solves the problem that the large error at the early period will lead to a small learning rate in the later period [12], as well as reducing the influence of the noise. Inspired by PLSOM, researchers use the eigenvalue of the autocorrelation matrix of the input vector to control the learning rate, reaching a faster converging speed [13]. In classic SOM, there might be several useless neurons between different patterns, reducing competitive layer utilization. A segmentation method based on the distribution of neurons is proposed to raise the competitive layer utilization of 1D SOM and get a more stable network structure [14].
Each neuron in the SOM competitive layer represents a prototype of one pattern in input space which is more representative than any input that inspires this neuron. Such characteristic is applied in several types of researches. Facing situation where samples of one pattern are far less than those of other patterns, SOM oversampling (SOMO) uses SOM to train prototype of minority samples as the input of SMOTE algorithm, providing oversampling for minority samples as well as avoiding the oversampled data being representative [15]. Also, in the real world, there should be neither samples nor prototypes in certain areas in the input space. Classic SOM is not able to handle this problem. Forbidden region SOM (FRSOM) set forbidden regions according to prior knowledge [16]. During training, FRSOM considers the distance differently; that is, the path which represents the distance between two points must not go across the forbidden area. is strategy ensures that no prototype will be set in the forbidden region, improving the clustering result.
Ensemble learning uses multiple learners to figure out a single question, and the answer to the question is given by combining all the outputs of the learners, which can improve the result [17]. When introduced to the clustering algorithm, ensemble learning can reduce the effect caused by noise, leading to higher robustness. e clustering ensemble is usually divided into two parts, which are cluster generation and cluster fusion. Specifically, there are n samples in the input space X, learned by M clusterers, and a fusion algorithm combines all the clusterers to get the final learner P (see in Figure 1). erefore, clustering ensemble algorithms can be divided into those which focus on clusterer generation and those which focus on clusterer fusion. As for clusterer generation, there are algorithms which generate different clusterers by different clustering algorithms [1], or by the same clustering algorithm with different parameters [18,19]. Also, dataset generation methods are used to generate different dataset from the original dataset to train different clusterers [20][21][22]. As for clusterer fusion, there is also plenty of research. An algorithm resamples different samples into the same cluster [23]; another one uses graph segmentation algorithm to solve clustering ensemble task [24]; furthermore, a mixed model is proposed [25]; Dempster-Shafer evidence theory is also introduced to clustering ensemble area [26]. Apart from all these, some researchers classify the sample into a transition sample and core sample and then use the core sample as the cluster segmentation basis [27,28] (see Figure 2). Besides these two aspects, prior knowledge is also applied to improve the clustering ensemble result [29,30].
Most research about the clustering ensemble introduces a new algorithm into the ensemble step to correct the result of certain subclusterers and combine results of all clusterers. According to research into using self-organizing map quantization error to indicate the single-pixel change in large random patterns [31], the self-organizing map cannot detect the small shift of single pixel. However, this study indicates that the self-organizing map itself can ignore a small shift in high-dimensional data, and such characteristic makes the self-organizing map able to be introduced to the ensemble step.

Materials and Methods
e SOM network can create a map between high-dimensional data and low-dimensional data, with original topology structure reserved. As a nonlinear dimension reduction algorithm, SOM trains competition layer neurons which represent prototypes of pattern in high-dimensional space to correspond with the topology structure of the original data. e network structure of SOM is shown in Figure 3. e network consists of k neurons where V � v 1 . . . v k , and each neuron has a weight vector e v ∈ R d whose dimension equals that of input space. Two values need to be calculated when SOM is being trained. e first is N(v i , v j ), which is the topology distance of two neurons on the competitive layer; the second is D(x, e v ) � x − e v2 , the Euclidean distance between the input vector and the weight vector. During the training period, the target is to reduce the Euclidean distance of the weight vectors belonging to neurons that have small topology distance.
When the SOM is trained on dataset D, for each input x i there will be a winner neuron v whose weight vector's Euclidean distance from the input vector is the smallest in the competition layer. en, the weight vectors of both the winner neuron and its neighboring neurons are updated as shown in Algorithm 1, where η is the learning rate and f(·) Computational Intelligence and Neuroscience is a neighborhood function which decreases by the increasing of N(u, v j ).
Each round of training will force the weight vectors of the winner neuron and its neighboring neurons to move towards the input vector x i . Along with the iteration time t growing, the value of f(·) decreases, leading to the separation of the neurons in the competition layer. Finally, the neurons will be spontaneously gathered around different clusters. Since one or several clusters can represent a pattern in the input dataset, the Euclidean distance of weight vectors can represent the distance between different patterns, i.e., reserve the data topology structure.
Cascaded SOM combines the output of multiple competition layers as the input of another SOM model, leading to the ensemble of multiple SOM models, which can improve the performance of the learning model. Figure 4 shows the structure of cascaded SOM (2-layer cascaded SOM as an example).
Since the competition layer of the SOM network indicates the topology structure of input patterns, the cascaded SOM needs to retain this characteristic, which cannot be provided by the original output of the SOM network. Inspired by the winner neuron of the same input pattern appearing in the same area, in order to reach the target, the output of the competition layer is changed into a one-hot vector to send the location of the winner neuron to the next layer. After the processing, the output data can be used as the training sample of the next layer, whose training step is exactly the same as shown in Algorithm 1.
e conversion of the output of the competition layer increases the dimension of the input space of the next layer, and the change of the winner of a single competition layer only changes the value on two vectors. When calculating Euclidean distance, the result would only be slightly influenced by the difference in several dimensions, which improves the robustness of the model.

Results and Discussion
To measure the improvement brought by the proposed cascaded SOM to the classic SOM, several experiments are conducted on MNIST handwritten digit dataset, USPS handwritten digit dataset, and Fashion MNIST dataset ( Figure 5). Data in MNIST and Fashion MNIST are 2-dimensional arrays with 28 * 28 elements, while those in USPS are 2-dimensional arrays with 16 * 16 elements. In the experiments, the 2-dimensional array is transferred to 1-dimensional array, which means that the array is flattened. e data in the MNIST and the Fashion MNIST are transferred into 1 * 784 arrays, and the data in USPS are transferred into 1 * 256 arrays. e rest of this section demonstrates the design of these experiments and discusses the results of the experiments.
In the experiments, a two-layer cascaded SOM network is used. Iteration time of training, network number of the first layer, the difference between training samples, and the network size are chosen to test their influence on the performance of the proposed algorithm. e accuracy of the clustering is utilized to evaluate the result of the clustering: where N is the total number of the samples, Ω � ω i |i � 1, 2, . . . , k is the set of the clusters, C � c j |j � 1, 2, . . . , J is the real label of the sample, and map(ω i ) is the category of ω i ; δ(c i , map(ω i )) compares c i and map(ω i ) and then outputs 1 if they are the same and 0 if they are not. When keeping the iteration time of the second layer and the network number of the first layer fixed, the accuracy of the first-layer SOM networks as well as the cascaded SOM network and the iteration time of the first-layer SOM networks are positively correlated before the first-layer SOM networks are converged ( Figure 6). After that, the accuracy of the first-layer SOM networks has no obvious improvement. Compared with the first layer, the accuracy of the cascaded SOM network has a stable improvement (5% on MNIST and USPS, 4% on Fashion MNIST).
When the parameters of the first-layer network are fixed while the iteration time of the second-layer SOM network increases, the improvement of the accuracy is shown in Figure 6. As a result of the incomplete training, the accuracy has not been improved but has dropped sharply when the iteration time is only one. After several rounds of training, the accuracy improvement of the cascaded SOM network tends to a fixed value (5.7% on MNIST, 5% on USPS, 4.5% on Fashion MNIST). e core of cascaded SOM is the exploitation of the insensitivity to changes in the values of a small number of dimensions. erefore, the ratio of numbers of the valuechanged dimension and the total dimension can influence the performance of the cascaded SOM network. e number of the first-layer networks is utilized to control the ratio mentioned above. Also, the first-layer networks should be different from each other to avoid misclassification towards the same input, which can be ensured using different initialized parameters or different training samples.
When all the first-layer SOM networks are trained with the same training dataset, and the difference of the parameter is provided by randomly initialized weight vector, as shown in Figure 7, the improvement of the accuracy reaches a limit if the number of the first-layer networks increases to 13.
While the difference among first-layer networks is provided by both different training datasets and randomly initialized weight vector, the extra difference leads to better generalization ability. As a result, the upper limit of the accuracy is higher than the cascaded SOM network with the first-layer SOM network trained with the same dataset. However, when the number of the first-layer SOM networks is small, the generalization ability is not high enough to reduce the accuracy brought by the first-layer training dataset being different from the second-layer training dataset. erefore, as shown in Figure 8, the cascaded SOM network with the first-layer SOM network trained with the same dataset has higher accuracy than that with first-layer SOM network trained with the different datasets when the number of first-layer SOM networks is small. 4 Computational Intelligence and Neuroscience Meanwhile, the size of the first-layer SOM network can influence the performance of the SOM network. e next experiment is to figure out the effect of the size of the firstlayer SOM network on accuracy. In one round of experiments, all the first-layer and second-layer SOM networks have the same size, the shape of those SOM networks is square, and the number of first-layer SOM networks is 10.
A bigger SOM network leads to a better performance of the cascaded SOM network (Figure 9). e reason is that a bigger SOM network is beneficial to separate different clusters and reduce the number of the intercluster neurons, which is hard to judge which cluster it belongs to. is benefit can work on both the first layer and the second layer, leading to an increasing accuracy of both first-layer SOM network and cascaded SOM network. Contrary to the performance of the cascaded SOM network, there is a decrease in the accuracy of the first-layer SOM network when the size of the network keeps increasing. is phenomenon appears because each neuron can gain less knowledge if the network size keeps increasing, which raises the noiseknowledge ratio and reduces accuracy. Nevertheless, when it comes to the cascaded SOM network, larger network size means higher-dimension input of the second-layer SOM network, which improves the antinoise ability of the network, making the accuracy of the cascaded SOM network keep improving instead. Table 1 shows the performance of the cascaded SOM network with 3 different parameter settings on MNIST, USPS, and Fashion MNIST dataset. e accuracy of the cascaded SOM is positively correlated with the iteration time of the first-layer SOM network, the iteration time of the second-layer SOM network, the number of the first-layer SOM networks, and the size of SOM network. Due to the difference among the data distribution and the data complexity of different datasets, the performance of the cascaded SOM network is slightly different, but, in general, the cascaded SOM network has better performance than the classic SOM network.
All the experiments above use the 2-layer structure. Table 2 shows the performance of the 3-layer cascaded SOM network. e additional third competition layer can indeed provide very limited promotion to the accuracy based on the  is indicates that the additional competition layer cannot deal with the lack of difference in the second-layer SOM network trained based on the same first-layer SOM network. Such a problem can be solved by increasing the difference between both first-layer and second-layer SOM networks, but the extra cost of training time should also be considered. e experiments on MNIST, USPS, and Fashion MNIST indicate that the proposed cascaded SOM network can stably improve the performance of the SOM network. At the same time, like any other ensemble model, the difference among sublearners raises the generalization ability, which can result from randomly initialized weight vector or different training datasets. ough there is evidence that adding an extra layer can further improve the performance of the cascaded SOM network, the extra time cost must be considered.
e cascaded SOM network proved to be an effective algorithm, compared to ensemble classic SOM network, in improving the model. e rest of this section will compare the proposed algorithm and other clustering ensemble algorithms.
In the following experiment, ARI (adjusted Rand index) and NMI (normalized mutual information) are utilized as the evaluation indices of the clustering ensemble algorithm, which are defined as follows:   Computational Intelligence and Neuroscience 7 ARI � k i�1 k j�1 n ij /2 − k i�1 n i /2 k j�1 n i /2 /(n/2) 1/2 k i�1 n i /2 + k j�1 n i /2 − k i�1 n i /2 k j�1 n i /2 /(n/2) , NMI P * , P t � k * i�1 k t j�1 n j log n.n ij /n i .n j ������������������������������ k * i�1 n i log n i /n k t j�1 n j log n j /n , (2) where n is the total number of samples, n ij is the number of objects in the intersection of clusters c i ∈ P t and c j ∈ P * , P * is the clustering result and P t is the true class numbers of the samples, n i and n j are the numbers of samples in clusters end for end for end while ALGORITHM 1: Learning rule of SOM network.    Iris  150  4  3  Wine  178  13  3  yroid  215  5  3  Mfeatures  2000  2  10  Glass  214  9  6  BCW  683  9  2  Soybean  47  35  4  Ionosphere  351  34  2 Cascaded SOM First-layer SOM  Computational Intelligence and Neuroscience c i ∈ P t and c j ∈ P * , respectively, and (n/2) is the binomial coefficient (n!)/2!(n − 2). e maximum value of ARI is equal to 1. e maximum value of NMI is equal to 1.
Eight real-world datasets from the UCI Machine Learning Repository were used, including Iris, Wine, yroid, Multiple Features (Mfeatures), and Ionosphere dataset. Table 3 shows the details of these datasets. BCW has an attribute with missing values in some objects, which is removed. e second attribute in the Ionosphere dataset is also removed.
e proposed algorithm has the best performance on several datasets (Wine, Mfeatures, and Glass) and has a relatively good performance on the other datasets too.

Conclusions
In this paper, a new clustering ensemble approach is investigated. e proposed method is based on introducing a cascaded structure to the self-organizing map. To test the clustering ensemble ability of the algorithm, experiments are conducted on MNIST, USPS, and Fashion MNIST dataset. e experiments on those datasets indicate that the proposed cascaded SOM network has the following characteristics: (1) It can stably improve the performance of the SOM network. (2) e difference among sublearners, which can result from randomly initialized weight vector or different training datasets, raises the generalization ability. (3) e 2-layer cascaded SOM network can improve the performance of the SOM network, while the 3-layer cascaded SOM network can only slightly increase the accuracy compared to that of a 2-layer one, yet the extra time cost must be considered.
e proposed algorithm is proved to have advantages on several datasets, compared with other clustering ensemble algorithms.
Also, the proposed algorithm can still be improved. For now, all the methods, except using different training datasets, to improve the performance of the cascaded SOM are increasing the data dimension, which increases the training time cost at the same time. erefore, the network can hardly reach its theoretical limit in practical application due to time cost control.

Data Availability
e data used to support the findings of this study are included in the article.

Conflicts of Interest
e authors declare that there are no conflicts of interest regarding the publication of this paper.  Computational Intelligence and Neuroscience 9