Memetic Variable Clustering and Its Application

Clustering analysis is an important and diﬃcult task in data mining and big data analysis. Although being a widely used clustering analysis technique, variable clustering did not get enough attention in previous studies. Inspired by the metaheuristic optimization techniques developed for clustering data items, we try to overcome the main shortcoming of k -means-based variable clustering algorithm


Introduction
Clustering analysis or clustering is the task of grouping a set of objects in such a way that, according to certain similarity, objects in the same group (called a cluster) are more similar than objects falling in different groups (clusters).Clustering analysis is used widely in the data preprocessing step and data mining step of KDD (Knowledge Discovery in Databases, KDD) [1] (Figure 1), and especially it is the main task of exploratory data mining and unsupervised machine learning.Recently, clustering analysis is pointed out as a powerful metalearning to accurately analyze the big data [2].Clustering analysis also plays an important role in many other fields, including pattern recognition, image analysis, information retrieval, and bioinformatics.Besides its importance, clustering analysis is also a challenging task because the unsupervised nature of clustering analysis implies that the structural characteristics of the dataset are not known, except if there is some domain knowledge about the dataset available in advance [3].
Because of the importance and difficulty of clustering analysis, a lot of clustering algorithms are proposed in the literature.Some popular clustering algorithms, for example, k-means clustering, suffer from the shortcoming that is being sensitive to outliers; therefore, metaheuristic methods such as evolutionary algorithms and swarm intelligence algorithms are used widely to improve the clustering algorithms from the optimization perspective.Almost all the metaheuristic based improvements for clustering algorithms in the literature are devoted to cluster the data items, but clustering analysis for variables is also a common technique for statistical data analysis for dimension reduction or (unsupervised) feature selection especially in practical statistical data analysis activities.e most famous one is the VARCLUS procedure in SAS, and there are also some other versions of variable clustering methods implemented in R and SPSS.In contrast to its wide application, the contributions of research to the variable clustering techniques are not sufficient.Also, the k-means-based variable clustering algorithms suffer from the same shortcoming that is being sensitive to initial centroids.
We studied the metaheuristic approach for variable clustering algorithm based on our previous work.In our previous research, MCLPSO [4] is studied to improve CLPSO [5] from two aspects: one is the chaotic local search and the other is the SA-based local search.Firstly, we integrate the chaotic local search operator to CLPSO to enable the stagnant particles to escape from the local optima.An SA-based local search operator combined with the "cognition-only" model is developed to enhance the local search ability of the elite members.e experimental results demonstrate that MCLPSO is competitive in optimizing the multimodal functions.In this work, MCLPSO is reorganized under a novel metaheuristic paradigm-memetic computing.Furthermore, MCLPSO is used to optimize k-means-based variable clustering algorithm as a metaheuristic approach.e experimental results demonstrate that MCLPSO can improve the k-means based variable clustering algorithm effectively.We also developed a web-based interactive software platform to implement this approach and give a practical case study-analyzing the performance of a semiconductor manufacturing system by MCLPSO-based variable clustering.
e main contribution of this work includes the following: (i) A novel memetic algorithm MCLPSO proposed in our previous research is described under a more sound and general theoretical framework-memetic computing.(ii) To our best knowledge, it is the first time to use metaheuristic method to improve the results for variable clustering.Also, the improved variable clustering is used to deal with some complex tasks.(iii) To facilitate the practical use of the MCLPSO-based variable clustering algorithm, we developed an interactive software system for this approach and give a real-world case study.
e rest of the paper is organized as follows: In Section 2, we review the related work.In Section 3, we describe our previous work MCLPSO in detail under the memetic computing framework.In Section 4, MCLPSO is used to optimize the k-means-based variable clustering problem.In Section 4, some experimental results on several datasets are presented and discussed.In Section 5, a web-based interactive software system developed for clustering variables is introduced.Finally, we give a final conclusion in Section 6.

Related Work
As mentioned in Section 1, clustering analysis is an important and difficult task.In the literature, dozens of clustering algorithms are proposed for multiple clustering analysis applications.
ese clustering algorithms can be categorized into partitioning methods, hierarchical methods, density-based methods, grid-based methods, and model-based methods [6].Despite the classification of methods, the main objective of a clustering algorithm is maximizing both the homogeneity within each cluster and the heterogeneity among different clusters [6].From optimization perspective, if the homogeneity within each cluster and the heterogeneity among different clusters can be measured by a certain clustering criterion, the metaheuristic algorithms including EA (Evolutionary Algorithms, EA) such as GA (Genetic Algorithms, GA) and swarm intelligence algorithms such as PSO (Particle Swarm Optimization, PSO) can be applied to improve the clustering results by adjusting the hyperparameters for those clustering algorithms which are sensitive to their hyperparameters.
As a best-known and most commonly used clustering technique, k-means clustering suffers from the deficiency that is being sensitive to its k value and initial centroids.In the literature, several metaheuristic-based clustering methods are proposed to overcome this deficiency.Maulik and Bandyopadhyay proposed a GA-based clustering technique to exploit the searching capability of genetic algorithms so that the clustering metric can be optimized by searching for appropriate cluster centroids [7].Van der Merwe and Engelbrecht introduced two PSO-based clustering algorithms [8].In PSO-based clustering algorithm, PSO is used to find the optimum centroids directly.In hybrid PSO and k-means clustering algorithm, the result of k-means is used to initialize PSO-based clustering for quick convergence.e results of algorithm were compared to k- means algorithm and the conclusion is that the proposed approaches gave better convergence and low quantization error in comparison to k-means algorithm.Esmin improved PSO-based clustering algorithm by modifying the evaluation function and the modification brought good improvements to the clustering results [9].Ahmadyfard proposed a twostage clustering algorithm [10] in which PSO is used at the first stage to find optimum centroids directly; then these optimized centroids are used to initialize k-means at the second stage.e combined method has the advantage of both PSO and k-means methods if the algorithm switch to kmeans when the PSO are closed to the global optimum.
Recently, memetic algorithms are used as a novel metaheuristic paradigm to improve clustering algorithms.A memetic algorithm (MA) is an EA that includes one or more local search operators to improve the individuals within its evolution cycles [11].In MAs, "memes" refer to the local search operators which are used to enhance local search ability of EAs [12].Moscato introduced the concept of meme to EA firstly by combining the SA with the crossover operator in the genetic algorithm to solve the TSP (Travelling Salesman Problem, TSP) problem [13].MA is inspired by the concept of a meme, which represents a unit of cultural evolution that can exhibit local refinement.e population evolution is cooperated with the individual learning, and the memetic model is a more detailed explanation for the adaption in the natural system than the genetic model [12].Most EAs can find the regions around the local optima, but some versions of EA including PSO exhibit the deficiency of lacking local search abilities.MA is proposed to overcome this deficiency.e promising regions throughout the search space can be found by global search operators and the local search operators can give fine-grained search around these search regions [12].e global search cooperates with the local search to find the global optima.Ong extends the notion of MA and defines memetic computation (MC) [14].e concept of memes used in MC is more general than the concept of memes used in MA.In MC, a meme can denote a learning strategy, an operator or a local search procedure.
Sheng proposed an approach for simultaneous clustering and feature selection using a niching memetic algorithm, NMA_CF [15].In NMA_CF, encode both feature selection and cluster centers with different numbers of clusters; local search operations are introduced to refine feature selection and cluster centers encoded in the chromosomes; and niching method is integrated to preserve the population diversity and prevent premature convergence.e experimental results demonstrated that simultaneous global clustering and feature subset optimization mechanism is effective in approaching the problem.Recently Sheng improved NMA_CF by introducing multiple local search operation and adaptive niching strategy [16].In our previous research [17], we proposed a novel memetic algorithm GS-MPSO and use GS-MPSO to optimize the initial centroids for k-means clustering.In GS-MPSO, k-means clustering algorithm is integrated into function evaluation so that the improvement of clustering results is significant.
Although most clustering algorithms are devoted to cluster data items, variable clustering is also a widely used technique in practical statistical analysis activities.e function of variable clustering is provided in almost all the statistical tools such as R, SPSS, and SAS. e most famous one is the VARCLUS procedure implemented in SAS.In VARCLUS, the similarity of variables is measured by Pearson correlation, and the centroid is computed by the first principal component of the variables in the cluster.e variables are clustered to hierarchical clusters by hierarchical clustering.Almost in all the variable clustering algorithms, PCA (Principal Component Analysis, PCA) is used to compute the representative of variables in a cluster.Vigneau proposed a variable clustering algorithm named Clustering around Latent Variables to segment quantitative variables [18,19].Chavent proposed ClustOfVar to cluster variables with mixed type [20].In [20], PCAMIX is used to calculate centroids and a k-means-based variable clustering and a hierarchical variable clustering are studied to optimize the homogeneity criterion.

Memetic Comprehensive Learning PSO
In many applications, the MAs are more competitive both in effectiveness and efficiency than the traditional EAs.But the method of designing an MA with a good performance is intricate.To design a competitive MA, the local search components should be kept in balance with the global search component to achieve a balance between exploration and exploitation.In some MAs, the excessive use of the local search can lead to a loss of diversity in the population.If the local search is applied to the candidate which is a local optimum or the local search depth is too high, the computing time may be wasted because of the unnecessary local search.e local search operators should cooperate with the evolutionary operators to find a balance between global search and local search.erefore, the following design and parameterization issues of MA are considered [21]: (i) How often should local search be applied?(ii) On which solutions should local search be used?(iii) How long should local search be run?e memetic strategy used in MCLPSO will give answers to the design issues of MA.We propose an adaptive memetic strategy based on the status and quality of particles.
Although some MAs have been proved to be effective, the framework of MA was found too specific to describe some complex hybrid algorithms.Some researchers try to develop a more general and more formal definition for MA.For example, Nguyen presents a probabilistic memetic framework to model the process of MA [22].Ong defines memetic computation as a paradigm that uses the notion of meme(s) as units of information encoded in computational representations for problem-solving [14].A MC is composed of several interactive memes and MC uses these memes to solve the complex problems.In MC, a meme can denote an operator, a learning strategy, or a local search procedure, so the concept of memes used in MC is extended.Icca gave a thorough analysis for MC and introduced Mathematical Problems in Engineering "Ockham's Razor" theorem which is stated as "Entities should not be multiplied unnecessarily" [23].Icca pointed out that simplicity will help to design an efficient and compact memetic computational algorithm from the perspective of Ockham's Razor theorem and summarized that four kinds of memes perform different exploration in MC: (i) Stochastic long-distance exploration (ii) Stochastic moderate-distance exploration (iii) Deterministic short-distance exploration (iv) Random long-distance exploration In our previous work, we have developed some novel memetic algorithms under the framework of MC and theses memetic algorithms are applied to data clustering [17] and missing data estimation [24].In this work, we develop MCLPSO by following the analysis of MC in [14] and design the following "memes" as in [24].
(i) Stochastic long-distance exploration: comprehensive learning strategy (ii) Stochastic moderate-distance exploration: chaotic local search (iii) Deterministic short-distance exploration: SA local search e diversity can benefit from random long-distance exploration.But random long-distance exploration may lower the quality of swarm when the comprehensive learning strategy is used.So random long-distance exploration is disabled in MCLPSO to keep the swarm stable.
Based on the above discussion, we will discuss the memes used in MCLPSO in detail and propose the memetic strategy for MCLPSO.

Classification of the Particles.
In MCLPSO, the CLPSO is responsible for the global search.e chaotic local search operator is applied to the stagnant particle to improve the stagnant particle and the SA-based local search operator performs fine-grained search around the promising regions.
At each iteration of CLPSO, the ith particle's solution x i will be updated by adding a velocity v i which is calculated by learning from pbest fi(d)d at each dimension d. pbest j is the best solution found by the jth particle so far.
] defines the ith particle's corresponding learning exemplars at each dimension.Some variables are introduced to classify the particles for the purpose of designing an adaptive memetic strategy.e classification depends on the searching status of the particle.
(i) For the ith particle, flag i is used to record the number of generations the ith particle has not improved its pbest i .If flag i ≥ m, f i is reassigned and flag i is reset to 0. m is the refreshing gap and set at 7 [5].(ii) For the ith particle, stagnant i is used to record the number of reassignments of f i and the pbest i has not been improved during this period, i.e., pbest i has not been improved for m * stagnant i + flag i generations.
If stagnant i ≥ stagnant max , the particle i is stagnant.
(iii) For the ith particle, improve i is used to record the number of generations that the pbest i has been improved continuously, i.e., pbest i has been changed continuously for improve i generations.(iv) A particle i with the best pbest i in the population is a promising particle if improve i ≥ improve max .

Stochastic Long-Distance Exploration-Comprehensive Learning Strategy.
e CLPSO is adapted from the original PSO by using a novel velocity updating equation ( 1) which is called comprehensive learning strategy: where w is the inertia weight, c is the weight of comprehensive learning, rand( ) will generate a random number in [0, 1] according to the uniform distribution, and ] defines the ith particle's corresponding learning exemplars at each dimension.At dth dimension, the ith particle should follow pbest fi(d)d which denotes the dth value of the f i (d) th particle's best solution found so far.Pc i is the probability that the ith particle will learn from other particles' pbest which is empirically defined as where ps means the population size.e selection of learning exemplars of the ith particle can be implemented by the following steps.For each dimension of the ith particle, a random number is generated between 0 and 1 according to the uniform distribution.If this random number is larger than Pc i , the corresponding dimension will learn from its own pbest i , otherwise it will learn from another particle's pbest and then two particles in the swarm which excludes the ith particle are chosen randomly and the one with a better pbest will be selected as the exemplar for particle i to learn at that dimension.is process is summarized in Figure 2.For efficiency, the ith particle is allowed to refresh its learning exemplars f i until the particle ceases improving for m generations and m is called the refreshing gap.
In the CLPSO, each particle will learn from pbest fi which is derived from different particles' historical best position.
e updating strategy ( 1) is proved to yield a larger potential search space than that of the original PSO by the analysis of search behavior [5].e swarm's diversity can be kept by the comprehensive learning strategy.
erefore, the performance is improved when solving complex multimodal problems.But this improvement is obtained at the cost of the convergence speed because the effect of the current global best position is weakened.If all the particles share the similar pbest with the current global best position, the comprehensive learning is not able to enable the swarm to escape from the local optimum.As other EAs, CLPSO also lacks of the ability of local search.In this study, the CLPSO is investigated under the framework of the MC.Two local search operators are introduced to overcome these deficiencies.4 Mathematical Problems in Engineering

Stochastic Moderate-Distance Exploration-Chaotic Local Search.
We study the chaotic local search operator to improve the stagnant particle i which cannot improve its pbest i by comprehensive learning strategy.e Chaotic_lo-cal_search is adapted from the chaotic local search operator in [25].e logistic equation ( 3) is used to generate the chaotic sequence.In (3), μ is the control factor and x is the chaotic variable.Although ( 3) is deterministic, it exhibits chaotic dynamics when μ � 4 and x k ∉ 0, 0.25, 0.5, 0.75, 1 { }.So, ( 4) is used to generate the chaotic sequence for the dth dimension of particle i. e sequence generated by ( 4) is sensitive to the initial value.A minute difference in the initial value of the chaotic variable would result in a considerable difference in its long behavior.Equation ( 5) is used to normalize the initial value of chaotic variable in (4).e stagnant particle i is perturbed with probability P Chaotic by the denormalized value of a chaotic variable.
e denormalized value is derived from (6): In Chaotic_local_search, x i is reset by pbest i and then x i is normalized between 0 and 1 by (4) to initialize the chaotic vector cx i .[x min,d , x max,d ] is the range of the dth dimension of the search space.A chaotic sequence is generated for each dimension by (5) and cx id is the chaotic variable for the dth value of particle i. k is the iteration number.cx id evolves by (5) iteratively, and the track of cx id during the evolution can travel ergodically over the whole search space.During the evolution process of the chaotic variables, the position x i is perturbed with probability P Chaotic by x k′ ir to escape from the local optimum.x k ′ ir is denormalized from cx k ir .e details of Chaotic_local_search are described in Figure 3. Chaot-ic_ls_length represents the number of iterations.

Deterministic Short-Distance Exploration-SA Based
Local Search.CLPSO is used as the global search component for MCLPSO because the diversity can be kept by comprehensive learning.But the lack of ability to local refinement in CLPSO can lead to missing the local optima.To solve this problem, a novel local search operator by combining the cognition-only model [26] with SA is developed in our previous work [17] to enhance the local search ability of the CLPSO.e details of this SA-based local search operator are described in Figure 4.
In Figure 4, T is the temperature variable and T 0 is the initial temperature.SA_ls_length represents the number of iterations.pbest i ′ can be obtained by introducing a Cauchy perturbation to the rth dimension of pbest i according to (7) in which [A r , B r ] is the range of the rth parameter and u is generated randomly subject to the uniform distribution between 0 and 1. pbest i is perturbed with a probability P SA each time for the purpose of "fine-grained" local search around the promising regions.e pbest i is updated in a greedy way, but the new position x i ′ which is generated by the cognition-only model is accepted subject to the Metropolis rule (Kirkpatrick, 1983).A local search around the promising region can be performed.us, the ability of local refinement of PSO can be enhanced by the SA_local_search.
procedure Refresh_learning_exemplar(particle i) begin for d = 1 to D if(rand() < Pc i ) randomly choose particle j and particle k from the population excluding particle i Adaptive Memetic Strategy 1: SA_local_search is only applied to the promising particle to give fine-grained local search around the promising regions, and the Chaotic_lo-cal_search should be applied to the stagnant particle which cannot improve its own pbest by the comprehensive learning strategy to enable the stagnant particles to escape from the local optima.
Although, in some other MAs, the local search is applied to all particles, we adopted Adaptive Memetic Strategy 1 in the MCLPSO because of the high cost of local search, and the frequent application of local search will result in a disastrous loss in diversity.In Adaptive Memetic Strategy 1, the swarm evolves along with the local refinement around the promising regions and the chaotic local search of the stagnant particles.A pseudocode for MCLPSO is described in Figure 5.
Adaptive Memetic Strategy 1 can give answers to two of the design and parameterization issues mentioned in the last section.
e local search operators will be applied adaptively according to current particle's quality and status.e SA_lo-cal_search is always applied to the promising candidate solutions and the Chaotic_local_search is always applied to a stagnant particle which cannot improve its pbest by the comprehensive learning strategy.For the third question, the depth of Chaoti-c_local_search and SA_local_search, we believe a moderate value of SA_ls_length is sufficient for SA_local_search to find the local optimum because the local search is always applied to the particles with high quality and the value of Chaotic_ls_length is set a same value to balance the exploration and exploitation.
In MCLPSO, the velocity of particle i is restrained by min (v max,d , max(v min,d , v id )) within [v min,d , v max,d ] which is the range of the dth velocity value.And, f(x i ) is evaluated only if x i is inside the search bounds.All the pbest are kept inside the search bounds, and the particle will be attracted back to the search bounds by the learning exemplars.

Clustering of Variables Based on MCLPSO
e main objective of this work is to improve k-means-based variable clustering algorithm by MCLPSO.As mentioned in section1, the k-means clustering method is sensitive to the initial centroids and is easy to be trapped into local optima.But k-means is still the most popular clustering algorithm because of its effectiveness and efficiency.Some variable clustering algorithms are implemented by k-means.In this section, we introduce k-means-based variable clustering

6
Mathematical Problems in Engineering algorithms in this section at first.en MCLPSO is used to optimize the initial centroids for the k-means based variable clustering algorithm.Some notations used in variable clustering are defined as follows: We consider hard partitioning clustering in this work, so each variable belongs to only one cluster.Based on the above notations, we can give a formal description for variable clustering furthermore: Definition 1. Clustering of variables can be defined as K partition on the variable set, Partition K : Partition K should satisfy the following constraints: In Sections 5 and 6, we choose some datasets generated by the sensors of complex manufacturing system, so the variables discussed in this section are quantitative.In [20], some variable clustering methods are developed to cluster variables with mixed types.

Principal Component Analysis-PCA. Centroid update rule is critical to k-means-based variables clustering.
In almost all the variable clustering algorithms in the literature, PCA is used to compute the first principal component as the centroid for a group of variables in a cluster.In this section, we give a brief introduction to PCA at first.
PCA is a widely used dimension reduction method.e essentiality of PCA is the coordinate transformation.e projection of data on the new coordinate can maximize the variance.PCA transforms x i to x i ′ by projecting x i on new coordinate U′ in (13), the dimension of x i ′ is less than N: U′ is a submatrix of U and U′ is obtained by deleting some columns from U. U is a N × N orthogonal matrix, U j is the jth column of U, and U j is defined as the jth eigenvector of the sample covariance matrix C. C is the sample covariance matrix of dataset S defined by (11), C � (c ij ) N×N : From ( 12), we can get that C is a real symmetric matrix.By the properties of real symmetric matrix, we can get that there are N real eigenvalues of C (λ 1 , λ 2 ,. .., λ N ) and it is possible that λ i � λ j , 1 ≤ i ≠ j ≤ N. e eigenvectors of C (U 1 , U 2 , . .., U N ) corresponding to (λ 1 , λ 2 , . .., λ N ) are real vectors.Eigenvectors corresponding to different eigenvalues are orthogonal to each other.
where λ j is the eigenvalue of C, U j is the eigenvector corresponding to λ j .Let λ 1 ≥ λ 2 ≥ . . .≥ λ N , then U 1 , U 2 , . .., U N is sorted by their corresponding eigenvalues.e projection of S on U 1 direction has the largest variance.e projection of S on U 2 has the second largest variance, and so on.ese eigenvectors are all orthogonal to each other.We can choose T eigenvectors with T maximum eigenvalues U′ � (U 1 , U 2 , . .., U T ). e S′ is the reduction dataset and S′ is the projection of S on U′.
Based on the discussion above, we can summarize the steps to calculate the FPC (First Principal Component) for S: (1) Calculate the sample covariance matrix C for S (2) Calculate the eigenvalues λ 1 , λ 2 , . .., λ N for C by Jacobi method (3) Choose the largest eigenvalue value λ 1 and U 1 is the eigenvector corresponding to λ 1 (4) Compute the projection of S on U 1 and get the FPC � SU 1 e Pseudocode of FPC is described in Figure 6.

Variable Clustering Based on KMEANSVAR.
We use MCLPSO to optimize the k-means-based variable clustering algorithm KMEANSVAR which is same as CLV_kmeans in R packge ClustVarLV [19].In KMEANSVAR, the variables are clustered iteratively and the key components of KMEANSVAR are defined as follows: (1) Similarity.In variable clustering, the similarity between variables is usually defined by correlation coefficient.In KMEANSVAR, Pearson correlation ( 14) is used to measure the similarity between the variables.If the two variables are highly correlated, the variables will be closer to each other, and vice versa.e similarity between variables is defined by (15).
Mathematical Problems in Engineering (2) Update of Centroid.In KMEANSVAR, the centroid of a cluster of variables is always kept as the FPC of the variables in the cluster.
SCluster k is the samples composed by M observations of the random vector (X k1 , X k2 , . .., X kP ).
SCluster k can be obtained by keeping X k1 , X k2 , . .., X kP and deleting X-{X k1 , X k2 , . .., X kP } from S. FPC (SCluster k ) is the centroid of Clusterk.(3) Clustering Criterion.e quality of clustering result is measured by clustering criterion.A high-quality clustering of variables can maximize the clustering criterion.In [20], a clustering criterion is proposed for both quantitative and qualitative variables.In this work, we only take quantitative variables into consideration.In KMEANSVAR, the clustering criterion is defined by the homogeneity of variables in each cluster.H (Cluster k ) denotes the homogeneity of the variable cluster Cluster k which is defined by (17).In (17), centroid k is the cenroid of Cluster k obtained by FPC (SCluster k ).
H (Partition K ) denotes the homogeneity of a clustering of variables Partition K , which is defined by Based on the discussion above, we can give the steps of KMEANSVAR in detail.
(1) Initialize K cluster centroids for centroid e distance between a variable and a cluster is defined by the distance between the variable and the centroid of the cluster as (4) Computer centroid k � FPC (SCluster k ) as the new centroid for Cluster k (5) Iteratively do (2) to (4) until the maximum iterations is reached e Pseudocode of KMEANSVAR is described in Figure 7.

Variable Clustering Based on MCLPSO.
Although KMEANSVAR can cluster the variables efficiently, KMEANSVAR is as sensitive to initial centroids as some other k-means-based methods.e clustering criterion is easy to be trapped to the local optima and the quality of clustering cannot be guaranteed.To overcome this shortcoming, MCLPSO is used to optimize the initial centroids for KMEANSVAR and MCLPSO-KMEANSVAR is proposed.In MCLPSO-KMEANSVAR, the solution is coded as the initial centroids for kmeansvar, and KMEANSVAR is embeded into the objective function of MCLPSO.MCLPSO optimizes the following: (1) Coding of the solution: particle i's solution is coded as a D-dimension vector ( 21), D � K * M, K is the number of clusters, M the number of observations.e kth component of solution i -centroid ik denotes that the centroid of the kth cluster centroid k is initialized by centroid ik , and solution i determines the initial centroids for the clusters.e pos i and pbest i of particle i can be denoted as solution i : (2) Objective function: in order to improve the quality of the clustering of variables, MCLPSO-KMEANSVAR optimizes H (Partition K ) by optimizing the initial centroids for KMEANSVAR.solution i can be decomposed to K initial centroids: centroid i1 , . .., centroid iK .e clustering of variables can be obtained by call KMEANSVAR parameterized by centroid i1 , . .., centroid iK , i.e., Partition K � KMEANSVAR (centroid i1 , . .., centroid iK ), and the clustering criterion H (Partition K ) can be obtained by (20).1/H (Partition K ) is defined as the value of the objective function f.KMEANSVAR is embeded into the objective function of MCLPSO; therefore, the clustering result of KMEANSVAR can be optimized by adjusting the initial centroids for KMEANSVAR (Figure 8).

Experiment
As the clustering criterion of the clustering results has been defined in Section 4, we give some experiment results in this section.We evaluate the performance of the proposed algorithm MCLPSO-KMEANSVAR and compare it with some other variable clustering methods.MCLPSO-   Mathematical Problems in Engineering KMEANSVAR is compared with CLPSO-KMEANSVAR (KMEANSVAR initialized by CLPSO) and the original version of KMEANSVAR with random initialization.In [18][19][20], cutting the dendrogram is recommended to initialize the k-means-based variable clustering method, but it is also stated that the hierarchical variable clustering lacks scalability when the number of candidate variables increases because its O (N 2 ) complexity in which N is the number of candidate variables.We choose a more scalable initialization k-means++ initialization [27] in which the first cluster center is chosen uniformly at random from the data points that are being clustered, after which each subsequent cluster center is chosen from the remaining data points with probability proportional to its squared distance from the point's closest existing cluster center.It is easy to apply k-means++ initialization to KMEANSVAR and obtain KMEANSVAR++.e variable clustering methods and some experimental setting are listed in Table 1.

Datasets.
We choose several real-world datasets as the benchmark datasets to test the variable clustering method in Table 1.e detailed information of the datasets is listed in Table 2. D 1 are chosen from UCI datasets.D 2 and D 3 are collected from the MES (Manufacturing Execution System) database of a large-scale semiconductor manufacturing system located at Shanghai.D 2 is composed of the values of the manufacturing performance variables and D 3 is composed of the values of manufacturing status variables.D 4 is the SECOM dataset which is described in [28].A complex modern semiconductor manufacturing process of house line testing is normally under consistent surveillance via the monitoring of signals/variables collected from sensors and or process measurement points.SECOM is collected from the database of the FCS (Floor Control System) of the semiconductor manufacturing process.In D 1 -D 4 , only continuous variables are considered.e number of clusters for each dataset is set according to the number of variables.
In order to ensure the validity of the evaluation, D 1 -D 4 are preprocessed before experiment.First, all the values are normalized between [0, 1] by Min-Max Normalization method.Especially, D 4 contains some null values because the FCS sometimes influenced by sensor drifting results in data loss.erefore, we give following rules to clean D 4 .After cleaning, a complete dataset D 4 consists of 1560 instances, and each instance has 440 variables.D 4 is a difficult variable clustering problem.
(i) Remove the variables with unchangeable data (ii) Remove the variables with more than 50% missing data (iii) Remove the data items with more than 30% missing data

Parameter Setting.
ere are many parameters in the MCLPSO.According to the "No Free Lunch" theorem [29], there do not exist a so-called optimal parameterization.We set the parameters by following some empirical rules mentioned in some studies [30,31].
For the parameters in the global search component of MCLPSO, w decreases from 0.9 to 0.4 linearly, c � 1.49445, m � 7, the number of generations is set at 100, the population size is set at 20.For the parameters in the SA_local_search, T 0 � 10 to give a fine-grained local search.For the parameters in the "cognition-only" model, w decreases from 0.9 to 0.4 linearly along with the evolution cycles and c 1 � 1.49445.P SA and P Chaotic are both set to 0.1.For the other parameters, we found two heuristic rules by some tentative experiments.Chaotic_ls_length should be positively correlated with stagnant max as stagnant max determines the degree of stagnation of the particle.A high value of stagnant max implies that a high value of Chaotic_ls_length is needed to enable the stagnant particle to escape from the local optimum.SA_l-s_length is negatively correlated with improve max because a high value of improve max denotes a high quality of a promising particle.A moderate value of SA_ls_length is enough to detect the local optima.ese parameters are set empirically: stagnant max � 10, improve max � 3, Chaot-ic_ls_length � 100, and SA_ls_length � 100.
In CLPSO, w decreases from 0.9 to 0.4 linearly and c is set at 1.49445 as recommended by [5].
In KMEANSVAR, the number of clusters for each dataset has been specified in Table 2.If KMEANSVAR is evaluated in the fitness function, the maximum number of iterations is set to 10.

Result and Discussion.
e mean value and the standard deviation are recorded in Table 3 with the best result in bold.
First, we assess the effect of introducing the metaheuristic optimization on variable clustering.From Table 3  Mathematical Problems in Engineering obtained by KMEANSVAR on D 1 -D 4 are relatively poor because of the intrinsic deficiency of k-means clustering that is sensitive to initial centroids.KMEANSVAR is easy to be trapped into local optima and results in a relatively poor variable clustering.KMEANSVAR also shows a large standard deviation, so the performance of KMEANSVAR is not stable.On the simplest dataset D 2 with only 11 variables, the clustering criterion values obtained by KMEANSVAR are not satisfactory enough and the variance values remain large.KMEANSVAR++ can improve KMEANSVAR by choosing centroids with probability proportional to its squared distance from the point's closest existing cluster center.e improvement is definite but not so significant.CLPSO-KMEANSVAR can improve the clustering results significantly compared with KMEANSVAR.
e mean values of the clustering criterion obtained by CLPSO-KMEANSVAR are improved, and the variance values of the clustering criterion are also reduced.
erefore, the clustering result can be improved more significantly by introducing the metaheuristic optimization than using k-means++ seeding.
Second, we analyze the effect of introducing the local search operators and adaptive memetic strategy to the population-based metaheuristic optimization on variable   erefore, the difference between MCLPSO-KMEANS-VAR's results and CLPSO-KMEANSVAR's results on D 1 -D 2 is not significant because of the limited number of clusters and variables.When the number of variables and clusters increases, the advantage of MCLPSO-KMEANS-VAR is more significant.On D 3 , MCLPSO-KMEANSVAR has a better performance than CLPSO-KMEANSVAR and the improvement will be more significant when the number of clusters increases.e advantage of MCLPSO-KMEANSVAR is more significant on D 4 -a complex realworld industry dataset with 440 variables.erefore, the global search operators and the local search operators will take effect when dealing with dataset with large number of variables.
Furthermore, we analyze the robustness of MCLPSO-KMEANSVAR when dealing with the complex real-word dataset.MCLPSO-KMEANSVAR can generally improve the quality of the clustering of variables by optimizing the clustering criterion; its variance values on D 1 -D 4 are not reduced.To show the robustness of the above approaches, the boxplots of MCLPSO-KMEANSVAR, CLPSO-KMEANS-VAR, and KMEANSVAR's result on D 4 are depicted in Figures 9-11.From Figure 9, we can find that KMEANSVAR lacks robustness because of its intrinsic deficiency.CLPSO-KMEANSVAR's results' distributions are flatter.e variable clustering results can be improved significantly by introducing metaheuristic optimization.Compared with CLPSO-KMEANSVAR's results, MCLPSO-KMEANS-VAR's results' values of range and interquartile range are relatively higher, so the robustness cannot be improved by introducing the local search operators and adaptive memetic strategy.But in Figure 11, we find that MCLPSO-KMEANSVAR can avoid some extreme bad cases.In Figure 10, we find that the possibility to find satisfactory results is also higher.
To prove the improvement brought by MCLPSO-KMEANSVAR compared with CLPSO-KMEANSVAR is definite, nonparametric Wilcoxon rank sum tests are conducted between the MCLPSO's results and the CLPSO's results.e results of tests are presented in the last row of Table 3.If h � 1, the performances of the two algorithms are statistically different with 95% certainty.If h � 0, the performances are not statistically different.From Table 3, we find that MCLPSO-KMEANSVAR and CLPSO-KMEANSVAR are statistically different with the increase of K and the number of candidate variables.

Implementation and Computational Time.
e algorithms discussed above are all implemented in JAVA 8, so we can use the multithread technique to accelerate the particle's comprehensive learning process.We run the codes on Intel i5-8365U CPU with a parallelism of 8, i.e., 8 particles can do comprehensive learning operation simultaneously.
When we use MCLPSO to optimize KMEANSVAR, we will run the KMEANSVAR with a maximum number of evaluations (calling KMEANSVAR) 2000 as stated in Table 1.In 5.2, we stated that when the KMEANSVAR is called in a fitness function, the number of iterations is restricted to 10. e computational time of MCLPSO-KMEANSVAR is  about 40-60 times as much as the computational time of KMEANSVAR.e computational time of D 4 with different K is listed in Table 4. en computation time of MCLPSO-KMEANSVAR and KMEANSVAR shows a linear increase with respect to the increase of K, but KMEANSVAR++ increases dramatically because the initialization of KMEANSVAR++ is sensitive to K.So MCLPSO-KMEANSVAR is more scalable than KMEANSVAR++ (Table 5).

A Web-Based Interactive Software Platform
In Section 5, some datasets are used to evaluate MCLPSO-KMEANSVAR.Except D 1 , D 2 -D 4 are collected from the information system databases of semiconductor manufacturing factories.e relationship between D 2 -D 4 is explained in Figure 12. e variable clustering analysis of the variables of D 2 -D 4 is an important and practical work.It is helpful to find useful insights of manufacturing systems from different perspectives and improve the operation management by some further analysis such as performance analysis, optimal control, fault diagnosis.
For the purpose of practical usage, we have also developed a web-based interactive software platform based on MCPSO-KMEANSVAR.In this section, we introduce the usage of a software platform by demonstrating each step.e performance analysis of the semiconductor manufacturing system is introduced as a case study.

Performance of Semiconductor Manufacutring System.
e semiconductor manufacturing system is a very complicated system and its performance can be affected by the manufacturing environment, scheduling rules, equipment failure rate, and rush order.e analysis of performance is useful to improve the operation management of semiconductor manufacturing system.In Table 3, we choose 8 performance variables, in which e detailed description is presented in Table 3.

A Web-Based Interactive Variable Clustering System.
First, the dataset of the performance history data should be uploaded (Figure 13).en some statistics of each variable can be found in Figure 14. e user can also choose some thresholds to smooth the outliers for each variable before variable clustering.
e result of MCLPSO-KMEANSVAR is presented in Figure 15.We can reduce the number of optimization objectives by variable clustering.

Conclusion
In this work, MCLPSO, a novel memetic algorithm is presented in our previous research, is introduced as a metaheuristic approach to improve k-means-based variable clustering.e experiment results show that MCLPSO-KMEANSVAR out-   Mathematical Problems in Engineering 13 performs KMEANSVAR significantly.We also develop a webbased interactive software platform to implement MCLPSO-KMEANSVAR and give a case study of performance analysis for semiconductor manufacturing system.In the future research, we will further study the practical use of MCLPSO-KMEANSVAR in other problems and develop a distributed MCLPSO-KMEANSVAR for analyze the big data.

Y 1 -
Y 3 are long-term global performance, Y 4 -Y 6 are shortterm global performance, and Y 7 -Y 8 are short-term local performance.
1 , . .., centroid K for clusters Cluster 1 , . .., Cluster K (2) Clear the clusters Cluster 1 , . .., Cluster K (3) For each X i ∈ X, find its nearest cluster Cluster nearest , and assign X i to Cluster nearest Cluster nearest � arg min Cluster k ∈Partition K d X i , Cluster k  , , λ 2 , …, λ N for C, and λ 1 ≥ λ 2 ≥, ..., ≥ λ N get eigenvectors U 1 , U 2 , …, U N return the SU 1 as the first principle component of S end , we can find that the mean values of clustering criterion function f (solution: sol) begin decompose sol and get centroid 1 , …, centroid K Partition K = KMEANSVAR(centroid 1 , …, centroid K ) return the result of 1/H(Partition K ) as the value of f end Figure 8: Objective function of MCLPSO-KMEANSVAR.

Table 1 :
Variable clustering methods for comparison.

Table 3 :
e result of the clustering of variables.

Table 3 ,
we can find that on D 1 -D 2 , the mean values of clustering criterion obtained by MCLPSO-KMEANSVAR are similar with the mean values of clustering criterion obtained by CLPSO-KMEANSVAR.e number of possible clustering results can be derived by the number of variables and the number of clusters.For example, the possible number of clustering results on D 2 is C2 11, C3 11, and C4 11 when the number of clusters is 2, 3, and 4.

Table 4 :
e result of computational time on D4 (minutes).