Subspace Clustering of High-Dimensional Data: An Evolutionary Approach

,


Introduction
Clustering problem concerns the discovery of homogeneous groups of data according to a certain similarity measure.The task of clustering has been studied in statistics [1], machine learning [2][3][4], bioinformatics [3,[5][6][7], and more recently in databases [8][9][10].Clustering algorithms finds a partition of the points such that points within a cluster are more similar to each other than to points in different clusters [11].In traditional clustering each dimension is equally weighted when computing the distance between points.Most of these algorithms perform well in clustering low-dimensional data sets [12][13][14][15].However, in higher dimensional feature spaces, their performance and efficiency deteriorate to a greater extent due to the high dimensionality [16].Another difficulty we have to face when dealing with clustering is the dimensionality of data.In the clustering task, the overwhelming problem of high dimensionality presents a dual aspect.First, the presence of irrelevant attributes eliminates any hope on clustering tendency, because such features cause the algorithm to search for clusters where there is no existence of clusters.This also happens with low-dimensional data, but the likelihood of presence of irrelevant features and their number grow with dimension.The second problem is the so-called "curse of dimensionality." For clustering this means that clusters do not show across all attributes as they are hidden by irrelevant attributes or blurred by noise.Clustering methods are typically either based on distances (like partitioning and hierarchical clustering) or on densities (like density-based methods).In [17] the authors study the effects of high dimensions on the nearest neighbor  min (o) and the farthest neighbor  max (o) of an object  in detail.They have proven the following equation for different distributions: ( max() < (1 + )  min() ) = 1. ( This statement formalizes that, with growing dimensionalities (dim), the distance to the nearest neighbor is nearly equal to the distance to the farthest neighbor (distances become more and more similar).Consequently, clustering methods based on distance functions have problems to extract meaningful patterns in high dimensional spaces as they either cluster only one object (the nearest neighbor) or nearly the complete data set (the farthest neighbor).Figure 1 shows that clusters are embedded in different subspaces of high-dimensional data sets.
Densities also suffer from the curse of dimensionality.In [18] the authors describe an effect of higher dimensions on density distributions: 99% of the mass of a ten-dimensional normal distribution is at points whose distance from the origin is greater than 1.6.This effect is directly opposite in lower dimensional spaces: 90% of the objects have a distance of less than 1.6 SD from the origin regarding a one-dimensional distribution.Density-based clustering methods [19,20] hence have problem to determine the density of a region as the objects are scattered over the data space.Grid-based methods [10,21,22] are capable of discovering cluster of any shape and are also reasonably fast.However, none of these methods address how to efficiently cluster very large data sets that do not fit in memory.Furthermore, these methods also only work well with input spaces with low to moderate numbers of dimensions.As the dimensionality of the space increases, grid-based methods face some serious problems.The number of cells grows exponentially and finding adjacent high-density cells to form clusters becomes prohibitively expensive.Often, especially in high dimensional spaces, not all dimensions are relevant-the data are bound along such dimensions, to a given cluster.It is vital for a clustering method to be able to detect clusters being embedded in subspaces possibly formed by different combinations of dimensions in different data localities.
These observations motivate our effort to propose a novel subspace clustering algorithm called multiobjective subspace cLustering (MOSCL) that efficiently clusters high dimensional numerical data sets.The first phase of MOSCL performs subspace relevance analysis by detecting dense and sparse regions and their locations in data set.After detection of dense regions it eliminates outliers.The discussion details key aspects of the proposed MOSCL algorithm including representation scheme, maximization fitness functions, and novel genetic operators.In thorough experiments on synthetic and real world data sets, we demonstrate that MOSCL for subspace clustering is superior to method such as PROCLUS [23].Additionally we investigate the effects of first phase for detecting dense regions on the results of subspace clustering.The performance measure of MOSCL is evaluated by the clustering error (CE) distance [24].It is an intuitive way to compare clustering because it uses the proportion of points which are clustered differently in generated subspace clusters and real subspace clusters after optimal matching of clusters.
The remainder of this paper is structured as follows.In Section 2, we review some related work.Section 3 describes multiobjective subspace clustering.Section 3.1 presents preprocessing phase that detects dense regions and sparse regions and removes outliers.Section 3.2 describes the design concepts of Multi Objective Subspace CLustering (MOSCL) algorithm.Section 4 presents experimental evaluation.Finally, we conclude in Section 5.

Related Work
The subspace algorithms can be divided in to two categories: partition-based subspace clustering algorithms and gridbased subspace algorithms.Partition-based algorithms partition the set of objects into mutually exclusive groups.Each group along with the subset of dimensions shows the greatest similarity known as a subspace cluster.Similar to the Kmeans method, most algorithms in this category define an objective function to be minimized during the search.The major difference between these methods and the K-means algorithm is that here the objective functions are related to the subspaces where each cluster resides in.CLIQUE (clustering in quest) [25] is one of the first subspace algorithms that attempt to find clusters within subspaces of the data set.This algorithm combines density and grid-based clustering and uses an APRIORI style technique to find clusterable subspaces.It is difficult for CLIQUE to discover high-quality clusters in all subspaces.This may be because the unit densities vary in different subspace cardinalities such as the identification of the dense units in all subspaces by utilizing an absolute unit density threshold.It may suffer from the trade-off between precision and recall.SUBCLU (densityconnected subspace clustering) [26] overcomes the limitations of grid-based approaches like the dependence on the positioning of the grids.Instead of using grids it uses densityconnected sets of DBSCAN [19] cluster model.SUBCLU is based on a bottom-up, greedy algorithm to detect the densityconnected clusters in all subspaces of high-dimensional data.However SUBCLU also suffers from the density divergence problem.PROCLUS (projected clustering) [23], a typical partition-based subspace clustering algorithm, searches for a partition of the dataset into clusters together with the set of dimensions on which each cluster is correlated.PROCLUS is a variation of the k-medoid algorithm, in which the number of clusters k and the average number of dimensions of clusters l need to be specified before the running of the algorithm.This algorithm also assumes that one data point can be assigned to at most one subspace cluster or classified as an outlier, while a dimension can belong to multiple clusters.ORCLUS (arbitrarily oriented projected CLUSter generation) [27] is a generalization from PROCLUS [23], which finds clusters in arbitrarily oriented subspaces.ORCLUS finds projected clusters as a set of data points C together with a set of orthogonal vectors such that these data points are closely clustered in the defined subspace.A limitation of these two approaches is that the process of forming the locality is based on the full dimensionality of the space.However, it is not useful to look for neighbors in data sets with very low-dimensional projected clusters.In addition, PROCLUS and ORCLUS require the user to provide the average dimensionality of the subspace, which also is very difficult to do in real-life applications.FIRES (Filter refinement subspace clustering) [28] is a general framework for efficient subspace clustering.It is generic in such a way that it works with all kinds of clustering notions.CLICK (subspace cLusterIng of categorical data via maximal k-partite cliques) [29] uses a novel formulation of categorical subspace clusters, based on the notion of mining cliques in a k-partite graph.It implements an efficient algorithm to mine k-partite maximal cliques, which correspond to the clusters.COSA (clustering objects on subsets of attribute) [30] formalizes the subspace clustering problem as an optimization problem.The algorithm returns with mutually exclusive subspace cluster with each data point assigned to exactly one subspace cluster.One dimension can belong to more than one subspace cluster.However, the subspace in which each cluster is embedded is not explicitly known from the algorithm.The FINDIT (fast and intelligent subspace clustering algorithm using dimension voting) [31] algorithm, uses a dimension voting technique to find subspace clusters.Dimension oriented distance is defined to measure the distance between points based on not only the value information but also the dimension information.DENCOS (density conscious subspace clustering) [32] tackles the density divergence problem; in this algorithm, authors devise a novel subspace clustering model to discover the clusters based on the relative region densities in the subspaces, where the clusters are regarded as regions whose densities are relatively high as compared to the region densities in a subspace.Based on this idea, different density thresholds are adaptively determined to discover the clusters in different subspace cardinalities.
Grid based subspace clustering algorithms consider the data matrix as a high-dimensional grid and the clustering process as a search for dense regions in the grid.ENCLUS (entropy based clustering) [33] uses entropy instead of density and coverage as a heuristic to prune away uninteresting subspaces.It finds correlated, high density and high coverage subspaces using level wise search that is used in CLIQUE [25].However, this algorithm finds only subspaces within which meaningful clusters exist, without explicitly finding the actual clusters.A more significant modification of CLIQUE is presented in MAFIA that extends the base units in CLIQUE to utilize adaptive and variable-sized units in each dimension.These variable-sized bins are taken as building blocks to form units in higher subspaces.MAFIA [34] also utilizes candidate generate-and-test scheme to generate candidate dense units in higher subspaces, thus resulting in unavoidable information overlapping in the clustering result.pMAFIA (merging adaptive finite intervals and is more than a clique) [35] proposes to use adaptive units instead of the rigid ones used in CLIQUE [25].Each dimension is partitioned into windows of small size, and then adjacent windows having similar distribution are merged to form larger windows.However, in the CLIQUE the search complexity increases exponentially as a function of the highest dimensionality of the dense units.pMAFIA may have the difficulties in discovering clusters with high qualities in all subspace cardinalities.DOC (density-based optimal projective clustering) [36] proposes a mathematical definition of an optimal projective cluster along with a Monte Carlo algorithm to compute approximations of such optimal projective clusters.DOC tries different seeds and neighboring data points, in order to find the cluster that optimizes the quality function.The entire process is repeated to find other projected clusters.It is clear that since DOC scans the entire data set repetitively, its execution time is very high.O-Cluster (orthogonal partitioning clustering) [37] clustering method combines a novel partitioning active sampling technique with an axis parallel strategy to identify continuous areas of high density in the input space.EWKM (entropy weighting -means) algorithm [38] is a new K-means type subspace clustering algorithm for high-dimensional sparse data.This algorithm simultaneously minimizes the within cluster dispersion and maximizes the negative weight entropy in the clustering process.

Multiobjective Subspace Clustering
A genetic algorithm, a particular class of evolutionary algorithms, has been recognized to be well suited to multi-objective optimization problems.In our work, we employ multiobjective subspace clustering (MOSCL) algorithm for clustering data sets based on subspace approach.In this section, we discuss important concepts of preprocessing phase and design concepts of MOSCL.

Preprocessing Phase.
The goal of preprocessing step is to identify all dimensions in a data set which exhibit some cluster structure by discovering dense regions and their location in each dimension [39].By cluster structure we mean a region that has a higher density of points than its surrounding regions.Such dense region represents the one-dimensional projection of some cluster.Hence, it is clear that, by detecting dense regions in each dimension, we are able to discriminate between dimensions that are relevant to clusters and irrelevant ones.The identified dimensions represent potential candidates for relevant dimensions of the subspace clusters.The irrelevant attributes contain noise/outliers and sparse data points [23].
Let us first give some definitions.Let D be a data set of n data points of dimensionality d.Let  = { 1 ,  2 , . . .,   } be the set of all attributes   of the data set D, and let  = { 1 ×  2 × ⋅ ⋅ ⋅ ×   } be the corresponding d-dimensional data space.Any k-dimensional subspace of  ⊆  is the space with the k dimensions drawn from the d attributes, where  ≤ .The cardinality of the subspace is defined as the number of dimensions forming this subspace.The input consists of a set of d dimensional points  = { 1 ,  2 , . . .,   }, where   = { 1 ,  2 , . . .,   }.The projection of a point   ∈  into a subspace  ⊆  is denoted by   (  ).The distance function between the data points D is denoted by dist.It is assumed that dist is one of the L p -norms.The k-nearest neighbors of a data point   for any  ∈  are denoted by   (  ).More formally, the set of k-nearest neighbors of a data point   is the smallest set   (  ) ⊆  that contains at least k data points from data set D. In order to detect densely populated regions in each attribute we compute variance of the local neighborhood of each data point by measuring the variance of its k nearest neighbors [40].The variance of local neighborhood of data point   ( = 1, . . .,  and  = 1, . . ., ) is defined as where A large value of   means that data point   belongs to a sparse region, while a small one indicates that it belongs to a dense region.Calculation of the k nearest neighbors is an expensive task, especially when the number of data points n is very large.We can search the k nearest neighbors in an efficient way by presorting the values in each attribute and limiting the number of distance comparisons to a maximum of 2k values.The major advantage of using the variance degree is that it provides a relative measure on which the dense regions are more easily distinguishable from sparse regions.On the other hand, when a dimension contains only sparse regions, all the estimated   for the same dimension tend to be very large.Our objective now is to determine whether or not dense regions are present in a dimension.In order to identify dense regions in each dimension, we are interested in all sets of   having a small variance degree by a predefined density threshold  ∈ R. Therefore, setting 0 <  ≤ 0.1 is a reasonable choice.We have applied this density threshold for estimation of binary weight   of data point   .If   < , then   = 1 and   belongs to a dense region; else   = 0 and   belongs to a sparse region.We obtain a binary matrix  ( * ) which contains the information on whether each data point falls in a dense region of an attribute.Table 1 shows a binary matrix  ( * ) which contains the information on whether each data point falls in a dense region of an attribute or sparse region.It is clear that the computation of   depends on the input parameters k and  threshold.We set the value of variance   ≤ 0.1, which indicates that data point   belongs to a dense region.In practice, since the variance degree   is the indicator for dense regions and its values vary significantly depending on the attributes, first we normalize all the   for each attribute A j by mapping them onto the interval [0, 1] before applying the threshold.In fact, the role of the parameter k is intuitively easy to understand and it can be set by the variance degrees   which are not meaningful enough because few neighbors would have a distance close to zero.
Obviously, the parameter k is related to the expected minimum cluster size and should be much smaller than the number of objects n in the data.To gain a clear idea of the variance of the neighborhood of a point, we have chosen  ≤ √ in this phase.In order to capture irrelevant attributes and outliers, we used binary similarity coefficients to measure the similarity between binary data points   for ( = 1, . . ., ) in the matrix .One commonly used similarity measure for binary data is the Jaccard coefficient [41].This measure is defined as the number of variables that are coded as 1 for both states divided by the number of variables that are coded as 1 for either or both states.In preprocessing phase, we require a similarity measure that can reflect the degree of overlap between the binary data points   in the identified dense regions in the matrix .Since dense regions are encoded by 1 in the matrix , we believe that the Jaccard coefficient is suitable for our task because it considers only matches on 1's to be important.The Jaccard coefficient is given as where a indicates the number of dimensions in which the two objects have the same binary value of 1.Similarly, c and b count the number of dimensions in which the two objects have different binary values.The Jaccard coefficient [41] has values between 0 and 1.A pair of points is considered similar if the estimated Jaccard coefficient between them exceeds a certain threshold .We set the value of threshold  equal to 0.8, for measuring degree of similarity between two binary vectors in preprocessing phase.On the other hand, it is clear that when all of the binary weights for a binary data point   in the matrix  are equal to zero, the related data point   is systematically considered as an outlier because it does not belong to any of the discovered dense regions.The identified outliers || are discarded from the data set and their corresponding rows are eliminated from the matrix.By eliminating outliers the size of the data set D is reduced with size   =  − || and now its new associated matrix of binary weights  (  * ) (see Table 2).MOSCL uses these binary weights of the matrix  for representation of chromosomes.MOSCL uses a modified distance function that considers contributions only from relevant dimensions when computing the distance between a data point and the cluster center.In concrete terms, we associate the binary weights   { = 1, . . .,   ;  = 1, . . ., } in matrix  which the Euclidean distance.This makes the distance measure more effective because the computation of distance is restricted to subsets where the data point values are dense.Formally, the distance between a point x i and the cluster center 2 . (5)

Multiobjective Subspace Clustering Algorithm.
In this subsection, we will dwell on the important design issues of MOSCL, including individual representation, fitness function, selection operator, search operator, and elitism.The basic steps of MOSCL are presented in Procedure 1.

Representation.
For any genetic algorithm, a chromosome representation is needed to describe each individual in the population of interest [42].The representation method determines how the problem is structured in the algorithm and the genetic operators that are used.Each chromosome is made up of a sequence of genes from certain alphabet.An alphabet can consist of binary digits (0 and 1), floating-point numbers, integers, and symbols [43].A straightforward yet effective representation scheme for subspaces is the standard binary encoding; all individuals are represented by binary strings with fixed and equal length , where  is the number of dimensions of the dataset.Using a binary alphabet set  = {0, 1} for gene alleles, each bit in the individual will take on the value of "0" and "1", respectively, indicating whether or not its corresponding attribute is selected ("0" indicates the corresponding condition is absent and vice versa for "1").As a simple example, the individual 110101 when the dimension of data stream  = 6 represents a 4-dimensional subspace cluster containing the 1st, 2nd, 4th, and 6th attributes of the dataset.
In our method, each chromosome is described by a sequence of matrix  =  ×   , where d is the dimension of the feature space and   is the number of subspace clusters described by the chromosome.That is to say, the chromosome of the algorithm is written as when the first d values represent the weights of d dimensions of the first subspace cluster, the next d points represent the weights of second subspace cluster and so on.

Objective Functions.
In the single-objective optimization problem, the objective function and fitness function are often identical and are usually used exchangeably [43].Hence, quality evaluation of individuals in single-objective optimization problems can be conducted directly based on their objective function values.Single objective methods of data clustering can be categorized based on the criterion they optimize [44].One group of clustering algorithms try to form compact spherical clusters by minimizing the intracluster distance between data points and cluster representatives, as in -means [12] average link agglomerative clustering [13] and self-organizing maps [45].Another group of clustering algorithms tries to form clusters in which data items close to each other fall into the same cluster, hence optimizing connectedness.This category of clustering algorithms can find clusters of arbitrary shapes; however, they might fail when clusters are close to each other.Two examples of this group of clustering algorithms are the single link agglomerative clustering, since clustering algorithms with single criterion can find one category of cluster shapes and fail on the other category.In contrast, the objective function and fitness function differ in genetic multiobjective optimization.The objective function f, by applying the decision vector , produces a k-dimensional objective vector S. MOSCL uses a multiobjective genetic algorithm during fitness evaluation including multi objective functions for finding subspace clusters.First objective measures the density of each subspace cluster   .
On the other hand minimizing compactness tries to cluster dataset in as many subspace clusters as possible.Therefore the two objectives, density and compactness, can interact towards making the semioptimal clustering solutions.The mathematical definitions of the objectives of MOSCL are as follows.
Let  = { 1 , . . .,   } be set of d dimensional attributes.We define the density of a single attribute   as the fraction of "1" entries in the column   , denoted as dens(  ).The density of a subspace cluster   denoted as dens(  ) is the ratio between the number of data points contained in the subspace cluster |  | and the total number of data points in the dataset ||.We propose a weighted density measure, which finds density of all subspace clusters.The weighted density of a subspace cluster C  denoted as dens  (  ) is defined as the ratio between dens(  ) and the average density over all attributes contained in ; that is, dens  (  ) = dens(  )/((1/||)(∑   ∈ dens(  ))), where || is the number of attributes contained in subspace cluster.We call the denominator (1/||)(∑   ∈ dens(  )) weight.The second objective is to minimize intracluster variance or compactness.To measure the compactness between subspaces clustering, we use the Euclidean distance in weighted form as the distance measure.In order to define weighted Euclidean distance for subspace clustering, the data points   and V  can be thought of as the projections   (  ) and   (V  ) in their own subspace S. The equation of compactness is where   is the subspace cluster and   is binary weights associated with each attribute.This makes the distance measure more effective because the computation of distance is restricted to subsets where the object values are dense.Thus by minimizing this objective we give higher fitness to compact subspace clusters.

Selection.
The selection operator supports the choice of the individuals from a population that will be allowed to mate in order to create a new generation of individuals.Genetic algorithm methods attempt to develop fitness methods and elitism rules that find a set of optimal values quickly and reliably [46].But, ultimately, it is the selection method that is responsible for the choice of a new generation of parents.Thus, the selection process is responsible for making the mapping from the fitness values of the input population to a probability of selection and thus the probability of an individual's genetic material being passed to the next generation.Here we adopt the tournament selection method [47] because its time complexity is low.The basic concept of the tournament begin for  = 1 to  pop do   best fitted item among N tour elements randomly selected from P; return   end Procedure 2: Procedure of tournament selection method.method is as follows: randomly select a positive number N tour of chromosomes from the population and copy the best fitted item from them into an intermediate population.The process is repeated P times, and here P is the population size.The algorithm of tournament selection is shown in Procedure 2.

Crossover.
Crossover is the feature of genetic algorithms that distinguishes it from other optimization techniques.As with other optimization techniques genetic algorithms must calculate a fitness value, select individuals for permutation, and use mutation to prevent convergence on local maxima.But, only genetic algorithms use crossover to take some attributes from one parent and other attributes from a second parent.We used the uniform crossover [48] in the proposed algorithm.The uniform crossover is applied to the genes of the chromosome, simultaneously.Two chromosomes are randomly selected as parents from the current population.The crossover creates the offspring chromosome on a bitwise basis, copying each allele from each parent with a probability   .The   is a random real number uniformly distributed in the interval [0, 1].Let P 1 and P 2 be two parents, and let C 1 and C 2 be offspring chromosomes; the  th allele in each offspring is defined as.
For example chromosome   and chromosome   are selected for uniform crossover as follows: The random number   is generated for crossover [0.3, 0.7, 0.80.4,0.2, 0.5].Now copy each allele from each parent with a probability   as discussed in the uniform crossover operation.The resulting offsprings are 3.2.5.Mutation.Unlike the crossover and selection operators, mutation is not necessarily applied to all individuals [49].Multiple mutation specifications are available: bit, uniform, and normal.Each mutation type has an independent probability of being applied to each individual.Therefore, some individuals may be mutated by several mutation methods, some individuals may be mutated by a single mutation method, and some individuals may not be mutated at all.Here we apply bit mutation to the genes [48].This results in some bits in genes of the children being reversed: "1" is changed to "0" and "0" is changed to "1".The gene values are modified as follows: where ∧  is the mutated gene,  is the existing gene value, and (0, ) is a normally distributed random number with mean 0 and standard deviation ().After some trial and error procedures, we find that when () is equal to 0.1.

Elitism. When creating a new population by crossover
and mutation, we have a high chance to lose the best subspace clusters found in the previous generation due to the nature of randomness of evolution [50].Hence, besides Pareto-based selection method, we also use elitism method to achieve a constantly improving population of subspace clusters.We adopt the archive fashion to implement elitism in MOSCL.
In our elitism strategy, the best or a few best subspaces are directly copied to the new population, together with other new individuals generated from the mating pool.In addition, all the solutions obtained in MOSCL in different generations will be stored in an external archive.In the current implementation of MOSCL, there is no limit on the archive that is used to store all the solutions that are obtained in the MOSCL.

Experimental Evaluation
The experiments reported in this section used a 2.0 GHz core 2 duo processors with 2 GB of memory.After data preparation has been finished, we can conduct an experimental evaluation on MOSCL and compare the performance of MOSCL with other competitive methods.We use both synthetic and real-life datasets for performance evaluation in our experiments.There are four major parameters that are used in the MOSCL, that is, the total number of generations that are to be performed in the MOSCL (denoted by   ), the number of individuals in the population of each generation (denoted by   ), the probability that the crossover between two selected individuals will be performed (denoted by   ), and finally the probability that each bit of the individual will be mutated (denoted by   ).The typical parameter setup in MOSCL is   = 100,   = 50,   = 0.8, and   = 0.1.One can change the value specification of these parameters in order to obtain different search workloads and search strengths for the MOSCL.
4.1.Performance Measure.The performance measure used in MOSCL is the clustering error (CE) distance [24] for subspace clustering.It is an intuitive way to compare clustering because it uses the proportion of points which are clustered differently in generated subspace clusters and real subspace clusters after optimal matching of clusters.In other words, it is the scaled sum of the nondiagonal elements of the confusion matrix, minimized over all possible permutations of rows and columns.The CE distance has been shown to be the most desirable metric for measuring agreement between partitions in the context of projected/subspace clustering [24].
Let us have two subspace clusters: GS = { 1 ,  2 ,  3 ,   } is a collection of experimental generated projected subspace clusters and RS = {  1 ,   2 ,   3 , . . .,    } is a collection of real subspace clusters.In order to construct a clustering error distance measure for subspace clusters, we need to define the union, and the intersection of subspace clusters in GS and RS.We denote the union of two subspace clusters GS and RS as where sup(GS) and sup(RS) are the support of subspace clustering GS and RS.The support of RS is defined as sup(GS) = ∪ =1,.
We use the Hungarian method [51] to find a permutation of the cluster labels such that the sum of the diagonal elements of M is maximized. max denotes this maximized sum.The clustering error (CE) distance for subspace clustering is given by The value of custer error (CE) is always between 0 and 1.The more similar the two partitions RS and GS, the smaller the CE value.

Data Sets
4.2.1.Synthetic Data Sets.The synthetic data sets were generated by data generator [23].It permits controlling the size and structure of the generated data sets through parameters such as number and dimensionality of subspace clusters, dimensionality of the feature space, and density parameters for the whole data set as well as for each cluster.The parameters used in the synthetic data generation are the size of the data set n, the number of clusters   , the data set dimensionality d, the average cluster dimensionality  avg , the domain of the values of each dimension [min j , max j ], the standard deviation value range [sd min , sd max ], which is related to the distribution of points in each cluster, and the outlier percentage.Using these parameters, the data generator proceeds by defining seed points among which the clusters will be distributed, as well as the dimensions associated with each such seed point [23].Then, it determines how many points will be associated with each cluster and finally it generates the cluster points.Projected values of cluster points are generated according to the normal distribution in their relevant dimension, with the mean randomly chosen from [min j , max j ] and the standard deviation value from [sdv min , sdv max ].The experiments were analyzed by using 10 data sets with  = 1000 to 50000, number of dimensions d = 20 to 250, min j = −100, and max j = 100.The values of sd min and sd max were set to 1 percent and 5 percent of the standard deviation on that dimension, respectively.

Real Data Sets.
We also tested our proposed algorithm on the breast cancer, the mushroom data sets, and multiple features data (MF).All three data sets are available from the UCI machine learning repository [http://archive.ics.uci.edu/ml/].The cancer data set contains 569 instances with 30 features.The mushroom data set contains 8124 instances and 22 attributes.The multiple features data (MF) consists of features of handwritten numerals ("0"-"9") extracted from a collection of Dutch utility maps.Two hundred patterns per cluster (for a total of 2,000 patterns) have been digitized in binary images.These digits are represented in terms of the following six feature sets (files): (1) mfeat-fou: 76 Fourier coefficients of the character shapes; (2) mfeat-fac: 216 profile correlations; (3) mfeat-kar: 64 Karhunen-Love coefficients; (4) mfeat-pix: 240 pixel averages in 2 × 3 windows; (5) mfeat-zer: 47 Zernike moments; (6) mfeat-mor: 6 morphological features.
For experimental results we used five feature sets (mfeatfou, mfeat-fac, mfeat-kar, mfeat-zer, and mfeat-mor).All the values in each feature were standardized to mean 0 and variance 1.

MOSCL Results
Analysis.The scalability of MOSCL with increasing data set size and dimensionality are discussed in this section.In all of the following experiments, the performance of the results returned by MOSCL is better in comparison with PROCLUS.

Scalability with the Data Set
Size.The scalability of MOSCL with the size of the data set is depicted in Figure 2. The results for scaling by the data size n verify that the algorithm is approximately linear in n.The experiments were obtained by using data set varying the number data points from 1000 of 50000.The dimensionality value of the data set is set to 20.The data points were grouped in 5 clusters with 10 percentages of data points generated as outliers.For this set of experiments, the distributions of points for the bounded dimensions of all clusters follow a uniform distribution.MOSCL successfully locates all input clusters within the course of 40 generations, on average.The execution time of MOSCL is comparable to that of PROCLUS [23].The slight nonlinearity comes from the final stage of extracting the points from the clustered dimensions.

Scalability with Dimensionality of the Data Set.
Figure 3 shows the scalability of MOSCL as the dimensionality of the data set increases from 50 to 250.In this series of experiments, each dataset has 5000 data points and there are 5 clusters being embedded in some 5-dimensional subspace.Additionally, as the data dimensionality increases, the application of the genetic operators becomes increasingly more expensive mainly due to the increasing number of constraintchecking computations which require yielding individuals without overlapping candidate solutions.As in the scalability experiments with the dimensionality of data set, the execution of MOSCL is usually better than that of PROCLUS [23].The result shows a linear increase of runtime when increasing the dimensionality of the data set.4 shows the dependency of the execution time on the average cluster dimensionality where the latter increases from 10 to 50 in a 100-dimensional data space.In each case, the dataset has 50000 points distributed over 5 clusters with a fixed 10 percent level of outliers.The superlinear speedup in the execution time with the dimensionality of the hidden clusters is explained as follows: the higher the dimensionality of the hidden clusters is, the more likely it is for the evolutionary search operators to produce chromosomes along the bounded dimensions.It required extra operations to obtain a feasible chromosome.Additionally, as the dimensionality of hidden clusters increases, more space becomes available for the mutation and crossover operators.The higher average cluster dimensionality increased the computational overhead by genetic operators.

Scalability with Average Cluster Dimensionality. Figure
The scalability experiments show that MOSCL scales well also for large and high-dimensional data sets.

Results on Real Data
Sets.The quality of the clustering result is evaluated in terms of accuracy.The accuracy achieved by the PROCLUS algorithm on breast cancer data is 71.0 percent, which is less than 94.5 percent achieved by MOSCL clustering algorithm.The enhanced performance given by MOSCL in comparison with that of the PROCLUS suggests that some dimensions are likely not relevant to some clusters in this data.The accuracy achieved by MOSCL on mushroom data set is 93.6 percent, which is greater than the accuracy (84.5 percent) achieved by PROCLUS on the same data.PROCLUS and MOSCL are used to cluster the MF data.The accuracy achieved by these algorithms on this data is 94.5 percent and 83.5 percent, respectively.Such results  suggest that the moderate number of dimensions of this data does not have a major impact on algorithms that consider all dimensions in the clustering process.MOSCL is executed for a total of 50 generations.The experimental results show the superiority of MOSCL algorithm in comparison to PROCLUS.The proposed algorithm MOSCL detects high quality subspace clusters on all real data sets but PROCLUS is unable to detect them.The accuracy results of the real data sets are presented in Table 3.

Performance Measure of Clustering Results.
We have computed the CE distance for all clustering results using the subspace clustering CE distance.MOSCL is able to achieve highly accurate results and its performance is generally consistent.As we can see from Figure 5, MOSCL is more robust to cluster dimensionality than the PROCLUS algorithm.The experiments show that our algorithm is able to detect clusters and their relevant dimensions accurately in various situations.MOSCL successfully avoids the selection of irrelevant dimensions in all the data sets used in these experiments.The results of PROCLUS are less accurate than those given by MOSCL.When data sets contain projected clusters of low dimensionality, PROCLUS performs poorly.

Conclusion
In this paper we have presented a robust MOSCL (multiobjective subspace clustering) algorithm for the challenging problem of high-dimensional clustering and illustrated the suitability of our algorithm in tests and comparisons with previous work.The first phase of MOSCL performs subspace relevance analysis by detecting dense and sparse regions and their locations in data set.After detection of dense regions it eliminates outliers.MOSCL discovers subspaces in dense regions of data set and produces subspace clusters.The discussion details key aspects of the proposed methodology, MOSCL, including representation scheme, maximization fitness functions, and novel genetic operators.Our experiments on large and high dimensional synthetic and real world data sets show that MOSCL outperforms subspace clustering as PROCLUS by orders of magnitude.Moreover, our algorithm yields accurate results when handling data with outliers.There are still many obvious directions to be explored in the future.The interesting behavior of MOSCL on generated data sets with low dimensionality suggests that our approach can be used to extract useful information from gene expression data that usually have a high level of background noise.Another interesting direction to explore is to extend the scope of preprocessing phase of MOSCL from attribute relevance analysis to attribute relevance.

Figure 1 :
Figure 1: An example of subspace clusters.

Figure 2 :
Figure 2: Scalability with the size of the data set.

Figure 3 :
Figure 3: Scalability with the dimensionality of the data set.
(  ) = {   (  ) ∪   } and |   (  )| =  + 1.Here    (  ) denotes the set of -nearest neighbors of   in attribute   and    is the center of the set    (  ); the center is calculated as

Table 2 :
Binary weight matrix  of data points after removing outliers. 2  3  4  6  7  8  9  10  1 .. sup(  ), where sup(  ) is support of subspace cluster C s , given by sup(  ) = {  |   ∈  ∧   ∈ }.The DS is a subset of data points of D, and AS is a subset of dimensions of A. Let  = (  ) denote confusion matrix, in which   represents number of data points in the intersection of subspace clusters   and   ; we can represent   as   =      sup (  ) ∪ sup (  )      .

Table 3 :
Average accuracy results on real data sets.