Clustering highdimensional data has been a major challenge due to the inherent sparsity of the points. Most existing clustering algorithms become substantially inefficient if the required similarity measure is computed between data points in the fulldimensional space. In this paper, we have presented a robust multi objective subspace clustering (MOSCL) algorithm for the challenging problem of highdimensional clustering. The first phase of MOSCL performs subspace relevance analysis by detecting dense and sparse regions with their locations in data set. After detection of dense regions it eliminates outliers. MOSCL discovers subspaces in dense regions of data set and produces subspace clusters. In thorough experiments on synthetic and realworld data sets, we demonstrate that MOSCL for subspace clustering is superior to PROCLUS clustering algorithm. Additionally we investigate the effects of first phase for detecting dense regions on the results of subspace clustering. Our results indicate that removing outliers improves the accuracy of subspace clustering. The clustering results are validated by clustering error (CE) distance on various data sets. MOSCL can discover the clusters in all subspaces with high quality, and the efficiency of MOSCL outperforms PROCLUS.
Clustering problem concerns the discovery of homogeneous groups of data according to a certain similarity measure. The task of clustering has been studied in statistics [
This statement formalizes that, with growing dimensionalities
An example of subspace clusters.
Densities also suffer from the curse of dimensionality. In [
These observations motivate our effort to propose a novel subspace clustering algorithm called multiobjective subspace cLustering (MOSCL) that efficiently clusters high dimensional numerical data sets. The first phase of MOSCL performs subspace relevance analysis by detecting dense and sparse regions and their locations in data set. After detection of dense regions it eliminates outliers. The discussion details key aspects of the proposed MOSCL algorithm including representation scheme, maximization fitness functions, and novel genetic operators. In thorough experiments on synthetic and real world data sets, we demonstrate that MOSCL for subspace clustering is superior to method such as PROCLUS [
The remainder of this paper is structured as follows. In Section
The subspace algorithms can be divided in to two categories: partitionbased subspace clustering algorithms and gridbased subspace algorithms. Partitionbased algorithms partition the set of objects into mutually exclusive groups. Each group along with the subset of dimensions shows the greatest similarity known as a subspace cluster. Similar to the
Grid based subspace clustering algorithms consider the data matrix as a highdimensional grid and the clustering process as a search for dense regions in the grid. ENCLUS (entropy based clustering) [
A genetic algorithm, a particular class of evolutionary algorithms, has been recognized to be well suited to multiobjective optimization problems. In our work, we employ multiobjective subspace clustering (MOSCL) algorithm for clustering data sets based on subspace approach. In this section, we discuss important concepts of preprocessing phase and design concepts of MOSCL.
The goal of preprocessing step is to identify all dimensions in a data set which exhibit some cluster structure by discovering dense regions and their location in each dimension [
Let us first give some definitions. Let
A large value of
Binary weight matrix
Data points  Attributes  










 


0 

0  0 

0 

0 



0 

0  0 

0 

0 



0 

0  0 

0 

0 


0 

0 

0  0  0 

0 


0  0  0  0  0  0  0  0  0  0 


0 


0 

0 


0 


0 


0 

0 


0 


0 


0 

0 


0 

0 


0  0 



0 


0 


0  0 



0 

Obviously, the parameter
Binary weight matrix
Data points  Attributes  









 


0 

0 

0 

0 



0 

0 

0 

0 



0 

0 

0 

0 


0 

0 

0  0 

0 



0 



0 


0 


0 



0 


0 


0 



0 


0 

0 


0 



0 


0 


0 



0 

This makes the distance measure more effective because the computation of distance is restricted to subsets where the data point values are dense. Formally, the distance between a point
In this subsection, we will dwell on the important design issues of MOSCL, including individual representation, fitness function, selection operator, search operator, and elitism. The basic steps of MOSCL are presented in Procedure
Begin
Apply preprocessing phase on data set
While (the termination criterion is not satisfied)
for
Randomly select chromosomes from the population
Call subspace_update ( )
end for
end while
Select the best solution from population
End
Procedure subspace_update ( )
Compute the membership matrix
for
for
if dist(
end if
end for
Compute the cluster center:
Compute cluster weights
membership matrix
until convergence
For any genetic algorithm, a chromosome representation is needed to describe each individual in the population of interest [
In the singleobjective optimization problem, the objective function and fitness function are often identical and are usually used exchangeably [
Let
The selection operator supports the choice of the individuals from a population that will be allowed to mate in order to create a new generation of individuals. Genetic algorithm methods attempt to develop fitness methods and elitism rules that find a set of optimal values quickly and reliably [
begin
for
selected from
return
end
Crossover is the feature of genetic algorithms that distinguishes it from other optimization techniques. As with other optimization techniques genetic algorithms must calculate a fitness value, select individuals for permutation, and use mutation to prevent convergence on local maxima. But, only genetic algorithms use crossover to take some attributes from one parent and other attributes from a second parent. We used the uniform crossover [
For example chromosome
The random number
Unlike the crossover and selection operators, mutation is not necessarily applied to all individuals [
When creating a new population by crossover and mutation, we have a high chance to lose the best subspace clusters found in the previous generation due to the nature of randomness of evolution [
The experiments reported in this section used a 2.0 GHz core 2 duo processors with 2 GB of memory. After data preparation has been finished, we can conduct an experimental evaluation on MOSCL and compare the performance of MOSCL with other competitive methods. We use both synthetic and reallife datasets for performance evaluation in our experiments. There are four major parameters that are used in the MOSCL, that is, the total number of generations that are to be performed in the MOSCL (denoted by
The performance measure used in MOSCL is the clustering error (CE) distance [
Let us have two subspace clusters:
We use the Hungarian method [
The value of custer error (CE) is always between 0 and 1. The more similar the two partitions RS and GS, the smaller the CE value.
The synthetic data sets were generated by data generator [
We also tested our proposed algorithm on the breast cancer, the mushroom data sets, and multiple features data (MF). All three data sets are available from the UCI machine learning repository [
mfeatfou: 76 Fourier coefficients of the character shapes;
mfeatfac: 216 profile correlations;
mfeatkar: 64 KarhunenLove coefficients;
mfeatpix: 240 pixel averages in
mfeatzer: 47 Zernike moments;
mfeatmor: 6 morphological features.
For experimental results we used five feature sets (mfeatfou, mfeatfac, mfeatkar, mfeatzer, and mfeatmor). All the values in each feature were standardized to mean 0 and variance 1.
The scalability of MOSCL with increasing data set size and dimensionality are discussed in this section. In all of the following experiments, the performance of the results returned by MOSCL is better in comparison with PROCLUS.
The scalability of MOSCL with the size of the data set is depicted in Figure
Scalability with the size of the data set.
Figure
Scalability with the dimensionality of the data set.
Figure
Scalability with average dimensionality of clusters.
The scalability experiments show that MOSCL scales well also for large and highdimensional data sets.
The quality of the clustering result is evaluated in terms of accuracy. The accuracy achieved by the PROCLUS algorithm on breast cancer data is 71.0 percent, which is less than 94.5 percent achieved by MOSCL clustering algorithm. The enhanced performance given by MOSCL in comparison with that of the PROCLUS suggests that some dimensions are likely not relevant to some clusters in this data. The accuracy achieved by MOSCL on mushroom data set is 93.6 percent, which is greater than the accuracy (84.5 percent) achieved by PROCLUS on the same data. PROCLUS and MOSCL are used to cluster the MF data. The accuracy achieved by these algorithms on this data is 94.5 percent and 83.5 percent, respectively. Such results suggest that the moderate number of dimensions of this data does not have a major impact on algorithms that consider all dimensions in the clustering process. MOSCL is executed for a total of 50 generations. The experimental results show the superiority of MOSCL algorithm in comparison to PROCLUS. The proposed algorithm MOSCL detects high quality subspace clusters on all real data sets but PROCLUS is unable to detect them. The accuracy results of the real data sets are presented in Table
Average accuracy results on real data sets.
Data set  PROCLUS  MOSCL 

Cancer  71.5  94.5 
Mushroom  84.5  93.6 
Multiple features 
83.5  94.5 
We have computed the CE distance for all clustering results using the subspace clustering CE distance. MOSCL is able to achieve highly accurate results and its performance is generally consistent. As we can see from Figure
Distances between MOSCL output and the true clustering.
In this paper we have presented a robust MOSCL (multiobjective subspace clustering) algorithm for the challenging problem of highdimensional clustering and illustrated the suitability of our algorithm in tests and comparisons with previous work. The first phase of MOSCL performs subspace relevance analysis by detecting dense and sparse regions and their locations in data set. After detection of dense regions it eliminates outliers. MOSCL discovers subspaces in dense regions of data set and produces subspace clusters. The discussion details key aspects of the proposed methodology, MOSCL, including representation scheme, maximization fitness functions, and novel genetic operators. Our experiments on large and high dimensional synthetic and real world data sets show that MOSCL outperforms subspace clustering as PROCLUS by orders of magnitude. Moreover, our algorithm yields accurate results when handling data with outliers. There are still many obvious directions to be explored in the future. The interesting behavior of MOSCL on generated data sets with low dimensionality suggests that our approach can be used to extract useful information from gene expression data that usually have a high level of background noise. Another interesting direction to explore is to extend the scope of preprocessing phase of MOSCL from attribute relevance analysis to attribute relevance.