Enhanced Connectivity Validity Measure Based on Outlier Detection for Multi-Objective Metaheuristic Data Clustering Algorithms

Data clustering algorithms experience challenges in identifying data points that are either noise or outlier. Hence, this paper proposes an enhanced connectivity measure based on the outlier detection approach for multi-objective data clustering problems. The proposed algorithm aims to improve the quality of the solution by utilising the local outlier factor method (LOF) with the connectivity validity measure. This modiﬁcation is applied to select the neighbour data point’s mechanism that can be modiﬁed to eliminate such outliers. The performance of the proposed approach is assessed by applying the multi-objective algorithms to eight real-life and seven synthetic two-dimensional datasets. The external validity is evaluated using the F-measure, while the performance assessment matrices are employed to assess the quality of Pareto-optimal sets like the coverage and overall non-dominant vector generation. Our experimental results proved that the proposed outlier detection method has enhanced the performance of the multi-objective data clustering algorithms.


Introduction
Data clustering intends to arrange collections of data points using similarity functions that can be employed next to understand the data. A diversity of applications utilised the data clustering algorithms to recognise the embedded structures within the data, and to analyse a precise collection of clusters to be additionally investigated and to recognise each cluster feature [1,2]. Consequently, the quality of the clusters can be handled by utilising the internal validity/ similarity measures, such as connectedness, compactness, and isolation. e data clustering validity measures serve as an important part in the development of the clustering algorithms, which are built based on distance measures such as the k-means partitioning algorithm. In general, the partitioning algorithms aim to identify spherically shaped clusters, but it is inefficient to recognise arbitrarily shaped clusters like non-convex or interlaced clusters that are studied in several applications. Moreover, the partitioning algorithms experience challenges in recognising data points that are either outlier or noise [3]. Unlike other validity measures, cluster connectivity works indifferently with the shape of clusters [4], which decides the degree to which neighbours of a data point have been located in the corresponding cluster. However, the robustness of the connectivity measure depends on the associated L-nearest neighbour [5,6]. ese neighbours concerned in quantifying the connectivity measure can contain outliers, which can extremely influence the accuracy of the connectedness based on non-reliable data points that can be a form of outliers [7]. erefore, choosing a proper neighbour data point's mechanism can be adjusted to eliminate such outliers, to enhance the performance of the connectivity measure. Data clustering and outlier detection share a corresponding relationship, in which a data point is recognised as a cluster member or an outlier. Data clustering algorithms commonly incorporate a mechanism for managing the outliers that eliminate these data points from the clusters. e applicability across the different problem fields is one significant problem for the outlier analysis [7][8][9][10]. Also, the effectiveness of an outlier analysis algorithm is quantified with the performance of the resolution of different thresholds for the outlier score. e local distance methods have been applied in several outlier detection methods [7,11]. e primary assumption of these methods is that the normal data points reside within dense neighbourhoods. In contrast to normal data, the outliers reside remotely out from the nearest neighbours. One of the most common local distance outliers detection algorithms is the local outlier factor (LOF) algorithm, which is used in several applications [7]. LOF is recognised as one of the widely applied local outliers detection algorithms and was introduced by [12], in which the local density of a point is associated with the surrounding neighbourhood points [7,13]. Although the LOF geometric anticipation is employed in low-dimensional data, the LOF algorithm can be implemented in different dissimilarity functions [14]. e LOF algorithm has shown outperformance against different competitor algorithms in several disciplines such as fault detection [15] or network intrusion detection [16]. e LOF variants can be generalised and implemented in various applications, such as detecting outliers in big data [17], machine learning [18], and data streams [19]. Additionally, the LOF algorithm can be employed for different cluster shapes with different dissimilarity functions, while other local distance methods such as connectivity-based outlier factor (COF) deals with outliers differing from spherical density-based shapes such as lines, while the influenced outlierness (INFLO) method handles the clusters that reside near to each other, and the local outlier probability (LoOP) method utilises the measurement of data points in the corresponding dataset with other datasets. To solve the concerns explained above, this paper intended to address the multi-objective data clustering problems using an outlier detection approach. e contribution significance of the paper is twofold.
(1) We introduced a modified connectivity validity measure based on the outlier detection approach (coded as Conn_LOF) for multi-objective data clustering problems. (2) We developed an algorithm that intends to enhance the quality of the solution generated by the multiobjective metaheuristic approach by utilising the LOF with the connectivity validity measure.
is paper is organised as follows: e related works of multi-objective metaheuristic clustering are briefly reviewed in Section 2. Section 3 discusses the theoretical background and concepts such as the data clustering problem, outlier detection methods, and the LOF method. In section 4, the description of the modified Conn_LOF approach is presented. Section 5 presents the experimental design of the modified Conn_LOF approach algorithm, and in Section 6 the experimental results of the introduced method are explained. Finally, Section 7 presents the paper's conclusions and future works.

Related Works
Several multi-objective metaheuristics approaches have been introduced to solve data clustering problems [20][21][22][23][24][25][26]. e multi-objective data clustering approach was initially offered by [27], where they proposed a multiobjective data clustering algorithm that was based on one or more cluster quality measures. eir algorithm used the Pareto envelope-based selection algorithm (PESA-II), a multi-objective algorithm, to optimise the deviation and connectivity cluster quality measures. eir research was extended in [28], where they investigated the performance of four different pairs of criteria (cluster quality measures) in multi-objective clustering. Reference [29] introduced a new dynamic multi-objective evolutionary algorithm (MOEA) for data clustering, which applies a chromosome with variable length scheme to search for optimal cluster number and cluster centre. Reference [30] proposed a multi-objective optimisation algorithm for solving the categorical data clustering problem (MOGA). Reference [31] offered a multi-objective evolutionary ensemble algorithm for addressing texture image segmentation (MECEA). Reference [32] introduced an enhanced multiobjective evolutionary approach for data clustering (EMCOC), which aims to determine the overlapping complex shape dataset problem. Reference [33] offered a multi-objective genetic fuzzy clustering (MOVGA) for the segmentation of multispectral magnetic resonance imaging (MRI). Reference [34] proposed a multi-objective clustering algorithm (MOCA) for data clustering.
Recently, [35] proposed a multi-objective algorithm based on the artificial bee colony optimisation algorithm and the non-dominated sorting (NSABC) to solve the data clustering problems. Reference [21] offered a particle swarm optimisation using the multi-objective approach (MOPSO) to increase the diversity of the solutions. Later, [36] presented an improved binary gravitational search algorithm using the multi-objective approach for feature selection (IMBGSAFS).
e Pareto-based approach is used in the algorithm to obtain better solutions diversity, by optimising the silhouette index and feature cardinality validity measures. Reference [37] introduced the multi-objective clustering algorithm based on a reduced-length representation. Reference [23] proposed a kernel-based, attribute-weighted multi-objective optimisation data clustering algorithm, in which they used the compactness and the separation cluster quality measures to find an optimal clustering solution. Table 1 demonstrates that most of the offered multiobjective clustering approaches were based on the NSGA-II multi-objective algorithm, which was widely used to achieve high-quality solutions. Several multi-objective clustering algorithms employ more than one validity measure to be optimised simultaneously, which minimises two validity measures such as cluster connectivity (Conn) and overall cluster deviation (Dev).
According to the related studies of data clustering algorithms, which are based on the multi-objective metaheuristic algorithms, further enhancements are required to tackle the rapid growth of data complexity with the consideration of preserving the accuracy of the clustering algorithm [7]. Although the majority of the clustering algorithms attempt to detect outliers during the clustering analysis stage [7], few algorithms offer validity measures that can tackle the detection of these outliers [38]. e connectivity measure of the cluster, which is commonly used in most multi-objective clustering algorithms, can measure the level of the connectedness of the neighbour data objects that are located in the same cluster [6,35] and may measure the amount of connectedness based on non-reliable data objects that can be a form of outliers [7]. erefore, the selection of a suitable neighbour data objects mechanism can be modified to exclude such outliers, and consequently improve the performance of the connectivity measure.

Background
is section introduces the concepts of the data clustering problems, the outlier detection methods, and the LOF method.

Data Clustering Problems.
Data clustering is an essential task of data mining that intends to group N data objects X � {x 1 , x 2 , . . ., x N } into a set of clusters C � {C 1 , C 2 , . . ., C K }, where all data objects in the same clusters are similarly based on a specified similarity measure. e clustering methods must ensure the following hard constraints [39]: (i) Each cluster should not be empty and hold at least one data object: (ii) Various clusters should not share data objects: (iii) Every data object should be included in a cluster: e mathematical representation of a multi-objective data clustering problem with M-objectives is given in equation (4) [40]: is the objective function that measures the partitions' quality produced by the clustering algorithm, where the objective function can be minimised or maximised depending on the similarity/dissimilarity measure employed. g i (X, C) denotes the p inequality constraints, and h j (X, C) denotes the q equality constraints.

Connectivity of the Cluster.
Connectivity of the cluster [27,35] is an objective function used to measure the amount of neighbour data points that are placed in each cluster that should be minimised. e mathematical formulation of the cluster connectivity is shown in equations (5) and (6): where N is the number of data points, and parameter M represents the number of neighbour data points, which will be considered to measure the connectivity.

Outlier Detection
Methods. e outlier detection methods are applied to overcome the influence of the outlier in creating descriptive or predictive models, and also to be adopted in the pre-processing stage in several applications of data mining. e common outlier detection techniques are classified into distance-based, density-based, distributionbased, clustering-based, and probabilistic-based methods. Besides, the outlier detection approaches are divided into local or global methods, in which global methods give each data point an anomaly score depending on the entire dataset points. On the contrary, the local distance methods assign an anomaly score to each data point depending on the surrounding neighbourhoods. Many variants of the local distance methods are introduced to produce simple anomaly score presentation and identify hidden outliers by the global methods. e variants of the local distance methods include the following methods: (1) Local Outlier Factor (LOF) [12]: It is recognised as the most broadly adopted local methods that associates the local density of data objects with the average distance of the k-nearest-neighbour objects. e anomaly score of the LOF algorithm is defined as the ratio of the data points' local density to the neighbourhood points' average local density.
(3) Influenced Outlierness (INFLO) [42]: It was introduced to produce further reliable results involving the different clusters' densities that exist near each other. (4) Local Outlier Probability (LoOP) [43]: It consists of statistical methods that define the anomaly score as a probability. ese probabilities employ the analysis of data points in the dataset with other datasets.
e local distance methods have been utilised in several outlier detection methods [7,[44][45][46]. e primary assumption of these methods is that the points of normal data exist inside dense neighbourhoods. Unlike normal data, the outliers remain remotely out from the nearest neighbours. e nearest neighbour methods need a distance metric to identify the distance separating the two data points [7]. One of the popular local distance outliers detection algorithms is the LOF algorithm, which is applied in several applications [7].

Local Outlier Factor (LOF)
. LOF is one of the commonly used local outliers detection algorithms that was introduced by [12], in which the local density of a point is related to the surrounding neighbourhood points [7,13]. e outlier factor is local which considers only each neighbourhood point. e local reachability distance of a point p is described as the inverse of the average reachability distance based on the minPts_nearest neighbours of p.
us, minPts is a primary parameter needed by the LOF algorithm which indicates the number of nearest neighbours employed in discovering the local neighbourhood of each point. e local reachability distance (lrd) is defined by equation (7), and the reachability distance is defined by equation (8) [12]: where minPts denotes a positive integer, D denotes the dataset points, and {o, p} ∈ D. e dist Minpts(p, o) is defined as the distance between p and point o. Given the min-Pts_distance of p, the minPts_distance neighbourhood of p contains every point whose distance from p is not greater than the minPts_distance. e outlier factor of point p represents the level of point p to be considered an outlier, which is defined in equation (9)  . (9) e utilisation of distance ratios ensures that the local distance performance is properly assessed. erefore, the LOF minpts for the points in density regions is close to 1 (LOF ≃ 1). Otherwise, the LOF minpts of the outlier points will be much higher (LOF ≫ 1) because they are measured depending on the ratios to the average neighbour reachability distances. Essentially, the maximum value of LOFminpts over a variety of minpts amount is employed as the outlier score to identify the optimal neighbourhood size.

e Proposed Outlier Detection Approach.
e proposed outlier detection approach of the connectivity measure (named Conn_LOF) is discussed in this section. e flowchart of the introduced outlier detection approach for the connectivity measure is shown in Figure 1, which includes the following stages: e computation of the connectivity measure includes the computation of the connectivity validity measure using equation (5). e procedure of computing the connectivity measure excludes the outlier-labelled neighbourhoods' points. (vi) Stage 6. e execution of the multi-objective clustering algorithm: executes the multi-objective clustering algorithm such as the non-dominated sorting genetic algorithm (NSGA-II) [47] and the strength Pareto evolutionary algorithm (SPEA-II) [48].
e algorithmic steps of the proposed method are shown in Algorithm 1, where λ denotes the threshold value used in the LOF algorithm that is set to 1, where the LOF value of each neighbourhoods point is approximated and then compared to the λ threshold value. e C label matrix stores the labels of the neighbourhoods' points.

Experimental Design
e performance of the proposed Conn_LOF outlier detection method is examined using eight real-life datasets with a variety of complexity, obtained from the UCI repository of the machine learning databases [49], and seven synthetic two-dimensional datasets [5], as shown in Table 2.
Since most of the state-of-the-art multi-objective clustering algorithms are based on NSGA-II (as shown in Table 1), NSGA-II and SPEA-II algorithms are used to prove the contribution of this paper. Additionally, other multiobjective algorithms are not used since the proposed Conn_LOF method is performed before running the multiobjective clustering algorithm (as shown in Figure 1) and will not affect the algorithmic steps of any given algorithm.
To evaluate the performance and the effectiveness of the proposed Conn_LOF method, the NSGA-II algorithm [47] is modified by employing two conflicting objectives that include the intra-cluster distance [50] and the proposed Conn_LOF method (named as eNSGA-II) and compared with the NSGA-II algorithm with a pair of conflicting objectives that include the intra-cluster distance [50] and the standard connectivity of the cluster [27]. Similarly, SPEA-II [48] is modified by employing the intra-cluster distance and the Conn_LOF method (named as eSPEA-II) and then compared with the standard SPEA-II with a pair of conflicting objectives that include the intra-cluster distance [50] and the standard connectivity of the cluster [27]. e data clustering solutions are represented using a label-based representation that includes a one-dimensional array, where a solution is denoted as a set of N data objects. Figure 2 demonstrates a solution representation example of eight data objects and three clusters. e solutions are randomly generated. Each data object is randomly attached to a cluster. e algorithm's external validity is evaluated using the F-measure [51]. e running time of the algorithms is not investigated since the Conn_LOF method runs before the execution of the multi-objective clustering algorithm (as shown in Figure 1), which will not affect the running time of these competing algorithms. e inference time is the same for a particular dataset depending on the number of attributes and instances.
Also, performance assessment indices (PI) are utilised to assess the Pareto-optimal sets' quality and to compare the performance between diverse multi-objective algorithms. Hence, to assess the multi-objective metaheuristic clustering algorithms, we followed the performance indices that have been used in recent data clustering researches [36,52], including the Overall Non-dominated Vector Generation (ONVG) [53] and coverage [54]. e details of these indices are below: 1. Coverage of Two Sets (C) [54]: Coverage is employed to compare two solution sets based on domination. Assuming that S 1 and S 2 are two Pareto-fronts/sets, then C(S 1 , S 2 ) indicates the portion of set S 2 that is dominated by the solutions in set S 1 . e mathematical formulation of the coverage is shown in equation (10).
where higher values of C denote that the dominance is better, which must be within the range [0, 1]. [53] represents the number of solutions in the Pareto-front set S; the mathematical formulation of the ONVG is shown in equation (11).

Overall Non-dominant Vector Generation (ONVG)
To evaluate the performance of the multi-objective methods using the PI indices, a Pareto-front pool is generated utilising the whole Pareto-fronts of the competing multi-objective algorithms. e non-dominated solutions in N runs of every algorithm are joined. Some PIs require a Pareto-front pool such as the coverage measure. e setting of the parameters for the competing algorithms was independently performed 31 times on each of the 15 datasets; then the average value and the standard deviation of the F-measure are computed. e population size is set to 20 and the maximum number of iterations is set to 1000. e nearest L data points are set to 21. Lastly, Java 1.8 is used to implement the algorithms and were run on a personal computer with a CPU of Intel Core i7 (2.6 GHz) that was equipped with 4 GB memory. Table 3 shows the results of the coverage (C), where A, B, C, and D symbolise eNSGA-II, NSGA-II, eSPEA-II, and SPEA-II, respectively. e C (A, B) values compared with C(B, A) values obtained better coverage for the datasets 2d-20c-no0, CMC, Ecoli, engytime, Flame, Seeds, Sizes5, Sonar, Soybean-small, and yroid, which means that the entire solutions in the pool of NSGA-II at least have been dominated by a single solution of the eNSGA-II solutions pool. On the other hand, the C(A, B) values compared to C(B, A) mostly obtained better coverage for (i) //Inputs: (ii) C//the nearest neighbours matrix that is generated from the stage (1) (iii) L//number of nearest neighbours minPts in LOF algorithm (iv) λ// e threshold used in the LOF algorithm (v) C label //the labels matrix generated by LOF (vi) for each C j in C do

Experimental Results and Discussion
Compute the L-distance neighbourhood points of C j ; Compute the reachability distance for neighbourhood (3) points of C j ; (xi) Compute the LOF of neighbourhood points of C j ; (4) //stage (4) (ix) for each neighbourhood point, P i of C j do (5) If LOF of P i ≥ λ then (6) Label P i as outlier and store it C label ; (7) Endif (8) End for (9) //stage (5) (10) Compute connectivity of C by excluding outliers in C label ; (11) //stage (6) (12) Execute the multi-objective clustering algorithm; ALGORITHM 1: Pseudo-code of the proposed LOF-based algorithm. Generally, this shows that the solutions in the modified algorithms with the Conn_LOF method's pool dominated the standard algorithms' solutions in a considerably high ratio. In conclusion, the modified algorithms with Con-n_LOF method attained better performance amongst other standard algorithms based on the coverage PI. Table 4 reveals the results of the obtained F-measure on the Pareto-fronts produced by the competing algorithms.
e eNSGA-II algorithm achieves higher F-measure results than the NSGA-II algorithm for most of the datasets except 2d-20c-no0, CMC, Sizes5, and Soybean-small datasets. e eSPEA-II provides higher F-measure results than SPEA-II for most of the datasets excluding CMC, Iris, and Soybeansmall datasets. e results verify that the average F-measure of the eNSGA-II and eSPEA-II is enhanced by adopting the Conn_LOF method compared to the corresponding standards NSGA-II, and SPEA-II.
Additionally, the impact of adopting the Conn_LOF is perceived in the ONVG metric, as shown in Table 5, in which the eNSGA-II algorithm achieves higher ONVG results than the NSGA-II algorithm for most of the datasets except 2d-20c-no0, Ecoli, and Seeds. e eSPEA-II provides higher ONVG results than SPEA-II for most of the datasets except 2d-20c-no0, Sizes5, and Soybean-small. e table also shows a weak performance of other competing algorithms concerning the ONVG metric. Hence, the modified eNSGA-II and eSPEA-II achieve better ONVG performance.
Results shown in Table 4 are additionally analysed using Friedman's test ranking using the F-measure. As presented in Table 6, Friedman's test shows that eNSGA-II achieved the best F-measure rank. e NSGA-II achieved the second rank, and the eSPEA-II algorithm achieved the third rank. Finally, SPEA-II obtained the worst rank.
In general, eNSGA-II, and eSPEA-II are proven to be a reliable choice for data clustering in the multi-objective approach by adopting the Conn_LOF outlier detection method for providing Pareto-front solutions with efficient clustering measures for datasets with varying characteristics and complexity.

Conclusions and Future Work
In this paper, an enhanced connectivity measure based on the LOF outlier detection method (Conn_LOF) is offered to enhance the performance of the connectivity measure by eliminating the outliers. To examine the efficiency of the proposed Conn_LOF method, it is employed within the competing algorithms and tested on eight real-life datasets with a variety of complexity obtained from the UCI repository of the machine learning database. us, the efficiency of the competing algorithms is tested on seven synthetic two-dimensional synthetic datasets with different cluster shapes and characteristics. e experimental results show that the performance of the modified eNSGA-II and eSPEA-II enhanced by adopting the Conn_LOF method concerning the average, and the standard deviation results of the F-measure. us, the multi-objective performance assessment matrices are used to evaluate the quality of the Pareto-optimal sets that include coverage and overall nondominant vector generation. Furthermore, the Conn_LOF outlier detection method is proven to be effective when combined with the clustering algorithms to provide better Pareto-front solutions with efficient clustering measures for datasets with varying characteristics and complexity.

Conflicts of Interest
e authors declare no conflicts of interest regarding this paper.