Morphological Accuracy Data Clustering: A Novel Algorithm for Enhanced Cluster Analysis

In today’s data-driven world


Introduction
Clustering algorithms play a crucial role in data analysis and machine learning by enabling the identifcation and grouping of similar data points.With the exponential growth of data in various domains, the need for efective data exploration and organization has become paramount.Clustering algorithms ofer a solution to this challenge by partitioning data into meaningful clusters, allowing for the discovery of hidden patterns and relationships.Tere are various types of clustering algorithms, each with its own approach and characteristics, such as: (1) Centroid-based clustering: in this method, the data are grouped into clusters based on the similarity of the data points to the central position or centroid of each cluster.(2) Hierarchical clustering: this algorithm builds a hierarchy of clusters either by starting with each data point as a separate cluster and merging them iteratively or by initially considering all data points as a single cluster and recursively splitting them.Te number of clusters does not need to be specifed in advance.
(3) Density-based spatial clustering: this algorithm aims to group data points based on their proximity and density.It considers a data point as a core point if it has a sufcient number of other data points within a specifed radius.Tese core points, along with their neighboring points, form a cluster.Data points that do not meet the density requirements are considered noise or outliers.(4) Distribution-based clustering: this is a type of clustering algorithm that assumes the data points are generated from a specifc probability distribution.Te goal is to estimate the parameters of the distribution to identify the underlying clusters in the data.
Centroid-based clustering algorithms have emerged as powerful tools for organizing and analyzing large volumes of data in information systems.Tese algorithms utilize the concept of centroids, which represent the center or average of a group of data points, to form clusters based on similarity or dissimilarity measures.Te ability of centroid-based clustering algorithms to efciently handle highdimensional data and their interpretability makes them particularly suitable for applications in information systems.Te objective of this research paper is to provide a centroidbased clustering algorithm that can be used for information systems.Te proposed algorithm uses a morphological accuracy measure to defne the centroid of the cluster.In literature, there is a tremendous body of research on clustering algorithm; for instance, the work done in [1][2][3][4][5][6][7][8][9][10][11].Te rest of this paper is organized as follows: (1) Mathematical morphology: in this section, we delve into the fundamental concepts of mathematical morphology, a theory and technique for the analysis and processing of geometrical structures.We will explore its basic defnitions, including dilation, erosion, opening, and closing [12][13][14].(2) Centroid-based clustering algorithms: this section provides a comprehensive survey of centroid-based clustering algorithms, which are widely used for partitioning data into groups or clusters.We will review signifcant algorithms, such as the K-mean algorithm and the k-medoid algorithm.(3) Morphological clustering algorithm: in this section, we introduce a novel clustering algorithm based on the concepts of mathematical morphology.We will discuss how morphological operations can be incorporated into clustering processes to enhance their efectiveness and interpretability.Furthermore, we will introduce the morphological accuracy measure, inspired by the principles of mathematical morphology.(4) Clustering performance evaluation: in this section, we focus on assessing the efectiveness of clustering algorithms by using the Calinski-Harabasz index.
Tis index provides a quantitative measure of the compactness and separation of clusters.We will explain the calculation of the index and its interpretation, discussing how it can be used to compare and evaluate diferent clustering solutions.(5) Conclusion: in the fnal section of the study, we summarize the key fndings and contributions of the research.We highlight the importance of mathematical morphology in understanding spatial structures and its potential for enhancing clustering algorithms.We discuss the insights gained from the survey of centroid-based clustering algorithms and the introduction of the morphological clustering algorithm.Finally, we conclude by discussing future research directions and potential applications of morphological concepts in clustering algorithms and information systems.

Mathematical Morphology
In the late 1960s, a distinct branch of image analysis known as "mathematical morphology" emerged.Tis branch focuses on the mathematical theory of describing shapes using set concepts to extract meaningful information from images.Mathematical morphology (MM) is a subset of nonlinear image processing and analysis that was initially developed by Serra [13,14].Its primary focus lies in the study of the geometric structure within an image [12].As we aim to provide a comprehensive understanding of both classical and neighborhood morphological operators, we structure this section in the following manner: (1) Classical morphological operators: in this section, we provide an overview of the concepts and defnitions of classical morphological operators [12,14].Tese operators, including dilation, erosion, opening, and closing, form the foundation of mathematical morphology.(2) Neighborhood morphological operators: in this section, we explore the concept and defnition of neighborhood morphological operators.Tese operators extend the classical morphological operators by considering the spatial relationships between elements within a neighborhood.We explore how neighborhood morphological operators can capture more detailed information about the local structure, enabling more precise analysis and processing [15].

Classical Mathematical Morphology.
In classical mathematical morphology, the two fundamental operators are dilation and erosion.In this section, we will review their defnitions as presented by Heijmans [12], which are consistent with the original defnitions of Minkowski addition and subtraction [14].

Te Dilation Operator.
Te dilation operator is used to expand a set based on the shapes contained within the object under consideration [14].It involves the use of a structuring element, which is a small shape or pattern that defnes the neighborhood around each element.Te dilation operation involves placing the structuring element at each point of the object and expanding it to cover all the neighboring points that intersect with the structuring element.Te result is 2 Applied Computational Intelligence and Soft Computing a new set that encompasses the original object while incorporating the shapes and structures present in the structuring element.
Defnition 1.For a universe set U and the two subsets A, SE ⊆ U, the dilation of the set A in relation to the structure element SE is defned as follows: where

Te Erosion
Operator.Te erosion operator is closely related to Minkowski subtraction [13].It is used to reduce a set based on the shapes included in the object under study.Similar to the dilation operator, the erosion operator also utilizes a structuring element.Te erosion operation involves placing the structuring element at each point of the object and checking if all the points within the structuring element are also present in the object.If they are, the point is retained; otherwise, it is removed from the set.Tis process results in a new set that represents the original object with the shapes and structures that can be completely contained within the structuring element removed.
Defnition 2. For a universe set U and the two subsets A, SE ⊆ U, the erosion of the set A relation to the structure element SE is defned as follows: where 2.2.Neighborhood Mathematical Morphology.By using the concept of the topological neighborhood, the defnition of neighborhood-dilation and neighborhood-erosion operations; this idea is assigned to a structure element that lies in the neighborhood of each point instead of the only one structure element for the universe [15].
Defnition 3.For any universe set U, where SE x is the neighborhood-structuring element and the two subsets A, SE x ⊆ U, the neighborhood-erosion and neighborhooddilation operators (n ε(A), n δ(A)) are defned as follows: (i) Te neighborhood-erosion is given by: Defnition 4. For any set A ⊆ X, the morphological accuracy of A is defned to be the ratio between the size of the neighborhood-eroded set of A and the size of the neighborhood-dilated set of A, namely,

Centroid-Based Clustering Algorithms
Te purpose of clustering is to assign similar objects into the same cluster and dissimilar objects into diferent clusters.In many applications, clusters may not necessarily have clear and sharp boundaries.On the other hand, we fnd the distances between them, and two clusters are almost the same [16,17].
In general, clustering techniques work like this: choose a center point for some cluster, the distance of a data point from the center increases, and the probability of it being a part of that cluster decreases.Eventually, for a universe set U, the clusters should hold the following properties: (1) One of the well-defned clustering techniques is the Kmean technique [17] and the k-medoids clustering [19].Te centroid-based clustering algorithms are the most known and widely used clustering tools.It is sensitive to provide the initial parameter, and it may be considered fast and efcient.One of the most used clustering algorithms is the k-means algorithm and the k-medoids clustering.Te primary objective of this paper is to develop a centroid clustering algorithm that emphasizes simplicity and efectiveness, specifcally focusing on the k-means and k-medoids algorithms.In Section 4, we will explore alternative approaches for selecting initial centroids and assess their performance through simulations involving student datasets.Te proposed centroid clustering algorithm aims to provide a straightforward and efcient solution for partitioning data into clusters.By examining diferent methods for selecting initial centroids, we can evaluate their impact on the clustering results and overall algorithm performance.Tis analysis will be conducted using simple simulated student datasets, which will allow us to assess the algorithm's effectiveness in real-world scenarios [20][21][22][23][24].

Te K-Means Algorithm. K-means clustering is a wellknown technique for performing nonhierarchical clus-
tering.As a centroid-based algorithm, its primary objective is to partition data into a predetermined number of clusters, denoted as k, which is specifed by the user.Te algorithm begins by randomly selecting k elements from the dataset.Tese elements serve as the initial centroids or center points of the k-clusters.Each data point in the dataset is assigned to the nearest centroid based on the Euclidean distance.Tis step forms the initial clusters.Te centroid of each cluster is recalculated as the mean of all data points currently assigned to that cluster.Tis step updates the clusters' center points to better represent their members.Continuously update and refne the centroids of each cluster until the centroids no longer change, or the changes are below a certain threshold, indicating that the algorithm has converged and the clusters have reached a steady state [10,17,18].

Applied Computational Intelligence and Soft Computing
Te steps of the k-means clustering algorithm can be summarized in Figure 1: Te k-means clustering algorithm is simple yet powerful, capable of efciently partitioning data into distinct clusters.However, it does have certain limitations and disadvantages: (1) Te performance of k-means clustering is highly sensitive to the initial selection of centroids.Different initializations can lead to diferent clustering results, which may not necessarily be the optimal solution.Tis sensitivity makes it important to carefully choose the initial centroids to ensure meaningful and accurate clustering.(2) Te number of clusters, k, needs to be specifed in advance.However, determining the optimal value of k can be challenging, as it requires prior knowledge or trial-and-error experimentation.Choosing an inappropriate value of k can lead to suboptimal clustering results or misinterpretation of the data structure.
(3) K-means clustering is sensitive to outliers in the data.
Outliers can signifcantly impact the centroid positions and distort the clustering results.Outliers may be assigned to clusters even though they do not truly belong to them, afecting the accuracy and interpretability of the clustering solution.(4) K-means clustering assumes that the clusters have similar sizes and variances.However, real-world data often exhibit uneven cluster sizes and varying variances, which can result in suboptimal clustering performance.(5) K-means clustering may struggle to handle clusters with complex shapes.
Despite these limitations, k-means clustering remains a widely used and efective algorithm for many applications.Careful consideration of the limitations and appropriate preprocessing of the data can help improve the performance of k-means clustering [6].
To commence, we decide to divide the dataset into two clusters (k � 2).
Ten, we choose two random students to be the center element for each cluster; let it be β4 and β3.Hence, we use the formula (1) to compute the Euclidean distance between the chosen clusters' center students and all the students.
Now, we are able to construct the two clusters (C 1 , C 2 ) based on the nearest students to each cluster C 1 � β1, β4, β5, β6, β7, β8  , and With those two clusters in hand, we defne the mean (centroid) of each cluster, which is shown in Table 2.
Te previous steps are to be repeated until reaching a steady clustering.
Eventually, we have the fnal clusters to be: Note that, if we start with diferent random center students, we get completely diferent clusters.
For instance, in Table 3, we may get the following results, when choosing the following:

Te K-Medoids Algorithm.
In k-medoids clustering, representative objects called medoids are used instead of centroids.Te medoid is defned as the most centrally located object within each cluster.Tis means that the medoid is the data point that has the minimum average dissimilarity to all other points in the same cluster.By relying on the medoids as the center points of the clusters, k-medoids clustering demonstrates reduced sensitivity to outliers compared to k-means clustering.Te k-medoids algorithm iteratively determines the optimal medoids and assigns each object to the nearest medoid, in a similar manner to the process in k-means clustering.However, instead of calculating the mean of the data points within each cluster, kmedoids clustering directly selects the most representative object as the medoid.Tis approach makes k-medoids clustering more robust to outliers and noise in the data, as the medoids are less infuenced by extreme values [19].
Te methodology of the k-medoids algorithm can be illustrated by Figure 2.
While k-medoids clustering ofers advantages over kmeans clustering, it also has some limitations and disadvantages: (1) Te computational complexity of k-medoids clustering is higher compared to k-means clustering.Tis is because, in each iteration, the algorithm needs to calculate dissimilarities between each data point and all potential medoids.As the number of data points and clusters increases, the computational cost of k-medoids clustering can become signifcant.(2) Te performance of k-medoids clustering can be highly dependent on the initial selection of medoids.Diferent initializations can lead to diferent clustering results.Terefore, it is crucial to carefully choose the initial medoids to ensure meaningful and accurate clustering.(3) K-medoids clustering may face scalability issues when dealing with large datasets.As the number of data points increases, the computation of dissimilarities and the search for optimal medoids become more timeconsuming and memory-intensive, making the algorithm less efcient for big data applications.(4) Like other clustering algorithms, k-medoids clustering can struggle with high-dimensional data.In high-dimensional spaces, the concept of distance becomes less meaningful, and the dissimilarity calculations may not accurately capture the true similarities between data points.Dimensionality reduction techniques or other preprocessing methods may be necessary to address this issue.(5) K-medoids clustering assumes that clusters have a single representative object (medoid) at their center.Tis assumption restricts the fexibility of cluster shapes, as the algorithm may not be able to handle complex cluster structures efectively.
Despite these limitations, k-medoids clustering remains a valuable technique, especially in scenarios where outlier robustness and interpretability are critical considerations.
Example 2. For the system given in Example 1, we apply the k-medoids cluster as follows:

Applied Computational Intelligence and Soft Computing
To commence, we choose to divide the dataset into two clusters (k � 2).
Ten, we choose two random students to be the medoid for each cluster; that is, c 1 � β3, c 2 � β8.Hence, we use the formula (5) to compute the Manhattan distance between the chosen medoid students and the students, where Md is Manhattan distance: Now, we are able to construct the two clusters C 1 and C 2 based on the nearest students to each cluster.Ten, we compute the total cost for each cluster where the Manhattan distance between the chosen medoid students and all the students.
From Table 5, we obtained: We select randomly another one nonmedoid student and recalculate the cost c 1 � β3, c 4 � β5.
We select randomly another one nonmedoid student and recalculate the cost c 1 � β3, c 5 � β7.
Swapping c 4 with c 5 is not a good idea, fnal medoid is

Morphological Accuracy Cluster (MAC)
In this section, we introduce a clustering algorithm based on the morphological concepts outlined earlier in [15], including the neighborhood-erosion operator, neighborhooddilation operator, and the morphological accuracy measure.Tis proposed algorithm determines the centroid of each cluster, with the number of clusters decided upon by the user.At the heart of the proposed algorithm, the centroid for each cluster is chosen using the morphological accuracy measure for each data point.Te higher the accuracy, and the more stable the set containing the point, the closer objects to each cluster are assigned based on the Euclidean distance between the centroids and the remaining points.Lastly, we apply the neighborhood-erosion operator to each cluster until we obtain a new centroid for each one.In this stage, we examine if the clusters are identical to those obtained previously.If they are not, we repeat the operation until we achieve a steady set of clusters.Remark 5. To construct a family of neighborhood structure elements, it is necessary to assign a neighborhood radius for each object.Unlike the fuzzy α−cut concept, where the α−cut value is the same for the entire dataset, utilizing a diferent radius for each object provides more accurate results.Tis allows for a more precise representation of the individual characteristics and relationships within the data.

Te MAC Methodology.
To commence the implementation of the proposed strategy for data clustering using the neighborhood and morphological concepts, we can follow the following steps: (1) Defne the universe set X and the distance function ?, as well as the modifed distance function z * based on the specifc requirements of the problem.For this discussion, we will use the following formulas: Where (2) Determine the neighborhood radius R for each object in the dataset, using the formula: (3) Deduce a neighborhood-structuring element for each object, Taking into account the adjustable parameter ξ. (4) Apply the clustering algorithm, utilizing the morphological concepts presented earlier, such as the neighborhood-erosion operator, neighborhooddilation operator, and the morphological accuracy measure.(5) Defne the centroid based on the value of the morphological accuracy.(6) Assign objects to clusters based on their morphological accuracy and proximity to the cluster centroids.(7) Continuously update and refne the centroids of each cluster using the neighborhood-erosion/dilation operators until a steady set of clusters is obtained.Tis involves checking if the clusters remain consistent with previously obtained clusters and repeating the operation if necessary.
Tis methodology can be summarized in the following Figure 3.
Example 3.For the system given in Example 1, we construct morphological cluster as follows: Step 1.In Table 6, we construct modifed distance matrix using formula (4).
Step 2. Using formulas ( 16) and ( 17) at ξ � 0.8 to compute the neighborhood radius for each object, which is shown in Table 7   Applied Computational Intelligence and Soft Computing Step 4. Operate the neighborhood-erosion and neighborhood-dilation operators to compute the morphological accuracy for each object, which is shown in Table 8.
Step 5. To assign the clusters' center elements, we choose the two elements with the highest and lowest accuracy values.
In this example, β4 has the highest accuracy values, while at the lowest value, we have fve elements β2, β5, β6, β7, and β8.
In this case, we choose the two center elements to be β4 and β2.
Tat is, we assign β4 with clusters C 1 and β2 with cluster C 2 Step 6. Calculate the distance between all the objects in X and the two centroids, which is shown in Table 9.
Step 7. Assign each element in X to the closest cluster: Step 8: Deduce the erosion of the two clusters: Step 9: In this step, we examine the stability of the obtained clusters, which are

Clustering Performance Evaluation
In assessing the efectiveness of clustering algorithms, a key measure is the Calinski-Harabasz index also known as the variance ratio criterion.Te index is the ratio betweencluster separation to within-cluster dispersion, each normalized by their respective degrees of freedom.A higher index value suggests a model with better-defned clusters as it encapsulates the principle that an optimal clustering solution should maximize the distance between distinct clusters, while minimizing the variance within each cluster.Tis concept can be expressed mathematically as follows: Given a dataset of n elements: where B is the sum of squared distances from each cluster centroid to the overall data centroid, which measures how well the clusters are separated from each other (the higher the better), and W is the sum of squared distances from each data element to its assigned cluster centroid, which measures the compactness or cohesiveness of the clusters (the smaller the better), that is, Now, let us use this index to evaluate the clustering algorithms presented in §3.1, §3.2, and §4.1.

Clustering Performance Evaluation (K-Means).
In this section, we evaluate the k-means clustering methodology presented in §3.1, using the results obtained in Example 1. Ten, when using the formulas ( 22) and ( 23), we get B � 13056.5 and W � 26160.Hence, the Calinski-Harabasz index is C � 2.99.

Clustering Performance Evaluation (K-Medoids).
In this section, we evaluate the k-medoids clustering methodology presented in §3.2, using the results obtained in Example 2.

Clustering Performance Evaluation (MAC).
In this section, we evaluate the k-medoid clustering methodology presented in §4.1, using the results obtained in Example 3.

Conclusion
In conclusion, we have introduced a novel clustering algorithm called the morphological accuracy cluster algorithm (MAC algorithm).Unlike existing clustering methods, our proposed algorithm utilizes an accuracy measure to defne the centroids of the clusters, eliminating the need for predefned centroids.Te accuracy measure employed in our algorithm, known as the morphological accuracy, is computed using the concepts of neighborhood-erosion and neighborhood-dilation operators.In the MAC algorithm, the neighborhood-erosion operation is applied to each cluster iteratively, resulting in the determination of a new centroid for each cluster.Tis process is repeated until the desired clusters are obtained.Empirical results demonstrate that our proposed algorithm achieves a steady cluster state in fewer iterations compared to the traditional k-means algorithm.Additionally, the clusters produced by the k-means algorithm are highly sensitive to the initial centroid selection made by the user.Te MAC algorithm ofers several advantages over existing clustering methods.It eliminates the need for manual centroid initialization and provides a more robust and accurate clustering solution.By incorporating the morphological accuracy measure, our algorithm can efectively capture the structural information within the data, leading to improved clustering performance.Future research directions may include exploring the applicability of the MAC algorithm to various domains and datasets, as well as investigating its scalability and performance on large-scale datasets.Furthermore, further refnement and optimization of the algorithm could be pursued to enhance its efciency and efectiveness.

Table 1 :
Te data set of the information system under study.

Table 2 :
Te centroid of the two clusters.

Table 3 :
Te clusters with diferent initial center students.

Table 6 :
Te modifed Euclidean distance matrix between objects.

Table 7 :
Te neighborhood radius for each object.

Table 8 :
Te neighborhood-erosion, neighborhood-dilation, and the morphological accuracy of some subsets of information system.

Table 9 :
Euclidean distance between sets A and C 1 , C 2  .