Clustering Analysis of Multivariate Data: A Weighted Spatial Ranks-Based Approach

. Determining the right number of clusters without any prior information about their numbers is a core problem in cluster analysis. In this paper, we propose a nonparametric clustering method based on diferent weighted spatial rank (WSR) functions. Te main idea behind WSR is to defne a dissimilarity measure locally based on a localized version of multivariate ranks. We consider a nonparametric Gaussian kernel weights function. We compare the performance of the method with other standard techniques and assess its misclassifcation rate. Te method is completely data-driven, robust against distributional assumptions, and accurate for the purpose of intuitive visualization and can be used both to determine the number of clusters and assign each observation to its cluster.


Introduction
In recent years, there has been a signifcant advancement in clustering methods in the feld of data science and machine learning.Density-based spatial clustering of applications with noise (DBSCAN) has become one of the popular nonparametric clustering algorithms that group together points that are close to each other and are surrounded by low-density areas [1].Also, hierarchical clustering [2] has been widely used to build a hierarchy of clusters by either merging smaller clusters into larger ones or splitting larger clusters into smaller ones.Te K-means clustering [3] remains one of the most popular partitioning methods, where the data are partitioned into K distinct, nonoverlapping clusters.
In addition, spectral clustering [4] has gained popularity due to its ability to use the eigenvectors of the similarity matrix to partition the data into clusters.Afnity propagation [5] assigns each data point to an exemplar, which is a representative point of a cluster, and iteratively updates the exemplars until convergence.Fuzzy clustering [6] assigns each data point a probability of belonging to each cluster, rather than assigning it to a single cluster.Te K-medoids algorithm introduced by [7] uses actual data points as representatives or medoids for each cluster, and medoids selected from the dataset are typically the most centrally located points within a cluster.Tis makes K-medoids more robust to outliers and noise compared to K-means.Another recent clustering method is the distance density clustering (DDC) introduced by [8].It is a distance density clustering method that is a medoid-based clustering with time series data density consideration which provides clustering results in a hierarchy fashion.
Te clustering by the fast search and fnd of density peaks (densityClust) method [9] is a density-based clustering method that aims to identify clusters based on the local density of data points and the distances between them, which makes it suitable for datasets with irregularly shaped clusters and varying densities.Furthermore, neighborhood grid clustering (NGC) introduced by [10] is a density-based clustering algorithm that aims to identify clusters based on the local density of data points and their spatial relationships.It utilizes a grid-based approach to partition the data space into cells and then assigns data points to their corresponding cells.Te algorithm then determines the cluster assignments by considering the density and spatial proximity of the points within each cell.
Tese methods have their own strengths and weaknesses and are suitable for diferent types of data and applications.Te choice of the appropriate clustering method depends on various factors such as the size and nature of the data, the desired number of clusters, and the research question being addressed.
Further advancements have been made in the use of spatial ranks to analyse multivariate data, with the development of nonparametric methods being particularly notable [11][12][13].Tey have a number of attractive features including being distribution-free and easy to compute.Furthermore, the traditional spatial ranks function gives information about how central each observation is and its direction in relation to the centre.However, they do not capture the distances between each pair of observations, which is important for cluster analysis.
Recently, Baragilly and Chakraborty [11] utilized spatial ranks as a clustering tool, using a forward search methodology based on nonparametric multivariate spatial rank functions to determine the number of clusters in the data.Teir method does not depend on the choice of the initial subsample and has been shown to perform well in diferent mixture distributions.Tis work has been extended to clustering functional datasets in medical applications [14].
Tis paper proposes a novel approach to utilizing spatial ranks for clustering, using a nonparametric weighted spatial ranks function that takes into account the distances between each pair of observations as weights and defnes a dissimilarity measure based on spatial ranks.By measuring the distances between each pair of observations instead of their central tendency, it becomes easier to segment a given set of data into a specifc number of clusters.
Te main idea behind the weighted spatial ranks (WSRs) is to defne a dissimilarity measure based on a localized version of spatial ranks, such that the weighted ranks can be used as a classifer and a confrmatory tool to determine the number of clusters and assign each observation to its cluster.Proper selection of a weight function can lead to better identifcation of clusters, and kernel weights are a popular choice in pattern analysis, classifcation, cluster analysis, machine learning, and support vector machines.
Te paper also demonstrates how weighted spatial ranks may be used for the purpose of visualizing the clusters, so that the number of clusters may be determined using weighted rank contours for a low-dimensional input space after dimension reduction.
Section 2 of the paper introduces the weighted spatial rank function and evaluates its use for diferent parametric and nonparametric weight functions.Section 3 demonstrates the weighted rank-based clustering algorithm and proposes a confrmatory classifer based on weighted ranks that can be used to assign observations to the most appropriate cluster for two-dimensional data.Section 4 demonstrates the weighted rank-based clustering algorithm to higher dimensional data, and Section 5 provides numerical examples based on both simulated and real datasets to examine the performance of the proposed algorithm.Te algorithm is compared with other clustering methods in Section 6, and concluding remarks are presented in Section 7.

Weighted Spatial Rank Functions
In this section, we propose two diferent weighted spatial rank functions.Suppose that X ∈ R d has a d-dimensional distribution of F, which we assume to be continuous hereon, then the unweighted spatial rank function of the point X ∈ R d with respect to F can be defned as Te frst weighted spatial rank function of the point X ∈ R d with respect to F is a vector function and can be defned as WSR (1)  F n (x) � Te second weighted spatial rank function can be defned as WSR (2)  F n (x) � where their L 2 norms are Note that, the diference between (2) and (3) lies with the scale (denominator) which in (2) is the sum of the weights of w i where w i � w i (x) and is therefore data dependent, whereas in (3) it depends on n which is data independent.
Kernel weight functions are often used in nonparametric estimation and, as already indicated, in a range of classifcation and pattern recognition problems.Here, we consider the Gaussian kernel weight, which is one of the commonly used nonparametric kernel weight functions and it is defned as follows:

Weighted Spatial Ranks Clustering Algorithm
We now introduce the weighted spatial ranks clustering algorithm starting with the bivariate case (d � 2) before considering the higher dimensional case d > 2.

Weighted Spatial
where i � 1, . . .n and w i � e − ‖s− X i ‖ 2 /2 .(3) Plot the weighted functional spatial ranks contour based on WSRN(s) and s∊S x , and then determine the number of clusters K from the contour lines.(4) Based on the contour lines, specify the assigned observations in each cluster.You can use a lower contour level for better visualization.(5) Use the weighted spatial rank classifer's rule defned in Section 3.2 to confrm the assignment of each observation and allocate the unassigned observations to the proper cluster.

Weighted Spatial Ranks
Classifer.Suppose that we have k groups, with distributions F 1 , F 2 , ..., F k , then based on the second WSR function in (3), we can assign d-dimensional observation vector x to the i-th group if where i ≠ j, 1 ≤ i ≤ k.Note, if we had used WSR (1)  F i , then since WSR (1)  F i increases outward from the spatial median, we have WSRN (1) So, after determining the number of clusters using the WSR contour, (8) may be used to assign each observation to the most suitable cluster.

Confrmatory Analysis Based on the Weighted Spatial
Ranks Clustering Algorithm.In order to assess the weighted spatial rank functions' performance, we compare them with other standard methods such as Euclidean distances, Mahalanobis distances, spatial ranks, and spatial depth.A simulation study is used to assess the performance of the proposed weighted ranks-based clustering in defning the group structure in multivariate data.Te simulated data are sampled from a bivariate normal mixture distribution that is assumed to cluster into two groups, where the mixture proportion p = 0.3 and sample size n = 1000 such that X 1 , where μ 1 � (0, 0) T , μ 2 � (5, 5) T , and Σ � I.For all the contour plots that follow, they are derived from a random sample of 1000 observations simulated from the bivariate mixture normal distribution in (10).Figure 1 shows the contour plots of the Euclidean distances, Mahalanobis distances, spatial ranks, and the spatial depth.Te fgure clearly shows that the contours produced have failed to map the shape of the two clusters' structure in the bivariate mixture distribution.
In Figure 2, WSRN (1)  F n (defned in ( 4)) and WSRN (2)   F n (defned in ( 5)) are used to derive the contour plots based on the nonparametric Gaussian kernel weights function defned in (6) (see section 4).In general, Figure 2 reveals that compared with Figure 1, the contours produced from both WSRN (1)  F n and WSRN (2)  F n based on the Gaussian kernel function capture more of the structure of the simulated data.However, the contour based on WSRN (1)  F n failed to detect some of the observations that are not close to either of the clusters.Tese undetected observations are indicated by some lines between the two clusters, suggesting the potential presence of a third cluster.In contrast, it is clear that the cluster structure of the data is better defned by using the WSRN (2)  F n as compared to that from the WSRN (1)  F n in Figure 2(a).Tis is because WSRN (1)  F n is a constant and the values increase outward from the centre or spatial median of the cluster.Tus, unassigned points may be assigned to that cluster which results in the lowest weighted rank for the point.In contrast, the contour derived from the second normed weighted spatial rank function, WSRN (2)  F n , defned in (5), is based on the values of WSRN (2)  F n that decrease outward from the spatial median.Tus, the larger the weighted spatial rank value for an individual point, the closer it is to the centre of the cluster.
In summary, WSRN (2)  F n is more successful than WSRN (1)   F n at capturing the cluster structure of the simulated data.Overall, the most accurate contour plot is synthesized when WSRN (2)  F n uses Gaussian kernel weights.

Weighted Spatial Ranks Based Clustering Algorithm for Higher Dimensions (d > 2)
reduction strategy may be employed and here we use the principal component analysis (PCA) to reduce the dimensionality of the data to two dimensions in order to derive a contour plot (see [16]).Te main idea of using PCA is to fnd a lowerdimensional subspace that captures most of the variance in the data.Specifcally, it involves fnding the orthogonal rotation of the axes that maximises the variance.For a ddimensional random variable X = (X 1 , X 2 , . .., X d ) with covariance matrix Σ, let C j � a T j X T , where a j is a ddimensional vector of constants.Since the variance of C j � a T j  a j , to fnd the principal component, C j requires fnding arg max It is necessary to constrain a j to have a unit length to ensure fnite values.Tis may be solved by using the method of Lagrange multipliers, so that for Lagrange multiplier, λ j , this reduces to solving the eigen equation.
It also means that for the component which yields the largest eigenvalue, λ j has the largest variance and a j is the corresponding eigenvector.It is also straightforward to show that the a j for j = 1, . .., d are orthogonal.Tus, for the d × d matrix of eigenvectors, A � [a 1 , . . ., a d ] and the matrix of principal component scores C is given by C � XA.Tis represents a rigid rotation of the X axes to an orientation of maximum variance.Tus, the frst principal component C 1 has the largest variance for X, and the second component C 2  Journal of Probability and Statistics has the second largest variance and so on (for more details on PCA, see [17]).

Numerical Examples
In this section, we apply the weighted spatial ranks-based clustering algorithm to two simulated datasets and three real datasets.

Simulated Data Examples.
In the frst simulated data example, we consider a mixture of three quadvariate normal distributions, with mixing proportions, p 1 � 0.3 and p 2 � 0.4, and sample size n � 100, such that X 1 , X 2 , ..., X n ∈ R d is a random sample from 4-dimensional mixture normal distribution.
where  Figures 3(c) and 3(d) demonstrate that the weighted spatial ranks contour accurately fts the shape of the three clusters without any misclassifcation.Finally, in Figures 3(e) and 3(f ), the confrmatory plots based on the weighted ranks classifer for the frst two components show that the observations have been correctly assigned to the three simulated clusters.
In the second example, we simulate a random sample of size n = 100 from a mixture of four 6-dimensional normal distributions, with equal proportions of weights p = 0.25, i.e., From the scatter plot matrix in Figure 4(a), we can see that whilst there are four clear clusters in the component 1 versus component panels, the number of clusters is less clear in the other panels.However, it is clear from Figure 4(b) that the frst two components explain the majority (98%) of the variance.
Te weighted spatial ranks contour plots shown in Figures 4(c) and 4(d) clearly reveal the shape of the four clusters, where in the latter, a lower contour level � 0.001 has  F n and (b) the normed weighted spatial rank function WSRN (2)  F n , using Gaussian kernel weights based on 1000 random observations from bivariate mixture normal distribution with the two groups.been used.Finally, the confrmatory plots based on the weighted spatial ranks classifer for the frst two components and the original data shown in Figures 4(e) and 4(f ) demonstrate the correct assignment of the simulated observations to the right clusters.

Real Datasets' Examples.
In this subsection, the algorithm is applied to three real datasets: the iris data [18], fnancial data [19], and old faithful geyser data [20,21].Te iris dataset consists of three diferent types of irises (Setosa, Versicolour, and Virginica).However, most of the clustering techniques consider there to be two groups, since iris Virginica and iris Versicolour are not separable without the species information that Fisher used.As we can see from Figure 5(c), the weighted ranks contour based on the frst two components, which explain 97.8% of the total variances, indicates two clusters.Te confrmatory plots in Figures 5(e) and 5(f ) assign all of the observations to two groups.
Te second real dataset is the fnancial data [19], which contains measurements of the three variables monitoring the performance of 103 investment funds operating in Italy since April 1996 (Table A.16 of Atkinson et al. [19]).Tese data include two diferent kinds of funds (stock funds and balanced funds).From Figure 6(c), the weighted ranks contour of components 1 and 3, which explain 96.4% of the total variances, suggests there are two clusters.Moreover, the confrmatory plots provide a valid assignment of the observations, which is consistent with the two types of funds.
Te third dataset, the old faithful geyser data, is taken from Azzalini and Bowman [20] and the MASS library of Venables and Ripley [21].It includes 272 observations and two variables, the waiting time between eruptions, and the duration of the eruption in minutes for the old faithful geyser in Yellowstone National Park, Wyoming, USA.Tis dataset consists of two apparent clusters, the short and the long eruptions.From Figures 7(b) and 7(c), it can be seen that the weighted ranks contour of the data indicates two clusters with an unassigned observation (number 174).Using the confrmatory classifer shown in Figure 7(d), observation 174 is correctly assigned to the second cluster.

Comparison with Other Clustering Methods
Te WSRN method determines the number of clusters in a dataset and classifes the data into each of the clusters.In this section, we compare the WSRN method with other clustering and classifcation methods.
Te frst method is the model-based clustering "mclust" [22].Tis is based on a Gaussian mixture model GMM [23] where the number of clusters corresponds to the model which returns the largest Bayesian information criterion (BIC).Te second method is the K-means algorithm combined with the Calinski-Harabasz (CH) index [24].Te number of clusters that returns the highest CH index is selected before applying the K-means [25] algorithm to classify the data.
Te third method used as a comparator is the highdimensional data clustering (HDDC) [26] which is again a clustering method based on the Gaussian mixture model where the BIC is used to select the number of clusters.Te fourth method used is the mixture of probabilistic principal component analyses "MixtPPCA" [27] where the number of clusters corresponds to the largest BIC.Te ffth method for comparison is the partitioning around medoids "PAM" clustering [28] method, where the number of clusters is estimated based on the optimum average silhouette width [29].Te sixth method used in the comparison is the density-based spatial clustering of applications with noise (DBSCAN), where the number of clusters is estimated using a density-based approach to identify regions of high density in the data and these are considered as clusters [1].Other methods that have been used in the comparison are KMD: clustering with K-medoids [7], FCM: fuzzy C-means clustering [30], GG: Gath-Geva clustering algorithm [31], DDC: distance density clustering [8], SNN: clustering with shared nearest neighbor clustering [32], and densityClust: clustering by fast search and fnd of density peaks [9].
Each method was applied to the three real datasets in Section 5.As the external classes are known, the diferent clustering methods were compared using the purity, entropy, and the misclassifcation rate.Although the purity and entropy are external validation methods commonly used in classifcation, they measure the homogeneity of the data in clusters and do not penalize algorithms that identify the incorrect number of clusters.Indeed if each cluster is homogeneous for a particular class, both the purity and entropy will assign a perfect score (1 for purity, 0 for entropy) even if the number of clusters is incorrect.Te following misclassifcation rate does penalize algorithms which identify the incorrect number of clusters.
Let there be n data points where there are r true classes such that T � {T 1 , T 2 , . .., T r } and in which the algorithm identifes k clusters such that C � {C 1 , C 2 , . .., C k }.Let A � {1, 2, . .., k} and B � {1, 2, . .., r}.Te misclassifcation rate H is defned as subject to the constraint that if two terms |C i ∩ T j | and |C t ∩ T u | appear in the sum, then i � t if and only if j � u [14,33].Tus, each row and column of the matrix A × B contribute at most one element to the sum.A consequence of this is that is one of the terms that maximises the sum in parentheses.Also, when there is only one cluster, the sum contains one term only.
Te adjusted Rand index (ARI) is another commonly used metric for evaluating the performance of clustering algorithms [34].While H compares clusters based on set matching, ARI assesses clusters by counting point pairs where there is agreement or disagreement.Moreover, ARI takes into account the expected value of the unadjusted Rand index, which is determined by randomly selecting entries in the contingency table with fxed column and row totals.
Other cluster validity indices can be used to evaluate the goodness of the diferent clustering structures such as the        connectivity index [35], CS index [36], and Sym index [37] (for more extensive details, see [38]).
Table 1 shows the results of the diferent algorithms applied to the iris dataset, where nine of the twelve methods recorded perfect scores for the purity and entropy, despite only 4 algorithms identifying the correct number of clusters.For these data, the WSR, mclust, and densityClust have the best misclassifcation rate, and H and both have perfect entropy, purity, and ARI scores.
For the fnancial dataset, as shown in Table 2, the WSR with seven other algorithms have the joint lowest misclassifcation rates.Te HDDC algorithm records the best scores for the purity and entropy but this identifes an incorrect number of clusters.Based on the ARI, WSR, Kmeans, MixtPPCA, PAM, KMD, FCM, and GG record the best scores.Finally, for the old faithful dataset, Table 3 shows that the WSR algorithm has the joint third lowest H, but the HDDC and GG algorithms are the best across all the four metrics for this dataset.

. Concluding Remarks
In this paper, we have introduced a new clustering method based on weighted spatial ranks.Te WSRN algorithm is completely data-driven and it both determines the number of clusters and classifes the data.As a nonparametric method, it does not require any assumptions to be made on the underlying distribution(s) of the data.Te synthesis of weighted rank contours, based on principal components analysis when the data have more than two dimensions, allows the intuitive visualization of the cluster structure in relation to the distribution of the data points.
We considered nonparametric kernel weights and we introduced WSRN functions based on Gaussian kernel weights.Compared to other standard approaches, the WSRN function based on the Gaussian kernel weights provided the best results in terms of cluster detection and visualization.Te weighted rank contours based on Gaussian weights were more accurate and provided the best ft to the shape of the clusters' structure.Tey captured each observation carefully and assigned it to the proper group with a minimal probability of misclassifcation.It also performed competitively with other methods when clustering and classifying the data from three real datasets.
Although the WSRN method is invariant under orthogonal transformations, it is not an afne invariant.Using afne invariant ranks has the potential to improve the results if the scales of diferent clusters are not similar [39].A further possible extension to the method would be to consider generalizations of the Euclidean norm for estimating the WSRN.Tus, diferent Lp norms for p > 2 could be investigated to establish the optimal value for p.

Figure 1 :
Figure 1: Simulated data example: contour plots of (a) Euclidean distances, (b) Mahalanobis distances, (c) spatial ranks, and (d) spatial depth based on 1000 random observations from bivariate mixture normal distribution with the two groups.

When d > 2 ( 1 )
Let X 1 , X 2 , ... X n ∈ R d be a d-dimensional random sample, then use the PCA to get the frst two components C 1 and C 2 of that sample and construct the matrix C, which is a matrix consisting of the two components C 1 and C 2 .(2)Consider C 1 and C 2 as the new variables and perform the steps of the weighted spatial ranks clustering algorithm when d � 2.

Figure 3 (
Figure 3(a) shows the scatterplot matrix of the principal components and reveals a mixed picture.It is clear from the component 1 versus component 2 panels that there are 3 clusters.In contrast, the component 2 versus component 3 and the component 2 versus component 4 suggest that there are only 2 clusters.Figure 3(b) gives the proportion of the total variance explained by each component, i.e., 97% of the total variance is explained by the frst two components.Figures3(c) and 3(d) demonstrate that the weighted spatial ranks contour accurately fts the shape of the three clusters without any misclassifcation.Finally, in Figures3(e) and 3(f ), the confrmatory plots based on the weighted ranks classifer for the frst two components show that the observations have been correctly assigned to the three simulated clusters.In the second example, we simulate a random sample of size n = 100 from a mixture of four 6-dimensional normal distributions, with equal proportions of weights p = 0.25, i.e.,

Figure 2 :
Figure2: Simulated data example: contour plots of (a) the normed weighted spatial rank function WSRN(1)  F n and (b) the normed weighted spatial rank function WSRN(2)  F n , using Gaussian kernel weights based on 1000 random observations from bivariate mixture normal distribution with the two groups.

Figure 3 :Figure 4 :
Figure 3: Simulated data 1: (a) scatterplot matrix of the PCA components, (b) the total variance explained by each component, (c) the weighted spatial ranks contour, (d) the contour at level 0.005 and the confrmatory plots based on weighted ranks classifer for (e) the frst 2 components and (f ) the original data.

Figure 5 :Figure 6 :
Figure 5: Iris data: (a) scatterplot matrix of the PCA components, (b) the total variance explained by each component, (c) the normed weighted spatial ranks contour, (d) the contour at level 0.07 and the confrmatory plots based on weighted ranks classifer for (e) the frst 2 components and (f ) the original data.

Figure 7 :
Figure 7: Old faithful data: (a) scatterplot of the faithful data, (b) the normed weighted spatial ranks contour, (c) the contour at level 0.005, and (d) the confrmatory plots based on weighted ranks classifer for original data.

Table 1 :
Comparison of diferent clustering approaches applied to the iris dataset.
*Results are based on the mean of 1000 repetitions.

Table 2 :
Comparison of diferent clustering approaches applied to the fnancial dataset.