Clustering for Probability Density Functions by New k-Medoids Method

This paper proposes a novel and efficient clustering algorithm for probability density functions based on k-medoids. Further, a scheme used for selecting the powerful initial medoids is suggested, which speeds up the computational time significantly. Also, a general proof for convergence of the proposed algorithm is presented. The effectiveness and feasibility of the proposed algorithm are verified and comparedwith various existing algorithms through both artificial and real datasets in terms of adjusted Rand index, computational time, and iteration number. The numerical results reveal an outstanding performance of the proposed algorithm as well as its potential applications in real life.


Introduction
Clustering plays a pivotal role in exploring the intrinsic structure of data, especially in data mining.Its main idea is to separate subgroups from an initial group such that objects in each subgroup have the most similarity.Therefore, it aims to minimize intracluster variation and to maximize the intercluster variation [1].Cluster analysis is divided into two kinds: hard (crisp) clustering and soft (fuzzy) clustering [2].For crisp clustering, -means and -medoids algorithms are the typical ones [1].
The primary difference of these algorithms is the way to approach center of cluster.For each iteration, -means updates its center by average of mass for each cluster called centroid.However, by this approach, -means is well-known to be sensitive to outlier despite efficiency in computational time.To overcome this shortcoming, -medoids clustering (KMC) is a good solution because this technique employs object in the initial input being the reference point instead of center of mass [2].That is the reason why its centers are named medoids.Among numerous KMC algorithms, the partition around medoids (PAM) firstly proposed by [3] is known to be the most powerful.However, computational time is still a drawback of PAM when it is applied to solve large problems [4].Therefore, in this paper, one robust but straightforward scheme is employed to address the aforementioned difficulty.This scheme which is inspired from [5] intends to discover the most middle objects to be initial medoids.
Dating back to the history, the common object of clustering is usually discrete elements with a lot of works having been done like [6][7][8][9][10][11]. Nevertheless, with the fluctuation of data nowadays, it seems more proper to feature the data by series of numbers or functions rather than just a single point.This leads to considering the probability density functions (pdfs) as other object in clustering besides the discrete element [12].So far, some of the state-of-the-art works related to clustering for pdfs can be mentioned as follows: Chen and Hung proposed a simple but effective automatic clustering algorithm for pdfs based on ad hoc technique [13].Besides, Nguyentrang and Vovan considered many approaches to clustering problem both in the hierarchical and nonhierarchical ways [12].Among them, a remarkable work related to -means for pdfs called nonhierarchical method is 2 Scientific Programming proposed.Furthermore, Tai et al. also applied an evolutionary technique to optimize the clustering solution [14].
Nevertheless, from an overview of the related works to clustering for pdfs, it is noticed that there is no research studying KMC for pdfs.Also, for a massive amount of data as pdfs, the computational time should be taken into consideration.Therefore, on the one hand, this paper proposes a KMC algorithm for pdfs (KMCF) for the first time.On the other hand, the convergence of KMCF algorithm is resolved.
Many numerical examples are performed to evaluate the robustness as well as the effectiveness of proposed method.The numerical results of the KMCF algorithm are compared with that of existing ones in the literature.All results show the dominance of the proposed method from the perspectives of both accuracy and computational time.
The remaining part of the paper is organized as follows.Section 2 presents some related theories and proposes an algorithm for clustering of pdfs based on -medoids method.Section 3 proves the convergence of the proposed algorithm.Section 4 discusses the numerical results of the proposed algorithm and existing ones.Section 5 gives conclusion of the whole work.(ii) Each object definitely belongs to one cluster.

Related Theory and the Proposed Algorithm
(iii) There is no common object between two clusters.
According to [15], the clustering problem is NP-hard when the number of clusters exceeds 3.In the case of the KMCF problem, the representing here-called medoids are objects in the initial input.Therefore, the set of the representing pdfs is defined as   = {  1 ,   2 , . . .,    } and   ⊂  as a result.For more details, one example will be given.
Suppose that we have 4 pdfs estimated from initial dataset.These pdfs are partitioned into 2 clusters,  1 and  2 .By some techniques, the clustering result is  1 = { 1 ,  2 } and  2 = { 3 ,  4 }, where  1 and  4 are, respectively, the medoids of  1 and  2 .Then, the partition matrix is presented as follows: Therefore the set of the medoids is 2.2. 1 -Distance.Addressing one clustering problem requires determining the similarity between elements or pdfs before grouping.This mission can be handled by certain criteria such as distance, density, or shape [16].In the field of clustering for pdfs, the  1 -distance firstly proposed by Pham-Gia et al. [17] is one of the most common criteria being used to evaluate the similarity between pdfs.The main technique is that this distance is primarily based on the maximum function to assess the level of proximity or separation between pdfs, which achieves many advantages as discussed in [18].The definition of  1 -distance is stated as follows.
We see that the problem  is a nonconvex program where a local minimum point does not need to be a global minimum.Based on the above denotations, the proposed medoids clustering algorithm for pdfs (KMCF) is presented as follows.
2. Compute V  for object  as follows: 3. Sort V  in ascending order.Select first  objects having the smallest values V  as the initial medoids.Then we have  initial medoids  ()  ,  = 1,  ( ()  is the th cluster center at the th iteration).
4. Assign each object   ,  = 1, , to the nearest medoid which is equivalent to fixing the values of   .Set  = 1.
5. Figure sum of distances from all objects to their medoids (W,   ).
Assign each object   ,  = 1, , to the nearest center which is equivalent to fixing the values of   .
Compute the sum of distances from all objects to their new medoids  (+1)
By the above proposed scheme in Step 1, the distance matrix is just computed one time.Moreover, the method tends to select the  most middle objects as the initial medoids.As a result, this improves computational time significantly.Proof.Consider two points W 1 and W 2 and let  be any scalar so that 0 ≤  ≤ 1; then

Convergence of the Proposed Algorithm
Therefore,  is concave.Next, we show an important property of the constrain set (5).
Lemma 3. Consider a set  given by The extreme points of  satisfy constraint (5).
Proof.For visualization of Lemma 3 proof, we suppose that  = 3 and the probability of  1 belonging to 3 clusters is 0.8, 0.1, and 0.1, respectively.Then, the pdf  1 will be assigned to the first cluster due to the highest probability.Thus, 0.9 is one of the extremes of  corresponding to pdf  1 .Moreover, this extreme point will establish a basis as {1, 0, 0}.Also, it is an identity matrix.Each basic variable will receive value 1 and value 0 and vice versa.This completes the proof.Therefore, we have following definition.
Definition 4. The reduced problem  of the problem  is given as follows: minimize (W) subject to W ∈ .
As the function  is concave, there exists an extreme solution of the problem RP which in turn satisfies the constrain set (2).Therefore, the following statement is given immediately.
Lemma 5. Problems  and  are equivalent.

𝑔(W
Thus, the following two problems are defined in order to receive the partial optimal solution. Problem P 1 .Given F , minimize (W, F ) subject to W ∈ .
Then, the below algorithm generates the partial optimal solutions.Then, it is essential to restate the KMCF algorithm.Since the step to find V  for the object  is similar, so it will not be shown here.
The Restated KMCF 1. Choose initial medoids based on values of V  ; we get  (0) ; solve P 1 with   =  (0) ; then one gets that W (0) is an optimal basic solution of problem P 1 .Set  = 0. Denote  ()  as the th cluster center at the th iteration.
Theorem 6. Algorithm restated  converges to a partial optimal solution of problem P in a finite number of iterations.
Proof.First we show that an extreme point of  is visited at most once by the algorithm before it stops.We will assume that this is not true; that is, W ( 1 ) = W ( 2 ) for some  1 ,  2 , where  1 ̸ =  2 .When applying step (ii), we get two optimal solutions  ( 1 +1) and  ( 2 +1) for W = W ( 1 ) and W = W ( 2 ) , respectively; that is, However, the sequence ( * , * ) generated by the algorithm is strictly decreasing.That means (9) is false.Therefore, an extreme point of  is visited at most once by the algorithm before it stops.Moreover, because there are a finite number of extreme points of , the algorithm will reach the partial optimal solution after a limited number of iterations.Therefore, this guarantees the convergence of -medoids type algorithms in general.
It is certain that the expected value of ARI for random partitions is zero.Anyway, it still has value 1 for perfect agreement between two partitions.Therefore, the ARI will be used in this paper for evaluating the results of the clustering algorithm.

Numerical Results
In this section, four datasets are set up to evaluate performance of the proposed algorithm.The first two sets are the simulated data which are already published in [13,18].The third one is taken from the well-known dataset called CUReT which is available at http://www1.cs.columbia.edu/CAVE//software/curet.The final one is a real data extracted from a video of traffic situation at Ton Duc Thang University in Vietnam at the fixed moment.Besides, three other algorithms are also taken into account to make a comparison with the proposed algorithm.First is the proposed algorithm with medoids chosen randomly, namely, random -medoids algorithm.Another one is the modification of -means for pdfs called nonhierarchical approach [20].The last one is one of the state-of-the-art algorithms for pdfs, namely, selfupdate or briefly SU.All the compared algorithms will be given the suitable number of clusters in advance, except for SU.For the terminate condition, epsilon is 10 −3 in case of SU; distance-based criteria will be employed for the remaining cases.Further, to test the stability, each algorithm is executed over independent 50 runs for every dataset and the average result is obtained as the final result.The performance of all algorithms is evaluated on three aspects: accuracy (ARI) [21], computational time (seconds), and iteration number.Further, we would like to point out that all the numerical results are developed in 2015-version Matlab software on an Intel (R) Core (TM) i3-4005U CPU @ 1.70 GHz with 4 GB main memory in Windows Server 2010 environment.
Example 1.In this example, the dataset is a kind of simple simulated data with "well-behaved" class structure and also well-studied in previous algorithms in field of clustering for pdfs.This data includes seven univariate normal distributed pdfs as presented in Figure 1.The details of the estimated parameters can be found in [18].From Figure 1, one can receive the appropriate partition corresponding to three clusters as The clustering result of all compared algorithms is listed in Table 1.It is obvious that, concerning the accuracy, the proposed algorithm and the SU achieve the absolute results with ARI 1, followed by the random -medoids and the nonhierarchical approach, respectively.Regarding the computational time and iteration number, the proposed algorithm ranks first on a list of four algorithms.Although both proposed algorithm and SU obtain good results in accuracy,  the proposed algorithm is still far superior to the SU method in the computational time.Therefore, it would be concluded that the proposed algorithm performs best in the first dataset.
Example 2. In this example, the considered dataset is more complex due to a greater number of pdfs in two-dimensional space.Hence, a prediction in increasing the computational time is also considered.For more details, the data contains nine pdfs estimated by the bivariate  distribution with V degrees of freedom as described in [13].All pdfs are shown in Figure 2. From the figure, one may find that the appropriate number of clusters is 3 and the corresponding partition is The result of the performance of all algorithms is demonstrated in Table 2.It is clear that a similar trend to the first example is observed in this case in terms of ARI.Besides, concerning the computational time and iteration number, the proposed method is a bit slower than the random medoids algorithm and the nonhierarchical approach.More specifically, despite high precision, the comparable algorithm SU is still far inferior to the proposed algorithm regarding the computational time.So far, considering accuracy and computational pace, the proposed algorithm can be seen as the best candidate through the first two examples.Example 3. In this example, we employ the image objects with large quantity to measure the robustness of the proposed algorithm.The dataset includes 114 texture images of size 640 × 480 pixels taken from the CUReT database.These objects are divided into two categories: 57 samples each of human skin and ribbed paper as demonstrated in Figure 3. Subsequently, Figure 4 illustrates the estimated pdfs of these images.It seems that this case is more complicated to cluster due to the significant overlapping area and numerous pdfs.The nominal partition is given as From what has been derived from Table 3, it is clear that there is no change in the order of algorithms with regard to the value of ARI.In aspects of computational time, the random -medoids algorithm consumes the least time, in contrast to SU.About number of iterations, the nonhierarchical approach runs most iteration to give the satisfied partition instead of SU as in the previous cases.Meanwhile, two remaining algorithms just need an iteration to deduce the final result.Nevertheless, considering all surveyed perspectives, a balance among them is found in the proposed algorithm rather than the others.Therefore, it can be said that the proposed algorithm has achieved an outstanding performance in this case.
Example 4. In this example, one real data is considered to apply the proposed method.As known to the world, most countries in South East Asia usually deal with the traffic congestion, including Vietnam.This problem is regularly happening in the rush hour in famous public places.To study this situation, we extract images from a short daily video taken in front of the Ton Duc Thang University, Ho Chi Minh City, Vietnam.In general, 116 images of size 1920 × 1080 pixels are taken into account.The no-traffic jam group includes 46 photos and the traffic jam group 70 photos.The file will be provided when having requirement.From these photos, the pdfs are estimated as shown in Figure 5.The nominal partition is  1 = { 1 ,  2 , . . .,  46 },  2 = { 47 ,  59 , . . .,  116 }.
The result in Table 4 reveals that this dataset is not quite easy to tackle for all algorithms.The first time we see a reduction of value of ARI for all compared algorithms in this case, with an exception of the proposed algorithm.Particularly, the SU just gets ARI 0.84 instead of 1 as the previous examples.A similar trend is witnessed in ARI of random -medoids algorithm.Due to a greater number of pdfs, the computational time and the iteration number are both increased in performance of all algorithms.Although the SU was always the most competitive algorithm before, it is defeated convincingly in the two last cases.Therefore, it can be concluded that the proposed method is quite potential in real applications.Throughout 4 examples, all results of the proposed method are briefly presented in Table 5 plus its ranks regarding each criterion.Here, the rank ranges from the 1st to the 4th, which is corresponding to the total algorithms mentioned in the numerical part of the paper.From the table, it is obvious that the proposed method is almost in the first rank of accuracy of deduced partition (ARI).Meanwhile, the other algorithms do not produce the final partition as good as that of the proposed method.This not only confirms the enhancement of accuracy of the proposed method but also reveals its stability.A similar trend can be seen in the number of iterations of the proposed method.Despite some restrictions in computational time of the proposed method, it still deserves to be the best one through what was shown in all numerical examples compared with the three remaining algorithms.

Conclusion
In this paper, we have suggested a robust but straightforward algorithm for clustering pdfs based on -medoids.By nature of -medoid clustering, the proposed method expertly tackles the outlier compared with clustering algorithms using means technique.In addition to that, the recommended scheme speeds up the convergence of the proposed method

3. 1 .Lemma 2 .
The Properties of Problem .First, we defined the reduced objective function of the problem  as follows:(W) = min{(W,   )} and W is any  ×  matrix.The reduced objective function  is a concave function.

Figure 1 :
Figure 1: Pdfs of seven univariate normal distribution functions.

Figure 4 :
Figure 4: Pdfs of 114 images consisting of human skin and ribbed paper in CUReT dataset.

Figure 5 :
Figure 5: Estimated pdfs of 116 photos taken from the short video at Ton Duc Thang University in daily time.

Table 4 :
Comparison of all algorithms in example 4 (116 traffic images).