An Enhanced Wu-Huberman Algorithm with Pole Point Selection Strategy

. The Wu-Huberman clustering is a typical linear algorithm among many clustering algorithms, which illustrates data points relationship as an artificial “circuit” and then applies the Kirchhoff equations to get the voltage value on the complex circuit. However, the performance of the algorithm is crucially dependent on the selection of pole points. In this paper, we present a novel pole point selection strategy for the Wu-Huberman algorithm (named as PSWH algorithm), which aims at preserving the merit and increasing therobustnessofthealgorithm. Thepole pointselection strategyis proposedto filterthepolepoint by introducing sparse rate. Experiments results demonstrate that the PSWH algorithm is significantly improved in clustering accuracy and efficiency compared with the original Wu-Huberman algorithm.


Introduction
Traditional data mining approaches can be categorized into two categories [1]: one is supervised learning, which aims to predict the labels of any new data points from the observed data-label pairs.Typical supervised learning methods include the support vector machine and the decision trees; the other one is unsupervised learning.The goal is just to organize the observed data points with no labels.Typical unsupervised learning tasks include clustering [2] and dimensionality reduction [3].In this paper, we will focus on the clustering problem, which aims to divide data into groups with similar objects.From a machine learning perspective, clustering is to learn the hidden patterns of the dataset in an unsupervised way.From a practical perspective, clustering plays a vital role in data mining applications such as information retrieval, text mining, web analysis, marketing, and computing biology [4][5][6][7].
In the last decades, many methods [8][9][10][11][12] have been proposed for clustering.Recently, the graph-based clustering has attracted many interests in the machine learning and data mining community [13].The cluster assignments of the dataset can be achieved by optimizing some criteria defined on the graph.For example, the spectral clustering is one kind of the most representative graph-based clustering approaches, and it aims to optimize some cut values (e.g., [14,15]) defined on an undirected graph.After some relaxations, these criteria can usually be optimized via eigen decompositions, and the solutions are guaranteed to be globally optimal.In this way, the spectral clustering efficiently avoids the problems of the traditional -means method.
Wu and Huberman proposed a clustering method based on the notation of voltage drops across the network [16].The algorithm uses a statistical method to avoid the "poles problem" instead of solving it.The idea randomly picks two poles, then applies the algorithm to divide the graph into two communities, and repeats in this way for many times.The algorithm uses a majority vote to determine the communities [16].However, after making some experiments, we have found that the choice of the pole points affects the accuracy of some of the clustering so seriously that the majority voting result is degraded.The specific details will be presented in Section 4.1 (Figure 1).
In order to overcome the above disadvantages of the Wu-Huberman algorithm, in this paper, first we construct a graph in terms of data points.Then we propose a novel strategy for pole point selection.After that, we iteratively solve the Kirchhoff equation to perform clustering.Finally, we get the clustering result.In this paper, we consider only the 2community clustering case and will leave the case of  cluster problem into the future research.

Related Works
The Wu-Huberman algorithm exhibits the graph as an electric circuit.The purpose is to classify points in the graph into two communities, that is, clusters.We denote a graph by  = (, ), where  is the point set of graph and  is the edge set.The set of voltages of points is .Suppose points  and  have been known to belong to different communities,  1 and  2 , respectively.By solving Kirchhoff equations the voltage value of each point can be obtained, which of course should lie between 0 and 1.A point belongs to  1 or  2 , which can be decided by voltage value of the point [17].The graph is regarded as an electric circuit by associating a unit resistance to each of its edges.Two of the nodes, assumed to be node 1 and node 2, without losing the generality, in the graph are given a fixed potential difference.The Wu-Huberman method is based on an approximate iterative algorithm that solves the Kirchhoff equations for node voltages in linear time [16,18].The Kirchhoff equations of -point circuit can be written as where   is the degree of point   and   is the adjacency matrix of the graph.After the convergence, each community, that is, cluster, is defined as the nodes with a specific voltage value within a tolerance.Without loss of generality, the algorithm has labeled the point in such a way that the battery is attached to point 1 and 2, which are termed as pole points.
Because of the complexity, the algorithm does not solve the Kirchhoff equations exactly rather solves it iteratively.The algorithm initially sets  1 = 1,  2 = ⋅⋅ ⋅ =   = 0.In the first round, the algorithm starts updating from point 3 to the th point in the following way.When the th point, the voltage of it is substituted by the average value of its  neighbors according to (1).The updating process ends when the algorithm gets to the last point , at which a round is finished.After repeating the updating process for a finite number of rounds, each point reaches voltage value that satisfies approximately the Kirchhoff equations within a certain precision.Then the algorithm finds community results by a threshold decision.
The Wu-Huberman algorithm inherits the superiority of the graph-based clustering.The final cluster solutions is global optimal.Especially, the running time of the algorithm is linear.However, the algorithm does not always work in many cases [16].Besides, there is still one critical problem which seriously affects the accuracy and efficiency in real applications.That is, the accuracy and efficiency are greatly affected by the poles, that is, node 1 and node 2 selected.Therefore, it is most important to improve the method of selecting poles.In this paper, we present the PSWH algorithm to improve the accuracy and effectiveness of the algorithm by presenting the pole point selection strategy.

The PSWH Algorithm
, where  is a dataset-dependent parameter.

The Pole Point Selection
Strategy.The Wu-Huberman algorithm selects pole point randomly.Based on plenty of experiments, we find that clustering results are very sensitive to the choosing of pole points.It may produce wrong clustering results if inappropriate points are chosen as the poles.Figure 1 gives us an intuitive illustration of such a problem.
For solving this problem, in this paper, we introduce a concept that is termed as "sparse points." There is the maximal diameter between the sparse point and its neighborhoods.The existence of sparse points will bias the final clustering results.An important fact of our experimental results is that if we choose sparse points as the pole points the Wu-Huberman algorithm will become less accurate.For this reason, the sparse points should not be selected as the pole points.Therefore, we propose the following sparse rate   to discriminate the sparse points from the others.Additionally, in order to exclude the impact of the distribution in the similarity and degree, the averaging similarity of the neighbors and the similarity summation of the neighbors should be taken in the sparse rate   .That is, where   is the maximum diameter between the th point and its neighborhoods;   = max arg (  −   ) 2 ,  = 1 to ,   and   are the neighborhoods of the   ,  and  are from 1 to   , number  is the feature number of   , and   is the th attribute feature in the th neighborhood of   .Here   is the similarity (weight) summation of   's neighborhood,   = ∑   =1   ,  = 1 to .   is the average weight of   's neighborhood,   =   /  .
Figure 1(e) shows the sparse rate of each point in Figure 1.A point can be determined as the pole point whose sparse rate is significantly larger than those of the most other points.Sparse points are far from other points between two different clusters, so they should not be chosen as the pole points.
We define an extent to describe the range of allowed sparse points' number.For example, an extent of 5% in the two-moon example means that the allowed sparse point number is the number of points * extent = 100 * 5% = 5.That is to say, we choose top 5 points upon the sparse rate as the sparse points.The specific experimental details are shown in Section 4.1.

Iteratively Solving the Kirchhoff
Equations.We will illustrate the computation procedure for iteratively solving the Kirchhoff equations by using an example.According to the results of ( 2), we get that the pole points are 1st and th points.That is to say, to obtain the voltage value of each point excluding the pole points, at which the voltage values are fixed.That is, the value of each point is the similarity average of its neighbor point.The updating process ends when we go through 2th to -1th points.Repeat this process till voltage value converges within stable error range.In our experiments, we set 0.001 as the terminative conditioning of the iteration.

The Procedure of the PSWH Algorithm
Input.Dataset  = {  }  =1 and the neighborhood size .
Output.The cluster membership of each data point.

Procedure
Step 1: construct the  nearest neighborhood.
Step 2: compute sparse rate   using (2) and apply the extent to determine the pole points.Then exclude the sparse points in graph and choose randomly two other points as the pole points.
Step 3: obtain the voltage value of each data point based on (1).
Step 4: output the cluster assignments of each data point.

Experimental Results
In this section, we will use the well-known two-moon example to illustrate the effectiveness of PSWH algorithm.The original dataset is a standard benchmark for machine learning algorithms [19] and is generated according to a pattern of two intertwining crescent moons.This benchmark is online available at http://www.ml.uni-saarland.de/GraphDemo/GraphDemo.html.In the experiments, the Gaussian noise with mean 0 and variance 0.01 has been added.The number of data points is set as 100 for the two moons.

Pole Points' Influence on the Clustering Accuracy.
In the Wu-Huberman algorithm, the choice of the pole points affects significantly the clustering results.Taking the twomoon dataset as an example, we set  as 0.5 and  as 5.In Figure 1(e), the sparse points are the 3rd, 20th, 35th, 45 ℎ , and 83rd points.In order to improve the clustering accuracy, we do not choose the sparse points as the poles.The clustering accuracy is 100%.Figure 1(c) illustrates that no matter what threshold is chosen, the cluster accuracy is low.That is to say, the choice of the poles has great effect on the clustering results.

Pole Points' Influence on the Iterate Number.
In the experiment, we find that the choice of the pole points has an impact on the iterate number.The two-moon dataset is taken as an example.All of the experiments are conducted in the same parameter conditions: such as  = 0.5, the iterate error is 0.001, and the maximum iterate number is 100.We first construct the KNN ( = 5) graph of original dataset.Then the degree of each point was computed and displayed in Figure 2(b).Next, we obtain the sparse rate of each point based on the degree distribution, which is the same as Figure 1(e).Finally, we choose the poles based on the sparse rate, compute (1) to obtain the voltage value of each point, and, respectively, display the iterate number of each point in Figures 2(c) and 2(d) when different poles are chosen.
In Figure 2, we can draw a conclusion that the greater degree of the poles corresponds to the more iterate number for convergence.Therefore, in order to decrease the iterate number of the algorithm, we should choose the points with smaller sparse degree as the poles.The clustering accuracy of Figure 2 is 100%.

Comparison with Other Algorithms.
We compare the PSWH algorithm with other algorithms on the UCI repository, which is available at http://archive.ics.uci.edu/ml/.
From Table 1, we can find that the PSWH algorithm does slightly better than other algorithms in most dataset.However, in some conditions, the PSWH algorithm is lower than LCLGR algorithm.Considering the complexity of algorithm is linear, which is lower than LCLGR algorithm.Therefore, in general, the PSWH algorithm is an excellent algorithm than the others.

Conclusions and Future Work
In this paper, we propose PSWH algorithm for enhancing the clustering accuracy and efficiency of the Wu-Huberman algorithm, which can extend the applicability and increase the robustness of the algorithm.The concept of sparse points and selection procedure are presented to obtain the suitable pole points for the algorithm.The experimental results showed that the PSWH algorithm is very effective and stable when applied to clustering problems.In the future, we will give the theoretical analysis of the new algorithm and employ the new algorithm to more general and larger datasets.Furthermore, we will try to extend the new algorithm to textual, image, and video retrievals.

Figure 1 :
Figure 1: Clustering results of the Wu-Huberman algorithm for the two-moon pattern with different pole point selections.(a) The distribution of the voltage values when 22nd and 86th points have been chosen as the poles.(b) The clustering results corresponding with (a).(c) The distribution of voltage values when the 45th and 86th points have been chosen as the poles under the same dataset, algorithm, and parameters with (a).(d) The clustering results corresponding to (c).(e) The graph of determining the pole points.The -axis is the data point number, and the -axis is the value of sparse rate .

Figure 2 :
Figure 2: Different pole points of the Wu-Huberman algorithm were applied, which leads to different iterate number of convergence.(a) The KNN ( = 5) graph.(b) The degree distribution graph.(c)The iterate number via vertical axis when the poles are the 2nd point (its degree is 8) and the 77th point (its degree is 11).(d) The iterate number via vertical axis when the poles are the 5th point (its degree is 6) and the 56th point (its degree is 5), where the -axis represents the data points and -axis represents the iterate number.
3.1.Graph Construction.Let  = (, ) be an undirected graph with point set  = { 1 , . . .,   } and edge set  ⊆ ×.The degree of point   ∈  is defined as   , which is the edge number connecting with point   .Constructing  nearest neighborhood graph is to model the local neighborhood relationships between the data points.Given data points  1 , . . .,   , we link   and   with an undirected edge if   is among the  nearest neighbors of   or if   is among the  nearest neighbors of   .We define   and   to be adjacent if   ∈ (  ) or   ∈ (  ), (  ), and (  ) is the neighbor of   and   , respectively.  is the similarity between   and   .  is computed in the following way:

Table 1 :
Comparison with other algorithms on the clustering accuracy.