Community detection in social networks plays an important role in cluster analysis. Many traditional techniques for onedimensional problems have been proven inadequate for highdimensional or mixed type datasets due to the data sparseness and attribute redundancy. In this paper we propose a graphbased clustering method for multidimensional datasets. This novel method has two distinguished features: nonbinary hierarchical tree and the multimembership clusters. The nonbinary hierarchical tree clearly highlights meaningful clusters, while the multimembership feature may provide more useful service strategies. Experimental results on the customer relationship management confirm the effectiveness of the new method.
A social network is a set of people or groups each of which has connections of some kind to some or all of the others. Although the general concept of social networks seems simple, the underlying structure of a network implies a set of characteristics which are typical to all complex systems. Social network plays an extremely important role in many systems and processes and has been intensively studied over the past few years in order to understand both local phenomena, such as clique formation and their dynamics, and networkwide processes, for example, flow of data in computer networks [
Clustering analysis is a data mining technique developed for the purpose of identifying groups of entities that are similar to each other with respect to certain similarity measures. Many different clustering methods have been proposed and used in a variety of fields. Jain [
In recent years, community detection based on clustering has become a growing research field partly as a result of the increasing availability of a huge number of networks in the real world. The most intuitive and common definition of community structure is that such network seems to have communities in them: subsets of vertices within which vertexvertex connections are dense, but between which connections are relatively sparse. Yang and Luo [
Most traditional community detection algorithms based on clustering are limited to handling onedimensional datasets [
Under such considerations, in this paper we firstly introduce two pretreatment methods for multidimensional and mixed type data, followed by a new clustering approach for community detection in social networks. In this approach, individuals and their relationships are denoted by weighted graphs, and the graph density we defined gives a better quantity depict of the overall correlation among individuals in a community, so that a reasonable clustering output can be presented. In particular, our method produces “trees” of simple hierarchy and allows for fuzzy (overlapping) clusters, which distinguishes it from other methods. In order to verify the utility/effectiveness of our method, we did a (preliminary) evaluation against a mobile customer segmentation use case. The numerical output of which shows supporting evidence for further (improvement) application.
The rest of the paper is organized as follows. In Section
The detection for communities has brought about significant advances to the understanding of many realworld complex networks. Plenty of detection algorithms and techniques have been proposed drawing on methods and principles from many different areas, including physics, artificial intelligence, graph theory, and even electrical circuits [
Random walk has also been successfully used in finding network communities [
Spectral clustering techniques have seen an explosive development and proliferation over the past few years [
Lots of other community detection algorithms have also been proposed in the recent literatures. For example, Wu and Huberman [
Traditional distance functions include Euclidean distance, Chebyshev distance, Manhattan distance, Mahalanobis distance, Weighted Minkowski distance, and Cosine distance. Among these distance functions, Mahalanobis distance is based on correlations between variables by which different patterns can be identified and analyzed. It gauges similarity of an unknown sample set to a known one. It differs from Euclidean distance in that it takes into account the correlations of the data set and is scaleinvariant. In other words, it is a multivariate effect size.
All these distance functions have their own advantages and disadvantages in practical applications. Some research results shows that Euclidean distance has better performance in vector models, while some other numerical examples in high dimensional spaces show that the farthest and nearest distance are almost equal, although Euclidean distance is used to measure the similarity between data points. That is in highdimensional data, traditional similarity measures as used in conventional clustering algorithms are usually not meaningful. This problem and related phenomena require adaptations of clustering approaches to the nature of highdimensional data. This area of research has been a highly active one in recent years. Common approaches are known as, for example, subspace clustering, projected clustering, patternbased clustering, or correlation clustering. Subspace clustering is the task of detecting all clusters in all subspaces, which means that a point might be a member of multiple clusters, each existing in a different subspace. Subspaces can either be axis parallel or affine. Projected clustering seeks to assign each point to a unique cluster, but clusters may exist in different subspaces. The general approach is to use a special distance function together with a regular clustering algorithm. Correlation clustering provides a method for clustering a set of objects into the optimum number of clusters without specifying that number in advance.
In 2011, A new function “Close()” is presented based on the improvement of traditional algorithm to compensate their inadequacy for highdimensional space [
It depicts the similarity degree between two data points and has the following properties.
The minimum value of the function is 0, which means that the similarity degree between
The maximum value of the function is 1, which means that the similarity degree between
Similar to the weighted operator in traditional distance functions, the close function can be corrected as
For clustering multiattributes datasets, we first introduce a method for the measurement of similarity between items as follows [
Consider
For the pure categorical datasets, the similarity can be defined as
A graph or network is one of the most commonly used models to represent realvalued relationships of a set of input items. Since many traditional techniques for onedimensional problems have been proven inadequate for highdimensional or mixed type datasets due to the data sparseness and attribute redundancy, the graphbased clustering method for single dimensional datasets proposed in [
Let
For a subgraph
In single weighted graph
Clustering is a process that detects all dense subgraphs in
A heuristic process is applied here for finding all quasicliques with density of various levels. The core of the algorithm is deciding whether or not to add a vertex to an already selected dense subgraph
In short, the main steps of our algorithm can be described as shown in Algorithm
inclusion relation.
While
begin
determine the value of
for each edge in
community
Merge two communities according to their common vertexes;
Contract each community to a vertex and redefine the weight of the corresponding edges.
Store the resulted graph to
End.
Trace the process of each vertex, and obtain the hierarchic tree.
Our detailed community detection algorithm that can find
Pick
If, for every
If Inequality (
If
Trace the movement of each vertex and generate the hierarchic tree.
If the input data is an unweighted graph
In order to validate the feasibility of the proposed novel approach to cluster multidimensional data sets, we randomly took 3000 customers’ consumption lists of August 2012 from Shandong Mobile Corporation and use our new approach to divide these customers into distinguishing clusters according to 4 evaluation indices: local call fee, long distance call fee, roaming fee and text message and WAP fee. The original data of 3000 customers are listed in Table
Some information of 3000 mobile customers.
Customer number  Local call fee (Yuan)  Long distance call fee (Yuan)  Roaming fee (Yuan)  Text message and WAP fee (Yuan)  Package type 

1  55.3  13.7  120.6  14.2  D 
2  132  44.8  36.2  5.6  B 
3  47.1  233.6  79.4  6.2  B 
4  173  19.3  87.5  19.3  C 
5  23.7  80.5  21  9  A 
6  62.3  62.9  77.8  10.6  E 
7  242.5  21.8  23.5  24.2  A 
8  166.2  34.5  8  19.5  C 






3000  77.6  67  21.2  24.7  D 
We have applied our approach to this problem, and the results of segmentation and their average consumption are listed in Table
The customer segmentation of mobile network.
Cluster number  Average local call fee (Yuan)  Average long distance call fee (Yuan)  Average roaming fee (Yuan)  Average text message and WAP fee (Yuan)  Number of customer 

1  156.9  172.8  39.8  58.5  121 
2  299.1  43.2  38.7  46.9  64 
3  42.6  32.9  174.7  36.2  168 
4  212.8  103.3  574.3  39.7  13 
5  187.9  871.5  35.3  28.7  9 
6  162.1  262.3  354.8  21.2  12 
7  43.0  25.8  13.7  21.2  2077 
8  19.2  7.5  4.8  13.5  792 
Average consumption list of 8 Groups.
As we can see from the clustering result, the long distance fee of group 1 has a high proportion of their total expenses; Groups 3 and 4 have high roaming fees; Group 8 has lower cost in each index; Groups 2, 3, and 4 have higher text message and WAP fees. Mobile corporations can initiate corresponding policies according to the clustering results. For example, for the customers in Groups 2, 3, and 4, mobile corporation should provide them with some discount text message package; for the customers in Groups 3, 4, and 6, some discount package of roaming will also help to increase customer loyalty and stability.
On the other hand, we noticed that the sum of the last column of Table
In this paper, a graphbased new clustering method for multidimensional datasets is proposed. Due to the inherent sparsity of data points, most existing clustering algorithms do not work efficiently for multidimensional datasets, and it is not feasible to find interesting clusters in the original full space of all dimensions. These researches were mainly focused on the representation of a set of items with a single attribute, which cannot accurately represent all the attributes and capture the inherent dependency among multiple attributes. The new clustering method we proposed in this paper overcomes this problem by directly clustering items according to the multidimensional information. Since it does not need data preprocessing, this new method may significantly improves clustering efficiency. It also has twodistinguished features: nonbinary hierarchical tree and multimembership clusters. The application in customer relationship management has proved the efficiency and feasibility of the new clustering method.
Peixin Zhao, CunQuan Zhang, Di Wan, and Xin Zhang certify that there is no actual or potential conflict of interests in relation to this paper.
The first author is partially supported by the China Postdoctoral Science Foundation funded Project (2011M501149), the Humanity and Social Science Foundation of Ministry of Education of China (12YJCZH303), the Special Fund Project for Postdoctoral Innovation of Shandong Province (201103061), the Informationization Research Project of Shandong Province (2013EI153), and Independent Innovation Foundation of Shandong University, IIFSDU (IFW12109). The second is author partially supported by an NSA Grant H982301210233 and an NSF Grant DMS1264800.