A Fuzzy Co-Clustering Algorithm via Modularity Maximization

. In this paper we propose a fuzzy co-clustering algorithm via modularity maximization, named MMFCC. In its objective function, we use the modularity measure as the criterion for co-clustering object-feature matrices. After converting into a constrained optimization problem, it is solved by an iterative alternative optimization procedure via modularity maximization. This algorithm offers some advantages such as directly producing a block diagonal matrix and interpretable description of resulting co-clusters, automatically determining the appropriate number of final co-clusters. The experimental studies on several benchmark datasets demonstrate that this algorithm can yield higher quality co-clusters than such competitors as some fuzzy co-clustering algorithms and crisp block-diagonal co-clustering algorithms, in terms of accuracy.


Introduction
Co-clustering, also called biclustering or block-clustering, aiming to group a set of objects and a set of features simultaneously, has found an increasingly wide utilization in many fields.In the bioinformatics domain, co-clustering method has been used for simultaneously clustering genes and conditions [1].Hussain and Ramazan [2] referred the theory of co-clustering text data and proposed a new algorithm based on weighted similarity, which generated two co-similarity matrices between genes and between conditions.In collaborative filtering, co-clustering technique is commonly used to simultaneously cluster users and items [3].Honda et al. [4] utilized the sequential weighted coclustering approach to find user-item co-clusters and recommended the proper items to the users according to the membership of users and items.In text mining, coclustering algorithms are devoted to grouping documents and words simultaneously [5].Tjhi and Chen [6] proposed a new fuzzy co-clustering algorithm for documents and words based on Ruspini's condition.The proposal improved the drawback of some previous co-clustering algorithms about the constraint for word memberships, which caused the failure to obtain practical and valuable natural fuzzy word clusters.
Fuzzy co-clustering further extends co-clustering by adding fuzzy membership function.It therefore simultaneously keeps advantages of co-clustering and fuzzy clustering, including dimensionality reduction, interpretable clusters, and improvement in accuracy [7].In a fuzzy co-clustering framework, the fuzzy relationship is presented by the captured degree to which objects belong to each other, as well as the features.Based on the fuzzy relationship, the objective function is designed, which will be minimized or maximized through multiple iterations until the convergence is realized.Oh et al. [8] proposed the famous fuzzy clustering algorithm for categorical multivariate data (FCCM), which maximizes the co-occurrence of categorical attributes and the individual patterns in clusters.Kummamuru et al. [9] modified the FCCM algorithm so that it can be used to cluster large text corpora.Hanmandlu et al. [10] presented a novel fuzzy coclustering algorithm for images color segmentation, named FCCI.The FCCI formulates an objective function which contains a multidimensional distance function as the dissimilarity measure and entropy as the regularization term.Forsati et al. [11] introduced a new fuzzy co-clustering approach based on genetic algorithm, computed the similarity between web pages and users, and proposed recommendations for hybrid recommender systems.These works show that fuzzy co-clustering can always achieve higher clustering accuracy, because of its fuzzy mathematical method which determines the fuzzy relationship among samples.
Although fuzzy co-clustering algorithms could be simultaneously carried out on both the object and feature dimensions, the co-clusters generated cannot be intuitively illustrated via graphs.In order to address this issue, some blockdiagonal co-clustering techniques have been proposed that can generate a block-diagonal matrix.It has the advantage of directly producing interpretable descriptions of the resulting document clusters [12].
Aykanat et al. [13] proposed bipartite graph and hypergraph models, which transform Linear Programming problems into block diagonal forms and obtain good clustering performance on sparse datasets.Laclau and Nadif [14] proposed a hard and a fuzzy diagonal co-clustering algorithms based on double K-means, which can process sparse binary and continuous data effectively and generate a block diagonal data matrix by minimizing a criterion based on the intraclass variance and the centroid effect.Dhillon [15] described the co-clustering problem as a bipartite graph partitioning problem and proposed a new spectral co-clustering algorithm Spec, aiming to find minimum cut vertex partitions in a bipartite graph between documents and words.
Some researchers proposed block diagonal clustering approaches based on modularity.Compared with some nondiagonal co-clustering algorithms such as ITCC [16], which is a new information theoretic divisive algorithm, these algorithms can generate more descriptive and significant coclusters.For example, Labiod and Nadif [17] proposed a generalized modularity measure and a spectral approximation of the modularity matrix.Aliem et al. [18] proposed a novel block-diagonal co-clustering algorithm named CoClus.This method maximizes the modularity using an iterative alternating procedure, which has been proved effective.Experimental results of these existing algorithms show that algorithms using the maximization of modularity as criterion performed well in the field of document clustering.Furthermore, modularity based clustering algorithms can help to easily determine the appropriate number of final (co-)clusters, which is one of the troublesome issues of clustering.
However, these existing clustering algorithms, including CoClus, are all hard clustering algorithms.In the field of clustering, we know that fuzzy (co-)clustering, allowing each object to belong to more than one cluster, is thought to be more consistent with human thinking than hard (co-)clustering.Therefore, in this paper we propose a blockdiagonal fuzzy co-clustering algorithm via modularity maximization, named MMFCC.This algorithm not only evaluates the clustering quality by applying modularity to the fuzzy co-clustering technique, but also enhances the validity and accuracy of the final clustering result by introducing the fuzzy membership degrees.
The main contributions of this paper include the following: (i) We propose a novel fuzzy co-clustering algorithm which uses modularity measure as a useful criterion.(ii) MMFCC can easily produce a block diagonal matrix, which can be intuitively illustrated via graphs.
(iii) We implement MMFCC on some real benchmark datasets and experimental results show that this algorithm can achieve higher accuracy than comparative algorithms.(iv) MMFCC can help to determine the appropriate number of final co-clusters.
The rest of this paper is organized as follows.Section 2 reviews the application of modularity in the fields of community detection and co-clustering.Section 3 presents our MMFCC approach.In Section 4, we demonstrate the effectiveness of MMFCC by carrying out experiments on 6 benchmark datasets.Finally, the conclusion and future work are given in Section 5.

Modularity
The maximization of modularity is known as an efficient and meaningful method in the aspect of indicating and evaluating the partitioned community structure in a network [19], whereas the sizes of the communities are not usually preestablished, which is similar to the mechanism of clustering in unsupervised learning.Given this, the modified modularity has been currently extended to text mining considered as a criterion for clustering text data.

Modularity for Community Detection.
Newman and Girvan [20] proposed an approach that measures the community division results by using modularity which is defined as where  denotes the value of modularity,   represents the ratio of the number of edges connecting the community  and community  to the total number of edges, and   = ∑    denotes the ratio of the vertices connected to the sides in the community  to the total number of vertices.
In order to study the community structure in more complex and larger networks, Clauset et al. [21] defined the modularity as where 2 is the total edges in a network and  V is an element of the adjacency matrix for the network.The value of  V will be 1 if there is an edge between nodes V and  and 0 otherwise. V and   represent the degrees of nodes V and , respectively, and  V and   denote the two communities where nodes V and  are located, respectively.The value of ( V ,   ) will equal 1 if  V =   meaning that nodes V and  are in the same community and 0 otherwise.Let us define  V whose value will be 1 if node V belongs to group  and 0 otherwise.Meanwhile, we define   whose value will be 1 if node  belongs to group  and 0 otherwise.Then the matrix form of modularity can be rewritten as where (  ) represents the sum of the elements on the main diagonal of   .The element of matrix  can be expressed as Nowadays, modularity has become one popular measure of the structure of networks or graphs.The higher the value of modularity indicates more desirable and accurate network division results.

Modularity for
where  .. = ∑ ,   = || represents the total number of edges,  .= ∑    and  .= ∑    express the degrees of  and , respectively, and   and   represent the membership degrees of objects and features, respectively, which are different from the corresponding parameters in traditional coclustering.The difference lies in that   and   have the same significance for the simultaneous clustering of documents and terms.If the object  belongs to the cluster   ,   takes the value of 1 and 0 otherwise.Similarly, if attribute  belongs to the cluster   , the value of   is 1, and 0 otherwise.

Fuzzy Block-Diagonal Co-Clustering Based on Modularity Maximization
In this paper, we propose an innovative and effective blockdiagonal fuzzy co-clustering algorithm MMFCC, which combines the mentioned improved modularity with the idea of fuzzy co-clustering.The algorithm is described in detail as follows.
Let  be an  ×  original data matrix,   and V  be the fuzzy membership degrees that object  and feature  belong to cluster , respectively, and  = {  } indicates the  ×  document membership degree matrix and  = {V  } represents the  ×  document membership degree matrix.The MMFCC algorithm defines the modularity as According to the definition, the objective function of MMFCC is given as where ∑  =1 ∑  =1   log   and ∑  =1 ∑  =1 V  log V  denote the separate entropy regularizing terms of document and term membership degree functions separately, and minimizing these two items corresponds to maximizing the fuzzy entropies − ∑  =1 ∑  =1   log   and − ∑  =1 ∑  =1 V  log V  .Introducing   and  V , two weighted fuzzy parameters that specify the ambiguity, will help contribute to increasing the clustering accuracy.
The objective function is subject to the following constraints: The above constrained optimization problem of MMFCC can be solved by Lagrange multiplier method by applying the Lagrange multipliers   and   to constraints as ( 8) and ( 9), respectively, as follows: Deriving (10) for  and setting the gradient to zero, we have Since the transposition does not change the values and size of a matrix, that is, the derivative value of a matrix transposition is the same as the matrix, thus in order to calculate conveniently, (11) can be written as Applying   obtained from (12) to the constraint as (8), we get Derive (10) for , set the gradient to zero, and we have Apply V  obtained from (14) to the constraint as ( 9), and we have Put the final   and V  obtained after the iteration into ( 7), and we can obtain the maximized modularity degree.
The MMFCC is outlined as shown in Algorithm 1.

Numerical Experiments
In this section, the MMFCC is compared with some other well-known block-diagonal and non-block-diagonal algorithms to validate its favourable clustering performance.

Datasets.
Eleven text datasets with different sizes and sources are selected in the following experiments, and their detailed information is summarized in Table 1.
Given the reliability, integrity, and accuracy of our experiments, each dataset is divided into two parts: train set and test set.The train set contains 75% samples of the corresponding dataset, and its data is selected randomly.And the test data contains the remaining 25% data.

Evaluation Metrics.
There already exist many common evaluation criteria for assessing the clustering equality [22] such as entropy, Acc (accuracy), and NMI (normalized mutual information).In this paper, Acc and NMI are chosen as two evaluated indices for the MMFCC algorithm.
Acc is used to describe how close the result values are to the true values of the series of experiments.It is defined as  = ( + )/( +  +  + ), where a TP (True Positive) decision assigns two similar documents to the same cluster, a TN (True Negative) decision assigns two dissimilar documents to different clusters, a FP (False Positive) decision assigns two dissimilar documents to the same cluster, and a FN (False Negative) decision assigns two similar documents to different clusters.High value of Acc illustrates the excellent clustering results.
The other assessment index is NMI.Assume that  and  are the distributions of  sample labels; the entropy of the two distributions is where where The mutual information between  and  is defined as where (, ) = |  ∩   |/.The mutual information after normalization becomes High value of  indicates the excellent clustering results.

Fuzzy Parameters Settings.
In fuzzy co-clustering algorithms, including our MMFCC, two fuzzy parameters,   and  V , are introduced.These two parameters control the fuzzy degree of the algorithm.Often, in many other algorithms,   and  V are assigned manually directly.
In this paper, in order to ensure the precision, we determine the values of these two parameters according to the experimental instances on train sets.Given amount of calculation,   and  V take values in the range of [1e−8, 1e+8] and increase by an integer multiple of 10. Figure 1 depicts the influence of different pairs of   and  V on the values of modularity.
Figure 1 involves eleven graphs for the eleven datasets, respectively.It displays the three-dimensional curved surfaces distribution among modularity values and log   , log  V .The three-dimensional surface fitting technique is adopted to fit the mean modularity values.
It can be observed from Figure 1 that the value range of log  and log V is located in [−2, 4] and [−3, 3], which means that   and  V take values from [0.01, 10000] and [0.001, 1000] by an integer multiple of 10 for all datasets, respectively.Generally, the valid modularity is located in the range of [0, 1] and Figures 1(a)-1(k) illustrate the point well.
It is easy to observe that there exists a highest position representing the maximum value of modularity in Figure 1.Table 2 summarizes the maximum modularity values and their corresponding pairs of   and  V .

Evaluation of Clustering Performance.
As mentioned above, we select Acc and NMI as the evaluation standards of our experiments.

Validation of Classification Correctness. One of the disadvantages of clustering analysis is that it is very difficult to determine the classification number of datasets in advance.
In this condition, users need to give empirical values.It will reduce the automation level of clustering.In MMFCC, this problem is solved well by using the modularity measure.This measure can help to find the optimal number of final clusters.In our experiments, we implemented MMFCC on the eleven datasets.Because the numbers of classes of these datasets are different (3,4,5,6, and 10 clusters, respectively), we let  = 2, 3, . . ., 8,  = 2, 3, . . ., 10,  = 6, 7, . . ., 14, and  = 7, 8, . . ., 13 as the variation range of classification number, respectively.In MMFCC, there exists a pair of   and  V that can maximize the modularity when each dataset takes a certain value of .The relationship between the value of  and maximum modularity obtained from different  for all datasets is shown in Figure 2.
As mentioned above, for each group of co-clusters of each dataset, there is a pair of   and  V to maximize the result modularity.These pairs of   and  V for different coclusters of the eleven datasets are summarized in Tables 4-14, respectively.
According to the data shown from Tables 4-14, it is easy to see that the value range of   and  V is [0.1, 10.0].However, in terms of the values of   and  V , the optimal number of coclusters is obviously different from other compared ones.
Due to logarithm and exponent operations that are involved in the objective function, generally, the change of order of   and  V may lead to different results.It can be seen from Figure 2 that the trends of curves on cstr, tr23 la2, and ohscal datasets are flatter when the corresponding fuzzy parameters stay invariant.But for the other datasets, the fluctuation of modularity values is relatively apparent even if   and  V do not change.Moreover, it is easily observed that the optimal numbers of co-clusters for eleven datasets are, respectively, 3 for classic3, 4 for cstr, classic4, and RCV1, 5 for NG5, 6 for tr23, k1b, and la2, and 10 for tr41, tr45, and ohscal.
Figure 3 displays the clustering result graphs of MMFCC and five comparative algorithms.Given the limited space, we choose only the cstr dataset as a representative to intuitively     Now, we know that the optimal number of co-clusters on cstr dataset is four, and MMFCC and CoClus algorithms find the four co-clusters in Figures 3(a  respectively.However, the CoClus algorithm cannot group many documents and terms synchronously, and moreover there exist lots of scattered points compared with MMFCC.The Spec algorithm also generates block-diagonal co-clusters but the numbers of co-clusters are incorrect, and the ITCC algorithm does not perform well and makes it hard to identify the true meaning of each co-cluster, which indicates the classification quality of these two algorithms is somewhat unsatisfactory.
The two classical fuzzy co-clustering algorithms, FCCI and FCCM, even cannot produce intuitive diagonal coclusters.However, the Acc and NMI values in Table 3 , t, C, N, M, k, and l, respectively, represent the number of nonzero values, iterations, result clusters, objects, features, row clusters, and column clusters.From the above expressions, we can make a preliminary conclusion that the computation effort of MMFCC is a bit more than CoClus, ITCC, and FCCM, but less than Spec and FCCI algorithms.
In our experiments, we recorded the clustering time of these algorithms running on test sets, and the results are listed in Table 15.We can see that the MMFCC algorithm takes less time than the Spec and FCCI algorithms except on cstr dataset, but slightly more time than the other three algorithms: CoClus, ITCC, and FCCM.The experimental results are consistent with the above theoretical analysis of time complexity.

Conclusions
In this paper, we propose a novel and efficient fuzzy coclustering method, named MMFCC, which groups data by maximizing the modularity.This algorithm not only greatly improves the accuracy of the block-diagonal algorithm but also makes up for such shortcomings of the traditional fuzzy co-clustering algorithm as determining the number of final co-clusters.
In addition, the modularity is no longer limited to the application in the field of graph or network.When processing text data, it can also verify the correctness of the original classification numbers of multiple datasets and determine the optimal classification number.It also helps the MMFCC improve the clustering accuracy.
The network and graph datasets are not applied in our experiments, but it is one of the prospective research points to employ our MMFCC algorithm in the field of network and graph.Besides, there are also some significant future research directions, such as combining other clustering methods with fuzzy co-clustering algorithm, adding weights into the fuzzy co-clustering algorithm.

Table 3 :
Comparison of the several co-clustering algorithms in terms of Acc and NMI on eleven datasets.The bold values represent the best results.
Co-Clustering.As mentioned above, in the CoClus algorithm, Aliem et al. defined a block seriation relation  =   , where   = ∑  =1     , and  and  are  ×  and  ×  index matrices, respectively (, , and  represent the number of co-clusters, documents, and terms, respectively).And then for data matrix  with size  × , the modularity can be rewritten as =1(  −  . . .. ) 1.Input: the original data matrix  and the number of co-clusters ,   ,  V 2.Output:fuzzy partitioning membership matrices   and V  3.Randomized initialization of   and V  4.Alternating iterations: 5. Compute   according to Eq(7) 6. Compute V  according to Eq(9) 7. Compute (, , ) according to the obtained   and V  8.Stop the iterations until the number of (, , ) keeps unchangeable 9.Gain the two sets of indices of maximum from each row according to the final   and V  respectively 10.Acquire the reorganized matrix according to the above two sets of indices Algorithm 1: MMFCC algorithm.

Table 1 :
Table 3 lists all the Acc and NMI values The details of datasets.

Table 2 :
The values of   and  V when the modularity of these datasets is maximized, respectively.Even on the cstr dataset, which is the only dataset in the experiments that MMFCC cannot achieve the best clustering accuracy in terms of Acc, the Acc value of MMFCC is comparable to that of Spec, the most outstanding algorithm on the dataset.In brief, this group of experiments shows that the MMFCC is better than or comparable to such competitors as Spec, ITCC, FCCM, FCCI, and CoClus in terms of Acc and NMI.

Table 4 :
Values of   and  V that maximize modularity for different co-clusters of the cstr dataset.

Table 5 :
Values of   and  V that maximize modularity for different co-clusters of the tr23 dataset.

Table 6 :
Values of   and  V that maximize modularity for different co-clusters of the tr41 dataset.

Table 7 :
Values of   and  V that maximize modularity for different co-clusters of the tr45 dataset.

Table 8 :
Values of   and  V that maximize modularity for different co-clusters of the k1b dataset.

Table 9 :
Values of   and  V that maximize modularity for different co-clusters of the la2 dataset.

Table 10 :
Values of   and  V that maximize modularity for different co-clusters of the classic3 dataset.

Table 11 :
Values of   and  V that maximize modularity for different co-clusters of the classic4 dataset.

Table 12 :
Values of   and  V that maximize modularity for different co-clusters of the RCV1 dataset.

Table 13 :
Values of   and  V that maximize modularity for different co-clusters of the NG5 dataset.

Table 14 :
Values of   and  V that maximize modularity for different co-clusters of the ohscal dataset.

Table 15 :
Comparison of the time (s) taken by these six algorithms to cluster eleven datasets.