A Collaborative Framework for Privacy Preserving Fuzzy Co-Clustering of Vertically Distributed Cooccurrence Matrices

In many real world data analysis tasks, it is expected that we can get much more useful knowledge by utilizing multiple databases stored in different organizations, such as cooperation groups, state organs, and allied countries. However, in many such organizations, they often hesitate to publish their databases because of privacy and security issues although they believe the advantages of collaborative analysis. This paper proposes a novel collaborative framework for utilizing vertically partitioned cooccurrence matrices in fuzzy co-cluster structure estimation, in which cooccurrence information among objects and items is separately stored in several sites. In order to utilize such distributed data sets without fear of information leaks, a privacy preserving procedure is introduced to fuzzy clustering for categorical multivariate data (FCCM). Withholding each element of cooccurrence matrices, only object memberships are shared by multiple sites and their (implicit) joint co-cluster structures are revealed through an iterative clustering process. Several experimental results demonstrate that collaborative analysis can contribute to revealing global intrinsic co-cluster structures of separate matrices rather than individual site-wise analysis. The novel framework makes it possible for many private and public organizations to share common data structural knowledge without fear of information leaks.


Introduction
Data mining is a powerful tool for many private and public organizations in supporting efficient decision making, and they have been utilizing various databases, which are independently and securely stored in each organization.However, it is often quite expensive or impossible to store enough data by each of themselves and many analysts believe that we can get much more useful knowledge by utilizing multiple databases stored in different organizations.In these collaborative data analysis, a significant problem is the privacy issue.For example, in many corporations, customer segmentation by clustering is a fundamental approach in possible marketing while their customer privacy must be securely protected and each data record such as purchase history and personal profiles must not be published to other corporations or organizations.Similar situations are found in many other organizations such as hospitals with clinical records and governments with military intelligences.
Privacy preserving data mining (PPDM) [1] is a fundamental approach for utilizing multiple databases including personal or sensitive information without fear of information leaks.A possible approach is a priori -anonymization of databases for secure publication [2,3], but such anonymization can bring information losses.Another approach for utilizing all distributed information is to analyze the information without revealing each element.In -means clustering, several secure processes for estimating cluster centers were proposed [4,5], in which the mean vector of each cluster is calculated with an encryption operation.
In this paper, a novel collaborative framework for utilizing vertically partitioned cooccurrence matrices in fuzzy cocluster structure estimation is proposed, where cooccurrence information among objects and items is separately stored in several sites.In vertically distributed databases, it is assumed that all sites share common objects but they are characterized with different independent items in each site.The goal is to reveal the global co-cluster structures varied in 2 Advances in Fuzzy Systems whole separate databases without publishing each element of independent databases to other sites.
The remaining parts of this paper are organized as follows: Section 2 gives a brief review on related works and Section 3 shows their problems and possible solutions.Section 4 provides explanations on the conventional fuzzy co-clustering model and Section 5 proposes a novel collaborative framework for applying fuzzy co-clustering considering privacy issues.In Section 6, several experimental results demonstrate that collaborative analysis can contribute to revealing global intrinsic co-cluster structures of separate matrices rather than individual site-wise analysis.Finally, a summary conclusion is given in Section 7.

Background
Co-clustering is a fundamental technique for summarizing mutual cooccurrence information among objects and items.For example, in document clustering, mutual cooccurrence information of documents and keywords are utilized for revealing intrinsic document clusters with their keywords summaries.In purchase history analysis, mutual connections among customers and their promising products are investigated considering purchase preferences.Co-clustering provides pairwise cluster structures among objects and items and has been widely investigated in both probabilistic [6] and heuristic contexts [7].In this paper, fuzzy clustering approaches are focused on.
Fuzzy clustering has been proved to have many advantages against hard ones from such view points as noise and initialization sensitivities.Fuzzy variants of co-clustering have also been demonstrated to be useful in such applications as document analysis [8] and collaborative filtering [9,10].The goal of fuzzy co-clustering is to simultaneously estimate memberships of both objects and items from a cooccurrence information matrix.For example, in document analysis, each document (object) is characterized by several keywords (items) with their appearance frequencies (degree of cooccurrences), and the goal is to extract documentkeyword clusters with their fuzzy memberships for analyzing their contents.
In order to analyze distributed databases in -meanstype clustering, several secure processes for estimating cluster centers were proposed [4,5], in which the mean vector of each cluster is calculated with an encryption operation.However, in fuzzy co-clustering, the clustering criteria of cluster aggregation degrees were defined without cluster centers and the conventional secure framework cannot be adopted.Then, a novel secure mechanism is needed, where the main problems to be solved remained as summarized in the next section.

Problems and Solution
In the -means-type secure clustering model for vertically distributed data [4,5], multiple sites share common objects, such as customers and patients, while having their own vector observations only, such as customer profiles of their own stores and clinical records in their own hospitals.In order to reveal the intrinsic object clusters without publishing each observation, each coordinate of cluster centers is separately calculated in each site and the derived coordinates are shared by all sites.
On the other hand, fuzzy co-clustering does not use cluster centers as cluster prototypes and utilizes two types of fuzzy memberships only.Then, the conventional secure framework for -means-type clustering cannot be adopted, and a secure process for calculating the fuzzy memberships must be developed.
In the following, in this paper, a novel framework for calculating fuzzy memberships in fuzzy co-clustering of vertically distributed cooccurrence matrices is proposed following a brief review on the conventional fuzzy coclustering models.In order to calculate object memberships, the sum of products of item memberships and cooccurrence observations are needed, and vice versa.In the proposed secure process, the sum calculation is securely achieved through an encryption operation, in which the sum can be calculated by concealing each value.
The novel framework is constructed in the FCCM context only, which is the basic model of fuzzy co-clustering.However, it is easily expected that a similar extension is directly applicable to the other FCCM variants without discussions because all the FCCM variants are based on the FCCM updating process.

Methodology of Fuzzy Co-Clustering
Assume that we have a cooccurrence matrix  = {  } on objects  = 1, . . .,  and items  = 1, . . ., , in which   represents the degree of cooccurrence of item  with object .The goal of co-clustering is to simultaneously partition objects and items into  co-clusters by estimating two types of fuzzy memberships.Object partitions are represented by object memberships   , which is the memberships degree of object  to cluster  and is forced to be exclusive in the same way with FCM such that ∑  =1   = 1.On the other hand, in order to avoid trivial solutions, item partitions are represented by item memberships   , which are mostly responsible for representing the mutual typicalities in each cluster such that ∑  =1   = 1.Oh et al. [11] proposed the FCM-type co-clustering model, which is called FCCM, by modifying the FCM algorithm for handling cooccurrence information, where the cluster aggregation degree of each cluster is maximized: The first term to be maximized measures the aggregation degree of objects and items in cluster , such that it becomes larger when mutually familiar objects and items having a large   , simultaneously, have large memberships in a cluster.Here, this aggregation degree is only designed for hard partition because the term is a linear function with respect to both of   and   , where we have always   ∈ {0, 1} and   ∈ {0, 1}.Then, in order to derive fuzzy memberships   ∈ [0, 1] and   ∈ [0, 1], the aggregation measure must be nonlinearized.
In FCCM, the entropy-based fuzzification method [13,14] was adopted instead of the standard approach in FCM because the exponential weight in FCM can work only in the minimization framework of positive objective functions.  and   tune the degree of fuzziness of memberships, where a larger  brings fuzzier partitions while a smaller  brings crisp partitions.
The clustering algorithm is an iterative process of updating   and   using the following rules: This FCCM process was also reconstructed with other fuzzification mechanisms.For example, Fuzzy CoDoK [8] utilized the quadric term-based regularization [19] for avoiding calculation overflows.Honda et al. [15] adopted K-L information-based regularization [20] for handling unbalanced cluster sizes.As discussed in Section 3, these extended models generally follow the original FCCM procedure and have similar characteristics.So, in this paper, the novel collaborative framework is described in the FCCM context only.

Privacy Consideration in 𝑘-Means
Clustering.When each object is characterized by -dimensional observation x  = ( 1 , . . .,   )  , -means algorithm tries to minimize the within-cluster errors by iterating cluster center updating and nearest prototype assignment.Let b  = ( 1 , . . .,   )  be the center of cluster .In cases of distributed databases, we must care about privacy issues in either of the two phases by adopting such a technique as encryption operation [5].
For vertically distributed databases, where the elements of x  = ( 1 , . . .,   )  are separately stored in several sites, distances between object  and  cluster centers are calculated under collaboration of all sites.Here, the clustering criterion is the sum of squared errors ∑  =1 |  −   | 2 and should be calculated by concealing each value of |  −   | 2 from other sites.Once we find the nearest prototype assignment of each object, we can independently calculate new b  = ( 1 , . . .,   )  in each site by sharing the object membership information.
Common objects Site-wise items Site-wise items Site-wise items 1 2 1 2 Although the above secure framework is also useful in many other -means-type clustering algorithms such as FCM, it cannot be directly adopted to co-clustering ones because co-clustering does not use cluster prototypes but considers two types of memberships.
In this paper, similar ideas are adopted to fuzzy coclustering tasks.

Fuzzy Co-Clustering with Privacy Consideration.
Assume that  sites ( = 1, . . ., ) share common  objects ( = 1, . . ., ) and have different cooccurrence information on different items, which are summarized into  ×   matrices   = {   }, where   is the number of items in site  and ∑  =1   = .Figure 1 shows a visual image of vertically distributed cooccurrence matrices.For example, we have a group of  corporations (or hospitals, countries, etc.) and each of them has its independent customer purchase history   = {   } (or patients' records, military intelligence, etc.).If we do not care about the privacy issues, the distributed matrices should be gathered into a full  ×  matrix to be analyzed in a single process without information losses.Taking the privacy preservation into account, however, each matrix should be processed in each site without broadcasting personal information although the reliability of each co-cluster structure may not be enough satisfied because of information losses.Then, the goal of the collaborative fuzzy co-clustering analysis is to estimate object and item memberships as similar to the full-data case as possible by sharing object partition information without broadcasting cooccurrence information   = {   }.Object memberships   to be shared by sites are common and are defined in the same manner with the conventional FCCM.On the other hand, item memberships   are somewhat different because they follow the within-cluster sum constraint.In this paper, it is assumed that item memberships are independently estimated in each site following the site-wise constraint ∑   =1    = 1, where    is the item membership on item  in site .Be noted that the item memberships    should not be opened to other sites from privacy consideration.
In applying FCCM clustering to distributed cooccurrence matrices, (2) implies that each object membership function is dependent on ∑  =1     , which is the sum of site-wise (1) Random vector generation t T (2) Encryption key Assume that we have at least three sites, that is,  > 2, and two sites of  1 and   are selected as representative sites.Figure 2 summarizes the process for secure calculation of ∑  =1     as follows.
( Once object memberships   are broadcasted to all sites, each item membership    is calculated by (3) in each site using in-site information only, where site-wise item memberships    follow site-wise normalization constraints ∑   =1    = 1.It should be noted that, in this algorithm, item memberships are independently estimated in each site under the assumption that each site does not have any information on the items, which other sites deal with, such as the number of items and the degree of fuzziness of item memberships.Additionally, the algorithm cannot exactly reconstruct the equivalent co-clustering result to the whole data case, where all cooccurrence information is shared without care for privacy issues, even if we use the same parameter setting in all sites.It is because the piecewise constraint of ∑   =1    = 1 is independently forced to item memberships in each site while we just consider ∑  =1   = 1 in the whole data case.

Numerical Experiments
In this section, three experimental results are shown for demonstrating the characteristics of the proposed algorithm.Section 6.1 demonstrates the basic features of the proposed framework with a simple data set and Section 6.2 discusses the applicability to more realistic situations with a data set having unbalanced cluster structure.Then, an applicational experiment is shown in Section 6.3, where a virtual alliance of military sections is simulated using a real world benchmark data set.

Data Set 1: Homogeneous Cluster
Partition.An artificially generated 100 × 90 cooccurrence matrix  = {  } was used in this experiment, where 100 objects and 90 items form roughly 4 co-clusters.Figure 3(a) shows the original whole data matrix, where black and white cells depict   = 1 and   = 0, respectively.Vertically distributed cooccurrence submatrices were generated by arranging the 100 × 90 noisy matrix into four sites.Figure 3(b) shows the arranged cooccurrence matrix, where  = 90 items were divided into ( 1 ,  2 ,  3 ,  4 ) = (27, 24, 21, 18).Then, four co-cluster structures are very weakly implied in each site and the global co-cluster structure is only expected to be revealed in collaboration by all sites.This is a virtual situation of a group of four corporations, where they share 100 customers but have independent purchase history data on their own products.Here, the goal of collaborative fuzzy co-clustering is to reveal the intrinsic four customer clusters associated with their familiar products, which can be captured in the whole data strategy without privacy consideration but cannot be found in the site-wise independent analysis.
The co-clustering results of the distributed matrices are compared with that of whole data case, where the conventional FCCM algorithm was applied to the original 100 × 90 cooccurrence matrix  = {  } without privacy consideration.Figure 4 shows the item membership vectors given in the  and   = 0, respectively.The goal is to estimate site-wise item memberships    , which are as similar to the original   as possible.Then, in this experiment, the similarity between original   and site-wise    is measured by their correlation coefficient.
Table 1 compares the correlation coefficients between the site-wise or proposed item memberships and the original result, where the best and the mean values in 50 trials with different initializations are depicted.In the site-wise FCCM, the conventional FCCM was applied to each submatrix (each small chunk) in each site.The fuzzification weights were set as   = 0.001 and   = 100.0,respectively.The table indicates that the proposed framework is useful for estimating reliable item memberships under collaboration of all sites while the derived item membership vectors are not necessarily equivalent to those of the whole data case because of site-wise independent constraints.

Data Set 2: Heterogeneous Cluster Partition.
Next, the applicability of the proposed framework is investigated in a heterogeneous cluster partition case.The second artificial 100 × 90 cooccurrence matrix  = {  } was vertically distributed into 4 sites as shown in Figure 5(a), where ( 1 ,  2 ,  3 ,  4 ) = (27,24,21,18).In contrast to the previous experiment, each site has different numbers of virtual coclusters such that ( 1 ,  2 ,  3 ,  4 ) = (4, 3, 2, 4).This situation is similar to the case where four corporations in the group have different products characteristics and cannot have the real customer features without their collaboration.
The goal of collaborative co-cluster analysis is to reveal the intrinsic global co-cluster structures, which can be found only with global whole data.Applying the proposed secure framework with various cluster numbers, the FCCM algorithm could derive at most  = 3 co-clusters; that is, when  > 3, the 4th or later clusters consisted of a few noise objects only.
In order to intuitively validate the  = 3 co-clusters derived by the proposed framework, Figure 5(b) provides the arranged whole data matrix, where the all 90 items were first resorted in descending order of item fuzzy memberships of the first cluster in order to extract items of first cluster, and then, the remaining items were second resorted in descending order of the second cluster.Be noted that, in real applications, we cannot construct such whole data  Figure 6 compares the item memberships derived by the proposed secure framework.Although sites 1 and 3 had different numbers of co-clusters from the global co-cluster structures, that is, ( 1 ,  3 ) = (4, 2), their co-cluster structures were also summarized into  = 3.In site 1, the first 2 coclusters were merged into a solo co-cluster.On the other hand, in site 3, the second co-cluster was shared by two coclusters because they cannot be distinguished in the global whole co-cluster structure.Finally, the derived item memberships are compared with the whole data case, where we do not care about privacy issues.Table 2 compares the correlation coefficients between the site-wise or proposed item memberships and the whole data result.In the similar manner to the previous experiment, the table also supports the high performance of the proposed method in collaborative fuzzy co-cluster analysis.
6.3.Data Set 3: Terrorist Attacks.Third, the proposed secure framework is applied to a social network dataset.Terrorist attacks data set, which is available from LINQS webpage of Statistical Relational Learning Group @ UMD (http://linqs.cs.umd.edu/projects//index.shtml), consists of 1293 terrorist attacks each assigned to one of 6 labels indicating the type of the attack.Each attack is characterized by 106 distinct features with a 0/1-valued vector of attributes whose entries indicate the absence/presence of a feature.The goal of this experiment is to extract the structural knowledge on the terrorist attacks from the 1293 × 106 cooccurrence matrix.
In this experiment, a virtual situation of four allied states is considered, where the 106 distinct features are separately First, the item memberships derived from the distributed matrices are compared with the whole data result.The whole data result was given by applying the conventional FCCM algorithm with (  ,   ) = (0.001, 180.0).The goal is to estimate similar fuzzy memberships to the whole case result from the distributed matrices.The proposed framework and the site-wise FCCM were applied with (  ,   ) = (0.0035, 100.0) and (  ,   ) = (0.01, 100.0), respectively.
Table 3 compares the correlation coefficients between the site-wise or proposed item memberships and the whole data result.In a similar manner to the previous experiments, the collaborative knowledge is much more efficient than the sitewise one.This result implies the applicability of the proposed framework in strategic collaboration of allied states.
Next, the cross tabulations of the labeled class and clusters are compared for validating the utility of object partitions.In Table 4, the three main classes are compared with the maximum membership cluster assignment.Although the site-wise models derived quite degraded object partitions only, the proposed collaborative model could reconstruct almost equivalent result to the whole data case.
These results show the proposed model efficiently achieves secure co-clustering from both object and item partitions view points and is suitable for co-clustering tasks.

Conclusions
In this paper, a novel framework for collaborative fuzzy cocluster analysis was proposed, in which vertically distributed cooccurrence matrices can be jointly analyzed with personal privacy preservation.In joint calculation of object fuzzy memberships, a secure encryption operation was adopted for calculating cluster-wise typicalities without broadcasting each element of individual cooccurrence matrices.Then, item fuzzy memberships are securely estimated in each site.Several experimental results demonstrated that collaborative analysis can contribute to revealing global intrinsic co-cluster structures of separate matrices rather than individual sitewise analysis.
The proposed framework is expected to enhance the collaborative utilization of many distributed databases, such as strategic marketing in corporation groups, collaborative medical development in hospitals, and strategic military actions in allied countries because they have a potential of sharing common knowledge withholding their independent sensitive information.
A possible future work is to evaluate the responsibility (utility) degree of each site.In the present model, each site is equally responsible for clustering estimation while some sites may have unreliable independent information only.Because the site-wise sum-to-one condition on item memberships can bring an undesirable influence of sites with low confidences, the responsibility of each site should be evaluated considering their confidences and should be fairly reflected in object membership calculation.Noise rejection mechanism [21,22] would be promising in removing unreliable sites.
with encryption operation.
is based on an encryption operation.
in site   .Then, site   broadcasts   to all sites.

Table 3 :
Comparison of partition quality measured by correlation coefficients among item memberships (terrorist attacks).

Table 4 :
Comparison of cross tabulation tables of object partition (terrorist attacks).four states and they want to get a collaborative knowledge on the terrorist attacks without publishing their observed features such as military intelligences.The 106 features were distributed to the four states such as ( 1 ,  2 ,  3 ,  4 ) = (26, 26, 27, 27); that is, each state has only a part of the whole features (1293 × * matrices) but the states want to get a knowledge, which is given from the whole data case.Because three of six labeled classes have fewer numbers of objects (attacks), the characteristics of major three classes (bombing, kidnapping, and Weapon-Attack) are mainly discussed with  = 3.