Bit-Table Based Biclustering and Frequent Closed Itemset Mining in High-Dimensional Binary Data

During the last decade various algorithms have been developed and proposed for discovering overlapping clusters in high-dimensional data. The two most prominent application fields in this research, proposed independently, are frequent itemset mining (developed for market basket data) and biclustering (applied to gene expression data analysis). The common limitation of both methodologies is the limited applicability for very large binary data sets. In this paper we propose a novel and efficient method to find both frequent closed itemsets and biclusters in high-dimensional binary data. The method is based on simple but very powerful matrix and vector multiplication approaches that ensure that all patterns can be discovered in a fast manner. The proposed algorithm has been implemented in the commonly used MATLAB environment and freely available for researchers.


Introduction
One of the most important research fields in data mining is mining interesting patterns (such as sequences, episodes, association rules, correlations, or clusters) in large data sets. Frequent itemset mining is one of the earliest such concepts originating from economic market basket analysis with the aim of understanding the behaviour of retail customers, or, in other words, finding frequent combinations and associations among items purchased together [1]. Market basket data can be considered as a matrix with transactions as rows and items as columns. If an item appears in a transaction it is denoted by 1 and otherwise by 0. The general goal of frequent itemset mining is to identify all itemsets that contain at least as many transactions as required, referred to as minimum support threshold. By definition, all subsets of a frequent itemset are frequent. Therefore, it is also important to provide a minimal representation of all frequent itemsets without losing their support information. Such itemsets are called frequent closed itemsets. An itemset is defined as closed if none of its immediate supersets has exactly the same support count as the itemset itself. For comprehensive reviews about the efficient frequent itemset mining algorithms, see [2,3].
Independently of frequent itemset mining, biclustering, another important data mining concept, was proposed to complement and expand the capabilities of the standard clustering methods by allowing objects to belong to multiple or none of the resulting clusters purely based on their similarities. This property makes biclustering a powerful approach especially when it is applied to data with a large number of objects. During recent years, many biclustering algorithms have been developed especially for the analysis of gene expression data [4]. With biclustering, genes with similar expression profiles can be identified not only over the whole data set but also across subsets of experimental conditions by allowing genes to simultaneously belong to several expression patterns. For comprehensive reviews on biclustering, see [4][5][6].
One of the most important properties of biclustering when applied to binary (0, 1) data is that it provides the same results as frequent closed itemsets mining ( Figure 1). Such biclusters, called inclusion-maximal biclusters (or IMBs), were introduced in [7] together with a mining algorithm, BiMAX, to discover all biclusters in a binary matrix that are not entirely contained by any other cluster. By default an IMB can contain any number of genes and samples.  Once additional minimum support threshold is required for discovering clusters having at least as many genes as the provided minimum support threshold (i.e., minimum number of genes), BiMAX and all frequent closed itemset mining methods result in the same patterns.
In this paper we propose an efficient pattern mining method to find frequent closed itemsets/biclusters when applied to binary high-dimensional data. The method is based on simple but very powerful matrix and vector multiplication approaches that ensure that all patterns can be discovered in a fast manner. The proposed algorithm has been implemented in the commonly used MATLAB environment, rigorously tested on both synthetic and real data sets, and freely available for researchers (http://pr.mk.unipannon.hu/Research/bit-table-biclustering/).

Problem Formulation
In this section we will show how both market basket data and gene expression data can be represented as bit-tables before providing a new mining method in the next section. In case of real gene expression data, it is a common practice of the field of biclustering to transform the original gene expression matrix into a binary one in such a way that gene expression values are transformed to 1 (expressed) or 0 (not expressed) using an expression cutoff (e.g., twofold change of the log2 expression values). Then the binarized data can be used as classic market basket data and defined as follows ( Figure 2): let = { 1 , . . . , } be the set of transactions and let = { 1 , . . . , } be the set of items. The Transaction Database can be transformed into a binary matrix, B 0 , where each row corresponds to a transaction and each column corresponds to an item (right side of Figure 2). Therefore, the bit-table contains 1 if the item is present in the current transaction and 0 otherwise [8].
Using the above terminology, a transaction is said to support an itemset if it contains all items of ; that is, ⊆ . The support of an itemset is the number of transactions that support this itemset. Using for support count, the support of itemset is ( ) = |{ | ⊆ , ∈ }|. An itemset is frequent if its support is greater than or equal to a userspecified threshold sup( ) ≥ minsupp. An itemset is called -itemset if it contains items from ; that is, | | = . An itemset is a frequent closed itemset if it is frequent and there exists no proper superset ⊃ such that sup( ) = sup( ).
The problem of mining frequent itemsets was introduced by Agrawal et al. in [1] and the first efficient algorithm, called Apriori, was published by the same group in [9]. The name of the algorithm is based on the fact that the algorithm uses prior knowledge of the previously determined frequent itemsets to identify longer and longer frequent itemsets. Mannila et al. proposed the same technique independently in [10], and both works were combined in [11]. In many cases, frequent itemset mining approaches have good performance, but they may generate a huge number of substructures satisfying the user-specified threshold. It can be easily realized that if an itemset is frequent then all its subsets are frequent as well (for more details, see "downward closure property" in [9]). Although increasing the threshold might reduce the resulted itemsets and thus solve this problem, it would also remove interesting patterns with low frequency. To overcome this, the problem of mining frequent closed itemsets was introduced by Pasquier et al. in 1999 [12], where frequent itemsets which have no proper superitemset with the same support value (or frequency) are searched. The main benefit of this approach is that the set of closed frequent itemsets contains the complete information regarding its corresponding frequent itemsets. During the following few years, various algorithms were presented for mining frequent closed itemsets, including CLOSET [13], CHARM [14], FPclose [15], AFOPT [16], CLOSET+ [17], DBV-Miner [18], and STreeDC-Miner [19]. The main computational task of closed itemset mining is to check whether an itemset is a closed itemset. Different approaches have been proposed to address this issue. CHARM, for example, uses a hashing technique on its TID (transaction identifier) values, while AFOPT, FPclose, CLOSET, CLOSET+, or STreeDC-Miner maintains the identified detected itemsets in an FP-tree-like pattern-tree. Further reading about closed itemset mining can be found in [20].
The formulations above yield the close relationship between closed frequent itemsets and biclusters, since the goal of biclustering is to find biclusters = ( , ), such that ̸ ⊆ , ̸ ⊆ . Therefore, while the size restriction for columns in a bicluster corresponds to the frequency condition of itemsets, the "maximality" of a bicluster corresponds to the closeness of an itemset. Thus, if itemsets that contain less than min rows number of rows are filtered out, the set of all closed frequent itemsets will be equal to the set of all maximal biclusters.

Mining Frequent Closed Itemsets Using
Bit- Table Operations In this section we introduce a novel frequent closed itemset mining algorithm and propose efficient implementation of the algorithm in the MATLAB environment. Note that the proposed method can also be applied to various   biclustering application fields, such as gene expression data analysis, after a proper preprocessing (binarization) step. The schematic view of the proposed pipeline is shown in Figure 3.

The Proposed Mining Algorithm.
The mining procedure is based on the Apriori principle. Apriori is an iterative algorithm that determines frequent itemsets level-wise in several steps (iterations). In any step , the algorithm calculates all frequent -itemsets based on the already generated ( − 1)itemsets. Each step has two phases: candidate generation and frequency counting. In the first phase, the algorithm generates a set of candidate -itemsets from the set of frequent ( −1)-itemsets from the previous pass. This is carried out by joining frequent ( − 1)-itemsets together. Two frequent ( −1)-itemsets are joinable if their lexicographically ordered first − 2 items are the same and their last items are different. Before the algorithm enters the frequency counting phase, it discards every new candidate itemset having a subset that is infrequent (utilizing the downward closure property). In the frequency counting phase, the algorithm scans through the database and counts the support of the candidateitemsets. Finally, candidates with support not lower than the minimum support threshold are added into the set of frequent itemsets.
A simplified pseudocode of the Apriori algorithm is presented in Pseudocode 1, which is extended by extracting only the closed itemsets in line 9. While the () procedure generates candidate itemsets , the () method (in row 5) counts the support of all candidate itemsets and removes the infrequent ones.
The storage structure of the candidate itemsets is crucial to keep both memory usage and running time reasonable. In the literature, hash-tree [9,11,21] and prefix-tree [22,23] storage structures have been shown to be efficient. The prefixtree structure is more common, due to its efficiency and simplicity, but naive implementation could be still very space consuming.
Our procedure is based on a simple and easily implementable matrix representation of the frequent itemsets. The idea is to store the data and itemsets in vectors. Then, simple matrix and vector multiplication operations can be applied to calculate the supports of itemsets efficiently.
To indicate the iterative nature of our process, we define the input matrix (A × ) as A × = B 0 0 × where b 0 represents the th column of B 0 0 × , which is related to the occurrence of the th item in transactions. The support of item can be easily calculated as sup( Similarly, the support of itemset , = { , } can be obtained by a simple vector product of the two related vectors because when both and items appear in a given transaction the product of the two related items can be represented by the AND connection of the two items: sup( , = { , }) = (b 0 ) b 0 . The main benefit of this approach is that counting and storing the itemsets are not needed; only matrices of the frequent itemsets are generated based on the elementwise products of the vectors corresponding to the previously generated ( − 1)-frequent itemsets. Therefore, simple matrix and vector multiplications are used to calculate the support of the potential  set of ( −1)-itemsets. As a consequence, only matrices of the frequent itemsets are generated, by forming the columns of the B × −1 as the element-wise products of the columns of where ∘ means the Hadamard product of matrices and .
The concept is simple and easily interpretable and supports compact and effective implementation. The proposed algorithm has a similar philosophy to the Apriori TID [24] method to generate candidate itemsets. None of these methods have to revisit the original data table, B 0 × , for computing the support of larger itemsets. Instead, our method transforms the table as it goes along with the generation of the -itemsets, 1 × 1 represents the data related to the 1-frequent itemsets. This table is generated from B 0 × , by erasing the columns related to the nonfrequent items, to reduce the size of the matrices and improve the performance of the generation process.
Rows that are not containing any frequent itemsets (the sum of the row is zero) in B × are also deleted. If a column remains, the index of its original position is written into a matrix that stores only the indices ("pointers") of the elements of itemsets L 1 1 ×1 . When L −1 −1 × −1 matrices related to the indexes of the ( − 1)-itemsets are ordered, it is easy to follow the heuristics of the Apriori algorithm, as only those L −1 itemsets will be joined whose first −1 items are identical (the set of these itemsets form the blocks of the B −1 −1 × −1 matrix). Figure 4 represents the second step of the algorithm, using minsupp = 3 in the () procedure.

Experimental Results
In this section we compare our proposed method to BiMAX [7], which is a highly recognized reference method within the biclustering research community. As BiMAX is regularly applied to binary gene expression data, it serves as a good reference for the comparison. Using several biological and various synthetic data sets, we show that, while both methods are able to discover all patterns (frequent closed itemsets/biclusters), our pattern discovery approach outperforms BiMAX.
To compare the two mining methods and demonstrate the computational efficiency, we applied them to several real and synthetic data sets. Real data come from various biological studies previously used as reference data in biclustering research [25][26][27][28]. For the comparison of the computational efficiency, all biological data sets were binarized. For both the fold-change data (stem cell data sets) and the absolute expression data (Leukemia, Compendium, and Yeast-80) fold-change cutoff 2 is used. Results are shown in Table 1 (synthetic data) and Table 2 (real data), respectively. Both methods were able to discover all closed patterns for all synthetic and real data sets. The results show that our method outperforms BiMAX and provides the best running times in all cases, especially when the number of rows and columns 6 The Scientific World Journal  is higher. Biological validation of the discovered patterns together with detailed explanations is given in [28].

Conclusion
In this paper we have proposed a novel and efficient method to find both frequent closed itemsets and biclusters in highdimensional binary data. The method is based on a simple bit-table based matrix and vector multiplication approach and ensures that all patterns can be discovered in a fast manner. The proposed algorithm can be successfully applied to various bioinformatics problems dealing with high-density biological data including high-throughput gene expression data.

Disclosure
Attila Gyenesei is joint first author.