Evaluation of Plaid Models in Biclustering of Gene Expression Data

Background. Biclustering algorithms for the analysis of high-dimensional gene expression data were proposed. Among them, the plaid model is arguably one of the most flexible biclustering models up to now. Objective. The main goal of this study is to provide an evaluation of plaid models. To that end, we will investigate this model on both simulation data and real gene expression datasets. Methods. Two simulated matrices with different degrees of overlap and noise are generated and then the intrinsic structure of these data is compared with biclusters result. Also, we have searched biologically significant discovered biclusters by GO analysis. Results. When there is no noise the algorithm almost discovered all of the biclusters but when there is moderate noise in the dataset, this algorithm cannot perform very well in finding overlapping biclusters and if noise is big, the result of biclustering is not reliable. Conclusion. The plaid model needs to be modified because when there is a moderate or big noise in the data, it cannot find good biclusters. This is a statistical model and is a quite flexible one. In summary, in order to reduce the errors, model can be manipulated and distribution of error can be changed.


Introduction
In biology, the cell is the basic structure of any organism. All cells of an organism have the same genes that could be at different expression levels across numerous conditions [1]. Scientists have concluded that different conditions could affect it in terms of whether a particular gene is expressed and how it could be expressed. The organism's health may be compromised due to the different expressions present. So it seems crucial to evaluate the levels of genome when exposed to tense factors [2]. In recent years, DNA microarray technology has provided monitoring of thousands of gene expressions simultaneously when cells are under different conditions and various processes. This technology has a key role in accelerating and increasing the efficiency of gene expression studies [3]. The development of this technique has led to the availability of gene expression matrix with rows containing thousands of genes and columns containing hundreds of conditions [4]. Clustering has been one of the most important techniques used for detecting pattern recognition and could find groups with similar expression patterns [3]. In gene expression data, some genes usually behave similarly under a subset of conditions and therefore these genes may not be expressed in other conditions. Furthermore, genes could be expressed in more than one subset. Therefore, traditional clustering methods will fail to discover such patterns [5]. In order to overcome these constraints and for the purpose of finding the appropriate gene expression patterns, biclustering methods have been proposed of which computational framework is more flexible [6]. A bicluster is a subset of genes that has similar expression patterns over a subset of conditions; so biclustering methods have determined homogeneous submatrices [7]. The first biclustering algorithm, the so-called block clustering, has been developed by [8]. Cheng and Church proposed the first biclustering algorithm for the analysis of high-dimensional gene expression data [9]. Since then, many different biclustering algorithms have been developed. Currently, there exists a diverse spectrum of biclustering tools that follow different algorithmic concepts basis on type of biclusters and definition of patterns [10]. Each of these algorithms has been proposed on the basis of coherence patterns and therefore based on these patterns, different submatrices have been identified. For instance, plaid model finds constant value biclusters, Cheng and Church (CC) model finds constant row biclusters, and OPSM and ISA find coherent evolutions biclusters [11]. And yet, there are some common issues with biclustering algorithms in general. Noise/errors in the data are the first issue that limits the discovery of appropriate biclusters [12]. The second issue would be the ability of algorithms to find overlapping biclusters [13]. Therefore, an important question is whether the algorithms based on these issues can find valid biclusters. Most of these algorithms have ignored noise and discovered biclusters based on all of gene expression data [14]. Also some of them could not find overlapping biclusters [13].
Distribution parameter identification is one of biclustering algorithms in which it is assumed that the data structures follow a statistical model and then trying to fit its parameters to the data by minimizing a certain criterion through an iterative approach is done [15]. Plaid models, spectral biclustering, and rich probabilistic model are some examples of this kind of biclustering. Among them, the plaid model is arguably one of the most flexible biclustering models up to now. This algorithm describes the biclustering structure of the data matrix. This model is proposed by [16,17] and modified by [18]. It defines the expression levels as a sum of layers, constructed as biclusters. The main goal of this study is to provide a systematic evaluation of plaid models. To that end, we will investigate this model on both simulation data and real gene expression datasets by validity indices.

Overview of Plaid
Model. The plaid model is a model based on biclustering approach that is used for analysis of gene expression data. This is a statistical model and assumes that the level of matrix entries is sum of the uniform backgrounds and biclusters. So the expression matrix with genes (rows) and conditions (columns) is represented as where 0 is a general matrix background and = + + and is the added background in bicluster and and are column specific additive constants in bicluster . Also, ∈ {0, 1} and ∈ {0, 1} are gene-bicluster membership and condition-bicluster membership indicator variables. The general biclustering problem is now formulated as finding parameters values so that the resulting matrix would fit the original data as much as possible. Formally, the problem is minimizing of ∑ [ − ∑ ] 2 .

Simulation Data.
Two simulating matrices with different degrees of overlap and noise are generated and then the intrinsic structure of these data is compared with biclusters result. We embedded two biclusters in the matrices with overlapping degrees of 0%, 10%, and 25% and noise degrees of 0%, 1%, 3%, 5%, and 10%. For the purpose of this study, simulations data were included in matrices with sizes 50 * 20 and 500 * 50 and distribution of (0, 100) that embedded two biclusters as normal distribution with means of 3 and 9 and variance 0.1, respectively. So other entries are built with (0, 100). Noise was built in the data with binary distribution and then values were generated by normal distribution with mean of 20 and variance of 4. We evaluated performance of the plaid algorithms based on three criteria that are numbers of rows and columns and overlap degree of biclusters. We ran the algorithm for 1000 iterations and the averages criteria were reported. Table 1 lists statistics for rows and columns number of generated biclusters.

Biological Significant.
The result of the different biclustering techniques in microarray data is groups of genes, coexpressed with each other strongly, so we expect these genes to have the same functions. Gene ontology biological process could be the function that measures these similarities and covers three domains: cellular component, molecular function, and biological process. GO enrichment validation is a hypergeometric test for GO enrichment. This statistical test is significant if the genes in the biclusters are annotated with GO terms and are not specified by chance. So for the purpose of evaluating the quality of biclusters we have applied plaid algorithm to real dataset and searched biologically significant discovered biclusters in the Database for Annotation, Visualization and Integrated Discovery (DAVID) bioinformatics resources [19]. The real dataset is related to breast cancer (docetaxel resistance) article in 2005 that was included in CGED [20]. 44 breast tumor tissues were sampled through biopsy. Numbers of assayed genes were 2453.

Simulation Data.
In this section, we implemented the plaid algorithms on two simulate datasets and then evaluated them. In this study two R packages, Biclust and Bioconductor, were used. As shown in Table 1, this algorithm was applied to matrix with size of 50 × 20. When there was no noise, the    Overlap 0% Overlap 10% Overlap 25% algorithm was capable of discovering biclusters with different degrees of overlap and results exactly have corresponded to what was generated. It could be stated that plaid model is efficient in this case. But when there was a little noise, almost 0.01%, about 20-50% of biclusters could not be found correctly and also when there was an overlap among biclusters, these algorithms could not discover it. When noise was larger, plaid algorithm could not identify any biclusters. As shown in Table 2, for matrix with size of 500 × 50 when there was no noise and overlap in data, the algorithm could discover all biclusters correctly. When overlap degrees were 10% or 20%, algorithm performed well, and yet it could discover 90% of elements. Anytime there were 0.01%, 0.03%, and 0.05% noise in dataset, this algorithm could correctly discover 80%, 40-50%, and less than 40% of biclusters, respectively. When noise was 10%, this algorithm could not find the biclusters correctly and the most of the biclusters were ignored. Also when there was noise in the data, this algorithm could not discover the overlapped biclusters. Figure 1 shows the percent of corrected rows and columns of biclusters in the matrix with dim 500 × 50. As shown here with the larger amount of noise, the diagnosis of the corrected biclusters reduced, especially when there is an overlap among biclusters.

Real Data.
First, datasets are normalized with median approach and then missing values are computed with nn ( Nearest Neighborhood) method. In our experiment we found 5 biclusters with the size of minimum 3 and maximum 189. Information about the discovered biclusters is shown in Table 3. In this table the first column contains the label of each bicluster. The second and third columns report the number  of genes and conditions, respectively, and the last column contains the mean square residue of the biclusters. Table 4 shows the significant GO terms for the set of genes that is discovered by each result of the biclusters along with their value. We used the web tool DAVID to evaluate the discovered biclusters. For each bicluster, we first denoted numbers of GO term and then evaluated the significance of the functions.

Discussion
Many biclustering algorithms and models have been already proposed. Till now, one of the most flexible biclustering models is the plaid model [21].
In order to evaluate the plaid model in biclustering of gene expression data statistically, we generated two datasets with different noise and overlap and used a real dataset. Then these items were considered through statistical and biological criteria. Obviously, this algorithm can perform well when size of data is small and there is no noise. In this case, the algorithm is capable of discovering biclusters with different degrees of overlap. Nonetheless, when there is a little noise, this algorithm cannot discover all biclusters correctly and most of the information is ignored. Likewise, when noise is large, it cannot identify any biclusters. For matrix with size of 500 × 50, biclusters are well discovered when there is not any noise in data; consequently, it is almost capable of finding overlap biclusters. When there are little (0.01%) and moderate noise (0.03, 0.05%) in dataset, this algorithm cannot perform well in finding overlapping biclusters and if noise is big, result of biclustering is not reliable. For the purpose of biological evaluation, we used plaid model for a breast cancer dataset containing 2243 genes and 70 conditions. In this study we found 5 biclusters whose MSR measures are small, leading to their acceptance in the experiment. For each bicluster, we checked number of GO terms. Minimum and maximum numbers of GO terms are 10 and 748 which stand for biclusters with 3 and 189 genes. As a result, A, B, and D biclusters are highly enriched and the largest biclusters are more acceptable. Perhaps biclusters with very small genes could not be accounted for and should be rejected.
Most of the researches which used biclustering are concerned with the introduction of a new approach, while only a few of them have evaluated existing methods especially plaid model. Also, in most of these studies, the evaluations have been done on gene expression datasets, not on simulated data. Prelic et al. in 2006 evaluated 5 biclustering algorithms including CC [9], SAMBA [3], OPSM [22], ISA [23], and xMotif [24]. This study showed that ISA and SAMBA discovered 80% of biclusters correctly and CC and xMotif less than 40% on noise-free data, but while noise level increased more than 5%, the efficiency of all algorithms extremely decreased except ISA which is robust to noise [4]. Eren et al. in 2012 compared 12 biclustering algorithms by using synthetic data and demonstrated that no algorithm was able to fully separate biclusters with substantial overlap and also showed that algorithms which are model based seem more robust to noise than the others. At the end, they found the highest proportion of enriched biclusters in gene ontology analysis [25].

Conclusions
In this study, we evaluated the capabilities of plaid algorithm to identify biologically significant groups of coexpressed genes under a number of conditions. The evaluation criteria of biological significance for biclusters used in our study were GO annotation and simulation studies. GO enrichment analysis showed that biological significance of each bicluster is high, especially when the size of biclusters is big. The purpose behind this study was to evaluate plaid model based on different degrees of noise and overlap. Results show that when there is not any noise in the data, the algorithm can correctly discover biclusters with overlap. When there is a little noise in data and matrix of data is small algorithm could not find biclusters properly; yet if matrix is large it can recognize the biclusters properly. Furthermore, when there is a big noise in the data, algorithm could not discover biclusters and results are not reliable. Both the simulation studies and the real data analysis have demonstrated that the plaid algorithm is suitable for discovering patterns and provided useful information for researchers in big datasets with little noise.
There are some issues which should be considered while using plaid model. The plaid model is a statistical model and a quite flexible one, so it can be improved and used in genomic studies.
This model was constructed based on normal distribution and the parameters were estimated through minimizing the least squares criterion. But when the data is not normally distributed, the least squares criterion seems to be inefficient [26]. The distribution of the normalized gene expressions often has heavy tails and asymmetry. Traditional centering and scaling indexes in normal distribution approximations including the mean and the standard deviation are sensitive to outliers [27]. So normal distribution is not efficient when the data is noisy and it is better to use Laplace distribution in plaid model while it uses median as a location parameter and the scale parameter with the mean absolute deviation which is robust to noise and outliers.
Also, the results show that almost more than 80% of overlapped elements can be discovered in noise-free data by plaid model. But noises in data cause problems in discovery of overlap among biclusters. So considering Laplace distribution which causes the algorithm to be robust to noise it improves the algorithm for discovery of overlap among biclusters.