An Unsupervised Approach to Predict Functional Relations between Genes Based on Expression Data

This work presents a novel approach to predict functional relations between genes using gene expression data. Genes may have various types of relations between them, for example, regulatory relations, or they may be concerned with the same protein complex or metabolic/signaling pathways and obviously gene expression data should contain some clues to such relations. The present approach first digitizes the log-ratio type gene expression data of S. cerevisiae to a matrix consisting of 1, 0, and −1 indicating highly expressed, no major change, and highly suppressed conditions for genes, respectively. For each gene pair, a probability density mass function table is constructed indicating nine joint probabilities. Then gene pairs were selected based on linear and probabilistic relation between their profiles indicated by the sum of probability density masses in selected points. The selected gene pairs share many Gene Ontology terms. Furthermore a network is constructed by selecting a large number of gene pairs based on FDR analysis and the clustering of the network generates many modules rich with similar function genes. Also, the promoters of the gene sets in many modules are rich with binding sites of known transcription factors indicating the effectiveness of the proposed approach in predicting regulatory relations.


Introduction
The cell works as a system governed by integrated action of the genes indicating that genes are functionally related; for example, they may have regulatory relations between each other or they may be concerned with the same protein complex or metabolic/signaling pathways and so on. Determining functional relations between genes enables development of a genetic network which leads to the prediction of the complex rolls of the genes in different systems in the cell. Nucleotide and/or amino acid sequence similarities have been extensively used to predict functional relation between genes [1,2]. Affinity purification [3,4] and yeast two-hybrid assays [5,6] are employed to determine physical association between proteins which are gene products. Synthetic lethal screens [7] measure the tendency for genes to compensate the loss of other genes. Scientists have performed numerous studies in an attempt to better understand and classify digenic epistatic relationships [8]. In [9] a probabilistic functional network of yeast genes was constructed by integrating diverse genomic data. In [10] an algorithm was proposed for regulatory networks of gene modules that combines information from genome wide location and expression data sets. Constraintbased Bayesian Structure Learning (BSL) techniques, namely, (a) PC Algorithm, (b) Grow-shrink (GS) algorithm, and (c) Incremental Association Markov Blanket (IAMB), were used to model the functional relationships between genes associated with differentiation potential of aged myogenic progenitors in the form of acyclic networks from the clonal expression profiles [11]. Attempts have been made not only to determine functional relationship between individual genes but also to measure functional relationship between gene sets [12]. Many more similar studies can be cited. Microarray gene expression data incorporating with other information have been extensively used for predicting regulatory relation between genes [13][14][15]. However it is logical to assume that expression data contains information about various types of functional relations between genes. In the present work we propose an approach for estimating integrated linear and probabilistic relations between expression profiles of genes and applied the concept to determine functional relations between yeast genes solely based on gene expression data. The proposed method successfully detected functionally related gene pairs that share many GO terms. The method also shows promise to be utilized in the process of detecting regulatory relations between genes.

Data Used in This
Work. The data used in this work was previously used in other works [16][17][18][19]. The data is a 2467 × 79 matrix containing some missing values. Each data point produced by a DNA microarray hybridization experiment represents the log of the ratio of expression levels of a particular gene under two different experimental conditions. The result, from an experiment with genes on a single chip, is a series of log-transformed expression-level ratios. Typically, the numerator of each ratio is the expression level of the gene in the varying condition of interest, whereas the denominator is the expression level of the gene in some reference condition. The expression measurement is positive if the gene is induced (turned up) with respect to the reference state and negative if it is repressed (turned down). The data were collected at various time points during the diauxic shift, the mitotic cell division cycle, sporulation, and temperature and reducing shocks.

Missing Value Imputation.
In microarray gene expression data missing values often occur due to various reasons, such as insufficient resolution, image corruption, dust, or scratches on the slide. Usually, microarray datasets are estimated to have more than 5% missing values and up to 90% of genes are affected [20,21]. The gene expression data considered in this work contains 3760 missing values. The missing values were filled based on principal component analysis (PCA) by using the package pcaMethods [22]. Using PCA we can model a matrix by defining two parameter matrices, the scores, , and the loadings, , such that when multiplied with each other they well reconstruct the original matrix as follows: where is the error matrix and 1 × denotes the original variable averages. Now if contains missing values but and can be completely estimated, then we can usê as an estimate for if is missing.

Digitization of Gene Expression Matrix.
After missing value imputation, let us denote the gene expression data matrix as . For each row of we calculate the average and standard deviation. Let for the th row the average and Table 1: Nine joint probabilities calculated for each gene pair.
standard deviations be denoted as avg and sd . Now, the digitized matrix is created as follows: In the above equations "th" is a threshold which should be a real number and in most practical cases it is within 0 to 2. We digitized the data using the values of threshold "th" as 0.5, 1, and 1.5. For each case the distribution of the genes with respect to the count of 1 s in their profiles is shown in Figure 1. In case of th = 0.5, the distribution approaches roughly normal and we observed similar trend in case of −1. Hence in this work we considered th = 0.5 for the digitization of the gene expression data. Table. Based on a digitized matrix containing only 1, 0, and −1 a probability density mass function table can be constructed corresponding to each gene pair indicating nine joint probabilities as shown in Table 1.

Probability Density Mass Function
Any element of the above table ( , ) (corresponding to two genes say, gene and gene ) where , ∈ {1, 0, −1} can be calculated by assuming TRUE = 1 and FALSE = 0 in (4) as follows: Here is the width of matrix . We assume that the joint probabilities of Table 1 and corresponding conditional probabilities contain important clues to estimate functional relations between genes.

Hypothesis.
In this work we hypothesize that when gene is positively functionally related to gene , then ( = 1 | = 1) should be statistically high. Using Bayes rule we can write ( = 1 | = 1) = ( = 1, = 1)/ ( = 1). Now if ( = 1) is very small, then ( = 1 | = 1) can be very high and that can sometimes happen because of noisy data. To avoid this problem we can consider ( = 1, = 1) as an indicator that gene is positively functionally related to gene . To further strengthen the case we consider that when both ( = 1, = 1) and ( = 1, = 1) + ( = 0, = 0) + ( = −1, = −1) are statistically significant then gene a and gene b are positively functionally related. Considering other joint probability masses might be useful for finding functional relations between some multi function genes. By intuition we can realize that the sum of probabilities ( = 1, = 1) + ( = 0, = 0) + ( = −1, = −1) actually indicates an integrated measure of both linear and probabilistic relations between the profiles of two genes and this term will be referred to as positive linear and probabilistic relation (LPR pos ) in the following. To our knowledge this is the first approach to measure similarity between two multivariate entities based on joint probability density masses in selected points giving emphasis on both linear and probabilistic relations.

Effectiveness of LPR pos .
The distribution of all gene pairs in the context of (1, 1) is shown in Figure 2(a). The average value of (1, 1) is 0.0819. We calculated LPR pos for the gene pairs for which (1, 1) is larger than the average value. The distribution of those gene pairs with respect to LPR pos is shown in Figure 2(b). The average value of LPR pos is 0.429. Initially we selected the highest 1%, 2%, 3%, 4%, and 5% gene pairs from the distribution of Figure 2(b), that is, gene pairs with higher LPR pos values, and determined the number of GO terms [23] shared by both the genes of each pair. Figure 3(a) shows the percentage of selected gene pairs that share at least 1, 2, and, 3 GO terms and also that of equal number of randomly selected gene pairs. In the context of minimum number of shared GO terms the percentage of selected gene pairs is always much higher compared to that of randomly selected pairs. Figure 3(a) further shows that the higher the lower cutoff value of LPR pos for a group of gene pairs is, the higher proportion of the gene pairs share common GO terms. To further illustrate the result we show in Figure 3(b) the actual number of shared GO terms for the highest 1% selected gene pairs and the equal number of random gene pairs which implies that the gene pairs selected based on LPR pos share much more GO terms. Thus LPR pos is a good measure to determine functional relation between genes.

FDR Analysis.
We conducted FDR (false discovery rate) [24,25] analysis to statistically assess the false positive rates among the selected gene pairs based on LPR pos . For each pair of genes for which (1, 1) is above average we did the following.
(v) Based on the chi-square value, a -value for the gene pair is determined using statistical software. Note that LPR pos is directly proportional to ∑ =1,0,−1 ( , ) real .  Figure 4(b) shows the plot of FDR with respect to cutoff -values. As the cutoff -value decreases, FDR decreases rapidly and becomes roughly constant at -value of 0.001. There are 25559 gene pairs for which the -value is less than 0.001.

Network and Modules of the Selected Gene Pairs.
Based on the FDR analysis of the above section, we selected 25559 gene pairs having highest LPR pos values. Such selected gene pairs make a network consisting of 2131 nodes. We determined high density modules in that network using the network clustering algorithm DPClusO [26] and found 1154 modules of size 3 or more (see Supplementary File 1 in supplementary material available online at http://dx.doi.org/10.1155/2014/154594).

Richness of Similar Function Genes.
To evaluate the richness of similar function genes in the modules we calculated their hypergeometric -values by using the package GOstats [27] in the context of all three types of GO terms: biological process (BP), cellular compartment (CC), and molecular function (MF). Figures 5(a), 5(b), and 5(c) show the distribution of the modules with respect to −log( -value) which implies that almost all the modules are statistically significant. We selected 10 lowest -value clusters corresponding to different GO terms from each of the three distributions of Figure 5 and their set union resulted in 22 clusters. Some biological information from the SGD database [28] about those 22 clusters is summarized in Table 2. Column 3 in Table 2 shows the -values and corresponding GO terms determined by GOstats. Column 4 in Table 2 shows other GO terms retrieved from SGD database associated to the clusters covering many genes which implies that almost all the genes of each of the clusters could be associated to important GO terms which confirms the fact that the proposed method is a promising way to establish functional relation between genes based on expression data.

Richness of Similar Binding
Sites. Furthermore to verify the presence of similar binding sites in the promoters of the genes included in individual modules we used the tool PRIMA (PRomoter Integration in Microarray Analysis) [29] from the software package EXPANDER [30]. Total 180 modules were found to have -values less than 10 −3 in the context of binding site enrichment of 57 various transcription  (9), protein binding (11), cellular protein metabolic process (10) Table 1). Table 3 shows information about 10 modules corresponding to lowest -values involving 10 different transcription factors. We downloaded a list of known regulatory relations from the YEASTRACT database [31] and verified whether the genes in a module have regulatory relation with the associated transcription factor. Column 6 of Table 3 shows that a large number of genes in individual modules are already reported to be regulated by the corresponding transcription factor. Only in case of CID736, though all 3 genes contain in their promoters the binding site of the transcription factor DAL82, no regulatory relation between those genes is reported in the YEASTRACT database presently. However based on our analysis regulatory relations between DAL82 and those three genes may be predicted. Thus the proposed measure can also be integrated to other types of information for developing a method to predict regulatory relations between genes which is one of our future works.

Conclusions
In this work we propose a novel measure to determine functional relation between genes based on gene expression data. The present approach first digitizes the log-ratio type gene expression data to a matrix consisting of 1, 0, and −1 indicating highly expressed, no major change and highly suppressed conditions for genes, respectively. Then a probability density mass function table is constructed indicating nine joint probabilities for each pair of genes. Those pairs of genes were considered as functionally related for which the sum of probability density masses in selected points are statistically significant. We applied the method to a sample gene expression data of S. cerevisiae. It was found that substantial majority of the selected gene pairs share many GO terms. Also the network consisting of the selected gene pairs contains high density modules. It was shown that those modules were rich with similar function genes. Furthermore, it was verified that for many modules many of the genes contain similar binding sites in their promoters corresponding to known transcription factors of yeast and those transcription factors are known regulators of many of the genes in the corresponding module. Above all this work introduces a new approach for simultaneously measuring both linear and probabilistic relations between multivariate entities which is useful for handling multivariate data and big data biology.