This work presents a novel approach to predict functional relations between genes using gene expression data. Genes may have various types of relations between them, for example, regulatory relations, or they may be concerned with the same protein complex or metabolic/signaling pathways and obviously gene expression data should contain some clues to such relations. The present approach first digitizes the log-ratio type gene expression data of
The cell works as a system governed by integrated action of the genes indicating that genes are functionally related; for example, they may have regulatory relations between each other or they may be concerned with the same protein complex or metabolic/signaling pathways and so on. Determining functional relations between genes enables development of a genetic network which leads to the prediction of the complex rolls of the genes in different systems in the cell. Nucleotide and/or amino acid sequence similarities have been extensively used to predict functional relation between genes [
The data used in this work was previously used in other works [
In microarray gene expression data missing values often occur due to various reasons, such as insufficient resolution, image corruption, dust, or scratches on the slide. Usually, microarray datasets are estimated to have more than 5% missing values and up to 90% of genes are affected [
After missing value imputation, let us denote the gene expression data matrix as
In the above equations “th” is a threshold which should be a real number and in most practical cases it is within 0 to 2. We digitized the data using the values of threshold “th” as 0.5, 1, and 1.5. For each case the distribution of the genes with respect to the count of 1 s in their profiles is shown in Figure
Distribution of the genes with respect to the count of 1 in their profiles in the context of the digitized matrix.
Based on a digitized matrix containing only 1, 0, and −1 a probability density mass function table can be constructed corresponding to each gene pair indicating nine joint probabilities as shown in Table
Nine joint probabilities calculated for each gene pair.
|
1 | 0 | −1 |
---|---|---|---|
1 |
|
|
|
0 |
|
|
|
−1 |
|
|
|
Any element of the above table
Here
We assume that the joint probabilities of Table
In this work we hypothesize that when gene
The distribution of all gene pairs in the context of
Distribution of gene pairs in the context of (a)
Figure
(a)
We conducted FDR (false discovery rate) [ The numbers of 1 s, 0 s, and −1 s in the digital profile of both genes are counted. Random profiles of both the genes are constructed by randomly imputing the same numbers of 1 s, 0 s, and −1 s. This process is repeated 100 times. Then, A chi-square value is calculated as follows where Based on the chi-square value, a
Figure
(a) Distribution of the gene pairs with respect to the
Figure
Based on the FDR analysis of the above section, we selected 25559 gene pairs having highest
To evaluate the richness of similar function genes in the modules we calculated their hypergeometric
Richness of similar function genes in selected clusters. For each cluster, hypergeometric
CID | Total number of genes |
|
Some relevant GO terms (corresponding number of genes) |
---|---|---|---|
4 | 97 |
|
Cytosolic ribosome (94), structural constituent of ribosome (94), cytoplasmic translation (93), ribosome (96) |
|
|||
16 | 76 |
|
Ribosomal subunit (37), structural molecule activity (38) |
|
|||
19 | 73 |
|
Ribonucleoprotein complex (47), intracellular part (73) |
|
|||
226 | 8 |
|
Nuclear nucleosome (8), DNA bending complex (8) |
|
|||
1 | 113 |
|
Cellular metabolic process (104), intracellular part (109) |
|
|||
44 | 34 |
|
Cytosolic part (21), cytoplasm (34) |
|
|||
35 | 44 |
|
Gene expression (41), primary metabolic process (43) |
|
|||
85 | 17 |
|
Mitochondrial part (14), mitochondrion (16) |
|
|||
155 | 11 |
|
Protein folding (9), protein binding (11), cellular protein metabolic process (10) |
|
|||
278 | 7 |
|
Proteasome complex (7), proteasome storage granule (5) |
|
|||
87 | 16 |
|
Nucleolus (12), non-membrane-bounded organelle (14) |
|
|||
107 | 14 |
|
Mitochondrion organization (12), cellular component organization (13) |
|
|||
121 | 13 |
|
Glycolysis (7), generation of precursor metabolites and energy (9) |
|
|||
442 | 5 |
|
Mitochondrial respiratory chain (5), oxidoreductase complex (5) |
|
|||
173 | 10 |
|
Protein folding (7), unfolded protein binding (5), protein binding (8) |
|
|||
282 | 7 |
|
Modification-dependent protein catabolic process (7), roteasomal ubiquitin-independent protein catabolic process (5) |
|
|||
71 | 15 |
|
Ribosome (13), ribonucleoprotein complex (14) |
|
|||
725 | 3 |
|
Acid phosphatase activity (2) |
|
|||
214 | 9 |
|
Hydrogen ion transmembrane transporter activity (5), single-organism metabolic process (7) |
|
|||
736 | 3 |
|
Asparaginase activity (3) |
|
|||
1092 | 3 |
|
Heme-copper terminal oxidase activity (3) |
|
|||
270 | 7 |
|
Ion transmembrane transporter activity (6) |
Distribution of the modules with respect to −log(
Furthermore to verify the presence of similar binding sites in the promoters of the genes included in individual modules we used the tool PRIMA (PRomoter Integration in Microarray Analysis) [
Richness of binding sites in the promoters of the module genes corresponding to 10 different transcription factors.
CID | Size | TF | Number of Promo. (PRIMA) |
|
Known regulatory relations (YEASTRACT) |
---|---|---|---|---|---|
3 | 98 | YP00066 [SFP1] | 58 |
|
98 |
5 | 95 | M00213 [RAP1] | 55 |
|
93 |
72 | 18 | YP00036 [MBP1] | 10 |
|
12 |
155 | 11 | M00169 [HSF] | 7 |
|
11 |
230 | 8 | YP00068 [SIP4] | 5 |
|
4 |
227 | 8 | YP00064 [RPN4] | 8 |
|
8 |
725 | 3 | M00064 [PHO4] | 3 |
|
3 |
259 | 7 | YP00076 [STB1] | 5 |
|
2 |
736 | 3 | YP00013 [DAL82] | 3 |
|
0 |
233 | 8 | YP00043 [MSN4] | 8 |
|
7 |
In this work we propose a novel measure to determine functional relation between genes based on gene expression data. The present approach first digitizes the log-ratio type gene expression data to a matrix consisting of 1, 0, and −1 indicating highly expressed, no major change and highly suppressed conditions for genes, respectively. Then a probability density mass function table is constructed indicating nine joint probabilities for each pair of genes. Those pairs of genes were considered as functionally related for which the sum of probability density masses in selected points are statistically significant. We applied the method to a sample gene expression data of
The authors declare that there is no conflict of interests regarding the publication of this paper.
This research is supported by the National Bioscience Database Center in Japan, the Ministry of Education, Culture, Sports, science, and Technology of Japan (Grant-in-Aid for Scientific Research on Innovation Areas “Biosynthetic Machinery. Deciphering and Regulating the System for Creating Structural Diversity of Bioactivity Metabolites (2007)”).