1. Introduction

BMRI

BioMed Research International

2314-6141 2314-6133

Hindawi Publishing Corporation

856325

10.1155/2013/856325

856325

Research Article

DeGNServer: Deciphering Genome-Scale Gene Networks through High Performance Reverse Engineering Analysis

Jun

http://orcid.org/0000-0002-3551-4998

Wei

Hairong

^2,3

http://orcid.org/0000-0002-3460-5564

Zhao

Patrick Xuechun

¹ Zhao

Zhongming

Bioinformatics Lab

Plant Biology Division

Samuel Roberts Noble Foundation

2510 Sam Noble Parkway

Ardmore

OK 73401

USA

noble.org

School of Forest Resources and Environmental Science

Michigan Technological University

1400 Townsend Drive

Houghton

MI 49931

USA

mtu.edu

Department of Computer Science

Michigan Technological University

1400 Townsend Drive

Houghton

MI 49931

USA

mtu.edu

2013

17 11 2013

2013 22 08 2013 01 10 2013

2013

This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Analysis of genome-scale gene networks (GNs) using large-scale gene expression data provides unprecedented opportunities to uncover gene interactions and regulatory networks involved in various biological processes and developmental programs, leading to accelerated discovery of novel knowledge of various biological processes, pathways and systems. The widely used context likelihood of relatedness (CLR) method based on the mutual information (MI) for scoring the similarity of gene pairs is one of the accurate methods currently available for inferring GNs. However, the MI-based reverse engineering method can achieve satisfactory performance only when sample size exceeds one hundred. This in turn limits their applications for GN construction from expression data set with small sample size. We developed a high performance web server, DeGNServer, to reverse engineering and decipher genome-scale networks. It extended the CLR method by integration of different correlation methods that are suitable for analyzing data sets ranging from moderate to large scale such as expression profiles with tens to hundreds of microarray hybridizations, and implemented all analysis algorithms using parallel computing techniques to infer gene-gene association at extraordinary speed. In addition, we integrated the SNBuilder and GeNa algorithms for subnetwork extraction and functional module discovery. DeGNServer is publicly and freely available online.

1. Introduction

The advent of high-throughput technologies including microarray experiments and RNA-Seq technologies has generated terabytes of gene expression data for systematically identifying transcriptional regulation and interactions through the reconstruction of gene networks on genome-wide scale. Analysis of whole genome-scale networks can provide a holistic view of all transcription regulations among and within different subnetworks and allows us to gain a more comprehensive understanding of regulation of cellular processes and events. In the past few years, large amount of gene expression data sets from numerous labs has been published and deposited in public databases such as ArrayExpress [1] and Gene expression Omnibus [2], and the volume of this kind of data is still exploding at an accelerated rate. Previous effort in analyzing these public available data has led to the discovery of large amount of novel biological knowledge, making it become increasingly clear that reverse engineering of such “big data” for genome-scale network reconstruction and analysis is one of the most efficient approaches for understanding how life functions through learning holistic transcription regulation and gene interaction.

To date, reverse engineering of aggregated high volume gene expression data for building accurate gene network is still very challenging. The challenge lies in the high dimensionality of gene space and large sample numbers that demand fast and high efficient algorithms, and enhanced computational power as well. A set of the algorithms operates under such a hypothesis that coexpressed [3–5], roughly coordinated genes [6, 7] and genes with dependency [8–10] across a set of samples indicate a functional relationship [11, 12]. As one of the best gene network construction methods, the context likelihood of relatedness (CLR) method [9] utilizing the mutual information (MI) for scoring the similarity of gene pairs has been widely used to decipher gene networks for multiple species, such as yeast, bacteria, mammalian, and plants [9, 13–16]. However, it is computationally infeasible to decipher genome-scale networks for species with large genomes on a single computer due to physical limits on CPU speeds and memory capacities. For example, there are more than thirty-five thousand genes (transcripts) in human genome. To decipher a genome-scale network through such reverse engineering method, it will need to calculate more than 1.2 billion MI values if we evaluate genes in pairwise fashion, and it is more likely that we will need to evaluate genes in triples or quadrants. Even for those species with small genomes, it is still a big computational challenge to use this method. When CLR was used to construct global networks for Escherichia coli in [9], the authors had to trim the number of genes down to a few thousands in order to reduce the computational complexity to a manageable scale. Obviously, this kind of gene reduction prior network construction could miss many potential gene regulations and interactions in the constructed networks. This is because many important transcription factors or genes involved in signaling transduction are expressed at low level and do not necessarily have high variability in expression [17–19]. These genes can be easily eliminated during data trimming process.

Meanwhile, the estimation of mutual information adopted in CLR method heavily relies on the number of microarray data sets. The mutual information value could be estimated accurately only when the number of microarray profiles is larger than one hundred [20]. However, as more microarray and RNA-seq data become available in public database, this, in turn, demands fast, accurate, and less computational complexity. Therefore it is urgently called to develop a high performance reverse engineering system for large-scale gene network analysis through both innovations in efficient algorithm development and parallel computing implementation.

In this study, we integrated parallel computing technologies into DeGNServer to accelerate network reconstruction and subnetwork extraction, which enables DeGNServer to analyze the “big data” in at least one hundred times faster than the original mutual information based CLR, making it much feasible for reverse-engineering global gene networks using the data from a large genome and discovering novel biological knowledge. Meanwhile, we integrated multiple gene association methods into our DeGNServer for network construction. The benchmark data set demonstrated that most of these different association-based CLR methods could reach very similar accuracy as the original mutual information-based CLR method. In addition, we also integrated the SNBuilder [21] and GeNa [22] communities-finding algorithms for identifying subnetworks by providing some seed genes. The major purpose of our system is to provide a practical system to construct the gene association networks from large scale gene expression data.

2. Implementation 2.1. Overview of Gene Network Analysis Methods and Data Analysis Workflow

We extended the CLR method through integrating several gene-gene association estimation methods, of which includes Pearson, Spearman correlation [6], Kendall, Theil-sen [23], and Weighted Rank methods [24] as well as the mutual information-based method proposed in the original CLR method [9], in the DeGNServer. In addition, the recently published method, maximal information coefficient (MIC) method [25], which has demonstrated capability in discovering novel associations in large data sets, was also integrated into our DeGNServer. To help the biologists to interpret the inferred network, we integrated SNBuilder [21] and GeNa [22] approaches for subnetwork analysis/functional module discovery. All algorithms have been implemented and deployed on our in-house parallel computing platform, namely, BioGrid, which has dedicated over 700 CPU Cores. Figure 1 illustrates the data analysis workflow in DeGNServer. Utilizing our high performance DeGNServer, typical genome-scale gene networks involving 40,000~50,000 gene models could be constructed from expression data that consists of ~200 microarray hybridizations in less than 30 minutes.

Figure 1

The DeGNServer data analysis workflow.

2.2. Parallel Computing for the Accelerating of GN Construction

To accelerate the GN construction through the parallel computing, we split the whole data sets of these gene pairs into multiple subsets. Let M denote the n×m gene expression matrix, where n denotes the number of genes and m represents the number of gene expression profiles. The computational complexity of association value for all gene-gene pairs of is O(n2×m). The reconstruction of network will be very time-consuming when there exists massive number of expression profiles (e.g., n>20,000 and m>1,000). To tackle this issue, we implemented the GN analysis algorithms using parallel computing techniques. When this task is distributed to all the computing nodes in our Biogrid system, the total computational time complexity is then reduced to O(n2×m/p), where p is the number of allocated processors.

When a gene regulatory network is inferred from n genes, the algorithm will need to compute n×(n-1)/2 pairwise associated values. A two-dimensional n×n matrix D is used to denote these gene pairs. For gene pair (i,j), the association value of this gene pair will be calculated when the following requirements are satisfied.(1)

When n is even (1)if i≤⌈n2⌉, then j∈[i+1,min(n-1,i-1+⌈n2⌉)],if i>⌈n2⌉, then j∈[i+1,n-1]∪[0,i+1-⌈n2⌉].

(2)

When n is odd (2)if i≤⌈n2⌉, then j∈[i+1,min(n-1, i+⌈n2⌉)],if i>⌈n2⌉, then j∈[i+1,n-1]∪[0,i-1-⌈n2⌉].

For every processor in our Biogrid system, we assign n/p rows of matrix to this processor for the calculation of their corresponding association values.

2.3. Input Detail

DeGNServer accepts normalized expression data either in a tab-delimited text file or tab-delimited text. The server DeGNServer provides two options to construct different networks, that is, the coexpression networks and the CLR method-based association networks. Users may adjust the parameter settings, including gene-gene association estimation method and cut-off threshold, to control the size of constructed networks. After the networks are reconstructed, user may submit a list of genes-of-interest and select different subnetwork identification methods to further mine and visualize the same subnetwork generated from different extraction methods.

2.4. Output Detail

DeGNServer lists links to the constructed networks/subnetworks in Cytoscape [26] compatible text files, which can be easily imported into the popular Cytoscape software for downstream analysis. In addition, the DeGNServer output page provides interfaces for query and network visualization through Cytoscape web plug-in [24] for each identified subnetwork.

2.5. Technical Detail

The DeGNServer is currently deployed on Linux using resin Java server 4.0. It has been tested using the popular web browsers, such as Internet Explorer, Firefox, and Google Chrome. The web interfaces are implemented in JAVA and JSP scripts. All backend integrated analysis algorithms are implemented with parallel programming techniques in efficient C++ computing language and are deployed on an in-house developed Linux cluster, namely, BioGrid, which currently consists of about 700 CPU Cores, to achieve high performance computing capacity. Upon job submission through DeGNServer web server, the master node of BioGrid system firstly divides the gene expression matrix into multiple submatrixes and transfers these submatrixes to slave computing nodes in the Linux Cluster. Next, the master node remotely calls to execute the analysis pipelines and monitors analysis progresses in these computing nodes. Finally the master node collects the association values of all gene-gene pairs for gene network construction and subnetwork analysis. For those species with large genomes, the distributions of gene-gene pairs are close to the normal distribution, so we applied the normal distribution to calculate the z-score of gene-gene pairs. Based on the preset z-score threshold, those gene-gene pairs whose z-scores are less than the threshold would be discarded. Figure 2 illustrates the parallel implementation of the CLR Method.

Figure 2

Parallel implementation of the CLR method.

3. Results 3.1. Performance Evaluation with Synthetic Data

To comprehensively evaluate performance of integrated network construction methods, we generated two groups of synthetic compendium gene expression data sets, each group with a series of data sets of various sizes, using the SynTReN software [27] and the regulatory network models based upon Escherichia coli experimental data as original seeds. The sampled sizes of Group A data sets are 30, 40, 50, 60, 70, 80, and 90, while the sizes for Group B are 100, 200, 300, 400, 500, 600, 700, 800, 900, and 1000 samples. We analyzed each of these compendium data sets with various sample sizes and then generated respective subnetworks containing 50 genes. The prediction accuracy against the corresponding reference network in SynTReN software uses the area under the receiver operating characteristic curve (ROC) curve, namely, the AUC scores [28], to represent the accuracy of each method. The AUC scores resulting from all compendium data sets within each group were averaged, and results of averaged AUC scores for all method in each group are shown in Figure 3.

Figure 3

Average AUC scores from different association-based CLR methods for networks with larger and smaller numbers of expression profiles; Group A: networks constructed with smaller number of gene expression samples (30~90 samples), Group B: networks constructed with larger number of expression samples (100~1000). AUC scores were obtained through varying different threhold settings. A perfect model will have AUC score of 1, while random guessing will score an AUC around 0.5.

The ROC curve indicates the change of sensitivity (true positive rate) versus specificity (true negative rate) under different thresholds, and AUC score can represent the accuracy of each method better because it is independent of different thresholds.

The following formula is used to calculate the sensitivity and the specificity: (3)Sensitivity=TPTP+FN,Specificity=FPFP+TN. All methods were applied to construct GNs with each sampled data set in either Group A or B with the positive regulatory relationships being counted. We then calculated their respective AUC scores. For each group (smaller and large number of expression data sets), we compared their average AUC scores for different methods. Figure 3 shows that the prediction accuracies of Spearman-based CLR method have higher average AUC scores than other methods, suggesting that Spearman-based CLR method may produce better results in term of network construction.

3.2. Case Study 1: Deciphering Genome-Scale Pluripotency Networks in Human Embryonic Stem Cells 3.2.1. Human Stem Cell Microarray Data Set

To validate the performance of DeGNServer, we analyzed genome-scale networks from 189 human stem cell microarray profiles. These data sets were generated in 17 individual experiments in which human embryonic stem cells were treated with various reagents for inducing differentiation. Therefore, this compendium data set is enriched with regulatory events and interaction of pluripotency maintenance and transition from pluripotent stem cells to differentiated cell linages, and thus it can serve as an ideal testing data for the performance of DeGNServer in discovering functionally associated gene subnetworks governing these processes. Of these 189 microarray data sets, there are 104 high-density human gene expression arrays from HG17 assembly. This platform of microarray contains 388,634 probes from 36,494 human locus identifiers. These 104 chips were compiled from 15 experiments in which stem cells were treated with different reagents that disrupted pluripotency. The reagents and the conditions included 12-O-tetradecanoylphorbol-13-acetate (TPA) treatment in conditioned medium, TPA treatment in TeSR medium, BMP4 treatment with FGF, BMP4 treatment without FGF, and coculture with mouse OP9 cells. The remained 85 high-density human gene expression arrays have 381,002 probes from 47,633 human locus identifiers from the HG18. These 85 microarray data sets were compiled from a set of experiments where a variety of different growth factors were applied to human embryonic stem cells at varying conditions for 3 days. Both HG17 and HG18 microarray platforms were manufactured by NimbleGen Systems (http://www.nimblegen.com/). All probes are 60 mers and all chips were hybridized to Cy5 labeled mRNAs extracted from human embryonic stem cells (hESCs) from undifferentiated to differentiated stages. Raw data were extracted using NimbleScan software v2.1. The two data sets were joined by gene mapping via selection of shared common probes between the same genes on the two platforms. More than 99.5% of mapped genes share at least 6 common probes, and the signal intensities from these common probes were normalized with the Robust Multiple-chip Analysis (RMA) algorithm [29]. Thus, the whole data set obtained contains 36,398 genes.

3.2.2. Results on Pluripotency Network Analysis in Human Embryonic Stem Cells

The gene networks including 21,167 genes and 200,000 links were reconstructed in less than 20 minutes with a z-score threshold of 4.3 and spearman-based association method. The built network could be retrieved at http://plantgrn.noble.org/DeGNServer/Result.jsp?time4=&sessionid human&method=1_1&cutoff=4.3. We also tested with original mutual information-based CLR, and it took 53.3 hours to complete whole genome-scale network construction.

Generally, global networks with huge numbers of regulations and interactions are a “hairball”, from which we can hardly identify any patterns. To facilitate the identification of subnetworks or modules that regulate a specific biological process or developmental program, we integrated both SNBuilder [21] and GeNa [22] methods to extract smaller subnetworks/functional modules by providing a few seed genes. We used NANOG, POU5F1, SOX2, and PHC1 as seed genes to bait the subnetwork shown in Figure 4.

Figure 4

The identified subnetwork contains the essential transcription factors and other genes required for pluripotency maintenance. The twelve genes on the inner ring are transcription factors known to play essential or important role in pluripotency renewal of human embryonic stem cells. These include three master transcription factors, NANOG, POU5F1, and SOX2, which are absolutely required for pluripotency maintenance. The genes located on the outer ring were identified by DeGNServer for being closely coordinated with those transcription factors in the inner ring. The genes on outer ring, but highlighted in yellow, are those that are implicated by the existing literature to participate in the pluripotency renewal. This subnetwork was generated by using SNBuilder method [21] with NANOG, POU5F1, SOX2, and PHC1 as query seeds.

Figure 4 shows the subnetwork that is implicated to control the pluripotency renewal of human embryonic stem cells. The literature evidence supporting the involvement of those transcription factors on inner ring in regulating pluripotency in human stem cells is already shown in our earlier publication and these TFs could be identified by our TF-Cluster that is capable of constructing gene association network with all TFs as an input [6]. However, it cannot be used to build the genome-wide GN mainly due to computational complexity. In this study, our DeGNServer identified 14 of 16 TF genes that were identified previously by TF-Cluster tool from the same data for governing pluripotency renewal. These 14 TFs include three master transcription factors, NANOG, POU5F1 (or OCT4), and SOX2, which are necessary for pluripotency maintenance, and they alone can convert skin cells to induced pluripotent cells [30]. Although two TFs were missed by our method, we identified six more other genes that are to be involved in pluripotency maintenance in human stem cells. In this study, we only examined the existing literature of six genes that are located on the outer rings (Figure 4). The developmental pluripotency-associated 2 (DPPA2) gene plays important roles in the maintenance of pluripotency and proliferation of human embryonic stem cells by regulating chromatin structures [31]. Although there is no direct evidence from human stem cells, study on mouse stem cells shows that DPPA2 knockdown induces the differentiation, while it represses proliferation of mouse embryonic stem cells [31]. PRDM14 is an important determinant of the human embryonic stem cell (ESC) identity, and it works in concert with the core ESC regulators to activate pluripotency-associated genes [32]. PRDM14 binds to silenced genes and serves as a direct repressor of differentiation genes in human stem cells though the exact mechanism of this repressive activity remains unknown. ZMYND8 encodes a zinc finger protein with a complex role in maintaining pluripotency. Although only expressed at low levels, either up- or downregulation of ZMYND8 can induce differentiation in ES cells [33]. JARID2 is a component of chromatin modification complex PRC2 in embryonic stem cells and is required for multilineage differentiation. It plays a role in recruiting PRC1 and RNA Polymerase II to developmental regulators. We found that JARID2 and CD99 in our subnetwork and previous study have shown JARID2 functions together with CD99 in controlling autism spectrum disorder [34]. The exact function of DEPDC2 is currently unknown, but it is known that the promoter of DEPDC2 is bound by the three master transcription factors, NANOG, SOX 2, and POU5F1 as mentioned above [35]. DEPDC2 is a molecular marker for human stem cell [36] though its exact function remains unknown. Similarly, the exact function of CHST4 is currently unknown, but it is know that CHST4 is one of the 16 methylation markers of embryonic stem cells, and these 16 methylation markers also include PRDM14 as mentioned above [37].

To further examine the sensitivity, specificity, and prediction accuracy of the case study described above, we made some assumptions. (1) We assumed that the genes that are evidenced to be involved in pluripotency maintenance in the existing literature are all positive genes; we then counted the true positive (TP) and false positive (FP) genes within each subnetwork. The true negative (TN) and false negative (FN) genes were calculated from the rest of network that was adjusted to the same size of each subnetwork. For comparison, we rescaled all numbers to one hundred before we calculated sensitivity, specificity, and prediction accuracy. The results were shown in Table 1. The results demonstrate the high accuracy of the DeGNServer.

Table 1

Sensitivity, specificity, and prediction accuracy of two case studies.

Case studies	TP	FP	TN	FN	Sensitivity	Specificity	Prediction accuracy
Human stem cell	2.42	97.58	99.95	0.05	98%	50.6%	51.2%
Mouse heart	39.6	60.4	97.5	2.50	94.1%	61.7 %	68.6%

Prediction accuracy = ((TP + TN)/(TP + FP + TN + FN)) × 100%.

3.3. Case Study 2: Deciphering Genome-Scale Pluripotency Networks in Murine Heart Tissues 3.3.1. Mouse Heart Microarray Data Set

We also analyzed a compendium microarray data set from heart tissues of Mus musculus to evaluate the efficiency of the DeGNServer. This compendium data set includes 172 Affymetrix microarray chips of platform GPL1261, which contains 45,101 probes. The data was downloaded from NCBI Gene Expression Omnibus (GEO) (http://www.ncbi.nlm.nih.gov/geo/). These 172 microarray data were from nine independent experiments that have the following GEO accession IDs: GSE11291, 15078, 19875, 29145, 30495, 3440, 38754, 5500, and 7781. The compendium data were generated through pooling the raw data of 172 microarray data and then normalized with RMA algorithm [29]. For quality control, we used two methods that were previously described [38].

3.3.2. Overall Performance

The gene networks including 41,742 genes and 3,869,157 links were deciphered in less than 30 minutes with a z-score threshold of 3.8 and spearman-based association method. The built network could be retrieved at http://plantgrn.noble.org/DeGNServer/Result.jsp?sessionid=1367625665687&method=1_1&cutoff=3.8#. We also tested with original mutual information-based CLR, which took 81.6 hours to complete whole genome-scale network construction.

3.3.3. Subnetworks Controlling Murine Heart Development

The pathway that controls murine heart development can be obtained from NCBI’s BioSystems database with an accession number of 672437 [39]. From the pathway diagram, we can find the three central genes, Nkx2-5, Tbx1, and Mef2c, which play very important roles in heart development, as showed up in the subnetwork, we obtained (Figure 5). Nkx2-5 is known to be involved in cardiac muscle cell differentiation [40], proliferation [41], contraction [42], and muscle tissue development [43]. Lack of Nkx2-5 can lead to the myogenic and morphogenetic defects in the heart tubes [43]. Mef2c and Nkx2.5 are known to control common downstream targets and exhibit striking phenotypic similarities when disrupted [43]. Tbx1 affects asymmetric cardiac morphogenesis by regulating Pitx2 in the secondary heart field [44]; it also controls regional coronary artery morphogenesis [45], aorta morphogenesis [46], and blood vessel development [47]. Prox1 is known to function as a direct upstream modifier of Nkx2.5 and is responsible for maintaining muscle structure and growth [48, 49]. CAMTAs promote cardiomyocyte hypertrophy and activate the ANF gene, at least in part, by associating with the cardiac homeodomain protein Nkx2-5 [50]. The transcriptional activity of CAMTAs is governed by association with class II histone deacetylases (HDACs), which negatively regulate cardiac growth [50]. Smarca4, as a nuclear notch signaling component required for the establishment of left-right asymmetry [51], is also essential for heart development by involving chromatin remodeling complexes [51]. Kdm6 interacts with Smarca4 to control T-box family member-dependent gene expression [52]. Wnt2 is required for atrial and inflow tract morphogenesis, and it regulates expansion of secondary heart field progenitors [53]. Myocd controls cardiac muscle cell proliferation, growth, and differentiation [54]. Eno3 is highly expressed in skeletal muscle and heart [55]. The specific function of murine Chst2 is currently unknown, but human umbilical vein endothelial cells predominantly express CHST2 [56, 57]. The heart requires glycerol as an energy substrate through aquaporin 7, a glycerol facilitator [58]. Glycerol is taken into cardiomyocytes and is finally converted to pyruvate by Gpd2 enzymes [59]. EphA4 mutant mice exhibit defects in the coronal suture and neural crest-mesoderm boundary [60].

Figure 5

The subnetwork that is responsible for heart growth and development in mouse. The whole genome-scale network was constructed from 175 chips of GPL1261 platform using DeGNServer and then extracted using community-finding algorithm called GeNa [22] with Nkx2-5, Prox1, and Mef2c as query seeds. Genes highlighted in red are implicated by the existing literature to participate in heart growth and development.

3.4. Sensitivity, Specificity, and Prediction Accuracy of the above Two Case Studies

To further examine the sensitivity, specificity, and prediction accuracy of the three case studies as shown above, we made some assumptions: (1) for human pluripotency renewal, we assumed that the genes that are evidenced to be involved in pluripotency maintenance in the existing literature are all positive genes; (2) for heart development, due to the large number of genes involved in these biological processes, we cannot search the literature evidence for all genes. We classified all genes involved in heart development to be positive based on gene ontologies. We then counted the true positive (TP) and false positive (FP) genes within each subnetwork. The true negative (TN) and false negative (FN) genes were calculated from the rest of network that was adjusted to the same size of each subnetwork. For comparison, we rescaled all numbers to one hundred before we calculated sensitivity, specificity, and prediction accuracy. The results were shown in Table 1.

4. Discussions

We developed the DeGNServer to enable the reconstruction of genome-scale GN using the increasingly accumulated large-scale gene expression data in public domain. Users may use it to generate whole genome scale GNs from large amount of gene expression data in any species. After whole genome GN construction, users can obtain the subnetworks by providing a few genes of interest. All subnetworks generated with different genes of interest and thresholds will be automatically listed online for downloading and studying. When genome-wide network construction was performed with 189 human microarray profiles as an input for DeGNServer, we could identify a subnetwork containing majority of genes involved in pluripotency maintenance in human embryonic stem cells [6, 30, 35]. It is worth mentioning that TF-Cluster pipeline that we developed earlier [6] is capable of building a coordinated network using the same human compendium data set and identifies only those transcription factors located on the inner ring in Figure 4, but it misses all genes that are located on the outer ring in Figure 4 mainly because it can build a local transcription factor coordination network rather than the whole genome-scale network. When genome-wide GNs were constructed using the DeGNServer, we could identify more genes (shown in outer ring in Figure 4) that regulate human pluripotency renewal together with those major transcription factors as shown in inner ring in Figure 4. To test if DeGNServer can identify true subnetworks in different circumstances, we also applied it to a murine compendium data set we downloaded and pooled from GEO database. The data is from heart tissues of Mus musculus. We obtained a subnetwork that contains functionally cohesive genes known to control the heart developmental program in mouse. This evidence clearly indicated that the use of DeGNServer can lead to the deciphering of the more comprehensive networks from which we can discover new genes involved in a specific biological process. We thus think that DeGNServer is useful in identifying genes governing a specific biological process, pathway, or a developmental program.

Although we have tested with synthetic data and found that Spearman-based CLR appears to have better performance than any of other methods including original mutual information based CLR, we still make all methods available in DeGNServer. This is because the efficiency of different methods may be dependent on the properties of biological data, as we showed in a previous study [7]. For subnetwork extraction, we integrated both SNBuilder and GeNa algorithms; both are found to be proficient in identifying the true subnetworks. However, GeNa usually produces small subnetworks with cohesive function.

5. Conclusions

We have developed a high performance web-based platform, namely, DeGNServer, for genome-scale GN construction and subnetwork extraction. DeGNServer is capable of analyzing gene expression data with very high dimensionality of gene space and very large number of gene expression profiles. As tested, it can analyze hundreds of microarray profiles of human (36,000 genes) for reconstruction of gene association networks within 30 minutes, mainly through the improvement of gene association estimation algorithms and parallel computing in combination. The DeGNServer is as accurate and sensitive as the original CLR method and runs hundreds to thousands times faster. Furthermore, through the integration of network decomposition methods, the DeGNServer is capable of identifying novel functional cohesive subnetworks or modules.

Acknowledgments

The authors thank Dr. Xinbin Dai for his assistance in DeGNServer deployment. This work was supported by the National Science Foundation (Grant DBI : 0960897 to Patrick Xuechun Zhao) and the Samuel Roberts Noble Foundation.

Parkinson

Sarkans

Shojatalab

Abeygunawardena

Contrino

Coulson

Farne

Lara

G. G.

Holloway

Kapushesky

Lilja

Mukherjee

Oezcimen

Rayner

Rocca-Serra

Sharma

Sansone

Brazma

ArrayExpress—a public repository for microarray gene expression data at the EBI

Nucleic Acids Research 2005 33 D553 D555

2-s2.0-13444249955

10.1093/nar/gki056

Barrett

Troup

D. B.

Wilhite

S. E.

Ledoux

Rudnev

Evangelista

Kim

I. F.

Soboleva

Tomashevsky

Edgar

NCBI GEO: mining tens of millions of expression profiles—database and tools update

Nucleic Acids Research 2007 35 1 D760 D765

2-s2.0-33846064113

10.1093/nar/gkl887

Wei

Persson

Mehta

Srinivasasainagendra

Chen

Page

G. P.

Somerville

Loraine

Transcriptional coordination of the metabolic network in arabidopsis

Plant Physiology 2006 142 2 762 774

2-s2.0-33750063171

10.1104/pp.106.080358

Zhu

X. L.

Z. H.

Wang

Y. L.

Teng

Y. C.

Weighted gene co-expression network analysis in identification of endometrial cancer prognosis markers

Asian Pacific Journal Cancer Prevention 2012 13 9 4607 4611

Presson

A. P.

Sobel

E. M.

Papp

J. C.

Suarez

C. J.

Whistler

Rajeevan

M. S.

Vernon

S. D.

Horvath

Integrated weighted gene co-expression network analysis with an application to chronic fatigue syndrome

BMC Systems Biology 2008 2 article 95

2-s2.0-58549117328

10.1186/1752-0509-2-95

Nie

Stewart

Ruan

Thomson

J. A.

Zhang

Cui

Wei

TF-cluster: a pipeline for identifying functionally coordinated transcription factors via network decomposition of the Shared Coexpression Connectivity Matrix (SCCM)

BMC Systems Biology 2011 5 article 53

2-s2.0-79955091925

10.1186/1752-0509-5-53

Kumari

Stewart

Nie

Evaluation of gene association methods for coexpression network construction and biological knowledge discovery

PLoS ONE 2012 7 11

10.1371/journal.pone.0050411

e50411

Margolin

A. A.

Nemenman

Basso

Wiggins

Stolovitzky

Favera

R. D.

Califano

ARACNE: an algorithm for the reconstruction of gene regulatory networks in a mammalian cellular context

BMC Bioinformatics 2006 7 supplement 1, article S7

2-s2.0-33947305781

10.1186/1471-2105-7-S1-S7

Faith

J. J.

Hayete

Thaden

J. T.

Mogno

Wierzbowski

Cottarel

Kasif

Collins

J. J.

Gardner

T. S.

Large-scale mapping and validation of escherichia coli transcriptional regulation from a compendium of expression profiles

PLoS Biology 2007 5 1, article e8

2-s2.0-34548289712

10.1371/journal.pbio.0050008

Wei

Liu

Zhao

P. X.

GPLEXUS: enabling genome-scale gene association network reconstruction and analysis for very large-scale expression data

Nucleic Acids Research 2013

10.1093/nar/gkt983

K.-C.

Genome-wide coexpression dynamics: theory and application

Proceedings of the National Academy of Sciences of the United States of America 2002 99 26 16875 16880

2-s2.0-0037168518

10.1073/pnas.252466999

Horvath

Dong

Geometric interpretation of gene coexpression network analysis

PLoS Computational Biology 2008 4 8

2-s2.0-50949084896

10.1371/journal.pcbi.1000117

e1000117

Madar

Greenfield

Vanden-Eijnden

Bonneau

DREAM3: network inference using dynamic context likelihood of relatedness and the inferelator

PloS ONE 2010 5 3

e9803

10.1371/journal.pone.0009803

Nazri

Lio

Investigating meta-approaches for reconstructing gene networks in a mammalian cellular context

PLoS ONE 2012 7 1

2-s2.0-84855485205

10.1371/journal.pone.0028713

e28713

Xiong

Yuan

Fillingham

J. S.

Garg

Chang

Liu

Pearlman

R. E.

Miao

Gene network landscape of the ciliate tetrahymena thermophila

PLoS ONE 2011 6 5

2-s2.0-79957560183

10.1371/journal.pone.0020124

e20124

Michoel

de Smet

Joshi

van de Peer

Marchal

Comparative analysis of module-based versus direct methods for reverse-engineering transcriptional regulatory networks

BMC Systems Biology 2009 3 article 49

2-s2.0-66449116353

10.1186/1752-0509-3-49

Wang

Zhang

Hearn

D. J.

Kang

I.-H.

Punwani

J. A.

Skaggs

M. I.

Drews

G. N.

Schumaker

K. S.

Yadegari

Identification of transcription-factor genes expressed in the arabidopsis female gametophyte

BMC Plant Biology 2010 10 article 110

2-s2.0-77953491224

10.1186/1471-2229-10-110

Osorno

Chambers

Transcription factor heterogeneity and epiblast pluripotency

Philosophical Transactions of the Royal Society B 2011 366 article 1575 2230 2237

2-s2.0-79960058253

10.1098/rstb.2011.0043

Linnekin

Park

L. S.

Farrar

W. L.

Dissociation of human cytokine receptor expression and signal transduction

Blood 1992 80 8 1896 1904

2-s2.0-0026787603

Edwards

Elements of information theory, 2nd edition

Information Processing & Management 2008 44 1 400 401

F.-X.

Mining and state-space modeling and verification of sub-networks from large-scale biomolecular networks

BMC Bioinformatics 2007 8 article 324

2-s2.0-38649114712

10.1186/1471-2105-8-324

Aluru

Zola

Nettleton

Aluru

Reverse engineering and analysis of large genome-scale gene networks

Nucleic Acids Research 2013 41 1, article e24

Wilcox

R. R.

A note on the Theil-Sen regression estimator when the regresser is random and the error term is heteroscedastic

Biometrical Journal 1998 40 3 261 268

2-s2.0-0032373378

Lyakhov

I. G.

Krishnamachari

Schneider

T. D.

Discovery of novel tumor suppressor p53 response elements using information theory

Nucleic Acids Research 2008 36 11 3828 3833

2-s2.0-46349085558

10.1093/nar/gkn189

Reshef

D. N.

Reshef

Y. A.

Finucane

H. K.

Grossman

S. R.

McVean

Turnbaugh

P. J.

Lander

E. S.

Mitzenmacher

Sabeti

P. C.

Detecting novel associations in large data sets

Science 2011 334 6062 1518 1524

2-s2.0-83755163018

10.1126/science.1205438

Kohl

Wiese

Warscheid

Cytoscape: software for visualization and analysis of biological networks

Methods in Molecular Biology 2011 696 291 303

2-s2.0-79952202448

van den Bulcke

van Leemput

Naudts

van Remortel

Verschoren

de Moor

Marchal

SynTReN: a generator of synthetic gene expression data for design and analysis of structure learning algorithms

BMC Bioinformatics 2006 7 article 43

2-s2.0-33144486498

10.1186/1471-2105-7-43

Baldi

Brunak

Chauvin

Andersen

C. A. F.

Nielsen

Assessing the accuracy of prediction algorithms for classification: an overview

Bioinformatics 2000 16 5 412 424

2-s2.0-0033931867

Irizarry

R. A.

Hobbs

Collin

Beazer-Barclay

Y. D.

Antonellis

K. J.

Scherf

Speed

T. P.

Exploration, normalization, and summaries of high density oligonucleotide array probe level data

Biostatistics 2003 4 2 249 264

2-s2.0-0142121516

Vodyanik

M. A.

Smuga-Otto

Antosiewicz-Bourget

Frane

J. L.

Tian

Nie

Jonsdottir

G. A.

Ruotti

Stewart

Slukvin

I. I.

Thomson

J. A.

Induced pluripotent stem cell lines derived from human somatic cells

Science 2007 318 5858 1917 1920

2-s2.0-36749043230

10.1126/science.1151526

Watabe

Roles of Dppa2 in the regulation of the present status and future of pluripotent stem cells

Journal of Biochemistry 2012 152 1 1 3

10.1093/jb/mvs057

Chan

Y. S.

Göke

A PRC2-dependent repressive role of PRDM14 in human embryonic stem cells and induced pluripotent stem cell reprogramming

Stem Cells 2012 31 4 682 692

Barbaric

Harrison

N. J.

Rediscovering pluripotency: from teratocarcinomas to embryonic stem cells. Cardiff, 10–12 October 2011

The International Journal of Developmental Biology 2012 56 4 197 206

10.1387/ijdb.123503ib

Ramos

P. S.

Sajuthi

Langefeld

C. D.

Immune function genes CD99L2, JARID2 and TPO show association with autism spectrum disorder

Molecular Autism 2012 3 1, article 4

10.1186/2040-2392-3-4

Boyer

L. A.

Tong

I. L.

Cole

M. F.

Johnstone

S. E.

Levine

S. S.

Zucker

J. P.

Guenther

M. G.

Kumar

R. M.

Murray

H. L.

Jenner

R. G.

Gifford

D. K.

Melton

D. A.

Jaenisch

Young

R. A.

Core transcriptional regulatory circuitry in human embryonic stem cells

Cell 2005 122 6 947 956

2-s2.0-25144525014

10.1016/j.cell.2005.08.020

Calloni

Cordero

E. A.

Henriques

J. A.

Bonatto

Reviewing and updating the major molecular markers for stem cells

Stem Cells and Development 2013 22 9 1455 1476

10.1089/scd.2012.0637

Kim

Kang

T.-W.

Lee

H.-C.

Han

Y.-M.

Kim

Shin

H. D.

Cheong

H. S.

Lee

Kim

S.-Y.

Kim

Y. S.

Identification of DNA methylation markers for lineage commitment of in vitro hepatogenesis

Human Molecular Genetics 2011 20 14 2722 2733

2-s2.0-79959808342

10.1093/hmg/ddr171

Persson

Wei

Milne

Page

G. P.

Somerville

C. R.

Identification of genes required for cellulose synthesis by regression analysis of public microarray data sets

Proceedings of the National Academy of Sciences of the United States of America 2005 102 24 8633 8638

2-s2.0-20844445318

10.1073/pnas.0503392102

Srivastava

Making or breaking the heart: from lineage determination to morphogenesis

Cell 2006 126 6 1037 1048

2-s2.0-33748621746

10.1016/j.cell.2006.09.003

Vincentz

J. W.

Barnes

R. M.

Firulli

B. A.

Conway

S. J.

Cooperative interaction of Nkx2.5 and Mef2c transcription factors during heart development

Developmental Dynamics 2008 237 12 3809 3819

2-s2.0-57149103137

10.1002/dvdy.21803

Prall

O. W. J.

Menon

M. K.

Solloway

M. J.

Watanabe

Zaffran

Bajolle

Biben

McBride

J. J.

Robertson

B. R.

Chaulet

Stennard

F. A.

Wise

Schaft

Wolstein

Furtado

M. B.

Shiratori

Chien

K. R.

Hamada

Black

B. L.

Saga

Robertson

E. J.

Buckingham

M. E.

Harvey

R. P.

An Nkx2-5/Bmp2/Smad1 negative feedback loop controls heart progenitor specification and proliferation

Cell 2007 128 5 947 959

2-s2.0-33847344204

10.1016/j.cell.2007.01.042

Takeda

Briggs

L. E.

Wakimoto

Marks

M. H.

Warren

S. A.

J. T.

Weinberg

E. O.

Robertson

K. D.

Chien

K. R.

Kasahara

Slow progressive conduction and contraction defects in loss of Nkx2-5 mice after cardiomyocyte terminal differentiation

Laboratory Investigation 2009 89 9 983 993

2-s2.0-69349086498

10.1038/labinvest.2009.59

Lyons

Parsons

L. M.

Hartley

Andrews

J. E.

Robb

Harvey

R. P.

Myogenic and morphogenetic defects in the heart tubes of murine embryos lacking the homeo box gene Nkx2-5

Genes and Development 1995 9 13 1654 1666

2-s2.0-0029090829

Nowotschin

Liao

Gage

P. J.

Epstein

J. A.

Campione

Morrow

B. E.

Tbx1 affects asymmetric cardiac morphogenesis by regulating Pitx2 in the secondary heart field

Development 2006 133 8 1565 1573

2-s2.0-33646706385

10.1242/dev.02309

Théveniau-Ruissy

Dandonneau

Mesbah

Ghez

Mattei

M.-G.

Miquerol

Kelly

R. G.

The del22q11.2 candidate gene Tbx1 controls regional outflow tract identity and coronary artery patterning

Circulation Research 2008 103 2 142 148

2-s2.0-48849103985

10.1161/CIRCRESAHA.108.172189

Zhang

Baldini

In vivo response to high-resolution variation of Tbx1 mRNA dosage

Human Molecular Genetics 2008 17 1 150 157

2-s2.0-37549052138

10.1093/hmg/ddm291

Merscher

Funke

Epstein

J. A.

Heyer

Puech

M. M.

Xavier

R. J.

Demay

M. B.

Russell

R. G.

Factor

Tokooya

Jore

B. S.

Lopez

Pandita

R. K.

Lia

Carrion

Schorle

Kobler

J. B.

Scambler

Wynshaw-Boris

Skoultchi

A. I.

Morrow

B. E.

Kucherlapati

TBX1 is responsible for cardiovascular defects in velo-cardio-facial/DiGeorge syndrome

Cell 2001 104 4 619 629

2-s2.0-17744395906

10.1016/S0092-8674(01)00247-1

Risebro

C. A.

Searles

R. G.

Melville

A. A. D.

Ehler

Jina

Shah

Pallas

Hubank

Dillard

Harvey

N. L.

Schwartz

R. J.

Chien

K. R.

Oliver

Riley

P. R.

Prox1 maintains muscle structure and growth in the developing heart

Development 2009 136 3 495 505

2-s2.0-64549159888

10.1242/dev.030007

Risebro

C. A.

Petchey

L. K.

Smart

Epistatic rescue of Nkx2. 5 adult cardiac conduction disease phenotypes by prospero-related homeobox protein 1 and HDAC3

Circulation Research 2012 111 2 e19 e31

10.1161/CIRCRESAHA.111.260695

Song

Backs

McAnally

Gerard

R. D.

Richardson

J. A.

Hill

J. A.

Bassel-Duby

Olson

E. N.

The transcriptional coactivator CAMTA2 stimulates cardiac growth by opposing class II histone deacetylases

Cell 2006 125 3 453 466

2-s2.0-33646144852

10.1016/j.cell.2006.02.048

Lickert

Takeuchi

J. K.

Von Both

Walls

J. R.

McAuliffe

Adamson

S. L.

Henkelman

R. M.

Wrana

J. L.

Rossant

Bruneau

B. G.

Baf60c is essential for function of BAF chromatin remodelling complexes in heart development

Nature 2004 432 7013 107 112

2-s2.0-8544270889

10.1038/nature03071

Miller

S. A.

Mohn

S. E.

Weinmann

A. S.

Jmjd3 and UTX play a demethylase-independent role in chromatin remodeling to regulate t-box family member-dependent gene expression

Molecular Cell 2010 40 4 594 605

2-s2.0-78649263390

10.1016/j.molcel.2010.10.028

Tian

Yuan

Goss

A. M.

Wang

Yang

Lepore

J. J.

Zhou

Schwartz

R. J.

Patel

Cohen

E. D.

Morrisey

E. E.

Characterization and in vivo pharmacological rescue of a Wnt2-Gata6 pathway required for cardiac inflow tract development

Developmental Cell 2010 18 2 275 287

2-s2.0-76549113391

10.1016/j.devcel.2010.01.008

Wang

D.-Z.

Chang

P. S.

Wang

Sutherland

Richardson

J. A.

Small

Krieg

P. A.

Olson

E. N.

Activation of cardiac gene expression by myocardin, a transcriptional cofactor for serum response factor

Cell 2001 105 7 851 862

2-s2.0-0035967868

10.1016/S0092-8674(01)00404-4

Zhou

Deng

Long

Xiong

Characterization of porcine ENO3: genomic and cDNA structure, polymorphism and expression

Genetics Selection Evolution 2008 40 5 563 579

2-s2.0-50049090276

10.1051/gse:2008015

Murphy

P. G.

Kadono

Steeber

D. A.

Tedder

T. F.

CHST1 and CHST2 sulfotransferase expression by vascular endothelial cells regulates shear-resistant leukocyte rolling via L-selectin

Journal of Leukocyte Biology 2001 69 4 565 574

2-s2.0-0035071302

Tedder

T. F.

CHST1 and CHST2 sulfotransferases expressed by human vascular endothelial cells: cDNA cloning, expression, and chromosomal localization

Genomics 1999 55 3 345 347

2-s2.0-0033082291

10.1006/geno.1998.5653

Hibuse

Maeda

Nakatsuji

Tochino

Fujita

Kihara

Funahashi

Shimomura

The heart requires glycerol as an energy substrate through aquaporin 7, a glycerol facilitator

Cardiovascular Research 2009 83 1 34 41

2-s2.0-67649373933

10.1093/cvr/cvp095

Gambert

Héliès-Toussaint

Grynberg

Regulation of intermediary metabolism in rat cardiac myocyte by extracellular glycerol

Biochimica et Biophysica Acta 2005 1736 2 152 162

2-s2.0-24344492092

10.1016/j.bbalip.2005.08.004

Ting

M.-C.

N. L.

Roybal

P. G.

Sun

Liu

Yen

Maxson

R. E.

Jr.

EphA4 as an effector of Twist1 in the guidance of osteogenic precursor cells during calvarial bone growth and in craniosynostosis

Development 2009 136 5 855 864

2-s2.0-65549099259

10.1242/dev.028605