Analysis of genome-scale gene networks (GNs) using large-scale gene expression data provides unprecedented opportunities to uncover gene interactions and regulatory networks involved in various biological processes and developmental programs, leading to accelerated discovery of novel knowledge of various biological processes, pathways and systems. The widely used context likelihood of relatedness (CLR) method based on the mutual information (MI) for scoring the similarity of gene pairs is one of the accurate methods currently available for inferring GNs. However, the MI-based reverse engineering method can achieve satisfactory performance only when sample size exceeds one hundred. This in turn limits their applications for GN construction from expression data set with small sample size. We developed a high performance web server, DeGNServer, to reverse engineering and decipher genome-scale networks. It extended the CLR method by integration of different correlation methods that are suitable for analyzing data sets ranging from moderate to large scale such as expression profiles with tens to hundreds of microarray hybridizations, and implemented all analysis algorithms using parallel computing techniques to infer gene-gene association at extraordinary speed. In addition, we integrated the SNBuilder and GeNa algorithms for subnetwork extraction and functional module discovery. DeGNServer is publicly and freely available online.
The advent of high-throughput technologies including microarray experiments and RNA-Seq technologies has generated terabytes of gene expression data for systematically identifying transcriptional regulation and interactions through the reconstruction of gene networks on genome-wide scale. Analysis of whole genome-scale networks can provide a holistic view of all transcription regulations among and within different subnetworks and allows us to gain a more comprehensive understanding of regulation of cellular processes and events. In the past few years, large amount of gene expression data sets from numerous labs has been published and deposited in public databases such as ArrayExpress [
To date, reverse engineering of aggregated high volume gene expression data for building accurate gene network is still very challenging. The challenge lies in the high dimensionality of gene space and large sample numbers that demand fast and high efficient algorithms, and enhanced computational power as well. A set of the algorithms operates under such a hypothesis that coexpressed [
Meanwhile, the estimation of mutual information adopted in CLR method heavily relies on the number of microarray data sets. The mutual information value could be estimated accurately only when the number of microarray profiles is larger than one hundred [
In this study, we integrated parallel computing technologies into DeGNServer to accelerate network reconstruction and subnetwork extraction, which enables DeGNServer to analyze the “big data” in at least one hundred times faster than the original mutual information based CLR, making it much feasible for reverse-engineering global gene networks using the data from a large genome and discovering novel biological knowledge. Meanwhile, we integrated multiple gene association methods into our DeGNServer for network construction. The benchmark data set demonstrated that most of these different association-based CLR methods could reach very similar accuracy as the original mutual information-based CLR method. In addition, we also integrated the SNBuilder [
We extended the CLR method through integrating several gene-gene association estimation methods, of which includes Pearson, Spearman correlation [
The DeGNServer data analysis workflow.
To accelerate the GN construction through the parallel computing, we split the whole data sets of these gene pairs into multiple subsets. Let
When a gene regulatory network is inferred from When When
For every processor in our Biogrid system, we assign
DeGNServer accepts normalized expression data either in a tab-delimited text file or tab-delimited text. The server DeGNServer provides two options to construct different networks, that is, the coexpression networks and the CLR method-based association networks. Users may adjust the parameter settings, including gene-gene association estimation method and cut-off threshold, to control the size of constructed networks. After the networks are reconstructed, user may submit a list of genes-of-interest and select different subnetwork identification methods to further mine and visualize the same subnetwork generated from different extraction methods.
DeGNServer lists links to the constructed networks/subnetworks in Cytoscape [
The DeGNServer is currently deployed on Linux using resin Java server 4.0. It has been tested using the popular web browsers, such as Internet Explorer, Firefox, and Google Chrome. The web interfaces are implemented in JAVA and JSP scripts. All backend integrated analysis algorithms are implemented with parallel programming techniques in efficient C++ computing language and are deployed on an in-house developed Linux cluster, namely, BioGrid, which currently consists of about 700 CPU Cores, to achieve high performance computing capacity. Upon job submission through DeGNServer web server, the master node of BioGrid system firstly divides the gene expression matrix into multiple submatrixes and transfers these submatrixes to slave computing nodes in the Linux Cluster. Next, the master node remotely calls to execute the analysis pipelines and monitors analysis progresses in these computing nodes. Finally the master node collects the association values of all gene-gene pairs for gene network construction and subnetwork analysis. For those species with large genomes, the distributions of gene-gene pairs are close to the normal distribution, so we applied the normal distribution to calculate the
Parallel implementation of the CLR method.
To comprehensively evaluate performance of integrated network construction methods, we generated two groups of synthetic compendium gene expression data sets, each group with a series of data sets of various sizes, using the SynTReN software [
Average AUC scores from different association-based CLR methods for networks with larger and smaller numbers of expression profiles; Group A: networks constructed with smaller number of gene expression samples (30~90 samples), Group B: networks constructed with larger number of expression samples (100~1000). AUC scores were obtained through varying different threhold settings. A perfect model will have AUC score of 1, while random guessing will score an AUC around 0.5.
The ROC curve indicates the change of sensitivity (true positive rate) versus specificity (true negative rate) under different thresholds, and AUC score can represent the accuracy of each method better because it is independent of different thresholds.
The following formula is used to calculate the sensitivity and the specificity:
To validate the performance of DeGNServer, we analyzed genome-scale networks from 189 human stem cell microarray profiles. These data sets were generated in 17 individual experiments in which human embryonic stem cells were treated with various reagents for inducing differentiation. Therefore, this compendium data set is enriched with regulatory events and interaction of pluripotency maintenance and transition from pluripotent stem cells to differentiated cell linages, and thus it can serve as an ideal testing data for the performance of DeGNServer in discovering functionally associated gene subnetworks governing these processes. Of these 189 microarray data sets, there are 104 high-density human gene expression arrays from HG17 assembly. This platform of microarray contains 388,634 probes from 36,494 human locus identifiers. These 104 chips were compiled from 15 experiments in which stem cells were treated with different reagents that disrupted pluripotency. The reagents and the conditions included 12-O-tetradecanoylphorbol-13-acetate (TPA) treatment in conditioned medium, TPA treatment in TeSR medium, BMP4 treatment with FGF, BMP4 treatment without FGF, and coculture with mouse OP9 cells. The remained 85 high-density human gene expression arrays have 381,002 probes from 47,633 human locus identifiers from the HG18. These 85 microarray data sets were compiled from a set of experiments where a variety of different growth factors were applied to human embryonic stem cells at varying conditions for 3 days. Both HG17 and HG18 microarray platforms were manufactured by NimbleGen Systems (
The gene networks including 21,167 genes and 200,000 links were reconstructed in less than 20 minutes with a
Generally, global networks with huge numbers of regulations and interactions are a “hairball”, from which we can hardly identify any patterns. To facilitate the identification of subnetworks or modules that regulate a specific biological process or developmental program, we integrated both SNBuilder [
The identified subnetwork contains the essential transcription factors and other genes required for pluripotency maintenance. The twelve genes on the inner ring are transcription factors known to play essential or important role in pluripotency renewal of human embryonic stem cells. These include three master transcription factors, NANOG, POU5F1, and SOX2, which are absolutely required for pluripotency maintenance. The genes located on the outer ring were identified by DeGNServer for being closely coordinated with those transcription factors in the inner ring. The genes on outer ring, but highlighted in yellow, are those that are implicated by the existing literature to participate in the pluripotency renewal. This subnetwork was generated by using SNBuilder method [
Figure
To further examine the sensitivity, specificity, and prediction accuracy of the case study described above, we made some assumptions.
Sensitivity, specificity, and prediction accuracy of two case studies.
Case studies | TP | FP | TN | FN | Sensitivity | Specificity | Prediction accuracy |
---|---|---|---|---|---|---|---|
Human stem cell | 2.42 | 97.58 | 99.95 | 0.05 | 98% | 50.6% | 51.2% |
Mouse heart | 39.6 | 60.4 | 97.5 | 2.50 | 94.1% | 61.7 % | 68.6% |
Prediction accuracy = ((TP + TN)/(TP + FP + TN + FN)) × 100%.
We also analyzed a compendium microarray data set from heart tissues of
The gene networks including 41,742 genes and 3,869,157 links were deciphered in less than 30 minutes with a
The pathway that controls murine heart development can be obtained from NCBI’s BioSystems database with an accession number of 672437 [
The subnetwork that is responsible for heart growth and development in mouse. The whole genome-scale network was constructed from 175 chips of GPL1261 platform using DeGNServer and then extracted using community-finding algorithm called GeNa [
To further examine the sensitivity, specificity, and prediction accuracy of the three case studies as shown above, we made some assumptions:
We developed the DeGNServer to enable the reconstruction of genome-scale GN using the increasingly accumulated large-scale gene expression data in public domain. Users may use it to generate whole genome scale GNs from large amount of gene expression data in any species. After whole genome GN construction, users can obtain the subnetworks by providing a few genes of interest. All subnetworks generated with different genes of interest and thresholds will be automatically listed online for downloading and studying. When genome-wide network construction was performed with 189 human microarray profiles as an input for DeGNServer, we could identify a subnetwork containing majority of genes involved in pluripotency maintenance in human embryonic stem cells [
Although we have tested with synthetic data and found that Spearman-based CLR appears to have better performance than any of other methods including original mutual information based CLR, we still make all methods available in DeGNServer. This is because the efficiency of different methods may be dependent on the properties of biological data, as we showed in a previous study [
We have developed a high performance web-based platform, namely, DeGNServer, for genome-scale GN construction and subnetwork extraction. DeGNServer is capable of analyzing gene expression data with very high dimensionality of gene space and very large number of gene expression profiles. As tested, it can analyze hundreds of microarray profiles of human (36,000 genes) for reconstruction of gene association networks within 30 minutes, mainly through the improvement of gene association estimation algorithms and parallel computing in combination. The DeGNServer is as accurate and sensitive as the original CLR method and runs hundreds to thousands times faster. Furthermore, through the integration of network decomposition methods, the DeGNServer is capable of identifying novel functional cohesive subnetworks or modules.
The authors thank Dr. Xinbin Dai for his assistance in DeGNServer deployment. This work was supported by the National Science Foundation (Grant DBI : 0960897 to Patrick Xuechun Zhao) and the Samuel Roberts Noble Foundation.