Gene Regulation, Modulation, and Their Applications in Gene Expression Data Analysis

Common microarray and next-generation sequencing data analysis concentrate on tumor subtype classification, marker detection, and transcriptional regulation discovery during biological processes by exploring the correlated gene expression patterns and their shared functions. Genetic regulatory network (GRN) based approaches have been employed in many large studies in order to scrutinize for dysregulation and potential treatment controls. In addition to gene regulation and network construction, the concept of the network modulator that has significant systemic impact has been proposed, and detection algorithms have been developed in past years. Here we provide a unified mathematic description of these methods, followed with a brief survey of these modulator identification algorithms. As an early attempt to extend the concept to new RNA regulation mechanism, competitive endogenous RNA (ceRNA), into a modulator framework, we provide two applications to illustrate the network construction, modulation effect, and the preliminary finding from these networks. Those methods we surveyed and developed are used to dissect the regulated network under different modulators. Not limit to these, the concept of “modulation” can adapt to various biological mechanisms to discover the novel gene regulation mechanisms.


Introduction
With the development of microarray [1] and lately the next generation sequencing techniques [2], transcriptional profiling of biological samples, such as tumor samples [3][4][5] and samples from other model organisms, have been carried out in order to study sample subtypes at molecular level or transcriptional regulation during the biological processes [6][7][8]. While common data analysis methods employ hierarchical clustering algorithms or pattern classification to explore correlated genes and their functions, the genetic regulatory network (GRN) approaches were employed to scrutinize for dysregulation between different tumor groups or biological processes (see reviews [9][10][11][12]).
To construct the network, most of research is focused on methods based on gene expression data derived from highthroughput technologies by using metrics such as Pearson or Spearman correlation [13], mutual information [14], codetermination method [15,16], Bayesian methods [17,18], and probabilistic Boolean networks [19]. Recently, new transcriptional regulation via competitive endogenous RNA (ceRNAs) has been proposed [20,21], introducing additional dimension in modeling gene regulation. This type of regulation requires the knowledge of microRNA (miRNA) binding targets [22,23] and the hypothesis of RNA regulations via competition of miRNA binding. Common GRN construction tries to confine regulators to be transcription factor (TF) proteins, a primary transcription programming machine, which relies on sequence-specific binding sites at target genes' promoter regions. In contrast, ceRNAs mediate gene regulation via competing miRNAs binding sites in target 3 UTR region, which exist in >50% of mRNAs [22,24]. In this study, we will extend the current network construction methods by incorporating regulation via ceRNAs.
In tumorigenesis, gene mutation is the main cause of the cancer [25]. The mutation may not directly reflect in the change at the gene expression level; however, it will disrupt gene regulation [26][27][28]. In Hudson et al., they found that mutated myostatin and MYL2 showed different coexpressions when comparing to wild-type myostatin. Chun et al. also showed that oncogenic KRAS modulates HIF-1 and HIF-2 target genes and in turn modulates cancer metabolism. Stelniec-Klotz et al. presented a complex hierarchical model of KRAS modulated network followed by double perturbation experiments. Shen et al. [29] showed a temporal change of GRNs modulated after the estradiol stimulation, indicating important role of estrogen in modulating GRNs. Functionally, modulation effect of high expression of ESR1 was also reported by Wilson and Dering [30] where they studied previously published microarray data with cells treated with hormone receptor agonists and antagonists [31][32][33]. In this study, a comprehensive review of existing algorithms to uncover the modulators was provided. Given either mutation or protein expression status was unknown under many of reported studies, the problem of how to partition the diverse samples with different conditions, such as active or inactive oncogene status (and perhaps a combination of multiple mutations), and the prediction of a putative modulator of gene regulation remains a difficult task.
By combining gene regulation obtained from coexpression data and ceRNAs, we report here an early attempt to unify two systems mathematically while assuming a known modulator, estrogen receptor (ER). By employing the TCGA [3] breast tumor gene expressions data and their clinical test result (ER status), we demonstrate the approach of obtaining GRN via ceRNAs and a new presentation of ER modulation effects. By integrating breast cancer data into our unique ceRNAs discovery website, we are uniquely positioned to further explore the ceRNA regulation network and further develop the discovery algorithms in order to detect potential modulators of regulatory interactions.

Regulation of Gene
Expression. The complex relationships among genes and their products in a cellular system can be studied using genetic regulatory networks (GRNs). The networks model the different states or phenotypes of a cellular system. In this model, the interactions are commonly modeled as regulator-target pairs with edges between regulator and target pair representing their interaction direction, as shown in Figure 1(a). In this model a target gene is a gene whose expression can be altered (activated or suppressed) by a regulator gene. This definition of a target gene implies that any gene can be at some point a target gene or a direct or indirect regulator depending on its position in the genetic regulatory network. The regulator gene is a gene that controls (activates or suppresses) its target genes' expression. The consequences of these activated (or suppressed) genes sometimes are involved in specific biological functions, such as cell proliferation in cancer. Examples of regulator-target pair in biology are common. For example, a target gene CDCA7 (cell division cycle-associated protein 7) is a c-Myc (regulator) responsive gene, and it is part of c-Myc-mediated transformation of lymphoblastoid cells. Furthermore, as shown in Figure 1(b), a regulator gene can also act as a target gene if there exists an upstream regulator.
If the interaction is modeled after Boolean network (BN) model [34], then where each regulator ∈ {0, 1} is a binary variable, as well as it is target . As described by (1), the target at time + 1 is completely determined by the values of its regulators at time by means of a Boolean function ∈ , where is a collection of Boolean functions. Thus, the Boolean network ( , ) is defined as a set of nodes (genes) = where Parents( ( + 1)) = { 1 ( ), . . . , ( ), ( )} is the set of regulators, or parents, of , ( ( + 1) | Parents ( ( + 1))) is the conditional distribution defining the regulatortarget relationship, and (Parents( ( + 1))) models the prior distribution of regulators. Unlike in (1), the target and regulators in (2) are modeled as random variables. Despite of this difference, in both (1) and (2), the target is always a function (or conditional distribution) of the regulator (or parents). When the relationship is defined by a Boolean function as in (1), the conditional distribution in (2) take the form of a binomial distribution (or a multinomial distribution when both regulators and target take more than two states). Other distributions such as the Gaussian and Poisson can be introduced to model more complex relationships than the Boolean. The network construction, inference, and control, however, are beyond the scope of this paper, and we leave the topics to the literatures [9,35,36].
The interactions among genes and their products in a complex cellular process of gene expression are diverse, governed by the central dogma of molecular biology [37]. There are different regulation mechanisms that can actuate during different stages. Figure 2 shows three different cases of regulation of gene expression. Figure 2(a) shows the case of regulation of expression in which a transcription factor (TF) regulates the expression of a protein-coding gene (in dark grey) by binding to the promoter region of target gene . Figure 2(b) is the case of regulation at the protein level in which a ligand protein interacts with a receptor to activate relay molecules to transduce outside signals directly into cell behavior. Figure 2(c) is the case of regulation at the RNA level in which one or more miRNAs regulate target mRNA by translational repression or target transcript degradation via binding to sequence-specific binding sites (called miRNA response elements or MREs) in 3 UTR region. As illustrated in Figure 2 between regulator-target pairs, a common element in network structure.

Modulation of Gene Regulation.
Different from the concept of a coregulator commonly referred in the regulatory biology, a modulator denotes a gene or protein that is capable of altering the endogenous gene expression at one stage or time. In the context of this paper, we specifically define a modulator to be a gene that can systemically influence the interaction of regulator-target pair, either to activate or suppress the interaction in the presence/absence of the modulator. One example of modulator is the widely studied estrogen receptor (ER) in breast cancer studies [38][39][40]; the ER status determines not only the tumor progression, but also the chemotherapy treatment outcomes. It is well known that binding of estrogen to receptor facilitates the ER activities to activate or repress gene expression [41], thus effectively modulating the GRN. Figure 3 illustrates the model of the interaction between a modulator ( ) and a regulator ( ) target ( ) pair that it modulates. Following the convention used in (1) and (2), the modulation interaction in Figure 3 can be modeled by where represents target expression, represents the parents (regulators) of target , and F ( ) (⋅) is the regulation function modulated by . When F ( ) (⋅) is stochastic, the relationship is modeled by the conditional distribution as where ( | , ) models the regulator-target relationship modulated by and ( | ) defines the prior distribution of regulators (parents) expression modulated by . Different distribution models can be used to model different mechanisms for modulation. At the biological level, there are different mechanisms for modulation of the interaction -, and currently several algorithms for prediction of the modulators has been developed. This survey presents the latest formulations and algorithms for prediction of modulators.

Survey of Algorithms of Gene Regulation and Modulation Discovery
During the past years, many computational tools have been developed for regulation network construction, and then depending on the hypothesis, modulator concept can be tested and extracted. Here we will focus on modulator detection algorithms (MINDy, Mimosa, GEM, and Hermes). To introduce gene-gene interaction concept, we will also briefly discuss algorithms for regulation network construction (ARACNE) and ceRNA identification algorithm (MuTaMe).

ARACNE (Algorithm for the Reconstruction of Accurate
Cellular Networks). ARACNE [14,42] is an algorithm that extracts transcriptional networks from microarray data by using an information-theoretic method to reduce the indirect interactions. ARACNE assumes that it is sufficient to estimate 2-way marginal distributions, when sample size > 100, in genomics problems, such that Or a candidate interaction can be identified using estimation of mutual information MI of genes and , MI( , ) = MI , where MI = 1 if genes and are identical, and MI is zero if ( , ) = ( ) ( ), or and are statistically independent. Specifically, the estimation of mutual information of gene expressions and of regulator and target genes is done by using the Gaussian kernel estimator. The ARACNE takes additional two steps to clean the network: (1) removing MI if its value is less than that derived from two independent genes via random permutation and (2) data processing inequality (DPI). The algorithm further assumes that for a triplet gene ( , , ), where regulates , through , then with no alternative path, where → represents regulation relationship. In other words, the lowest mutual information MI , is from an indirect interaction and thus shall be removed from the GRN by ARACNE in the DPI step. A similar algorithm was proposed [43] to utilize conditional mutual information to explore more than 2 regulators.

MINDy (Modulator Inference by Network Dynamics).
Similar to ARACNE, MINDy is also an information-theoretic algorithm [44]. However, MINDy aims to identify potential transcription factor-(TF-target) gene pairs that can be modulated by a candidate modulator. MINDy assumes that the expressions of the modulated TF-target pairs are of different correlations under different expression state of the modulator. For simplicity and computational consideration, MINDy considers only two modulator expression states, that is, up-( = 1) or down-expression ( = 0). Then, it tests if the expression correlations of potential TF-target pairs are significantly different for modulator up-expression versus down-expression. The modulator dependent correlation is assessed by the conditional mutual information (CMI) or ( , | = 0) and ( , | = 1). Similar to ARACNE, the CMI is calculated using the Gaussian kernel estimator. To test if a pair of TF ( ) and target ( ) is modulated by , the CMI difference can be calculated as The pair is determined to be modulated if Δ ̸ = 0. The significance values for Δ ̸ = 0 is computed using permutation tests.

Mimosa.
Similarly to MINDy, Mimosa [45] was proposed to identify modulated TF-target pairs. However, it does not preselect a set of modulators of interest but rather aims to also search for the modulators. Mimosa also assumes that a modulator takes only two states, that is, absence and presence or 0 and 1. The modulated regulator-target pair is further assumed to be correlated when a modulator is present but uncorrelated when it is absent. Therefore, the distribution of a modulated TF-target pair, and , naturally follows a mixture distribution where is the probability of the modulator being absent. Particularly, an uncorrelated and correlated bivariant Gaussian distributions were introduced to model different modulated regulator-target relationship, such that ( , | = 1) where models the correlation between and when the modulator is present. With this model, Mimosa sets out to fit the samples of every pair of potential regulator target with the mixture model (7). This is equivalent to finding

GEM (Gene Expression Modulator)
. GEM [46] improves over MINDy by predicting how a modulator-TF interaction affects the expression of the target gene. It can detect new types of interactions that result in stronger correlation but low Δ , which therefore would be missed by MINDy. GEM hypothesizes that the correlation between the expression of a modulator and a target must change, as that of the TF changes. Unlike the previous surveyed algorithms, GEM first transforms the continuous expression levels to binary states (up-(1) or down-expression (0)) and then works only with discrete expression states. To model the hypothesized relationship, the following model is proposed: where is a constant, and model the effect of modulator and TF on the target genes, and represents the effect of modulator-TF interaction on the target gene. If the modulator-TF interaction has an effect on , then will be nonzero. For a given ( , , ) triplets GEM devised an algorithm to estimate the model coefficients in (10) and a test to determine if is nonzero, or is a modulator of and .

MuTaMe (Mutually Targeted MRE Enrichment).
The goal of MuTaMe [21] is to identify ceRNA networks of a gene of interest (GoI) or mRNA that share miRNA response elements (MREs) of same miRNAs. Figure 4 shows two mRNAs, where one is the GoIy and the other is a candidate ceRNA or modulator . In the figure, the miRNA represented in color red has MREs in both mRNA and mRNA ; in this case the presence of mRNA will start the competition with for miRNA represented in color red.
The hypothesis of MuTaMe is that mRNAs that have many of the same MREs can regulate each other by competing for miRNAs binding. The input of this algorithm is a GoI, which is targeted by a group of miRNAs known to the user. Then, from a database of predicted MREs for the entire transcriptome, it is possible to obtain the binding sites and its predicted locations in the 3 UTR for all mRNAs. This data is used to generate scores for each mRNA based on several features: Then each candidate transcript will be assigned a score that results from multiplying the scores in (a) to (d). This score indicates the likelihood of the candidates to be ceRNAs and will be used to predict ceRNAs. [20] is an extension of MINDy that infers candidate modulators of miRNA activity from expression profiles of genes and miRNAs of the same samples. Hermes makes inferences by estimating the MI and CMI. However, different from MINDy (7), Hermes extracts the dependences of this triplet by studying the difference between the CMI of expression and expression conditional on the expression of and the MI of and expressions as follows:

Hermes. Hermes
These quantities and their associated statistical significance can be computed from collections of expression of genes with number of samples 250 or greater. Hermes expands MINDy by providing the capacity to identify candidate modulator genes of miRNAs activity. The presence of these modulators ( ) will affect the relation between the expression of the miRNAs targeting a gene ( ) and the expression level of this gene ( ).
In summary, we surveyed some of the most popular algorithms for the inference of modulator. Additional modulator identification algorithms are summarized in Table 1. It is worth noting that the concept of modulator applies to cases beyond discussed in this paper. Such example includes the multilayer integrated regulatory model proposed in Yan et al. [49], where the top layer of regulators could be also considered as "modulators. "

Applications to Breast Cancer Gene Expression Data
Algorithms of utilizing modulator concept have been implemented in various software packages. Here we will discuss 6 Advances in Bioinformatics MINDy Gene-pair interaction dependency on modulator candidates by using the conditional mutual information (CMI). [44] Mimosa Search for modulator by partition samples with a Gaussian mixture model. [45] GEM A probabilistic method for detecting modulators of TFs that affect the expression of target gene by using a priori knowledge and gene expression profiles. [46] MuTaMe Based on the hypothesis that shared MREs can regulate mRNAs by competing for microRNAs binding. [21] Hermes Extension of MINDy to include microRNAs as candidate modulators by using CMI and MI from expression profiles of genes and miRNAs of the same samples. [20] ER modulator Analyzes the interaction between TF and target gene conditioned on a group of specific modulator genes via a multiple linear regression. [48] two new applications, MEGRA and TraceRNA, implemented in-house specifically to utilize the concept of differential correlation coefficients and ceRNAs to construct a modulated GRN with a predetermined modulator. In the case of MGERA, we chose estrogen receptor, ESR1, as the initial starting point, since it is one of the dominant and systemic factor in breast cancer; in the case of TraceRNA, we also chose gene ESR1 and its modulated gene network. Preliminary results of applications to TCGA breast cancer data are reported in the following 2 sections.

MGERA. The Modulated
Gene Regulation Analysis algorithm (MGERA) was designed to explore gene regulation pairs modulated by the modulator . The regulation pairs can be identified by examining the coexpression of two genes based on Pearson correlation (similar to (7) in the context of correlation coefficient). Fisher transformation is adopted to normalize the correlation coefficients biased by sample sizes to obtain equivalent statistical power among data with different sample sizes. Statistical significance of difference in the absolute correlation coefficients between two genes is tested by the student -test following Fisher transformation. For the gene pairs with significantly different coefficients between two genes, active and deactive statuses are identified by examining the modulated gene expression pairs (MGEPs). The MGEPs are further combined to construct the modulated gene regulation network for a systematic and comprehensive view of interaction under modulation.
To demonstrate the ability of MGERA, we set estrogen receptor (ER) as the modulator and applied the algorithm to TCGA breast cancer expression data [3] which contains 588 expression profiles (461 ER+ and 127 ER−). By using value <0.01 and the difference in the absolute Pearson correlation coefficients >0.6 as criteria, we identified 2,324 putative ER+ MGEPs, and a highly connected ER+ modulated gene regulation network was constructed ( Figure 5). The top ten genes with highest connectivity was show in Table 2. The cysteine/tyrosine-rich 1 gene (CYYR1), connected to 142 genes, was identified as the top hub gene in the network and thus may serve as a key regulator under ER+ modulation. Table 2: Hub genes derived from modulated gene regulation network ( Figure 5). Gene Ontology analysis of CYYR1 and its connected neighbor genes revealed significant association with extracellular matrix, epithelial tube formation, and angiogenesis.

TraceRNA.
To identify the regulation network of ceRNAs for a GoI, we developed a web-based application TraceRNA presented earlier in [50] with extension to regulation network construction. The analysis flow chart of TraceRNA was shown in Figure 6. For a selected GoI, the GoI binding miRNAs (GBmiRs) were derived either validated miRNAs from miR-TarBase [51] or predicted miRNAs from SVMicrO [52]. Then mRNAs (other than the given GoI) also targeted by GBmiRs were identified as the candidates of ceRNAs. The relevant (or tumor-specific) gene expression data were used to further strengthen relationship between the ceRNA candidates and GoI. The candidate ceRNAs which coexpressed with GoI were reported as putative ceRNAs. To construct the gene regulation network via GBmiRs, we set each ceRNA as the secondary GoI, and the ceRNAs of these secondary GoIs were identified by applying the algorithm recursively. Upon identifying all the ceRNAs, the regulation network of ceRNAs of a given GoI was constructed.  To identify ceRNA candidates, three miRNAs binding prediction algorithms, SiteTest, SVMicrO, and BCMicrO, were used in TracRNA. SiteTest is an algorithm similar to MuTaMe and uses UTR features for target prediction. SVMi-crO [52] is an algorithm that uses a large number of sequencelevel site as well as UTR features including binding secondary structure, energy, and conservation, whereas BCMicrO [53] employs a Bayesian approach that integrates predictions from 6 popular algorithms including TargetScan, miRanda, PicTar, mirTarget, PITA, and DIANA-microT. Pearson correlation coefficient was used to test the coexpression between the GoI and the candidate ceRNAs. We utilized TCGA breast cancer cohort [3] as the expression data, by using 60% of GBmiRs as common miRNAs and Pearson correlation coefficient >0.9 as criteria. The final scores of putative ceRNAs (see Table 3, last column) were generated by using Borda merging method which rerank the sum of ranks from both GBmiR binding and coexpression values [54]. To illustrate the utility of the TraceRNA algorithm for breast cancer study, we also focus on the genes interacted with the estrogen receptor alpha, ESR1, with GBmiRs including miR-18a, miR-18b, miR-193b, miR-19a, miR-19b, miR-206, miR-20b, miR-22, miR-221, miR-222, miR-29b, and miR-302c. The regulation network generated by ESR1 as the initial GoI is shown in Figure 7, and the top 18 ceRNAs are provided in Table 3. The TraceRNA algorithm can be accessed http://compgenomics.utsa.edu/cerna/.

Conclusions
In this report, we attempt to provide a unified concept of modulation of gene regulation, encompassing earlier mRNA expression based methods and lately the ceRNA method. We expect the integration of ceRNA concept into the gene-gene interactions, and their modulator identification will further enhance our understanding in gene interaction and their systemic influence. Applications provided here also represent examples of our earlier attempt to construct modulated networks specific to breast cancer studies. Further investigation will be carried out to extend our modeling to provide a unified understanding of genetic regulation in an altered environment.
Advances in Bioinformatics 9