Organisms simplify the orchestration of gene expression by coregulating genes whose products function together in the cell. The use of clustering methods to obtain sets of coexpressed genes from expression arrays is very common; nevertheless there are no appropriate tools to study the expression networks among these sets of coexpressed genes. The aim of the developed tools is to allow studying the complex expression dependences that exist between sets of coexpressed genes. For this purpose, we start detecting the nonlinear expression relationships between pairs of genes, plus the coexpressed genes. Next, we form networks among sets of coexpressed genes that maintain nonlinear expression dependences between all of them. The expression relationship between the sets of coexpressed genes is defined by the expression relationship between the
Organisms have evolved to vary internal and external cell environments by carefully controlling the abundance and activity of these proteins to suit their conditions. To simplify this task, genes whose products function together are often under common regulatory control. This regulatory control is such that these genes are coordinately expressed under the appropriate conditions. The experimental observation that a set of genes is coexpressed frequently implies that the genes share a biological function and are under common regulatory control [
Microarray technology, as well as the new techniques of next generation sequencing (NGS), allows us to obtain large size gene expression arrays [
Current statistical technologies allow us to study inclusion relationships between clusters of coexpressed genes, that is, to study which clusters of coexpressed genes would be more correlated and which others would be more uncorrelated [
With the developed tools, we expect to detect the complex expression dependences between the different sets of coexpressed genes, synthesize them, and make them easier for the researcher to interpret. With this purpose we will provide the researcher with networks that show the expression dependences between all the coexpressed gene sets of the network. In this way, the researcher will be able to study the alternation or synchronism among all these sets of coexpressed genes.
Our methodology is based on the following three principles. First, the interdependence between sets of coexpressed genes cannot be described by linear expression relationships. Second, if two genes maintain an expression relationship with a certain type of curve, the genes coexpressed with these two genes maintain expression relationships of the same type between them. Third, the curve type of the intergroup expression relationships will describe the dependence of activation and deactivation between these sets of coexpressed genes. Thus, the strategy proposed here is focused on the detection of nonlinear expression relationships between sets of coexpressed genes.
Activation and deactivation dependences between sets of coexpressed genes can be very complex; some of the most common ones are those in which a set of coexpressed genes act as a trigger of another set of coexpressed genes; the case of antagonist processes, where the coexpressed gene set that carries out each process needs to be totally deactivated so the other set can express; or sets of coexpressed genes that activate or deactivate another set of coexpressed genes when losing their basal values of expression. In any case, the system does not anticipate any type of expression relationship. Since the system is able to recognize curves of very different shapes, it can process unknown activation and deactivation relationships as reliably as when processing the best known relationships.
There are multiple works that highlight the relevance of the analysis of nonlinear expression relationships [
As it is shown in the mentioned paper [
The ultimate goal of our approach is that researchers are able to know the networks of processes hidden in their experimental data, as well as the activation and deactivation relationships between all of these processes. Furthermore, if the researcher is particularly interested in specific genes, the system will allow him/her to study the way the expression of a gene activates and deactivates different processes from the process that this gene, and those genes coexpressed with it, carry out.
The appropriate data to be analysed by our methodology must come from large sample series (i.e., expression matrices with a high number of sample conditions). This large sample series will not consist of repetitions of the same sample condition; on the contrary, it must include the highest number of different sample conditions. A sample series with few experiments or with repetitions of the same experiment will not allow detecting coexpressed genes and even less to detect complex expression relationships. Note that de-noise, normalization, and similar procedures should be considered before using our tools.
The examples provided in the paper and supplementary materials use the data from
The mathematics behind this system uses the principal curves of oriented points (PCOP) calculation [
All the nonlinear expression relationships are detected for each gene of the expression array. These nonlinear expression relationships are classified by the type of curve. The
The correlation degree provided by the PCOP calculus is what guarantees us that the linear expression relationships (coexpressed genes) as well as the nonlinear expression relationships (intergroup expression relationships) are not a product of chance and have a biological meaning. For this reason, we require a high correlation degree for the linear expression relationships as well as for the nonlinear expression relationships. We are also restrictive in the classification of the expression relationships as linear expression relationships and the consequent consideration of two genes as coexpressed genes. Even a small curvature in the relationship of two coexpressed genes can cause a diversity in the typology of the expression relationships of these two genes with the genes of another set of coexpressed genes, more concretely, being A and B, two coexpressed genes whose linear expression relationship has a small curvature, and being C, a set of coexpressed genes that maintain nonlinear expression relationships with A and B. The expression relationships of gene A with set C may describe a different typology with respect to the expression relationships of gene B with set C.
A clique in an undirected graph is a subset of its vertices such that every two vertices in the subset are connected by an edge. If we consider a graph of all the nonlinear expression relationships with a high correlation, we obtain its cliques. These cliques will not relate sets of coexpressed genes yet, but genes individually. Nevertheless, as the cliques are not relating pairs of genes but several genes in a network of nonlinear expression relationships, the cliques will be the seed to relate the sets of coexpressed genes between all of them. The genes of a clique must be at least three and they must maintain nonlinear expression relationships between all the genes of the clique.
Once the cliques are detected, they are grouped in pairs by relating genes that belong to the same set of coexpressed genes.
The cliques that will form each pair will meet two conditions. Each one of the genes of a clique will be coexpressed with a different gene of the other clique, forming pairs of coexpressed genes. The type of curve that relates two pairs of coexpressed genes will be the same in both cliques.
This provides us with pairs of cliques. Each gene of a clique will be coexpressed with a different gene of the other clique forming pairs of coexpressed genes. Then, the expression relationships that relate genes from different pairs of coexpressed genes will maintain the same type of curve for the two genes of the pair.
On the previous section we obtained the nonlinear expression relationships between pairs of coexpressed genes. Now, we obtain sets of coexpressed genes that maintain nonlinear expression relationships between them by grouping these pairs of coexpressed genes into sets of coexpressed genes. If we consider a graph where the vertices are the cliques of nonlinear relationships between genes and the edges link the pairs of linear isomorphic cliques, now we will calculate the cliques of this new graph obtaining the cliques of cliques. Thereby we obtain the
The genes of the
The higher the number of genes of the
The correlation degree to consider the expression relationships coexpressed enough will depend on the number of genes of the expression array. It is useful for small expression matrices, where nonlinear expression relationships between sets of coexpressed genes can be detected although these expression relationships have high entropy. The aim is to always detect enough nonlinear expression relationships to be able to find the
The threshold to consider an expression relationship as linear or nonlinear will also depend on the number of genes, being (this threshold) more restrictive for the linear ones in matrices with less genes. Thus, large sets of coexpressed genes with very sharp curves between them can be formed for large expression matrices, whereas smaller sets of coexpressed genes, as well as more subtle nonlinear expression relationships between the sets, will be considered for small matrices.
The expression relationships have been filtered by the uncorrelation factor provided by the PCOP calculation to be considered correlated enough [
A higher number of genes in the expression array increase the number of coexpressed genes and nonlinear expression relationships, which facilitates finding
The system allows to study sets of coexpressed genes that maintain nonlinear expression relationships among them, as well as to study the nonlinear expression relationships that a concrete gene of interest maintains with different sets of coexpressed genes. This can be studied for this target gene as well as for the genes coexpressed with it. There have been found 4573 nonlinear relationships and 20269 pairs of coexpressed genes (all highly correlated) from the microarray of 1416 genes used in the examples.
The study of the expression relationships between sets of coexpressed genes can start from the researcher’s genes of interest. All the nonlinear expression relationships that a gene of interest maintains with different sets of coexpressed genes will be shown. These relationships will be shown classified by curve type, because each curve type implies a different activation/deactivation relationship. There will be shown only the nonlinear expression relationships that maintain a sufficient correlation degree.
We will study the activation and deactivation relationship between our gene of interest and different sets of coexpressed genes starting from these high-correlated nonlinear expression relationships. Two lists of coexpressed genes will be shown in a new view for each high-correlated nonlinear expression relationship of the gene of interest (Figure
This view shows a nonlinear expression relationship where a researcher’s gene of interest participates. The gene of interest is displayed on the top of the view on the left side. The column on the left side displays the genes coexpressed with the gene of interest, while the column on the right displays a set of coexpressed genes that maintain a nonlinear expression relationship with the gene of interest. The coexpressed genes are ordered by their correlation degree with their respective gene at the top (the
An icon shows the type of nonlinear relationship between the two main genes and, by extension, between the two sets of coexpressed genes. The curve type is very important, since it determines the role of the genes in each expression dependence.
The system obtains the inner pattern of the curve for any type of expression relationship and classifies it. The only requirement is that the data cloud must be continuous. A
A
Other relationships, such as those of type
As pointed out in the introduction, one of the three principles of our methodology is as follows. The type of curve between two genes is also maintained between the genes coexpressed with each one of them. Let us see an example: HLA genes are indicative of cell maturation marking the cell so it is recognised by the immune system [
Hypertrophic scarring (HS) is a result of increased fibrogenesis, which is thought to be caused by an exaggerated inflammatory response [
The relation of HLA and hypertrophic scar is already documented, but using our tool we found that this relation is mediated by NREP, because HLA genes must be overexpressed to activate NREP (a gene directly linked to HS [
In this way, it is valuable that even though the technologies to obtain gene-expression arrays do not capture regulatory genes because of the low variability in their gene expression, these technologies do allow studying the regulation between processes through the genes that perform these processes (coexpressed genes that result from the activation cascade started by regulatory genes). This is because these final genes do maintain wide enough expression ranges, which allows our high-throughput tool to analyse the expression dependence between the sets of coexpressed genes.
The different networks of nonlinear expression relationships among sets of coexpressed genes are classified by the number of sets and the curve types of the expression relationships between the sets. Once a network type is selected, the networks found in the expression matrix that maintain this pattern in the intergroup expression relationships are displayed.
In the view that shows the networks (Figure
This view shows networks of concrete types of nonlinear expression relationships between sets of coexpressed genes. The icons at the top show the curve-type pattern of the networks listed. The networks will always form a complete graph. The pink line separates the networks found for the curve-type pattern. The columns contain the genes of the
Four expression relationships are shown. The different sample conditions of the expression matrix (the sample series) constitute the data cloud. The PCOP describes the expression-relationship inner pattern. (b) and (c) show coexpressed genes. HLA-A and HLA-F are coexpressed genes (b) and GRAMD1A and NREP are also coexpressed genes (c). (a) and (d) show nonlinear expression relationships of
In Figure
To respond to diverse and frequently changing conditions, cells must precisely mediate the synthesis and function of the proteins in the cell. This is controlled in part by the overall genomic expression program that results from the combined action of different regulatory factors, each of which responds to specific extra- and intracellular signals. These regulators govern the expression of sets of coexpressed genes that perform the appropriate cell functions. The variations in the expression of these coexpressed genes can be captured by high-throughput technologies to obtain gene expression arrays. In this way, the researcher is able to know which processes are carried out in the conditions he/she wishes to study, by knowing the different genes coexpressed in them. But what if the researcher wishes to know more? What if he/she wishes to know which relations have those different processes between them? In the case of working with large sample series, how do we know how these processes are activating or deactivating and activating again among them? If the researcher suspects that certain target genes can be a therapeutic target, how can he/she know the effect of their expression on the rest of the processes that this target gene does not belong, since it expresses with a different set of coexpressed genes? To know this could be implied from discovering unknown side effects to finding new ways to manipulate the expression of this gene.
In order to solve all these issues, we perform our high-throughput analysis. We obtain the coexpressed genes and the high-correlated nonlinear expression relationships and from them we obtain cliques (complete graphs) between coexpressed gene sets that maintain nonlinear expression relationships between them. In these networks, all the sets of coexpressed genes maintain a nonlinear expression relationship with each and every one of the other sets of coexpressed genes of the network. So, anytime, you know how to move from one process to another, passing by any other intermediate process. As a result, multiple networks are provided by a dynamic system that allows detecting sets of coexpressed genes related between them by complex activation dependences. The networks found will always depend on the analysed microarray. Large enough sample series and wide enough gene expression ranges facilitate the detection of coexpressed genes as well as the detection of nonlinear expression relationships between them. It is important to note that to detect a nonlinear expression dependence between two sets of coexpressed genes, this nonlinear expression dependence must exist, even though it affects only two small sets of coexpressed genes. In case these dependences exist, the developed tools allow to obtain very relevant information for the researcher since it makes possible to observe how the sets of coexpressed genes of his/her experiment interact between all of them. We think this approach could be a useful complement to other computational methods commonly used to analyse gene expression data.
As we present in the introduction, the expression dependences between sets of coexpressed genes, as well as between the processes these sets of coexpressed genes carry out, would never be linear. This is why new tools like the presented one are necessary.
The authors declare no conflict of interests.
This research was supported by Grant nos. BFU2010-22209-C02-01 and BIO2013-48704-R from MCYT (Ministerio de Ciencia y Tecnología, Spain), from the Centre de Referència de R + D de Biotecnologia de la Generalitat de Catalunya, and by Comisión Coordinadora del Interior (Uruguay).