NCR-PCOPGene: An Exploratory Tool for Analysis of Sample-Classes Effect on Gene-Expression Relationships

Background. Microarray technology is so expensive and powerful that it is essential to extract maximum value from microarray data. Our tools allow researchers to test and formulate from a hypothesis to entire models. Results. The objective of the NCRPCOPGene is to study the relationships among gene expressions under different conditions, to classify these conditions, and to study their effect on the different relationships. The web application makes it easier to define the sample classes, grouping the microarray experiments either by using (a) biological, statistical, or any other previous knowledge or (b) their effect on the expression relationship maintained among specific genes of interest. By means of the type (a) class definition, the researcher can add biological information to the gene-expression relationships. The type (b) class definition allows for linking genes correlated neither linearly nor nonlinearly. Conclusions. The PCOPGene tools are especially suitable for microarrays with large sample series. This application helps to identify cellular states and the genes involved in it in a flexible way. The application takes advantage of the ability of our system to relate gene expressions; even when these relationships are noncontinuous and cannot be found using linear or nonlinear analytical methods.


Introduction
DNA microarray technology enables high-throughput geneexpression analysis, which allows researchers to compare the activity of genes in multiple cellular conditions. There are several relevant web applications for microarray analysis, that is, GEO [1], BIOREL [2], ArrayExpress [3], and MicroGen [4]. Currently, most tools try to extract biological information from such high-throughput expression data combining information from coexpressed genes [5] as well as additional annotations extracted from Gene-Ontology [6], phylogenetic information [7], or pathway data [8]. In this paper, a different and complementary new approach based on the effect of the experiments on the fluctuations of geneexpression relationships is proposed.
The suitable data for our type of analysis can be provided by (a) temporal series, useful to study synchronous cellular events, and (b) serial analysis of gene-expression samples under different conditions (i.e., chemotherapy, temperature, radiation, starvation, etc.) which are more useful for studying asynchronous events. The progressive increase of microarray sample series [9] motivates a more thorough analysis of expression relationships and gene dependencies throughout these large series, trying to rescue global gene behaviours, cellular states and phenotypes. The GEO database [1] facilitates the study of the microarray experiments grouped into predefined classes introduced by the microarray authors. Nevertheless, if the researcher wants to understand the microarray experiments effect on the expression relationships and elucidate hidden cell states, he/she needs a more specific approach. To study the effect of the microarray experiments on the relationships among different sets of genes, our web tool allows the user to define flexible sample classes. It involves a significant new approach by going from the linear, nonlinear, and mutually exclusive gene relationships to the complex, noncontinuous dependence among genes or sets of genes.

Advances in Bioinformatics
Our strategy begins from the analysis of the "continuous gene-expression relationships" (abbreviated in the paper as "gene relationships" or "expression relationships"). This analysis provides the "inner pattern" of the expression relationships. This "inner pattern" describes the relationship in an n-space (sample space of n dimensions), where each axis represents the expression level of each one of the genes which are being related, and from a data cloud with the microarray experiments. This pattern analysis detects the lineal, nonlineal, and mutually excluding relationships, providing a series of points in the n-space which describe the "inner pattern" of the expression relationship. Beyond this first analysis remains the second objective: the study of the "noncontinuous dependence" among gene expressions. To perform this, the sample classes are defined allowing for the study of their effect on gene expressions and the stated continuous dependencies. This task facilitates relating genes which do not have continuous relationships but rather have local, unidirectional, or other kinds of complex dependence in their expression. The final objective is that the sample classes defined help to identify the hidden cellular states and phenotype changes (and provide the genes involved).

Methods
Pattern extraction methodologies are very powerful techniques to extract biological knowledge from data, as was shown in our previous work [10], where this approach was used for analysis of populations, diagnosis, and prognosis. In the present work, a useful extension to high-throughput microarray analysis is presented.

Preprocess
When microarray data are uploaded, the inner pattern of the expression relationship of all pairs of microarray genes is calculated automatically. These calculations are made by the principal curves of oriented points (PCOPs) calculus [11] and recognise the linear, nonlinear and mutuallyexcluding relationships between genes (both genes cannot be overexpressed or underexpressed at the same time, or one gene can be overexpressed if the other one is underexpressed) [12]. The noncorrelation factor between each pair of gene expressions is calculated in the preprocess pattern analysis, leading to the identification of those genes which are more correlated with each gene. How it will lead to the "analysis expansion" around the researcher's genes of interest is shown in Results, and a brief description of the PCOP calculation [11] is given in Section 2.4.

Defining Sample Classes
The researcher could test his/her hypotheses and intuitions by defining the sample classes. The sample-classes definition can be made in the following three different ways.
(i) Selecting the samples from a range of gene-expression data. (ii) Clustering the samples from a gene-expression relationship: once the inner pattern among a set of correlated genes of interest is calculated (once again by PCOP calculus [11]), the samples that constitute the different "local behaviours" of this inner pattern can be clustered (and a high correlation implies a better-fitted pattern). Then, the effect of these conditions can be studied on any other genes relationship, where these genes could not be correlated with the set initially used to cluster the microarray experiments. Note that although in the preprocess PCOP is calculated for two dimensions only (a pair of genes), now it is calculated for n ≥ 2 dimensions (the n genes of the gene set). (iii) Classifying the microarray experiments into different classes using previous knowledge.

Colouring the Sample Classes in a Genes Relationship
Now, the graphical interface facilitates the visualisation of the defined-classes effect on gene relationships by colouring the samples (as is shown in Figure 1) for every set of genes of interest with the intention of studying their effect. The study of this effect is especially relevant in the nonlinear relationships in order to understand the biological sense of the slope change, possibly due to a phenotype change.
The class definitions of types (i) (selecting the samples from a range of gene-expression values) and (ii) (clustering the samples from a gene-expression relationship) are made using these graphical visualisations of the initial-genes-ofinterest pattern analysis. Type (iii) is made using the web interface that enquires for the samples belonging to each class, or directly uploading the file.

Defining the Sample Classes from a Nonlinear Expression Relationship
The system makes use of the principal curves of oriented points (PCOPs) calculus to obtain the nonlinear inner pattern [11,13] used to cluster the samples (remember that PCOP calculus is also used to obtain the correlation degree between all pairs of microarray genes). The analysed variables with the PCOP method can be independent because the method uses a hidden variable for ordering the data (in contrast to other nonlinear analyses like regression curves) which is suitable for gene-expression comparisons [14]. PCOP is defined by the generalisation, at the local level, of the principal-components variance properties. From the sample-space data of n dimensions (one per gene expression), our system obtains discretised states named "principal oriented points" (POPs), which represent this principal component generalisation at the local level or local area. The series of principal oriented points (POPs) obtained makes up the principal curve of oriented points (PCOPs) or inner pattern of the data cloud [11,13]. The final series of POPs obtained will minimise the dispersion degree of the samples around PCOP [11,13]. As a result, even with no isotropic distribution of the data, we obtain a very realistic inner pattern, compared, for example, with methods like least squares, and a very accurate data-dispersion measurement (the noncorrelation factor), compared with artificial intelligence methods [11,15]. Next, when the user selects the POPs (the different discretised states obtained in the PCOP calculus) using the graphical interface, the samples belonging to each POP scope or local area are selected. One of the main keys of PCOP calculus is precisely the scope detection of the local area of each POP, as has been explained in a previous work [11]. In this way, the different sample classes are finally defined based on the "local behaviours" of the gene-expression relationship. For more details about this clustering method based on the fluctuations of the inner pattern, consult our previous work [10].

Gene Search Based on the Distribution of the Classes along the Gene-Expression Range
To make it easier to relate the genes in a noncontinuous approach, the user can carry out a gene search based on the expression level required for each class defined. This noncontinuous approach leads to the correlation of gene expressions that cannot be related by the continuous analysis facilitated by the pattern analysis (as shown in the actual examples described in Section 3). In the "class-distribution" search, the genes can be searched for some of the defined classes by being upregulated or downregulated, with respect to the basal value, or by being disjointed, overexpressed, or underexpressed, with respect to some of the other classes. All possible combinations are allowed where different combinations will supply different gene sets.

Contextualisation Consulting the GEO Database
In order to get complementary information from other public microarray data, the GEO database could be consulted.
These queries attempt to know if the genes supplied by the "class-distribution" search tool (genes that follow the user-defined class distribution in their expression) are gene markers in the GEO datasets. To achieve this information, the NCR-PCOPGene queries the microarray "gene-centric" GEO Profiles [1]. The profiles across all GEO datasets where the query gene displays significant expression differences among the GEO-predefined sample classes are obtained. In this way, the queried gene can be considered, for example, as a marker of osmotic stress in a microarray to analyse cellular stress response, or as a marker of metastasis in another microarray to analyse disease states, and so forth. These query-gene properties can then be assigned to the sample classes obtained with the NCR-PCOPGene. A demo user for visualising the above examples is available. A multimedia tutorial is provided to describe the application use and it is indexed from the application help icon. For mathematical and computational details, technical reports are also available in the web.

Results
Let us describe some real-use cases of the NCR-PCOPGene and the relevance of the new knowledge supplied (the geneexpression dependence put forward in the examples are new and unknown, until now). The three ways to define the sample classes and a basic-analysis procedure have been used.

Microarray Data
The suitable data to be analysed by the NCR-PCOPGene are microarrays with large synchronous or asynchronous sample series. The analysis presented in the paper uses microarray data provided by the National Cancer Institute (USA) [16]. They correspond to the profiles of 9703 cDNAS representing ∼8000 unique genes of 60 cell-lines, in relation to the activity profiles of 1400 drugs. They provide a resulting table of 1376 genes and 118 compounds with the most representative substances and genes normalised for the 60 cell-lines (suitable data for knowledge discovery using our tools).

Basic Analysis Procedure
The analysis begins in the researcher's genes of interest, usually gene markers of a specific disease, cell state, or function. As was commented on in Section 2, the correlation factors between all microarray pairs of genes (for the linearly, nonlinearly, and mutually excluding relationships) are automatically calculated when the microarray data are uploaded (the preprocess). Thus, for each gene the system provides its rank of best-correlated genes and the user can expand the initial correlated-gene-set based on these ranks [14]. And then finally, the user can launch the PCOP calculus of the gene set.
Once the pattern analysis of the query genes has been performed, the graphical interface will show their expressionrelationship, inner pattern, and their fluctuations. Also, their noncorrelation factor of the set is provided (remember that a better correlation among the initial gene-set implies an inner pattern that fits better, facilitating the detection of the fluctuations and slope changes, and also the clustering of the different microarray experiments associated with each one).
To find the expression dependence of these initial sets of correlated genes with other genes either linearly or nonlinearly correlated with the first ones (on the contrary, it is easier and faster to perform PCOP analysis directly), the user should proceed as follows: first, he/she must discretise the continuous relationship among the initial gene set forming the different classes (by simply clicking on the POPs along the relationship's inner-pattern in the plot interface); second, applying the "classes-distribution" search tool for a certain distribution of that discretisation, the genes that follow the required distribution in their expressions are obtained. Finally, the researcher can now perform the pattern analysis of the genes provided and observe, in its interface visualisation, the effect of each sample class on its expression relationship (with the samples of the classes coloured, as in Figure 1). This procedure will show the noncontinuous dependence among the genes provided by the search and the initial gene set in the manner specified in the search. If the distribution required in the search is changed, the genes provided and their noncontinuous dependence, with respect to the initial ones, will vary too. Bioinformatics   5

Example 1: Defining the Sample Classes from a Nonlinear Genes Relationship
Let us now look at a real analysis. We wish to relate the Soluble Guanylate Cyclase Beta1 3 (SGC) and Quiescin Q6 (Q6) genes (our genes of interest). The SGC is underexpressed in cellular stress [17], whereas Q6 is overexpressed in the last phases of tissue remodelling [18]. Additionally, we would like to relate this pair of genes to GATA-binding protein 3 (GATA3) and acute myeloid leukaemia 1 (AML1, RUNX1). GATA3 is involved in growth control and the maintenance of the differentiated state in epithelial cells [19]. The impairment of the AML1 function deregulates the pathways leading to cellular proliferation and differentiation [20]. The two gene sets show a correlation between their respective members (noncorrelation factors of SGC versus Q6 = 0.08; GATA3 versus AML1 = 0.1) and their inner pattern can be calculated and visualised in the graphical interfaces, as is shown in Figure 1. The problem is that these two sets of genes are neither linearly nor nonlinearly correlated (noncorrelation factor for SGC versus Q6 versus GATA3 versus AML1 = 0.28). But perhaps they are maintaining a noncontinuous dependence in their expressions, and we cannot discern it with the analysis for continuous dataclouds. However, we can try to find it using the classes' definition.
To perform this, two clusters are built from the SGC and Q6 relationship by selecting the POPs located in the two extremes of their inner-pattern relationship, one corresponding to the cellular-stress samples and the other to the tissue-remodelling samples (Figure 1, SGC and Q6 relationship). These clusters will constitute two different sample classes. Now, the classes are applied to the GATA3 and AML1 relationship, painting their respective samples with red and blue colours ( Figure 2). As can be observed in Figure 2, almost all samples corresponding to cellular stress (red) appear with an underexpression of GATA3 and AML1, indicating that the two genes are not overexpressed in cellular stress. However, the tissue-remodelling-class samples (blue) appear along the GATA3 and AML1 relationship as being over-and underexpressed and indicate that some of these tissue-remodelling conditions are affected by the GATA3 and AML1 differentiation ways while some are not. This points out that the overexpression of GATA3 and AML1 implies an overexpression of SGC and Q6, but not the opposite. Note that this "unidirectional" relationship is impossible to detect by pattern-or correlation-analysis methods.

Example 2: Defining the Sample Classes from Selecting Expression Ranges
As seen above, there are some experiments involved in tissue remodelling, but not in the GATA3 differentiation processes. It would be interesting to study them. For this purpose, those samples where GATA3 is overexpressed were selected (using the graphical interface) to define a new sample class (Figure 1, GATA3 and AML1 relationship). This class of samples is coloured in the SGC and Q6 relationship, as shown in Figure 3. As we can see, the differentiation induced by GATA3 is independent of the tissue-remodelling level achieved by the SGC and Q6 relationship. Note that the sample selection from POP is the appropriate method to classify cellular states because the cause or effect of a cell state is the combined expression of multiple genes.

Example 3: Defining the Sample Classes by Classifying the Experiments Using Previous Knowledge
Previous knowledge can arise basically from two different origins: a biological/clinical origin or a statistical one. In our case, the microarray experiments are grouped by their linear correlation using principal components, but other methods like biclustering [21,22] or Locally Linear Embedding [23,24] can be used with better accuracy to define the classes. Note that in order to define the classes, the microarray experiments play the variable role and the genes make up the data cloud. establish the hypothesis that each sample class represents a differentiated cell process, with the genes implicated in it acting jointly. The effect of the defined classes on relationships among genes of interest can now be observed. Colouring these classes in the relationship among the four genes of the above examples (Figure 1, GATA3, RUNX, SGC, and Q6 relationship), and remembering the observations of the above examples, the genes that may act jointly would be (i) the genes involved in cell stress (yellow and red) and (ii) the genes involved in tissue remodelling (green and blue), this last set being divided into genes related to GATA3 and AML1 differentiation (green), and those not implicated in it (blue).

Example 4: Gene Search Based on the Distribution of the Classes along Their Expression Range
With the classes obtained in the last example (Figure 1, GATA3, RUNX, SGC, and Q6 relationship) and using the "class-distribution" gene-search tool, it is interesting to search the genes which mark the "transition" from the cell process involved in tissue remodelling without GATA3 differentiation (blue) to the cell process linked to tissue remodelling with GATA3 differentiation (green). For this purpose, the following "class-distribution" gene search must be performed: the blue and green classes are overexpressed, with respect to the basal value; the rest of the classes are underexpressed, with respect to the basal value; and the green-class samples are more overexpressed than the blueclass samples. Furthermore, we can identify gene markers of specific cell processes or pathologies (in relation to their expression levels) from the supplied genes by means of the queries made by the application against the "gene-centric" GEO profiles. Thus, the analysis can be focused on the relevant genes for biomedical user interest.

Conclusions
NCR-PCOPGene strength resides in (a) the flexibility of the classes' definition, due to the nonlinear pattern analysis of gene expressions and the sample clustering along the inner pattern, combined with (b) the highthroughput approach of microarray technology, which, by means of the "class-distribution" gene search and the genecorrelation table, leads the researcher to expand his/her analysis. As a result, our application can help to relate gene expressions when their relationships are noncontinuous and cannot be found using linear or nonlinear analytical methods.
The flexibility of the tool leads to the combination of the three ways for determining the classes shown in this paper, to define and redefine the classes. For example, using the "class definition from a nonlinear genes relationship" way, the classes can be clustered from two different, uncorrelated, initial gene sets (as long as no sample appears in more than one class). Therefore, the user could search the genes that are partially related to both sets of genes in a specific manner (e.g., being correlated with one set in the underexpression, but with the other in the overexpression analyses). Or initially using the "previous-knowledge" way, the user redefines the predefined classes by their effect on the gene-expressionrelationships (performing subclasses from the original ones to study subprocesses in which the genes of interest are involved).
In summary, it is a powerful tool to study genes of interest and test researchers' hypotheses by taking advantage of the high-throughput capability of microarray technology.